CONF-88052Q--
(DE88013180)
QSAR 88
Proceedings of the
THIRD INTERNATIONAL WORKSHOP ON
-QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS
IN ENVIRONMENTAL TOXICOLOGY
May 22-26, 1988
Knoxvllle, Tennessee
Edited by: James E. Turner
M. Wendy Williams
T. Wayne Schultz
Norma J. Kwaak
-------
SIMPLIFYING COMPLEX QSAR'S IN TOXICITY STUDIES
WITH MULTIVARIATE STATISTICS
Gerald J. Niemi and James M. McKim
Environmental Research Laboratory, Duluth
U.S. Environmental Protection Agency
6201 Congdon Boulevard
Duluth, MN 55804 USA
ABSTRACT
During the past several decades many quantitative structure-activity
relationships (QSAR's) have been derived from relatively small data sets of
chemicals in a homologous series and selected empirical observations. An
alternative approach ys to analyze large data sets consisting of
heterogeneous groups of chemicals and to explore QSAR's among these
chemicals for generalized patterns of chemical behavior. Exploratory
analyses using multivariate statistical procedures in an iterative fashion
have traditionally been a neglected tool in the effort to find relationships
that can lead to testable hypotheses. Hence, statistical analysis does not
need to be a device only to test a hypothesis. Moreover, multivariate
statistical analyses (e.g., principal components analysis (PCA) and factor
analyses) can simplify the complex relationships among variables. One of the
major reasons for not considering multivariate statistical routines for
"simplifying" complex relationships is a lack of understanding and routine
use of these techniques by practicing QSAR scientists. The use of
exploratory multivariate statistical techniques for simplifying complex QSAR
problems is demonstrated through the use of research data on biodegradation
and mode of toxic action. In these examples, a large number of explanatory
variables were examined to explore which variables might best explain whether
a chemical biodegrades or whether a toxic response by an organism can be
used to identify a mode of toxic action. In both cases, the procedures
reduced the number of potential explanatory variables and generated
hypotheses about biodegradation and mode of toxic action for future research
without explicitly testing an existing hypothesis.
i
Present address: Center for Water and the Environment,
Natural Resources Research Institute, University of
Minnesota, Duluth, MN 55811 USA.
11
-------
INTRODUCTION
The- vast majority of QSAR's developed over the past several decades were
largely derived from relatively small data sets of homologous series of
chemicals. Furthermore, the "structural" variables used to make predictions
about the "activity" variables in a "quantitative relationship" were
primarily•based on "secondary structural" variables such as the n-octanol
water partition coe'fficient (log P) . There is nothing inherently wrong with
the development of these relationships except that certain limitations exist
in their application. These include;
1) secondary structural variables such as log P (independent variables)
are measured or calculated with error and hence these errors are
propagated into predictions of the activity variables (dependent
.variables);/
2) secondary structural variables are often impossible to calculate for
some compounds, which limits their application; and
3) precise definitions on what constitutes a "homologous" series are
often vague and, hence, the boundaries of a specific QSAR is also
vague,
There clearly is no simple solution to these problems. However, the
first two limitations can be overcome by considering primary structure
activity variables or variables calculated directly from the structure of the
chemical (e.g., chemical fragments or molecular connectivity indices). The
third limitation can be overcome by building QSAR's in a more global context
in which no subjective boundaries are placed on the realm of chemicals to
which the QSAR will apply. Here our objectives are the following: i)
explore reasons why a more global perspective has not been pursued, and 2)
present two examples of how a complex QSAR problem can be simplified through
the use of multivariate statistics.
Limitations to a global perspective
The sheer magnitude of a QSAR problem from a global perspective is
intimidating because of the large number of structures (e.g., hundreds of
thousands) that can be considered, plus the number of potential structural
variables that theoretically can be calculated for a structure. The
magnitude of the problem has limited conceptual approaches and has
immediately forced scientists into limiting the problem, usually by dealing
with discrete homogeneous series of chemicals.
When one is working from an industrial perspective in attempts to design
a new chemical or drug, this more focused approach might be feasible because
a more limited number of solutions are possible given the availability of a
lead structure. However, from a regulatory perspective it is not feasible
because an initial evaluation must focus on an objective placement of the
chemical into the proper group of chemicals and QSAR model, from which
predictions can be made. Hence, the. perspective in which scientists using
QSAR techniques must examine a problem will partly determine the approach
to appropriately bound the chemical within the global universe of potential
chemical structures.
12
-------
Two additional reasons for an inhibited global perspective in
understanding QSAR are the lack of training of scientists in statistical
analysis, especially multivariate statistics, and the relatively rapid and
expanding development of computer capabilities. Regarding the former,
scientists using QSAR techniques are' generally either chemists, biologists,
or biochemists. Most of these scientists have formal training in mathematics
including calculus because most undergraduate and graduate curricula require
some mathematical training. Some have formal training in one or two
elementary statistics courses in which at most two variables (e.g.,
correlation or regression) are considered in statistical tests. Few have
any training in multivariate statistical techniques, training which is
necessary to consider a multivariate, global perspective to QSAR.
In regard to the latter limitation, phenomenal progress has been made
during the past 30 years in the development, design, and use of computer
hardware and software. Yet, despite this progress, the capacity and cost to
use many computers and fcfie availability of software capable of handling
hundreds of variables for thousands of chemical cases is still limited. For
example, we can calculate literally hundreds of potential primary structure-
activity variables based on various mathematical routines (Basak et al.
1987), yet one of the most commonly available statistical packages, the
Statistical Package for the Social Sciences (SPSS, Nie et al. 1975) is
limited to analysis of < 100 independent variables (Niemi et al. 1985).
Therefore, we have powerful computer capabilities today, but they may not
yet be as powerful as we desire nor do enough scientists have the proper
technical training to fully utilize statistical routines or existing
computers. Progress in developing QSAR, especially global-multivariate
relationships, will be inhibited until a larger critical mass of QSAR
scientists are educated, hardware and software capabilities of computers are
improved, and costs to obtain and use these computer capabilities are
reasonable.
QSAR in biodegradation research
Development of QSAR models to predict whether chemicals microbially
degrade in aquatic environments or to determine the rate at which a chemical
degrades can.be difficult because of the many interacting factors that
contribute to biodegrada.bi 1 i ty (e.g., see Alexander 1981). Niemi et al.
(1987) attempted an objective multivariate statistical approach to this^
problem by using a data base of 287 compounds with available 5-day BOD values
(BODg). BODg was used as an approximate measure of the inherent ability
of a chemical to microbially degrade in a m6dern sewage treatment facility.
For each of these compounds 54 molecular connectivity indices and five
physicochemical parameters including log P (Leo and Weininger 1984) were
calculated and used as potential explanatory variables for assessing whether
the BOD^ value was relatively high or low (e.g., biodegradable or
persistent respectively).
The first manipulation of these data was to separate the compounds into
biodegradable and persistent groups based on a natural division in the
BODjj values. Discriminant function analysis (DFA) with the molecular
; connectivity indices and the five physicochemical factors were used as
explanatory variables in an attempt to separate these two groups. In
13
-------
general, DFA is a multivariate statistical technique that identifies
whether differences in a set of explanatory variables exist between two or
more groups. Although two previous papers reported some success with this
technique (Geating 1981, Enslein et al. 1984), only 50 % of the 287 compounds
could be correctly discriminated in this exercise.
From the perspective of chemical structure, it is likely that there are
many different factors that contribute to the persistence or degradabi1ity
of a chemical and, hence, the chemicals need to be assessed in smaller
groups. Because 'there was no a priori rationale for defining these groups,
an objective multivariate technique, K-means clustering (Dixon 1981), was
used. Prior to the use of clustering, a principal components analysis (PCA)
was calculated on 45 of the molecular connectivity indices. PCA is a
technique used to reduce the number of variables to be considered in a
problem and here it was used to reduce the molecular connectivity indices to
less than 10 variables that still explained > 90 % of the variation in the
original variables^ PCA was a necessary step here because the K-means
clustering software of the Biomedical Computer Program (BMDP, Dixon 1981) and
the PDP-11/70 computer system used at the time was limited to a maximum of
nine variables for eight clusters that could be defined for 287 cases.
Two additional problems were encountered in this analysis. First, there
was no a priori rationale for defining how many clusters should be identified
to improve the predictions. Secondly, compounds that were outliers in the
principal components space were often identified as single compound
clusters. To avoid the latter problem, any compounds that were > 2 standard
deviations from the mean for any of the principal components used were
identified as belonging to an "outer" space and analyzed separately from
those compounds within 2 s.d.'s for all principal components. The former
problem was solved by iterating the number of clusters to be formed over a
range of clusters and identifying the number of clusters that produced the
best discrimination of biodegradable from persistent chemicals. Hence, the
statistical process consisted of the following:
(1) PCA of 45 molecular connectivity indices that described the structure
of the compounds,
(2) identification of an "outer" and "inner" space, and
(3) iterative clustering of the outer and inner space followed by DFA of
.. biodegradable and persistent groups within each iterative cluster.
The results of this iterative analysis process improved the correct
prediction of biodegradabi1ity to an overall Q8 % (85 % for biodegradable
compounds and 94 % for persistent compounds). To identify the types of
structural features associated with biodegradability or persistence, the
discrimination within each of the clusters was examined and summarized into
a set of heuristic rules. When possible, each of the heuristic rules was
related with previous knowledge published on structural relationships
associated with biodegradabi1ity. After some obvious misc1 assifications
based on DFA were translated into the set of heuristic rules, the set of
heuristic rules correctly classified 93 % of compounds into the appropriate
biodegradabi1ity group (91 % for degradable chemicals and 96 % for persistent
chemi ca 1 s ) .
In summary, the iterative multivariate statistical procedures described
above allowed for an eventual simplification of structural features
14
-------
associated with the complex process of biodegradabi1ity of chemicals into a
set of heuristic rules. These rules can be viewed as tentative hypotheses
to be tested in future experimentation and modified as a result of those
subsequent experiments. Admittedly, the statistical procedures are complex,
especially to those unfamiliar with these techniques, but the eventual
results led to a simplification in understanding potential structural
features associated with biodegradability.
QSAR in mode of toxic action research
f
Over the past five years scientists at U.S. EPA's Environmental Research
Laboratory in Duluth have studied eight xenobiotic chemicals from the
perspective of four different biological disciplines; two chemicals for each
of four different known modes of toxic action. The major objective of this
research was to identify effective, but cost-efficient sets of toxic
responses in fish that would correctly identify specific modes of action.
These response sets were-'termed fish acute toxicity syndromes or FATS (McKim
et al. 1987a). The basic premjse of this research was based on the idea that
if an appropriate FATS could be identified for a chemical, then a reasonable
prediction of mode of action could be made for that chemical. A QSAR
equation could then be used for that mode of action and subsequently a
prediction about its toxicity (McKim et al. 1987a).
The four biological disciplines and number of parameters included in the
analysis were the following:
(l) 17 physiological variables measured on four individual rainbow trout
(Sa1 mo ga i rdneri) exposed to each of the test chemicals (primarily
respiratory-cardiovascular) (McKim et al . 1987b, c);
(2) 14 behavioral variables measured on fathead minnows (Pimepha1es
prome1 as) exposed to each of the test chemicals in standard 96-h
LCgQ assays (Drummond et al. 1986);
(3) 25 hemato1ogica1 variables measured on individual trout exposed to
each of the test chemicals (Snarski and Stokes, pers. comm.); and
(4) 14 biochemistry variables measured on individual trout exposed to
each of the test chemicals (Christensen, pers. comm.). "
These data represent a substantial multivariate problem and one in which
substantial violations of statistical assumptions are possible as well as a
situation in which spurious results are expected. For example, the common
denominator that links the observations for each variable for each discipline
are the eight chemicals. Thus, the multivariate situation is that there are
70 variables for eight cases; a reversal from the ideal situation in which
one would like 70 cases for each of eight variables. However, from a
biological perspective, it is seldom that information of this detail is
available across disciplines and we argue that despite the statistical
problems these data are worthy of exploratory analysis. These data are
especially worthy from the perspective of using multivariate statistical
analysis to simplify future, analyses of FATS predictions.
15
-------
The initial major question of interest here is what variables can best
discriminate between the four FATS groups (each reflective of a mode of
toxic action) which are represented by the eight chemicals. The four modes
of toxic action and the associated chemicals studied were:
(l) nonpolar narcosis (tricaine methanesulfonate and 1-octanol);
(2) acety 1 chojl i neste rase inhibitors (malathion and carbaryl);
(3) uncouplers of oxidative phosphory1 ation (pentachloro-
phenol and 2,4-dinitrophenol); and
(4) mucous membrane irritants (acrolein and benzaldehyde).
Selection of the most useful variables for discriminating between FATS
was based on the following steps: (1) identification of those variables that
provided the best discrimination (lowest alpha values) of all four FATS, (2)
identification of'those variables that best discriminated between two FATS
groups, and (3) elimination of one variable from a pair of variables that
were highly correlated within a biological discipline (here defined as £ >
0.85). In steps one and two above, univariate F values and associated alpha
values from an analysis of variance were used to determine the best
discriminating variables. In step three Kendall's rank correlation was
used because of the relatively low sample size. The final step in the
analysis was to use some of the .variables identified in the first three
steps above in a DFA -to identify the smallest number of variables that.could
discriminate the four FATS.
A total of 23 variables including six physiological, five behavioral,
five hemato1ogical, and seven biochemical variables were highly significant
(JD < 0. 001)-di scr iminators of the four FATS groups. In addition to these
23 variables, six physiological, three behavioral, three hematological, and
one biochemical variable were significant (_p_ < 0.01) discriminators of the
four FATS groups (Table 1). Therefore, a total of 36 of the 70 variables
(51 %) considered here were significant discriminators of the four groups.
Table 1. Summary of three steps in reducing the number of potential
explanatory variables in discriminating four FATS among four different
biological disciplines .(see text for details in reduction process).
Discipline
Original Step-1
variables
Step 2 Step 3 Total
Phys i o 1 og i ca 1
Behavi oral
Hemato logical
Bi ochemi cal
17
14
25
14
12
8
8
8
3
4
3
1
— 1
- 1
0
- 1
14
11
11
8
Total
70
36
11
- 3
44
16
-------
In considering all six pairwise combinations of the four FATS groups,
three additional physiological, four behavioral, three hemato1ogical
variables, and one biochemical variable were significant (JD < 0.01)
discriminators of at least two FATS groups (Table 1). Hence, a cumulative
total of 15 of 17 physiological variables (88.%), 12 of 14 behavioral
variables (86 %) , 11 of 25 hematological variables (44 %) , and 9 of 14
biochemical variables (64 %) or 47 of 70 potential explanatory variables (67
%) of at least 2 FATS groups were significant at j) < 0.01.
Pairwise correlations between variables that were good discriminators
within a discipline (Steps 1 and 2) showed that only six pairwise variables
(Table 1) had correlation values greater than r_ > 0.85 (_r > 0.72).
Therefore, only three variables could be eliminated at Step 3 and 26 of the
original 70 variables (37 %) could be eliminated using this reduction process
(70-44=26).
The final step in the analysis was to conduct a stepwise DFA to identify
the best variables that could discriminate the eight chemicals among the four
FATS groups. In this process, instead of including all 44 good
discriminating variables of the four FATS groups, we selected the two best
discriminating variables from each of the four biological disciplines. This
process could still produce spurious relationships in the results because
the number of variables is equal to the number of cases. However, this
analysis is a better alternative as compared with including all 44 good
discriminating variables and it is only being calculated to explore the best
combination of variables among discipli.nes for potentially discriminating all
four FATS groups.
The first variable selected was oxygen consumption, a physiological
variable, (McKim et al, 1987b), which discriminated the narcosis and
uncoupler FATS groups from the inhibitor and irritant FATS groups. After
this step, six of the eight chemicals were correctly classified. The second
variable selected was a behavioral variable, scoliosis (a morphological
abnormality, Drummond et al. 1986), which correctly discriminated the
remaining two chemicals and all four FATS.
Figure 1. Plot of first two discriminant functions from a DFA of eight
chemicals in which the response of fish (as measured by two variables
(oxygen consumption and scoliosis) were the. best discriminators and correctly
classified each chemical into one of four FATS groups.
DF2
"LOW
: "HIGH"
UNCOUPLERS
r^\^
^V t (
•SCOLIOSIS"
1
ANESTHETICS
s-~^ (aAa)
m) ^
\ ( c )
IRRITANTS Vcy
AChE INHBTTORS
DF1
"HIGH" -«— OXYGEN CONSUMPTION »- ^OW"
17
-------
In summary, we established a criteria for potentially reducing the number
of variables to be considered for correctly classifying chemicals into a
respective FATS group based on biological responses of fish exposed to
those chemicals. By selecting the best discriminating variables and
variables that were highly intercorre1ated, 26 of 70 potential variables
(37 %) were eliminated. The lack in our ability to reduce the dimensionality
further is partly due to the good selection of discriminating variables by
the scientists involved among the respective disciplines and partly due to
our lack of knowledge regarding FATS. For example, one would not want to
eliminate a variable that might prove to be a good discriminator of a FATS
not yet tested with the model. In contrast, two variables were able to
correctly discriminate eight chemicals into four FATS groups. This likely
indicates the problem of discriminating FATS can be accomplished with a
relatively small set of response variables and that the response of fish to
chemical intoxication is manifested by a number of variables; each of which
is measurable at a variety of levels (physiologically, behaviorally,
hematoIogically, and biochemically). Discovery of the best combination of
variables to use for screening a large number of chemicals will best be
accomplished by an examination of the cost-effectiveness and the precision
and accuracy of measuring the respective variables.
Acknowledgments
We thank Mr. Robert Drummond, Ms. Virginia Snarski, Ms. Nan Stokes, and Mr.
Glenn Christe-nsen for access to their data on the biological responses of
fish to the eight chemicals considered here. We also thank Drs. Steven
Bradbury, Subhash Basak, and Gilman Veith for their comments on this
manuscript. This paper has not been peer-reviewed by the U.S. Environmental
Protection Agency and therefore the views expressed herein do not necessarily
reflect the views of the Agency. This research was partly supported by
Cooperative Agreement No. CR81198! to Dr. Ronald Regal of the University of
Mi nnesota.
Literature Cited
Alexander, .M. 1981. Biodegradation of chemicals of environmental concern.
Science 211: 132-139.
Basak, S.C., V.R. Magnuson, G.J.Uiemi, R.R. Regal. andG.D. Veith.
1987. Topoiogical indices: their nature, mutual relatedness, and
applications. Pages 300-305 in X.J.R. Abulah, G. Leitmann, C.D. Mote, and
E. Y. Rodin, eds. Proceedings, Fifth International Conference on
Mathematical Modelling, Berkeley, CA. Pergamon Press, New York, NY, USA.
Dixon, tf.J. Ed. 1981. BMDP Statistical Software, 1981. University of
California Press, Berkeley, CA, USA.
Drummond, R.A.,. C.L. Russom, D.L. Geiger, and D.L. DeFoe. 1986. Behavioral
and morphological changes in fathead minnows, Pimephaies 'proms las, as
diagnostic endpoints for screening chemicals according to mode of
action. Pages 415-435 in Aquatic Toxicology. Ninth Aquatic Toxicology
Symposium, American Society for Testing and Materials, Philadelphia, PA,
USA.
18
-------
Enslein, K., M.E. Tomb, and T.R. Lander. 1984. Structure-activity models of
biological oxygen demand. Pages 89-109. in K.L.E. Kaiser, ed., QSAR in
Environmental Toxicology. D. Reide1, New York, NY, USA.
Geating, J. 1981. Literature study of the biodegradabi1ity of chemicals in
water. Vols. 1 and 2. EPA/600/2-81-175/176. U.S. Environmental
Protection Agency, Office of Research and Development, Cincinnati, OH.
Leo, A. and D. ffeininger. 1984. CLOGP version 3.2 user reference manual.
. Medicinal Chemistry "Project, Pomona College, Claremont, CA, USA.
McKim, J.M.-, S.P. Bradbury, and G. J. Niemi. 1987a. Fish acute toxicity
syndromes and their use in the QSAR approach to hazard assessment.
Environmental Health Perspectives 71: 171-186,
McKim, J.M., P.K. Schmieder, R.W. Carlson, E.P. Hunt, and G.J. Niemi.
1987b. Use of respi ra-tory-card i ovascul ar responses of rainbow trout ( Sa 1 mo
gai rdne ri ) in identifying acute toxicity syndromes in fish: part 1.
pentachloropheno1, 2,4-dinitropheno1, tricaine methanesulfonate, and
1-octanol. Environmental Toxicology and Chemistry 6: 295-312.
McKim, J.M., P.K. Schmieder, G.J. Niemi, R.W. Carlson, and T.R. Henry,
1987c. Use of respiratory-cardiovascular responses of rainbow trout (Sa1 mo
ga i rdneri) in identifying acute toxicity syndromes in fish: part 2.
malathion, carbaryl, acrolein, and benzaldehyde. Environmental Toxicology
and Chemistry 8: 313-328.
Nie, N.H., C.H. Hull, J.G. Jenkins, K. Steinbrenner, and D.H. Bent. 1975.
SPSS, statistical package for the social sciences. McGraw-Hill Book
Company, New York, NY, USA.
Niemi, G.J., R.R. Regal, and G.D. Veith. 1985. Applications of molecular
connectivity indexes and multivariate analysis in environmental chemistry.
Pages 148-159 in J.J. Breen and P.E. Robinson, eds., Environmental
applications of chemometrics. ACS symposium series No. 292.
American Chemical Society, Washington-D,C., USA.
Niemi, G.J., G.D. Veith, R.R. Regal, and D.D. Vaishnav. 1987. Structural
features associated with degradable and persistent chemicals. Environmental
Toxicology and Chemistry 6: 515-527.
Veith, G.D,, D.J. Call, and L.T, Brooke. Structure-toxicity relationships
for the fathead minnow: narcotic industrial chemicals. Canadian Journal
of Fisheries and Aquatic Sciences 40: 743-748,
19
------- |