CONF-88052Q-- (DE88013180) QSAR 88 Proceedings of the THIRD INTERNATIONAL WORKSHOP ON -QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS IN ENVIRONMENTAL TOXICOLOGY May 22-26, 1988 Knoxvllle, Tennessee Edited by: James E. Turner M. Wendy Williams T. Wayne Schultz Norma J. Kwaak ------- SIMPLIFYING COMPLEX QSAR'S IN TOXICITY STUDIES WITH MULTIVARIATE STATISTICS Gerald J. Niemi and James M. McKim Environmental Research Laboratory, Duluth U.S. Environmental Protection Agency 6201 Congdon Boulevard Duluth, MN 55804 USA ABSTRACT During the past several decades many quantitative structure-activity relationships (QSAR's) have been derived from relatively small data sets of chemicals in a homologous series and selected empirical observations. An alternative approach ys to analyze large data sets consisting of heterogeneous groups of chemicals and to explore QSAR's among these chemicals for generalized patterns of chemical behavior. Exploratory analyses using multivariate statistical procedures in an iterative fashion have traditionally been a neglected tool in the effort to find relationships that can lead to testable hypotheses. Hence, statistical analysis does not need to be a device only to test a hypothesis. Moreover, multivariate statistical analyses (e.g., principal components analysis (PCA) and factor analyses) can simplify the complex relationships among variables. One of the major reasons for not considering multivariate statistical routines for "simplifying" complex relationships is a lack of understanding and routine use of these techniques by practicing QSAR scientists. The use of exploratory multivariate statistical techniques for simplifying complex QSAR problems is demonstrated through the use of research data on biodegradation and mode of toxic action. In these examples, a large number of explanatory variables were examined to explore which variables might best explain whether a chemical biodegrades or whether a toxic response by an organism can be used to identify a mode of toxic action. In both cases, the procedures reduced the number of potential explanatory variables and generated hypotheses about biodegradation and mode of toxic action for future research without explicitly testing an existing hypothesis. i Present address: Center for Water and the Environment, Natural Resources Research Institute, University of Minnesota, Duluth, MN 55811 USA. 11 ------- INTRODUCTION The- vast majority of QSAR's developed over the past several decades were largely derived from relatively small data sets of homologous series of chemicals. Furthermore, the "structural" variables used to make predictions about the "activity" variables in a "quantitative relationship" were primarily•based on "secondary structural" variables such as the n-octanol water partition coe'fficient (log P) . There is nothing inherently wrong with the development of these relationships except that certain limitations exist in their application. These include; 1) secondary structural variables such as log P (independent variables) are measured or calculated with error and hence these errors are propagated into predictions of the activity variables (dependent .variables);/ 2) secondary structural variables are often impossible to calculate for some compounds, which limits their application; and 3) precise definitions on what constitutes a "homologous" series are often vague and, hence, the boundaries of a specific QSAR is also vague, There clearly is no simple solution to these problems. However, the first two limitations can be overcome by considering primary structure activity variables or variables calculated directly from the structure of the chemical (e.g., chemical fragments or molecular connectivity indices). The third limitation can be overcome by building QSAR's in a more global context in which no subjective boundaries are placed on the realm of chemicals to which the QSAR will apply. Here our objectives are the following: i) explore reasons why a more global perspective has not been pursued, and 2) present two examples of how a complex QSAR problem can be simplified through the use of multivariate statistics. Limitations to a global perspective The sheer magnitude of a QSAR problem from a global perspective is intimidating because of the large number of structures (e.g., hundreds of thousands) that can be considered, plus the number of potential structural variables that theoretically can be calculated for a structure. The magnitude of the problem has limited conceptual approaches and has immediately forced scientists into limiting the problem, usually by dealing with discrete homogeneous series of chemicals. When one is working from an industrial perspective in attempts to design a new chemical or drug, this more focused approach might be feasible because a more limited number of solutions are possible given the availability of a lead structure. However, from a regulatory perspective it is not feasible because an initial evaluation must focus on an objective placement of the chemical into the proper group of chemicals and QSAR model, from which predictions can be made. Hence, the. perspective in which scientists using QSAR techniques must examine a problem will partly determine the approach to appropriately bound the chemical within the global universe of potential chemical structures. 12 ------- Two additional reasons for an inhibited global perspective in understanding QSAR are the lack of training of scientists in statistical analysis, especially multivariate statistics, and the relatively rapid and expanding development of computer capabilities. Regarding the former, scientists using QSAR techniques are' generally either chemists, biologists, or biochemists. Most of these scientists have formal training in mathematics including calculus because most undergraduate and graduate curricula require some mathematical training. Some have formal training in one or two elementary statistics courses in which at most two variables (e.g., correlation or regression) are considered in statistical tests. Few have any training in multivariate statistical techniques, training which is necessary to consider a multivariate, global perspective to QSAR. In regard to the latter limitation, phenomenal progress has been made during the past 30 years in the development, design, and use of computer hardware and software. Yet, despite this progress, the capacity and cost to use many computers and fcfie availability of software capable of handling hundreds of variables for thousands of chemical cases is still limited. For example, we can calculate literally hundreds of potential primary structure- activity variables based on various mathematical routines (Basak et al. 1987), yet one of the most commonly available statistical packages, the Statistical Package for the Social Sciences (SPSS, Nie et al. 1975) is limited to analysis of < 100 independent variables (Niemi et al. 1985). Therefore, we have powerful computer capabilities today, but they may not yet be as powerful as we desire nor do enough scientists have the proper technical training to fully utilize statistical routines or existing computers. Progress in developing QSAR, especially global-multivariate relationships, will be inhibited until a larger critical mass of QSAR scientists are educated, hardware and software capabilities of computers are improved, and costs to obtain and use these computer capabilities are reasonable. QSAR in biodegradation research Development of QSAR models to predict whether chemicals microbially degrade in aquatic environments or to determine the rate at which a chemical degrades can.be difficult because of the many interacting factors that contribute to biodegrada.bi 1 i ty (e.g., see Alexander 1981). Niemi et al. (1987) attempted an objective multivariate statistical approach to this^ problem by using a data base of 287 compounds with available 5-day BOD values (BODg). BODg was used as an approximate measure of the inherent ability of a chemical to microbially degrade in a m6dern sewage treatment facility. For each of these compounds 54 molecular connectivity indices and five physicochemical parameters including log P (Leo and Weininger 1984) were calculated and used as potential explanatory variables for assessing whether the BOD^ value was relatively high or low (e.g., biodegradable or persistent respectively). The first manipulation of these data was to separate the compounds into biodegradable and persistent groups based on a natural division in the BODjj values. Discriminant function analysis (DFA) with the molecular ; connectivity indices and the five physicochemical factors were used as explanatory variables in an attempt to separate these two groups. In 13 ------- general, DFA is a multivariate statistical technique that identifies whether differences in a set of explanatory variables exist between two or more groups. Although two previous papers reported some success with this technique (Geating 1981, Enslein et al. 1984), only 50 % of the 287 compounds could be correctly discriminated in this exercise. From the perspective of chemical structure, it is likely that there are many different factors that contribute to the persistence or degradabi1ity of a chemical and, hence, the chemicals need to be assessed in smaller groups. Because 'there was no a priori rationale for defining these groups, an objective multivariate technique, K-means clustering (Dixon 1981), was used. Prior to the use of clustering, a principal components analysis (PCA) was calculated on 45 of the molecular connectivity indices. PCA is a technique used to reduce the number of variables to be considered in a problem and here it was used to reduce the molecular connectivity indices to less than 10 variables that still explained > 90 % of the variation in the original variables^ PCA was a necessary step here because the K-means clustering software of the Biomedical Computer Program (BMDP, Dixon 1981) and the PDP-11/70 computer system used at the time was limited to a maximum of nine variables for eight clusters that could be defined for 287 cases. Two additional problems were encountered in this analysis. First, there was no a priori rationale for defining how many clusters should be identified to improve the predictions. Secondly, compounds that were outliers in the principal components space were often identified as single compound clusters. To avoid the latter problem, any compounds that were > 2 standard deviations from the mean for any of the principal components used were identified as belonging to an "outer" space and analyzed separately from those compounds within 2 s.d.'s for all principal components. The former problem was solved by iterating the number of clusters to be formed over a range of clusters and identifying the number of clusters that produced the best discrimination of biodegradable from persistent chemicals. Hence, the statistical process consisted of the following: (1) PCA of 45 molecular connectivity indices that described the structure of the compounds, (2) identification of an "outer" and "inner" space, and (3) iterative clustering of the outer and inner space followed by DFA of .. biodegradable and persistent groups within each iterative cluster. The results of this iterative analysis process improved the correct prediction of biodegradabi1ity to an overall Q8 % (85 % for biodegradable compounds and 94 % for persistent compounds). To identify the types of structural features associated with biodegradability or persistence, the discrimination within each of the clusters was examined and summarized into a set of heuristic rules. When possible, each of the heuristic rules was related with previous knowledge published on structural relationships associated with biodegradabi1ity. After some obvious misc1 assifications based on DFA were translated into the set of heuristic rules, the set of heuristic rules correctly classified 93 % of compounds into the appropriate biodegradabi1ity group (91 % for degradable chemicals and 96 % for persistent chemi ca 1 s ) . In summary, the iterative multivariate statistical procedures described above allowed for an eventual simplification of structural features 14 ------- associated with the complex process of biodegradabi1ity of chemicals into a set of heuristic rules. These rules can be viewed as tentative hypotheses to be tested in future experimentation and modified as a result of those subsequent experiments. Admittedly, the statistical procedures are complex, especially to those unfamiliar with these techniques, but the eventual results led to a simplification in understanding potential structural features associated with biodegradability. QSAR in mode of toxic action research f Over the past five years scientists at U.S. EPA's Environmental Research Laboratory in Duluth have studied eight xenobiotic chemicals from the perspective of four different biological disciplines; two chemicals for each of four different known modes of toxic action. The major objective of this research was to identify effective, but cost-efficient sets of toxic responses in fish that would correctly identify specific modes of action. These response sets were-'termed fish acute toxicity syndromes or FATS (McKim et al. 1987a). The basic premjse of this research was based on the idea that if an appropriate FATS could be identified for a chemical, then a reasonable prediction of mode of action could be made for that chemical. A QSAR equation could then be used for that mode of action and subsequently a prediction about its toxicity (McKim et al. 1987a). The four biological disciplines and number of parameters included in the analysis were the following: (l) 17 physiological variables measured on four individual rainbow trout (Sa1 mo ga i rdneri) exposed to each of the test chemicals (primarily respiratory-cardiovascular) (McKim et al . 1987b, c); (2) 14 behavioral variables measured on fathead minnows (Pimepha1es prome1 as) exposed to each of the test chemicals in standard 96-h LCgQ assays (Drummond et al. 1986); (3) 25 hemato1ogica1 variables measured on individual trout exposed to each of the test chemicals (Snarski and Stokes, pers. comm.); and (4) 14 biochemistry variables measured on individual trout exposed to each of the test chemicals (Christensen, pers. comm.). " These data represent a substantial multivariate problem and one in which substantial violations of statistical assumptions are possible as well as a situation in which spurious results are expected. For example, the common denominator that links the observations for each variable for each discipline are the eight chemicals. Thus, the multivariate situation is that there are 70 variables for eight cases; a reversal from the ideal situation in which one would like 70 cases for each of eight variables. However, from a biological perspective, it is seldom that information of this detail is available across disciplines and we argue that despite the statistical problems these data are worthy of exploratory analysis. These data are especially worthy from the perspective of using multivariate statistical analysis to simplify future, analyses of FATS predictions. 15 ------- The initial major question of interest here is what variables can best discriminate between the four FATS groups (each reflective of a mode of toxic action) which are represented by the eight chemicals. The four modes of toxic action and the associated chemicals studied were: (l) nonpolar narcosis (tricaine methanesulfonate and 1-octanol); (2) acety 1 chojl i neste rase inhibitors (malathion and carbaryl); (3) uncouplers of oxidative phosphory1 ation (pentachloro- phenol and 2,4-dinitrophenol); and (4) mucous membrane irritants (acrolein and benzaldehyde). Selection of the most useful variables for discriminating between FATS was based on the following steps: (1) identification of those variables that provided the best discrimination (lowest alpha values) of all four FATS, (2) identification of'those variables that best discriminated between two FATS groups, and (3) elimination of one variable from a pair of variables that were highly correlated within a biological discipline (here defined as £ > 0.85). In steps one and two above, univariate F values and associated alpha values from an analysis of variance were used to determine the best discriminating variables. In step three Kendall's rank correlation was used because of the relatively low sample size. The final step in the analysis was to use some of the .variables identified in the first three steps above in a DFA -to identify the smallest number of variables that.could discriminate the four FATS. A total of 23 variables including six physiological, five behavioral, five hemato1ogical, and seven biochemical variables were highly significant (JD < 0. 001)-di scr iminators of the four FATS groups. In addition to these 23 variables, six physiological, three behavioral, three hematological, and one biochemical variable were significant (_p_ < 0.01) discriminators of the four FATS groups (Table 1). Therefore, a total of 36 of the 70 variables (51 %) considered here were significant discriminators of the four groups. Table 1. Summary of three steps in reducing the number of potential explanatory variables in discriminating four FATS among four different biological disciplines .(see text for details in reduction process). Discipline Original Step-1 variables Step 2 Step 3 Total Phys i o 1 og i ca 1 Behavi oral Hemato logical Bi ochemi cal 17 14 25 14 12 8 8 8 3 4 3 1 — 1 - 1 0 - 1 14 11 11 8 Total 70 36 11 - 3 44 16 ------- In considering all six pairwise combinations of the four FATS groups, three additional physiological, four behavioral, three hemato1ogical variables, and one biochemical variable were significant (JD < 0.01) discriminators of at least two FATS groups (Table 1). Hence, a cumulative total of 15 of 17 physiological variables (88.%), 12 of 14 behavioral variables (86 %) , 11 of 25 hematological variables (44 %) , and 9 of 14 biochemical variables (64 %) or 47 of 70 potential explanatory variables (67 %) of at least 2 FATS groups were significant at j) < 0.01. Pairwise correlations between variables that were good discriminators within a discipline (Steps 1 and 2) showed that only six pairwise variables (Table 1) had correlation values greater than r_ > 0.85 (_r > 0.72). Therefore, only three variables could be eliminated at Step 3 and 26 of the original 70 variables (37 %) could be eliminated using this reduction process (70-44=26). The final step in the analysis was to conduct a stepwise DFA to identify the best variables that could discriminate the eight chemicals among the four FATS groups. In this process, instead of including all 44 good discriminating variables of the four FATS groups, we selected the two best discriminating variables from each of the four biological disciplines. This process could still produce spurious relationships in the results because the number of variables is equal to the number of cases. However, this analysis is a better alternative as compared with including all 44 good discriminating variables and it is only being calculated to explore the best combination of variables among discipli.nes for potentially discriminating all four FATS groups. The first variable selected was oxygen consumption, a physiological variable, (McKim et al, 1987b), which discriminated the narcosis and uncoupler FATS groups from the inhibitor and irritant FATS groups. After this step, six of the eight chemicals were correctly classified. The second variable selected was a behavioral variable, scoliosis (a morphological abnormality, Drummond et al. 1986), which correctly discriminated the remaining two chemicals and all four FATS. Figure 1. Plot of first two discriminant functions from a DFA of eight chemicals in which the response of fish (as measured by two variables (oxygen consumption and scoliosis) were the. best discriminators and correctly classified each chemical into one of four FATS groups. DF2 "LOW : "HIGH" UNCOUPLERS r^\^ ^V t ( •SCOLIOSIS" 1 ANESTHETICS s-~^ (aAa) m) ^ \ ( c ) IRRITANTS Vcy AChE INHBTTORS DF1 "HIGH" -«— OXYGEN CONSUMPTION »- ^OW" 17 ------- In summary, we established a criteria for potentially reducing the number of variables to be considered for correctly classifying chemicals into a respective FATS group based on biological responses of fish exposed to those chemicals. By selecting the best discriminating variables and variables that were highly intercorre1ated, 26 of 70 potential variables (37 %) were eliminated. The lack in our ability to reduce the dimensionality further is partly due to the good selection of discriminating variables by the scientists involved among the respective disciplines and partly due to our lack of knowledge regarding FATS. For example, one would not want to eliminate a variable that might prove to be a good discriminator of a FATS not yet tested with the model. In contrast, two variables were able to correctly discriminate eight chemicals into four FATS groups. This likely indicates the problem of discriminating FATS can be accomplished with a relatively small set of response variables and that the response of fish to chemical intoxication is manifested by a number of variables; each of which is measurable at a variety of levels (physiologically, behaviorally, hematoIogically, and biochemically). Discovery of the best combination of variables to use for screening a large number of chemicals will best be accomplished by an examination of the cost-effectiveness and the precision and accuracy of measuring the respective variables. Acknowledgments We thank Mr. Robert Drummond, Ms. Virginia Snarski, Ms. Nan Stokes, and Mr. Glenn Christe-nsen for access to their data on the biological responses of fish to the eight chemicals considered here. We also thank Drs. Steven Bradbury, Subhash Basak, and Gilman Veith for their comments on this manuscript. This paper has not been peer-reviewed by the U.S. Environmental Protection Agency and therefore the views expressed herein do not necessarily reflect the views of the Agency. This research was partly supported by Cooperative Agreement No. CR81198! to Dr. Ronald Regal of the University of Mi nnesota. Literature Cited Alexander, .M. 1981. Biodegradation of chemicals of environmental concern. Science 211: 132-139. Basak, S.C., V.R. Magnuson, G.J.Uiemi, R.R. Regal. andG.D. Veith. 1987. Topoiogical indices: their nature, mutual relatedness, and applications. Pages 300-305 in X.J.R. Abulah, G. Leitmann, C.D. Mote, and E. Y. Rodin, eds. Proceedings, Fifth International Conference on Mathematical Modelling, Berkeley, CA. Pergamon Press, New York, NY, USA. Dixon, tf.J. Ed. 1981. BMDP Statistical Software, 1981. University of California Press, Berkeley, CA, USA. Drummond, R.A.,. C.L. Russom, D.L. Geiger, and D.L. DeFoe. 1986. Behavioral and morphological changes in fathead minnows, Pimephaies 'proms las, as diagnostic endpoints for screening chemicals according to mode of action. Pages 415-435 in Aquatic Toxicology. Ninth Aquatic Toxicology Symposium, American Society for Testing and Materials, Philadelphia, PA, USA. 18 ------- Enslein, K., M.E. Tomb, and T.R. Lander. 1984. Structure-activity models of biological oxygen demand. Pages 89-109. in K.L.E. Kaiser, ed., QSAR in Environmental Toxicology. D. Reide1, New York, NY, USA. Geating, J. 1981. Literature study of the biodegradabi1ity of chemicals in water. Vols. 1 and 2. EPA/600/2-81-175/176. U.S. Environmental Protection Agency, Office of Research and Development, Cincinnati, OH. Leo, A. and D. ffeininger. 1984. CLOGP version 3.2 user reference manual. . Medicinal Chemistry "Project, Pomona College, Claremont, CA, USA. McKim, J.M.-, S.P. Bradbury, and G. J. Niemi. 1987a. Fish acute toxicity syndromes and their use in the QSAR approach to hazard assessment. Environmental Health Perspectives 71: 171-186, McKim, J.M., P.K. Schmieder, R.W. Carlson, E.P. Hunt, and G.J. Niemi. 1987b. Use of respi ra-tory-card i ovascul ar responses of rainbow trout ( Sa 1 mo gai rdne ri ) in identifying acute toxicity syndromes in fish: part 1. pentachloropheno1, 2,4-dinitropheno1, tricaine methanesulfonate, and 1-octanol. Environmental Toxicology and Chemistry 6: 295-312. McKim, J.M., P.K. Schmieder, G.J. Niemi, R.W. Carlson, and T.R. Henry, 1987c. Use of respiratory-cardiovascular responses of rainbow trout (Sa1 mo ga i rdneri) in identifying acute toxicity syndromes in fish: part 2. malathion, carbaryl, acrolein, and benzaldehyde. Environmental Toxicology and Chemistry 8: 313-328. Nie, N.H., C.H. Hull, J.G. Jenkins, K. Steinbrenner, and D.H. Bent. 1975. SPSS, statistical package for the social sciences. McGraw-Hill Book Company, New York, NY, USA. Niemi, G.J., R.R. Regal, and G.D. Veith. 1985. Applications of molecular connectivity indexes and multivariate analysis in environmental chemistry. Pages 148-159 in J.J. Breen and P.E. Robinson, eds., Environmental applications of chemometrics. ACS symposium series No. 292. American Chemical Society, Washington-D,C., USA. Niemi, G.J., G.D. Veith, R.R. Regal, and D.D. Vaishnav. 1987. Structural features associated with degradable and persistent chemicals. Environmental Toxicology and Chemistry 6: 515-527. Veith, G.D,, D.J. Call, and L.T, Brooke. Structure-toxicity relationships for the fathead minnow: narcotic industrial chemicals. Canadian Journal of Fisheries and Aquatic Sciences 40: 743-748, 19 ------- |