Report No.  EPA - 560/1-76-006
                     ANALYSIS AND TRIAL APPLICATION OF CORRELATION

                        METHODOLOGIES FOR PREDICTING TOXICITY OF

                                   ORGANIC CHEMICALS
                                        May 1976
                                      Final  Report
                                      Prepared for

                               Office of Toxic Substances
                            Environmental Protection Agency
                                      401 M Street
                                Washington, D. C.  20460
                                Contract No. 68-01-2657

-------

-------
                      NOTICE
This report has been reviewed by the Office of Toxic
Substances, EPA, and approved for publication.  Ap-
proval does not signify that the contents necessarily
reflect the views and policies of the Environmental
Protection Agency, nor does mention of trade names for
commercial products constitute endorsement or recom-
mendation for use.

-------
                                                                F-C3947
                          EXECUTIVE SUMMARY

     This project was carried out in response to the needs of OTS for an
analysis of the potential role which may be played by applications of
correlation methodologies to a study of prbpertyp-effect data in meeting
their early warning objectives.
     First, a literature survey was conducted to describe the state-of-
the-art for the major methods of structure-activity correlation and of
their potential utility for OTS needs; second,  a study of available
effects data to see whether their nature permits application of the
various existing methods for correlation; third; a study of available
physical-chemical properties and chemical structure fragment codes; and
fourth, design, implementation and study of a prototype,toxicity data
base to test several methods side-by-side.
     The literature survey resulted in an index of the literature on
structure-activity correlation, which has been made available for dis-
tribution by the NTIS.
     The study of available effects data has resulted in a rational
approach to the organization and classification of the types of raw
toxicity data encountered in this field.  Such data do not often exist
in sufficient quantities to permit application of classical statistical
methods for analysis.
     The study of many possible physical-chemical properties resulted
in only one property which has been shown to provide almost universal
applicability to structure-activity correlation problems, regardless
of chemical structure — namely, the partition coefficient.  A limited
study of chemical structure-fragment codes showed that many could serve
this function.  The U.S. Army CIDS structure codes were chosen for
the prototype data base because of their ready availability to EPA; other
                                  iii

-------
                                                                F-C3947

systems, such as  the Chemical Abstracts Services substructure fragments
would have equal  utility.
     A prototype toxicity data base was developed using toxicity data
for oral LD   values in the rat for 687 compounds selected from the
NIOSH 1974 Toxic Substances List.  Partition coefficients, molecular
weights, and several hundred chemical structural fragment keys (obtained
from the U.S. Army CIDS file at Edgewood Arsenal) were studied by two
approaches to see whether meaningful predictions pf toxicity could be
made.  The more successful was a modification of a method known as
Substructural Analysis, first reported in 1974.  Using the modified
approach, rat toxicities for 21 out of 23 compounds studied were pre-
dicted with a reasonable degree of accuracy.  Another group of 25 pairs
of compounds, each pair containing one structural feature in common and
representing the highest and lowest toxicity for that feature, was
successfully ranked by this method.
     The same data base was studied by readily available conventional
methods of cluster analysis, discriminant analysis and regression
analysis.  Of these methods, discriminant analysis was found to permit
separation of the small number of highly toxic substances from the bulk
of the compounds, with a few false positives.
     Recommendations are made for continuing the development and analy-
sis of data in the prototype toxicity data bank.  The simultaneous
application of the modified substructural analysis method and discrim-
inate analysis is recommended in extending this project to other types
of toxicity,  with emphasis on carcinogenesis and aquatic organisms
toxicity data.  As the number of sets of different toxic data in this
data bank increases, the prediction of the degree of several kinds of
toxicity for a novel substance will provide OTS with an early warning
capability of increasing usefulness.
                                  iv

-------
                                                                F-C3947
                               CONTENTS

Section                          Title                            Page
         EXECUTIVE SUMMARY	ill
  1      OBJECTIVES OF CONTRACT    	   1-1
         1.1  Background	1-1
  2      WORK PERFORMED	2-1
         2.1  Literature Search    	   2-1
         2.2  Study of Available Toxicity Data	2-1
              2.2.1  Toxicity Data in Registry of Toxic Effects
                     of Chemical Substances (RTECS)    .   .    .   2-2
              2.2.2  Fish Muscle Accumulation Studies  .   .    .   2-3
              2.2.3  Carcinogenesis Data	2-3
              2.2.4  Mutagenesis Data	2-4
              2.2.5  Water Pollution Data	2-4
              2.2.6  Air Pollution Data	2-4
              2.2.7  Teratogenesis Data	2-5
         2.3  Study of Available Physical-Chemical Data   .    .   2-5
              2.3.1  Partition Coefficients   	   2-5
              2.3.2  Hammett Sigma Constants  	   2-6
              2.3.3  Taft Steric Constants	2-7
              2.3.4  Molar Refractivity Values	2-7
              2.3.5  Other Types of Physical Constants.   .    .   2-7
         2.4  Methods for Representing Chemical Structures.    .   2-8
         2.5  Evaluation of Methods for Structure-Activity
              Correlation	2-9
              2.5.1  Multiple Parameter Regression Analysis
                     (Hansch Method)  	   2-9
                                   v

-------
                                                                F-C3947
                           CONTENTS (CONT.)
Section                         Title                            Page

              2.5.2  Discriminant Analysis Method.   .   .   .   2-11

              2.5.3  The Free-Wilson Method	2-11

              2.5.4  Substructure Analysis Method (SAM)  .   .   2-12
              2.5.5  Pattern Recognition Methods  ....   2-12
              2.5.6  Overview of Alternative Methods for Data
                     Analysis	   2-13

         2.6  Prototype Toxicity Data Bank	2-13

              2.6.1  Use of CIDS Fragment Keys for Structure
                     Representation  	   2-14
              2.6.2  Inclusion of Partition Coefficient Data
                     (Log P) .    .   .   .	2-14
              2.6.3  Implementation of the Toxicity Data Base.   2-15

         2.7  Analyses of Data in the Toxicity Data Bank (TDB)   2-15

              2.7.1  Empirical Approach  	   2-15

              2.7.2  Methodology Employed.   !"....   2-19

                     2.7.2.1  Discussion of this Method  .   .   2-21

                     2.7.2.2  Reliability of the Estimates
                              Made in Table 1	2-23

              2.7.3  Study of TDB by Conventional Statistical
                     Methods	2-24

                     2.7.3.1  Methods Applied and Results
                              Obtained	2-25
                     2.7.3.2  Discriminant Analysis  .   .   .   2-25

                     2.7.3.3  Clustering, Followed by Regression
                              Analysis   .   .    .   .   .   .   2-26
                     2.7.3.4  Separating the Data Into Three
                              Classes	2-27
                                  vi

-------
                                                                F-C3947
                           CONTENTS (CONT.)
Section                         Title                            Page

         2.8  Discussion of the Results Obtained from Studies
              of the Prototype Toxicity Data Bank.   .   .   .   2-28

              2.8.1  Discriminant Analysis   	   2-28
              2.8.2  Substructural Analysis Method   .   .   .   2-30

   3     RECOMMENDATIONS FOR FUTURE DEVELOPMENT OF THIS PROTO-
         TYPE DATA BASE	3-1

         3.1  Further Expansion of TDB Using Rat Oral LD5Q
              LJcL to   •   •    •   *   •   •   *   •   •   •   •   J~ J_

         3.2  Expansion of the TDB to Include Other Types of
              Toxicity Data	3-2
         3.3  Improvement in Operational Characteristics By
              Increased Computerization of the TDB   .   .   .   3-3
         3.4  Refinement of the CIDS Keys by Experiences Gained
              From TDB Studies	3-4

         3.5  Proposed Use of the TDB by OTS for Early Warning   3-4
   APPENDIX A - PHYSICAL PROPERTIES

   APPENDIX B - SYSTEM DESIGNED FOR THE TOXICITY DATA BANK

   APPENDIX C - DELIVERABLES UNDER CONTRACT

   APPENDIX D - REGRESSION EQUATIONS FOR CLASS 1 AND CLASS 2
                                  vii

-------
                                                                F-C3947
                      1.  OBJECTIVES OF CONTRACT
1.1  BACKGROUND
     By the end of 1973, OTS staff had identified a significant body of
scientific literature concerning structure-activity correlations and
methodologies.  At the same time, no clear-cut decision could be easily
obtained as to the potential which these methods might have for helping
OTS in its early warning activities.  In June 1974, efforts were initiated
to review and evaluate this field of scientific endeavor and to develop
useful correlations and methodologies for application in the OTS early
warning programs.
     That this new area was ready for an intensive study by OTS was
shown by the scheduling of the first Gordon Research Conference on
Quantitative Structure-Activity Relationships in Biology in the summer
of 1975.
                                  1-1

-------
                                                                F-C3947
                          2.  WORK PERFORMED

2.1  LITERATURE SEARCH
     Immediately after start-up of the contract, the Office of Toxic
Substances furnished 275 full-text hard copies of literature articles
dealing with structure-activity correlation which had been located in
a preliminary literature search.
     The 275 papers served as a nucleus to which was added approximately
75 papers from the personal files of the Principal Investigator.  Some
50 additional papers were added over the first four months of the con-
tract as a result of a current-awareness screening operation.
     This enlarged bibliography was indexed in depth by SIS literature
chemists, and the index was organized by a computerized three-level in-
dexing program.  The resulting first version of the index was placed
into the NTIS for wider dissemination as NT1S report No. PB 240-658.

     Current-awareness screening of the scientific literature received
by the FI Library continued for an additional 10 months, during which
time about 100 pertinent articles were retrieved.  These were augmented
by some 50 articles located by SIS personnel assigned to this project.
These provided more complete coverage in the area of pattern recognition
and discriminant analysis, areas not well covered in the prior retro-
spective search.

2.2  STUDY OF AVAILABLE TOXICITY DATA
     The application of quantitative correlation methodologies requires
that the activity or effect be expressed by either continuous or graded
incremental quantitative numerical values.  Additionally these values
should be reproducible at least to 1 50% upon repeated measurements, which
                                 2-1

-------
                                                                F-C3947

 is an acceptable amount of variability encountered in making biological
 activity measurements.  Unfortunately, many biological data in the literature
 are reported without any confidence intervals or other indication of the
 confidence which one can give to these data.
     Data which have been most often successfully studied by the quanti-
 tative SA methods include pesticides and pharmacological test data.
 These test data are frequently reported for more than 10 or 20 closely
 related structural analogs, and hence represent closely knit sets of
 data.  Classical statistical methods such as regression analysis have
 been readily applied to these coherent sets of data.
     The kinds of biological data which must be considered by OTS for
 its early warning mission most often represent small units with widely
 divergent chemical types, and toxicity data is often more descriptive
 in nature than quantitative.
2.2.1  Toxicity Data in Registry of Toxic Effects of Chemical
       Substances (RTECS)
     The largest collection of toxicity data in the world is reported
in the annual Registry of Toxic Effects of Chemical Substances which
is prepared by NIOSH.   it contains toxicity data for more than 16,000
chemicals, selected from the scientific journal and report literature.
Previous editions were called the Toxic Substances List (TSL).
     A study of several hundred entries chosen at random from the 1974
TSL showed that more compounds had oral LD n values in the rat than
any other type of toxicity test.  The oral mouse LD__ data are second
to the rat, and other types of toxicity are much less frequently re-
ported.  Although it is recognized that many variable factors enter
into an LD Q value,  and that such data are not to be considered as
"firm" data, the abundance of data suggested the use of the rat oral
LDj-n as having the best chance for showing structure-activity relation-
ships across a large cross-section of organic compounds.
                                  2-2

-------
                                                                F-C3947

     The 1975 version of TSL, renamed the Registry of Toxic Effects of
Chemical Substances, contains a compilation of aquatic toxicity data
which may prove useful as a mechanism to locate enough aquatic organism
toxicity data to permit study for correlations.  The original compila-
tions by Hann and coworkers must be reexamined, since the Registry of
Toxic Effects has combined toxicity of many aquatic species into one
Table.

2.2.2  Fish Muscle Accumulation Studies
     The bioconcentration of highly lipophilic substances in living
trout has been successfully correlated with the partition coefficient
by Dow scientists (Neely et al, 1975).    These workers report a linear
relationship between the octanol-water partition coefficient and the
bioconcentration factor for a series of halogenated organic compounds.
Indeed, a simple "rule of thumb" is suggested by their results, namely,
that any halogenated organic compound for which log P (octanol-water)
exceeds 5 log units may be considered to pose a potential bioconcentra-
tion threat to the environment.
     These data are excellent, but too few compounds have yet been
studied to permit their inclusion into the prototype toxicity data base.

2.2.3  Carcinogenesis Data
     Much of the world's carcinogenesis data is summarized in the PHS
Monograph 149 Series.  After careful study of these Monographs, and
discussions with Dr. Sidney Siegel, (Information and Resources
Segment, Carcinogenesis Program, Division of Cancer Cause and Prevention,
(NCI) it was concluded that, at present, these data are not easily ex-
pressed as one numerical value, since factors such as latency play an
important role.
     Additional complicating factors include unusual species and strain
variations, route of administration, frequency of dosage, frequent lack
of adequate control data and conflicting literature reports.  Thus the
use of established methodologies was not indicated.
                                  2-3

-------
                                                                F-C3947

2.2.4  Mutagenesis Data
     Data on mutagenic activity are becoming available as a result of
the expanded use of the "Ames" mutagenesis test as a screening tool for
carcinogenic potential of chemical substances.  This test makes use
of a mutated Salmonella typhimurium bacterial strain.  A new program
is now underway at NCI to establish the consistency of the relationship
between mutagenesis and carcinogenesis.  Because of the substantial NCI
effort, it was felt that manipulation trials with these data were not
warranted although it may be advisable to add these data in the near
future.

2.2.5  Water Pollution Data
     Here the problem was found to be the limitation on the number of
compounds for which data exist, although some useful data are given in
the November, 1975 Supplement, EPA 440/9-75-009, published by The
Hazardous Substances Branch, Office of Water Planning and Standards,
EPA.  The bulk of the data in STORET is mainly monitoring concentration
data, not toxicity data.
     The Oil and Hazardous Materials Technical Assistance Data System
(OHM-TADS) is an automated system which contains a wide variety of data
on more than 850 substances.  These data were not readily retrievable
when this project began, but the file is now being completed and en-
larged.  This System is a possible source of data for correlation studies.

2.2.6  Air Pollution Data
     An analysis by Hickey, et al (1970), showed some correlations between
health effects and current levels of air pollutants.   The report was
studied for possible reinterpretation of the data.   A better correlation
was found in our study between high levels of pollutants and respiratory
toxic effects one year subsequent to the pollutant levels.   These data
are epidemiological in nature, and do not provide the type of early
warning needed.  A visit to the EPA-RTP Air Pollution Center (February,
                                  2-4

-------
                                                                F-C3947

1975) uncovered only epidemiological and monitoring data in the SAROAD
system  (SAROAD for "Storage and Retrieval of Aerometric Data).  Despite
the millions of data items in that system, only some 30 different chemicals
were included in these results, an insufficient number for applying
correlative techniques.

2.2.7  Teratogenesis Data
     The best data collection available at present in this field is
Shepard's "Catalog of Teratogenic Agents" (1973).   Although it lists 649
agents, the data do not lend themselves to easy analysis or to simple
quantitation; the results are mostly descriptive.   Several parameters are
crucial, including the dose level and route, frequency and timing of
administration,  types of deformations produced and control problems.  The
current development of a data bank (teratogenic Information Center, ORNL)
may help provide sufficient data for analysis in the future.

2.3  STUDY OF AVAILABLE PHYSICAL-CHEMICAL DATA
     Many physical-chemical properties have been used in correlation
studies with varying degrees of success.  Several of these properties
are closely related, or co-linear, and this fact can lead to wrong or
unsubstantiated conclusions being drawn from apparently successful
correlations.  Care must be taken to select those parameters which are
substantially independent of each other.  Each of the major parameters
found to be most useful in establishing correlations is discussed separately
below.

2.3.1  Partition Coefficients
     The work of Hansch (1971) and Neely (1975) has firmly established
the preeminent status of partition coefficient data in this field.  These
papers substantiate the advantages of using the normal-octanol/water
system and record many hundreds of computer-stored experimentally observed
values.  They also discuss the rationale behind calculations of partition
data, although these methods are currently undergoing rather drastic im-

                                   2-5

-------
                                                                F-C3947

provements, based upon the recent work of Rekker and Nys (1974) (reported
at  the Gordon Research Conferences, summer, 1975).  Basically, the
structural fragments of a molecule may be assigned additive fractional
partition values and the log of the partition coefficient can be esti-
mated by summation of all fragment partition values.
     Partition data alone can often give reasonable correlations with
biological data, although they are often used in various combinations
with other physical parameters.
     The computer printout of partition data from the Pomona College
Medicinal Chemistry Project was purchased in this contract and delivered
to OTS for use by EPA.  Subscriptions to the yearly updates of this
compilation are available at an annual charge of $500.  Dr. Albert Leo,
Director of the Pomona College Project, has served as a consultant on
this project. Use of these partition tables by OTS can provide the
"rule of thumb" estimates described for bioconcentration in Section 2.2.2.
Preliminary studies for computerized calculation of partition data are
being considered by NCI; this activity should be monitored by OTS.

2.3.2  Hammett Sigma Constants
     There are currently some 22 varieties of Hammett sigma constants
and recently these have been combined into F and R constants (F =  field
and R = resonance).   These constants are used to express the polar (or
electronic) effects of substituent groups, usually on a benzene ring.
Hence they are limited to use with compounds of this type, a limitation
which does not hold for partition coefficients.  There is ample pre-
cedent for selecting the particular sigma constant for use in a given
series of compounds, but the value of these constants when used alone in
correlation studies is much more restricted than the use of partition
coefficient.  Sigma constants usually are employed together with partition
data.  Tables of the various kinds of sigma constants are readily  avail-
able in the Pomona College Medicinal Chemical Project's compilation.
                                  2-6

-------
                                                                F-C3947

 2.3.3  Taft  Steric  Constants
     These constants are derivatives of Hammett sigma constants which
 are derived  to express the relative steric strains or crowding effects
of substituent groups.  Steric constant values are available in the
literature for most of the common chemical substituent groups.  They
have been most constructively applied by Hansch and coworkers to studies
of enzyme inhibition, and have little or no applicability in this field
outside of such studies.

2.3.4  Molar Refractivity Values
     These values can be calculated from chemical fragment refractivity
values, as reported by Hansch and coworkers (Journal of Medicinal
Chemistry 16, 1207-1216 (1973), in a manner analogous to calculating
log P values.  The best use of these values in correlation studies
has proven to be in studies of enzyme inhibition, as for Taft steric
 constants.  Molar refractivity values are usually employed in conjunction
with partition data or other parameters.

 2.3.5  Other Types of Physical Constants
     Of the several types of physical constants so far discussed, only
the partition coefficient and molar refractivity provide values for
almost any organic compound, regardless of structure, and these can be
calculated for many molecules with a reasonable degree of assurance.
Although it is desirable to include constants such as the molecular
weight, refractive index, melting and boiling points, etc., for study
by pattern recognition methods, one should be aware of the careful
attempt reported by Russian workers to relate toxicity of 218 compounds
to 38 different physical properties studied one at a time.  No reasonable
 correlations were found using LC n, MAC, narcotic data and any single
 physical constant.  See Appendix A for a list of the 38 physical proper-
 ties studied.   (Ljublina and Filov, 1975).
                                   2-7

-------
                                                                F-C3947

     This background suggests strongly the need to include several pro-
perties simultaneously, by pattern recognition methods,  as well as the
desirability of getting down to a more basic approach by representing
the chemical structure instead of using physical constants.  It should
be pointed out that Hansch has often obtained significant correlations
using only the molecular weight, which often mirrors the partition
coefficient data for closely related structures.

2.4  METHODS FOR REPRESENTING CHEMICAL STRUCTURES
     Chemical structure fragment codes of many types are well-described
in the chemical literature.  Most of these codes would be applicable to
the purposes of this project, but they all are manually encoded, and
this is a laborious task which can give rise to occasional errors in
coding.  The use of augmented connectivity fragments, computer-generated
from CAS connectivity tables can overcome the inconsistencies in coding
(Lynch, et al, 1971).  The augmented connectivity fragment describes
every atom by its immediate environment; i.e., by each atom to which it
is connected.
     However, the limitations of the augmented connectivity fragments
are severe, since they do not provide for interrelating structural
features more than three atoms apart.  One solution of this problem
was used by Chu, et al (1975) in which they employed a combination of
ring descriptors and chain fragments which connect heterocyclic atoms,
together with the augmented atom fragments.  However, the chain (or
"heteropath") fragments required a large amount of manual editing and
correcting, since the existing computer programs were not adequate to
make these as exact assignments.
     For the developmental purposes of this project, the chemical com-
pounds and their related molecular structures were known; the need was
for computer-derived, unambiguous codes for the many substructure frag-
ments contained in each compound.
                                  2-8

-------
                                                                F-C3947

     A readily available source is the U.S. Army Chemical Information
Data System (CIDS) which is in operation at Edgewood Arsenal.  This
system has functioning computer programs to assign fragment code keys
based upon a connectivity table representation of each molecule repre-
sented in the system.  Many of the keys relate to groups of 4, 5 or more
atoms, rather than only 3 or 4.  Furthermore, the bulk of the chemicals
in the Toxic Substances List is available in the CIDS unclassified file.
A careful study of the CIDS structure keys was made by Dr. Craig, who
decided that these included the basic types of structural relationships
which should permit a valid test of the concepts of relating chemical
structures to toxicity.  An additional point recommending the use of CIDS
is that modifications or additions to the keys can be made with relative
ease, and the MIDS-EPA group employs the CIDS programs in its chemical
substructure search system.

2.5  EVALUATION OF METHODS FOR STRUCTURE-ACTIVITY CORRELATION
2.5.1  Multiple Parameter Regression Analysis (Hansch Method)
     Since much information is available on this method, we began our
study with a careful evaluation of the requirements for its application
to collections of chemical and biological data.
     In brief, the Hansch method employs a standard statistical method
(multiple regression analysis) to ssek significant relationships between
a dependent variable (the biological activity) and from one to four or
more independent variables, which may be used alone or in combinations.
These dependent variables include the physical properties discussed in
Section 2.3 above.  Experience has shown the partition coefficient to be
the most useful single property.  The statistical requirements include
a minimum number of five compounds for each independent variable studied
in an analysis; acceptable correlation coefficients which can range from
.7 to .95 or so;  and acceptable standard deviations, which rarely are
lower than .25 or .3 log units for biological test data.
                                  2-9

-------
                                                                F-C3947

     In summary, the Hansch method is not readily applicable to the wide
variety of structures for which toxicity data are available, but can
often be applied to smaller sub-sets of closely related chemicals after
these have been identified by other techniques.  Such uses will repre-
sent a "fine-tuning;" of interest, but of little use to OTS for early
warning.
     The major limitation is that rarely does one have a sufficiently
large number of analogs and corresponding biological test data avail-
able to permit use of the Hansch method for a new chemical (or even
for an existing chemical) proposed for widespread use.
     There will occur some specific instances where the Hansch method
should prove to be useful to EPA.  For example, in the pesticides field
the submission of a close structure analog to a series of well-studied
products should allow the predictive application of a multiple para
meter regression equation derived by the Hansch method for the well-
known series of compounds, permitting an educated guess as to the
relative hazard posed by the new compound.  However, it is the current
accepted practice to require a battery of biological tests before a
new pesticide is permitted to be marketed.  The mere knowledge that a
new substance has a close structural relationship to known toxic pesti-
cides serves as the most readily applied "Early Warning" indicator,
whether or not a quantitative estimate of the toxicity potential has
been made.
     The Hansch computer program for multiple parameter regression
analysis was purchased from the Pomona College Medicinal Chemistry
Project for use by OTS and EPA; the computerized data bank of sigma
constants, octanol-water and other partition data was also obtained
and delivered to OTS.  Thus OTS now has the required computer programs
and partition and sigma constants listings for using the Hansch method,
when the proper data are available for its application.
                                  2-10

-------
                                                                F-C3947

2.5.2  Discriminant Analysis Method
     Since the more rigorous requirements of the Hansch method preclude
its use in most cases of interest to OTS, the less demanding statistical
method, discriminant analysis, was analyzed and evaluated for its po-
tential application to the needs of OTS.
     The basic requirements and limitations of this method were studied,
and the required computer programs were shown to be readily available
and easily used at any major computer facility.  Basically the several
methods for applying discriminate analysis employ statistical methods
to discriminate between two or more groups of substances by deriving
an optimal line, plane or multidimensional plane which will permit use-
ful discrimination between these groups and can be used to predict
whether a new substance has a high probability of belonging to one or
another group, e.g. carcinogenic, non-carcinogenic, etc.
     The prototype data base, containing rat oral LD   data, was studied
successfully by this method, as discussed in Section 2.7.3.  Discriminant
Analysis thus offers a readily applicable method for identifying poten-
tially toxic substances, and can often serve the desired goals ot OTS
by quickly classifying a new substance as having a "high toxicity" or
"low toxicity".

2.5.3  The Free-Wilson Method
     This is a classical method which has been well-described and con-
                              j
trasted with the Hansch method (Craig, 1973).
     The basic assumption of this method is the constancy (and additivi-
ty) played by substituent groups in a closely related set of structural
analogs.  This assumption is tested by the solution of a series of
multiple equations with multiple unknowns.  This method of solution
imposes certain rigorous mathematical requirements such that a solution
cannot be obtained without considerable data.  Unfortunately for the
need for OTS, one rarely will encounter a sufficient number of closely
                                  2-11

-------
                                                                F-C3947

related, systematically designed analogs to permit use of the Free-
Wilson method, save in the design and study of drugs or pesticides.  No
applications of this method to a diverse group of compounds is known.

2.5.4  Substructure Analysis Method (SAM)
     This method shares with the Free-Wilson method the basic assumption
that a structural unit of a molecule makes an additive and consistent con-
tribution towards the overall biological activity of each molecule
which contains that structural unit, but unlike the Free-Wilson method
it is not limited to use with close structural analogues.
     This method was first described in 1974 (Cramer, et al 1974).  It
was designed to overcome the limitations of the Free-Wilson method, and
thus permits the use of any organic chemical regardless of any particular
structural relationships.  To accomplish this a very large training set
of many hundreds of structures is required.  The method employed by
Cramer, et al, involved the assignment of a substructure activity frequency
to each of the various fragments of a molecule (obtained from a manually
assigned chemical fragment coding system) and summing up the mean sub-
structure activity frequencies of all the substructures present in the
compound under consideration.  The resulting calculated "Mean Substructure
Activity Frequency" for a compound then was found to distinguish sig-
nificantly between active and inactive compounds in a crude biological
screening test.

2.5.5  Pattern Recognition Methods
     Although there; are many types of pattern recognition methods known,
two of the methods most often studied in this field will be discussed here.
     The "nearest neighbor" method involves computing Euclidean distances
between points in n-dimensional space, and by use of a training set of
compounds, a new substance is classified into the particular group for
which the sum of the squares of the Euclidean distances to all group
members is smallest for the new substance.
                                   2-12

-------
                                                                F-C3947

     The "learning machine" method is an iterative error-correcting
procedure which creates a hyperplane in n-dimensional space which can
discriminate between the known "active" and "inactive" compounds in a
training set.
     These two methods are well illustrated in a paper by Chu et al (1975)
Limitations of these methods are also well discussed therein.
     In the preliminary studies of a prototype toxicity data base
(Section 2.7.3.3) the use of a pattern recognition method did not give
any better results than could be accomplished by the much simpler method
of discriminant  analysis.   Although we  recommend keeping  in close
touch  with developments  in pattern recognition,  the  relative success
of the far more  easily  applied methods  of substructure  analysis and
discriminant  analysis in the studies  of the prototype data base make
it unnecessary to go over to the  more complex pattern recognition
methods at this  time for the purposes of OTS.

2.5.6  Overview of Alternative Methods for Data Analysis
     A study was made of the many types  of mathematical methods avail-
able for organizing, analyzing and evaluating kinds of data, to see
whether one or several existing methods  could be readily adapted to
the needs of OTS.  The study included consideration of the properties
of the data elements (e.g., nominal,  ordinal interval or ratio) and of
the mathematical tools available for each type of data.   It soon be-
came evident that empirical methods should be considered when a number
of diverse compounds are being studied.

2.6  PROTOTYPE TOXICITY DATA BANK
     The analyses of methods and data types indicated that  the best test
of utility would be obtained from establishing a prototype  toxicity data
bank and testing various methods on the actual data on hand.  From the
reports of multiple regression methods,   (including the Hansch, Free-
Wilson and discriminant analysis approaches) it was apparent that em-
                                  2-13

-------
                                                                F-C3947

pirical methods for studying the data should be first considered, with
possible application of more refined statistical methods.when suitable
sets of data are identified.
     In designing the data base it was concluded that at least 500 com-
pounds would be required to achieve a critical size to permit valid
studies of the type envisaged.  Since the toxicity data were the cri-
tical factor, it was decided to choose compounds on a roughly random
basis, from those for which an oral rat LD   value was reported in the
1974 TSL.  [The 1975 successor  (RTECS) to the 1974 TSL was not yet avail-
able when this phase was initiated.]
     Some 800 compounds were selected from the sections A through H
of the TSL.  This set was considered adequate to serve for the prototype
study.

2.6.1  Use of CIDS Fragment Keys for Structure Representation
     Chemical structural representation was provided by the CIDS keys
which vary from 4 to 8 or more characters in length.  Although they
were computer-assigned to all compounds in the CIDS file,  the system was
not designed to display individual keys assigned to each compound.  There-
fore special programs were written to permit the extraction and print-out
of all keys assigned to each compound requested.

2.6.2  Inclusion of Partition Coefficient Data (Log P)
     Fewer than 20% of the more than 800 compounds selected for the TDB
were reported in the partition data banks of Professor Hansch.  Dr.
Albert J. Leo, Pomona College, a consultant to this project, calculated
log P values for compounds for which no experimental log P values were
known, with the exception of a few structures for which insufficient
knowledge exists to permit the calculation to be made.  These values
were desired in order to permit inclusion of partition data in discrim-
inant analysis, regression analysis and pattern recognition studies.
                                  2-14

-------
                                                                F-C3947

2.6.3  Implementation of the Toxicity Data Base (TDB)
     Of the more than 800 compounds selected for the TDB,  700 were found
to exist in the Edgewood Arsenal non-classified file.   CIDS keys for
687 compounds, were extracted for the TDB.  The required programs and
sub-routines were developed under a subcontract.
     Under another subcontract the necessary computer programs were pre-
pared to merge the CIDS keys with the toxicity, molecular formula and
log P values into the master TDB record.  A brief description of the
system designed for the TDB is given in Appendix B.

2.7  ANALYSES OF DATA IN THE TOXICITY DATA BANK (TDB)
2.7.1  Empirical Approach
     The TDB master record was searched by computer to obtain the follow-
ing printouts:
     a) A listing ordered by CIDS keys.
     This list contains the Army Registry Number (ARN) (a compound iden-
tifier for the CIDS system) and oral rat LD   values,  ordered in ascending
value of toxicity, for each compound to which that key was assigned.
Since the toxicities range in value from 1.11 to 6.88 (log 1/C values),
this permits a ready count of the distribution between six categories of
log 1/C for each key.  These categories are:

          log 1/C = 1.0 to 1.999 = Category 1
                    2.0 to 2.999 = Category 2
                    3.0 to 3.999 = Category 3
                    4.0 to 4.999 = Category 4
                    5.0 to 5.999 = Category 5
                    6.0 to 6.999 = Category 6
                                  2-15

-------
                                                                 F-C3947
      b)   A listing ordered by rat  oral  LD Q  log 1/C  values.
      This list  gives  for each compound  the ARN,  mouse  oral ID—  1/C
 value,  each CIDS  structure key,  molecular weight and log  P value (n-
 octanol-water).   The  distribution  of  compounds  in each of the  six cate-
 gories  is obtained from this  list.
      c)   An "explosion" run of more than  6,000  pages  listed for every
 key  assigned to an individual compound, the  complete records for all  other
 compounds in the  data base to which that  key was also  assigned.   This
 provides  a valuable resource  to  study combinations of  keys in  search
 of more discriminatory  power  for correlations.
      Our  modifications  of the method  reported by Cramer,  et. al., in-
 clude the use of  quantal grades  for toxicity and the use  of frequency of
 assignment for  each key across quantal  grades to estimate the  activity
 of the  molecule as a  composite described  by  combination of these keys.
 This method has been  applied  manually to  estimate toxicities for 31
 compounds including 10  not in the  original TDB.   Results  are summarized
 in Table  I.
                                 Table I
                                      Estimated   Observed    Deviation
                                       Log  1/C    Log  1/C
1.
2.
3.

4.

5.


6.
       Caffeine
       Pentachlorophenol
                  0
CH30(CH2) 3N
-------
                                        F-C3947
Table I (continued)
             Estimated   Observed    Deviation

7.
8.
9.



10.

11.

12.

13.
14.

15.
16.

17.
1 O
18.
19.

20.
21.


•^
C*H'1oV=
C ' \^y ^ ' Lindane
Cl c\
Hv
LL
t »
H2N~CH2CH2N~CH2CH2~N~CH2CH2°H

CH-NHNH. *•
3 2
Alloxan X- #
X*S-CH CHCOOH * O
^ II-OTK,
f^VNHCOCH
U'1-OH
CH2=CH-OCH
CH30-C-CH=C-CH2COOCH3 ^
0=P(OCH3)2
DDT

C13CCOOH
COOH
ill
H° OH °H
i t
L 0 J-CH2OH
^
FCH^COONa
Log 1/C
2.52
3.02
3.46


2 .49

3.26


3-97

3.84
2.96

2.16
2.52

3.12
2.59

3.27

2.10
3.03

Log 1/C
2.95
2.67
3.52


1.50

3.24


1.50

4.30
1.47

1.41
3-96

3-49
1.58

1.53

1.83
5.65


-.43
.35
-.06


• 99

.02


2.47

-.46
1.49

.75
-1.44

-.37
1.01

1.74

.27
-2.62

        2-17

-------
                                                                   F-C3947
                           Table I (continued)
22.
     (CH 0)  -P-CHCC1
           Z 0 OH   J
26.
27.
       (C2H50)2-P=S
                                         Estimated  Observed   Deviation
                                          Log 1/C   Log 1/C
                                           2.52
                                           2.09
                                        1.54
                                         .4*4
                             2.79      3.70



                             2.45      1.32

                             2.97      4.34



                             3.00      1.51
                                                                  .98
                                                                  .65
                                                                  1.13

                                                                 •1.37



                                                                  1.49
28.      C1-CH2CH2-OH
         nC.H,.-OH
29.        6  13
30.
 31.
       CH N-CO
H
                                           2.48      3-05
                                           2.36       1.34
                                            1».20
                                            2.93      2.25
  *  Insufficient number of examples of one key in TDB to
    permit confidence in this value.
  **Unusual  toxicity, no close structural  analog present
    in TDB.
                                                     -.57
                                                     1 .02
                                                     -.2k
                                                      .68
                                   2-18

-------
                                                                F-C3947
2.7.2  Methodology Employed

     To make the estimations reported in Table I the following empirical

steps were manually employed to calculate the activity for compound 31,

Table I.

      a)  The chemical to be considered is  encoded by assignment  of
          the CIDS structure keys,  using the Handbook of CIDS Chemical
          Search Keys if the compound is not in the CIDS system.


     b)  Locate the frequency of encoding for each assigned key from
         the frequency listing.  Do not use keys assigned fewer than
         5 times if possible, since the use of keys assigned to fewer
         than 5 compounds introduces a considerable possibility for
         error.  One can use these rarely assigned keys when necessary
         but caution should be observed in interpreting the results.
         Keys assigned to more than 400 compounds have but little
         chance of being significantly related to toxicity in this
         TDB, which contains a very small number of highly toxic
         substances.

     c)  If no SCN key exists (Specific Cyclic Nucleus) for a cyclic
         moeity, use all GCN (Generic Cyclic Nucleus) keys.  If an
         SCN key is found, do not use the GCN keys, to avoid
         redundancy.


     d)  Obtain the distribution counts for the number of 9om-
         pounds in each of the 6 toxicity classes for each key,
         using the listing of rat log 1/C values ordered by keys
         (see

     e)  Prepare a substructure vector matrix for each key as
         follows:
                               Toxicity Range:

              Key:         123    A   5   6

              FG51R        0   k   11    300
          (this means that no compounds in the TDB, assigned the
          FG51R key, had toxicities ranging from log.1/C = 1.1 to
          1.99; 4 ranged from log 1/C = 2.0 to 2.99; 11 ranged
          from 3.0 to 3.99, etc.).  The last compound in Table I
          will serve as the example for the following steps.
                                 2-19

-------
                                                              F-C3947
  f)  Sum all 6 substructure vectors as follows:

                            Toxicity Ranges;
Key:
FG51R
DACN=2
HR11E
FG268R
SCN48
Sum of Vectors
1
0
36
4
28
99
167
2
4
62
17
129
209
421
3
11
9
4
48
38
110
4
3
1
1
15
7
27
5
0
1
0
6
1
8
6
0
0
0
0
1
1
  g)  Since the total data base (687 compounds) was not evenly
      divided betewen classes 1, 2, 3, 4, 5 and 6, the weight-
      ing factors of 3, 2, 6, 12,  60 and 120 were used to correct
      for the unequal distributions in the data bases.  This is
      necessary to prevent the great preponderance of high frequency
      data from drowning out the important low frequency data, which,
      however, carries the really important information concerning
      toxicity.  If the usual method for obtaining averages were
      applied without this weighting step (a "normalizing" operation),
      the skewed nature of the data would prevent successful calcula-
      tions .  The origin of the weighting factors arises from the
      following distribution which was found for this TDB:  The total
      tallies for all 681 compounds by toxicity levels are:

              TOXIC   TOXIC   TOXIC   TOXIC   TOXIC   TOXIC   TOXIC
              LEVEL   LEVEL   LEVEL   LEVEL   LEVEL   LEVEL   LEVEL
                123456     MISSING

  POPULATION
  OF COMPOUNDS
       ~      200     300     100     50      10       5       26


      Ignoring the 26 missing data, the 665 compounds remaining
      have toxicity level tallies which are all fractions of
      600; the weighting factors of 3, 2, 6, 12, 60 and 120
      result  from dividing 600 by each toxicity level tally.

       Sum of Vectors =    167    421     110     27     8     1

    Correction Factor =    x 3    x 2    x 6   x 12  x 60  x 120

Normalized Vector Sum =    501    842    660    324   480    120



       Total sum of these six vector sums is 2,927.
                                2-20

-------
                                                                F-C3947
     h)  To obtain weighted averages, the normalized vectors are
         multiplied by 1, 2, 3, 4, 5, and 6, and the sum of these
         vectors, divided by the normalized vector sum is the es-
         timated toxicity value:
501 8k2
x 1 x 2
501 1681*
ies is 8,581;
660
x 3
1980
8,581 =
o rt m
32k
x k
1296
2.93 =
1*80 120
x 5 x 6
2AOO 720
estimate of
                                         2,927
         toxicity.


     i)  This estimated value (2.93) is compared with the value
         reported in the 1974 TSL (compound FD 80500) of 1200
         mg/kg.  Since the molecular weight is 213.7, 1.2g  =
                                                      213.7
         .005615 moles/kg; to convert this to log 1/C it is
         necessary to take the reciprocal (=    1    ) = 178.1;
                                             .005651

         the log value for which is 2.25.  This is then to be compared
         with the estimated value of 2.93;  a deviation of  0.68 is noted.
         This deviation is in the "safe" direction of predicting a some-
         what greater toxicity than that actually observed.
2.7.2.1  Discussion of this Method

     The rat log 1/C values are grouped into six "quantal" units.  Since

these are log units, .5 corresponds to a 3-fold difference in the actual
value for an LD  .

     When the entire range from 1.1 to 6.8 (a 5.7 log or 500,000-fold

range) is considered, the ability to estimate rat oral toxicity values
to even ± 1 unit should be useful.

     It must be emphasized that the technique presented is applicable to

any range desired,  such as 3 or 4 quantal units instead of 6; also to

any type of activity for which data can be located or measured across

a large set of compounds, and for which the basic assumption of additi-

vity is found to hold.
                                   2-21

-------
                                                                F-C3947

     Among the interesting conclusions which can be arrived at by a
quick scan of the frequency counts by keys for each toxicity range is
the relative "toxicity" for a given key.
     This can be readily estimated as follows:
     Key SCN48 (the benzene ring):

                        1    2    3    A    5    6
                        99  209  38   7    1    1
                        x3   x2  x6  x!2  x60  x!20
   (Sum = 1207)         297 ^18 228   Bk   60   120
                         xl  x2  x3   xA   x5    x6
   (Sum = 3173)         297 836 68A  336  300   720
     3173
     10<^z = 2.63;   this represents the "average" toxicity due to a
     i/u/   	
                   benzene ring.
      Thus,  the "average"  toxicity values  for.the structure keys are  deter-
mined by the sum of all the experiences included  in the TDB.  Obviously
the larger  the TDB, the more reliable will be the estimates of  toxicity,
both  for the fragment  keys and for  the  entire molecule calculations.
With  the total of  687  compounds  employed  the principle is well-illustrated
by the examples in Table  I, but  this TDB  currently lacks an adequate
number of examples of  several important functional groups  to permit  their
inclusion in toxicity  estimates.
      The concept of a  dynamic data base is  important  to  the  successful
implementation of  the  SAM.   The  computerized nature of TDB permits ready
expansion  and  updating of the data.  Equally important is  the  flexibility
designed in TDB  to permit facile addition of other types of  toxicity data,
such  as  aquatic  organism  toxicity,  mouse  toxicity, carcinogenicity,
mutagenicity,  or any other type  of biological data which can be measured
and  graded.  In each case,  separate printouts would be prepared similar
                                   2-22

-------
                                                                 F-C3947

 to that described for the rat LD5Q  data.   In an expanded TDB system it
 would be practical to maintain all  of the files on-line or in a minicomputer;
 the required distributions could then be  retrieved as  needed or printed  out
 from time to time to maintain updated listings.

2.7.2.2  Reliability of the Estimates Made in Table I
     The estimated rat log 1/C values for each compound listed in Table
I are plotted against the observed rat log 1/C values (taken from the
1974 Toxic Substances List) in Figure 2-1.  Included in this plot are
compounds (marked by asterisks) for which too few example^ of related
compounds exist at present in TDB to permit confidence in the estimates
(fewer than 5 examples of important FG or SCN keys) . They are included
to show that estimates can be made even in these cases, but  the  user is
cautioned to be aware of the lack of confidence in that estimate.
     The FG and SCN keys which have fewer than 5 examples in the TDB,
and which are coded for those compounds in Fig.  2-2 marked with an asterisk,
represent the following structural features:

     hydrazine   -NHNH2
                              0
                              II
     dialkylphosphate   RO   P— OR
                              0
     guanidino  — NH— C = NH
                        I
                      NH2

     hexahydropyrimidine    T  j
                            ^
                             S
     dithiophosphate  _ cg—p —
                             II
                             0
                               2-23

-------
                                                                F-C3947

The addition of more examples containing these structural features to
the TDB will improve the utility of the TDB.
     The remaining 23 compounds in Fig. 2-1 without asterisks all were
estimated to be at least one-tenth as toxic as the observed values, while
6 of the 23 were estimated to be more than 10 times as toxic as observed.
Seventeen of the 23 compounds were estimated to be within ± 1 log unit
(from one-tenth to ten times the actual toxicities.  When the full range
of more than 500,000 between the least toxic and most toxic compounds
in the TDB is considered, the results are quite reasonable.
     It must be emphasized at this point that the present TDB is too
small to permit its predictive use with confidence except for relatively
simple chemical structures which contain only those functional groups
and ring structures which are well-represented in the prototype TDB.
As is true of all systems which generate information based upon a
"training set" (in this case, 687 compounds), it is misleading to use
uniquely ( or rarely) occurring groups to predict the activities of
the compound in which they exist. However, 9/10 of the compounds in
Table I which do not exist in the TDB gave satisfactory estimates ranging
from +.02 to +1.06 log units higher than the observed values. The one
compound poorly predicted, alloxan, was estimated to be almost 300 times
more toxic than the reported LD  • Aside from the point already made
that too few close analogs of alloxan exist in the TDB at present, it is
interesting to observe that alloxan is known to be highly toxic to the
rat upon chronic administration, and it is used to produce a state in
rats akin to diabetes in man.  Although TDB is presently based on lethal
oral toxicity data in the rat, it is interesting to speculate that the
accumulated information in TDB already is suggestive of "toxicity" in
a broader sense than the input which was used.

2.7.3  Study of TDB by Conventional Statistical Methods
     Under a subcontract all of the data contained in the prototype TDB
was analyzed by multivariate techniques, including discriminant analysis,
regression analysis and preliminary clustering techniques.  The data

                                  2-24

-------
                                   vs Estimated LOR   /C Values
                                                                                           examples, in! TEJB
                                                                                           ofl  tHe ristijmatje
1   |                  f   -   •   |      '   I   I   .1—   •   i  -.   .--  i   I   .                   i   •   1       i   I

1 .3  .4  .5  .6  .7  .8 .9 2.0 .1 .2  .3  .A  .5  .6  .7 .8 .9 3.0  .1  .2  .3  .4  .5 .6 .7 .8 .9~4.V Vl~.2~ .'3"74~.
                                                 Observed Log  1/C
5 .6  .7  .8 .9 5.0 .-4s?
                                                                                                                           -


-------
                                                                F-C3947

set consisted of the 549 compounds for which rat toxicity,  log P (partition
data), molecular formulas and the CIDS fragment keys were all available.

2.7.3.1  Methods Applied and Results Obtained
     Initially only the 101 fragment keys which had been assigned 10 or
more times were studied.  A regression equation containing 30 variables
was obtained for 543 compounds by stepwise-regression.  The standard
                               2
error of estimate was .54 and R  was .33 (R=.574).  All 101 keys had
an opportunity to be entered into the regression.  It is interesting to
note the standard error of .54, which supports our earlier decision to
assume a range of + 1 log unit for the estimates discussed in Section 2.7.2
above.  This range is approximately twice the standard error found this
preliminary regression study.

2.7.3.2  Discriminant Analysis
     A series of stepwise discriminant analyses was performed, based
upon the upper and lower quartiles of log 1/C values; a typical result
is shown below:
                                Classified as:
           Actual Toxicity      High       Low       Error Rate %
                High             91         33            27
                Low              12        132             8

                                      (overall average error rate = 17%)

     It should be noted that the skewed nature of the data results in a
range of 1.114 to 1.968 for the lower quartile, and a much broader range
of 2.727 to 6.914 for the higher quartile.  The extremes of the two quart-
iles are only 0.759 log 1/C units apart.  Thus the relatively poor dis-
crimination found is not surprising.
                                  2-25

-------
                                                                F-C3947
2.7.3.3  Clustering, Followed by Regression Analysis

     Next, clusters were sought, using the ISOGEN method, which is a
modified learning machine method.  A random set of 142 compounds was
                                                     •
clustered, using 137 variables, after scaling all variables to lie between
                        2
0 and 1.  Log P, (log P) ,  molecular weight and 134 CIDS keys were used.
One main cluster of 129 compounds, together with 3 clusters of 6,2 and 4,
respectively, were obtained.

     It was found that 15 CIDS keys did not discriminate among these four
clusters, so by eliminating these keys a larger group of 305 randomly
selected compounds could be used.  This resulted in 2 clusters, of 297
and 7 compounds.

     Limiting the regression study to this large cluster, they then
obtained the following results:

     Run #2:  41 variables, 297 compounds:

              R2 =  .63
              Standard Error of Estimate = .5396
              Standard Deviation of Residuals = .4999
                                    2
          Major variables:    (log P)  , molecular weight; FG120;
              FG51R; FG80;  FG96; SCN48; SCN49; GCN3=C2,01;
              GCN3=C5; GCN4=C5,01; GCB4=C6; GCN6=1,2; GCN1=3
              GCN5=1

     Run #3:  30 variables, 285 compounds:

              R2 =  .50
              S.E. of Est.  = .4619
              S.D. of Res.  = .4368
          Major variables:   molecular weight; NCN=3, FG120; FG51R;
              FG80; FG94; FG96; HR12R; GCN= C8
     Run #4:  38 variables, 267 compounds:

              R2 =  .56
              S.E. of Est. = .3625
              S.D. of Res. = .3356
          Major variables:  molecular weight; DACN=2; FG120; FG51R;
              FG80; FG96; HR12R; GCN5=0; GCN5=6.
                                  2-26

-------
                                                                F-C3947

     Class I contains those compounds for which all the structure keys
 are used at least seven times in the TDB.  Class II contains all com-
 pounds with at least one rare key (assigned less than 7 times).  Class
 III contains the outliers, mostly those compounds with high toxicity.
 The above table describes the Discriminant Equations and shows the
 different coefficients for the keys in Class III compounds and Classes
I and II compounds (combined).   The F test indicates the significance
of the differences between these coefficients; e.g., between 1.24 and
43.3 for FG205.  (An F value above 2 or 4 is usually significant at a
95% level, hence all of these are highly significant.)  It should be
realized that the discriminant approach merely differentiates between
classes, e.g., toxic and non-toxic and should hot be used to estimate
relative toxicities.
     Below are given the structural meanings of these keys:
      KEY             STRUCTURE
     FG205         phosphorodithioate
     GCN3=C5       cyclopentyl ring
     FG29R         imide group

     SCN107        isochroman
     SCN84         2,3-dihydrobenzofuran
     SCN45         piperidine
     G-CN4=C1J      cyclopropyl
     HRIR          ring-methyl
     FG118         acetylenic
     FG34          primary amide
     FG112         aliphatic halogen
     FG231R        trialkyl-phosphate
C=N — (group attached to
         ring)
                                 2-29

-------
                                                               F-C3947

    These structural groupings range from highly specific to very
generic.  However, they alone can serve as alerting tools to OTS for
purposes of early warning.  They are not sufficient by themselves to
pinpoint a highly toxic substance, but they can be legitimately used
to question a new structure, as far as potential rat toxicity is con-
cerned.
    Each new set of toxicity data entered into the TDB should be studied
by discriminant analysis to see if a similar set of discriminant equations
can be derived from that set.  The resulting keys can provide a quick
rule of thumb for questioning further the potential toxicity of a novel
substance which contains keys permitting discrimination between high
and low toxicity.
    In the particular example just studied (rat oral toxicity) the
Class I and Class II compounds were also studied by regression analyses,
so that in the evsnt that a toxicity arises from a particular combination
of relatively non-toxic and common keys, the regression equations may
be applied to predict toxicity.  These equations are given in Appendix
D.  As more compounds are added, these equations can readily be derived anew.

2.8.2  Substructure! Analysis Method
    This relatively simple method has been shown in this project to
lead to quite acceptable predictions of toxicity.  This approach can
be applied to all types of toxicity data which can be obtained (or
located) in graded or quantitative terms of activity.  Although these
initial results have been obtained by manual use of computer-generated
print-outs, it should be emphasized that estimates will be more readily
obtained from computerization of the calculations, since these are
often laborious when done manually.
    The same point should be made concerning the manual assignment of
CIDS keys;  these are normally assigned by computer, can lead to in-
accurate encoding by an inexperienced chemist.  Of interest is the
type of feedback which these studies can provide for increasing the
fragment key specificity to represent toxicological significance.
                                 2-30

-------
                                                                F-C3947

     The difference between runs 2, 3 and 4 is that certain outliers
were not included in runs 3 and 4, thus sharpening the resulting equations.
These preliminary results pinpoint certain keys to contain significant
importance as contributors to oral rat toxicity.

2.7.3.4  Separating the Data Into Three Classes
     The compounds were then divided into three classes:
     a)  Class I;  those compounds in which no key occurs with
         a frequency of use of less than 7.
     b)  Class II;  those compounds containing at least one rare
         key.
     c)  Class III:  the outliers, mostly those compounds with
         high toxicity.

     The intent was to separate class III from classes I and II by dis-
criminant analysis and then derive separate regression equations for
classes I and II.  In actual practice this TDB contains most of the highly
toxic substances in class III.
     This approach was followed through, and led to the suggestion that
a three-step system be used for classification and prediction of the rat
toxicity of chemicals:
     First, the discriminant equations would be used to pinpoint the
class III compounds (highly toxic), thus achieving a highly desirable
early warning result;
     Second, the remaining compounds would be separated into classes I
and II, based upon the presence or absence of rare keys;
     Third, toxicity for members of these two groups would be predicted
using the appropriate regression equations.  It should be noted that
this approach can handle the actual log 1/C values as a continuous
function, whereas substructural analysis is more easily applied to step-
wise or quantal gradations in toxicity.
     The above approach was carried out to give regression equations for
group I and II compounds and a discriminant equation to separate group
III from groups I and II.   Details of the regression equations are given
in Appendix D.
                                  2-27

-------
                                                                F-C3947



     The discriminant analysis equation needs to be adjusted to assure

that no false negatives slip by.  This adjustment can be readily made;

it will result in an increase (not prohibitive) in the false positives.

     One of the major advantages of the discriminant analysis approach

over substructural analysis is that one does not need a training set (or

large data base) to apply discriminant analysis, whereas the substructural

analysis approach, in common with pattern recognition methods, requires

a large training set to establish baselines.

2.8  DISCUSSION OF THE RESULTS OBTAINED FROM STUDIES OF THE PROTOTYPE
     TOXICITY DATA BANK

     The studies of the prototype TDB by two quite different methods for

structure-activity correlation have shown that both methods have the

potential to alert OTS about potential toxicity for novel or untested com-

pounds.


2.8.1  Discriminant Analysis

     This method provides a model equation which discriminates between
highly toxic substances for the rat and the bulk of the non-toxic sub-

stances, and which is based upon only the presence of one or more of a

group of 12 structural keys.  Although this group of structural keys

may require some additions as new data are added, it is worthwhile to

examine these at this point.


     Variable        Class (I+II)       Class III           F

     FG205               1.24               43.3          82.1
     GCN3=C5              .983              18.9          65.6
     FG29R               1.24               43.3          41.2
     SCN107               .095              38.4          33.6
     SCN84                .095              38.4          33.6
     SCN45                .727              13.1          30.8
     GCN4=C3             1.24               22.3          20.5
     HRIR                1.15                4.90         15.9
     FG118                .852              13.5          11.0
     FG34                 .852              13.5          11.0
     FG112               1.16                5.30         10.8
     FG231R              -.488              14.7          10.2
     Constants           -.140              -13.1
                                  2-28

-------
                                                                F-C3947

     Because the development of the CIDS system was for generic  retrieval
of chemical structures, it is understandable that the keys  which are
chosen for that purpose will not always be the best keys for relating
structures to biological activity.  Using this study of the TDB  to  guide
improvements in the CIDS keys is a logical step, which would be  required
regardless of what fragment encoding system were to be used.
     The recent interest in the CIDS systems shown both by  NCI and  EPA
scientists and computer personnel could serve to expedite any future
work in this area.  It is especially noteworthy that current programming
is already underway to provide for assignment of CIDS fragment keys by
computer programs designed to convert the Chemical Abstracts Services
connectivity tables into the CIDS keys.  This will have the result  of
making readily available the GIDS keys for over 3,500,000 compounds al-
ready on record at CAS.  The ability to access the CIDS keys by  computer
would make the use of this empirical method readily operable by  OTS.
     The results obtained to date from this study support the validity
of the basic assumption, namely; that structural fragments  can be
assigned incremental toxicity values which may serve to estimate the
resulting toxicity of the entire molecule.  This result was by no means
a foregone conclusion, especially when the complex nature of lethality
and the problem of poor duplicability of LD   results are considered.
The size of the TDB was crucial; a large enough number of compounds had
to be studied to override the variability factors.  It is unlikely  that
similar results could have been obtained with confidence using a much
smaller data base.
    Although the TDB should be increased in size before being widely
applied, it can even now be used to estimate toxicity (oral rat  LD  )
for those chemicals which do not contain CIDS structural keys which
occur fewer than five times in the TDB.  Many hundreds of chemicals
of interest to OTS fall into this category, and could be easily  calcu-
lated if the SAM method were further computerized.
                                  2-31

-------
                                                                F-C3947
              3.  RECOMMENDATIONS FOR FUTURE DEVELOPMENT
                  OF THIS PROTOTYPE DATA BASE
     Possible ways to further develop the dual approach to using SAR
methods for purposes of early warning include:
     a)  Further expansion of the TDB using rat oral I^^Q
         data.
     b)  Expansion of the TDB to include other types of
         toxicity data, such as marine toxicity data,
         carcinogenicity, etc.
     c)  Improvement in operational characteristics by in-
         creased computerization of TDB.
     d)  Refinement and additions to CIDS structure keys,
         by experience gained in studied of the TDB.
Each of these points is discussed below.

3.1  FURTHER EXPANSION OF TDB USING RAT ORAL LD50 DATA
     The best way to increase the size of the rat oral LD   data in the
TDB is to extract all such data from the NIOSH computer tapes which are
now available.  By including the CAS registry numbers (also on the
NIOSH tapes) it should then be possible to obtain the CIDS structure
keys by computer from the connectivity tables.  Towards this end,
MDSD-EPA has already made arrangements with CAS to obtain connection
tables by supplying them with the CAS registry numbers.
     Should an unexpected delay be encountered in these arrangements,
it is recommended that the remainder of the 1974 TSL (I through Z)  be
scanned manually for inclusion of more compounds with oral rat toxicity
data, as we did in setting up this prototype TDB.
     One problem which is frequently encountered is the change in rat
oral LD   values reported in the 1975 TSL as compared to the 1974
                                  3-1

-------
                                                                F-C3947

edition.  Some of these changes are so startling as to suggest that one
or the other must be reported in error (e.g., a change from the micro-
gram to milligram level).  These apparent discrepancies should be checked
by examination of the original literature references, to avoid entering
possible noise into the TDB.  It would be desirable, but time-consuming,
to check all reported values; NIOSH is supposed to be correcting such
errors and may be able to provide the desired confirmation.

3.2  EXPANSION OF THE TDB TO INCLUDE OTHER TYPES OF TOXICITY DATA
     It should be emphasized that inclusion of the chemical structure
keys for compounds without any accompanying toxicity data is not
encouraged for this stage of development of TDB.  At some time in the
future it may be desirable to do so to provide a "shopping list" for
compounds available for toxicity study to increase the number of examples
of rare structure groups (or keys).  However, at this stage it makes
much more sense to add other toxicity data, both for compounds already
in the TDB and for new structures.
     The Carcinogenicity problem is of great concern to EPA.  Although
the problems of determining relative degree of carcinogenicity are
staggering, and NCI is struggling with this great problem, it would
appear that sufficient data now exist to permit tentative assignment of
several quantal grades of carcinogenicity, and the following are suggested
for consideration:
     Highly active carcinogens (aflatoxins, 2-AAF, etc.).  This
     would be used for potent carcinogens shown to cause cancer
     consistently in several species.
     Active carcinogens (vinyl chloride, hexamethylpyrophosphoramide,
     nitrosamines, etc.) Some of these may belong in the first
     category.
     Suspected carcinogens (conflicting results, only one species,
     currently under trial.)
     Not carcinogenic (reserved for compounds adequately tested).
                                  3-2

-------
                                                                F-C3947

     Compiling a test set of approximately 200 to 300 compounds with
these categories into the TDB, and generating the CIDS frequently counts
for the chemical structure keys would be the first step.  Ultimately
an estimate of the carcinogenic potential would be obtained.
     As more mutagenicity data becomes available, it should be possible
to treat it in a manner similar to the carcinogenicity data.  Indeed,
it should be easier to quantitate the results from the Ames test than
from a carcinogenesis bioassay.  Further impetus for this approach
may come from the careful study underway at present at NCI to confirm
or deny the proposed relationship between carcinogenicity and muta-
genicity.
     From a practical point of view, the present TDB contains much data
for the mouse oral LD - test.  These data have not been studied in
the way reported for the rat, but could readily be examined.
3.3  IMPROVEMENT IN OPERATIONAL CHARACTERISTICS BY
     INCREASED COMPUTERIZATION OF THE TDB
     There are several ways to improve the accuracy and ease of opera-
tion of the prototype TDB.  With but minimal programming, one can readily
calculate the average toxicity associated with all of the keys, as
shown by a manual calculation for the benzene ring (Section 2.5.2.1).
This should be done to provide estimates of those keys most associated
with toxicity, to see whether any surprises are encountered.
     With additional programming, the TDB calculations shown in Section
2.5.2 can be systematized so that the process will not only be easier and
faster, but it should be freed from possible arithmetic errors.   With
but little additional effort,  the calculation process can be placed into
an on-line computer mode.  The whole TDB operation can also be converted
to a minicomputer operation, which may offer several advantages.   During
these programming stages the system can still be studied in the manual
mode.
                                 3-3

-------
                                                                F-C3947

3.4  REFINEMENT OF THE CIDS KEYS BY EXPERIENCES GAINED
     FROM TDB STUDIES
     The current CIDS keys were.,assigned by organic chemists after care-
ful study of the problems in retrieval of chemical structures which are
chemically related to each other.  The requirements for this type of
"generic" retrieval system are not necessarily the same as the require-
ments for relating chemical structure to biological activity.  Even with
the TDB at its present level of 687 compounds, our preliminary ex-
periences have pinpointed several keys for which it is advisable to
have greater specificity of structural information.
     It will become increasingly obvious with more experience that
this type of feedback should result eventually in the best type of keys
for this purpose; namely, keys designed to optimize the biological
differences between chemical structure fragments.

3.5  USE OF THE TDB
     Even in this prototype state, the TDB can be used as described
in Section 2.7.3 and 2.7.2 to predict toxicities for many compounds
in the rat.  When aquatic organism toxicity data, carcinogenesis data
and possibly mutagenesis data are added, predictions of several types
of toxicity may be possible for many types of chemicals,  and each new
set of toxicity data will strengthen the utility of this  data base.
We recommend that both the discriminant analysis and substructural
analysis methods be employed at present since we have not had sufficient
opportunity to compare these methods and to determine whether one or
the other, or both, should always be employed.
                                  3-4

-------
                                                                 F-C3947
                               REFERENCES
 1.  Adamson, G. W., Lynch, M. F., and Town,  W.  G.,  J.  Chemical Society
     C, 1971. 3702-3706.

 2.  Chu, K. C., Feldmann, R. J., Shapiro,  M. B.,  Hazard,  Jr.,  G.  F.,  and
     Geran, R. I., J.  Medicinal Chemistry 18_, No.  6, 539-545 (1975).
 3.  Craig, P. N., "Advances in Chemistry,  Volume  114,  Chapter  8,  1973,
     pp. 115-129, American Chemical Society.
 4.  Cramer III, R. D., Redel, G., Berkoff, C. L., J. Medicinal Chemistry
     17. 533 (1974).
 5.  Hickey, R. J., Boyce, D. E., Harner E. B. and Clelland, R. C.,
     IEEE Trans, on Geoscience Electronics, October, 1979, pp.  186-201.
 6.  Leo, A. J., Hansch, C. H., and Elkins, D.,  Chemical Reviews 71,
     525-616 (1971).
 7.  Ljublina, E. I. and Filov, V. A., "Methods  Used in the USSR for
     Establishing Biologically Safe Levels  of Toxic  Substances", World
     Health Organization, Geneva, 1975, Chapter  2, pp.  19-44.
 8.  Neeley, W. B., Bronson, D. R., and Blau, G. E., Environmental
     Science and Technology 8^, 1113-1115 (1974).
 9.  Nys, G. G. and Rekker, R. F., Chim. Ther. £,  521 (1974).

10.  Shepard, T. H., "Catalog of Teratogenic Agents",  The  Johns Hopkins
     University Press, 1973.

-------
Appendix
                    A
            PHYSICAL PROPERTIES
     studied by LjubUna and FHov,  USSR

-------
                         APPENDIX A
                    PHYSICAL PROPERTIES
            studied by Ljublina and Filov, USSR
 1.
 2 .
 3.
 4.
 5 .
 6.
 7.
 8.

 9 .

10 .

11.

12.

13.
14.
15.
16.
17.

18.
19.

20.

21.
Molecular weight
Density
Molar volume
Refractive index
Molar refraction
Melting point
Boiling point
Saturated vapor
pressure
Saturated vapor
pressure
Equilibrium
temperature
Rate of change of
      with pressure
Rate of change of
 boil with pressure
Critical density
Critical temperature
Critical pressure
Latent heat of fusion
Latent heat of vapori-
zation
Heat of combustion
Heat of formation of
gas
Helmholz energy of
formation of gas
Logarithm of distribu-
tion coefficient (olive
oil/water)
22.  Logarithm of distribution
     coefficient (water/air)
23.  Surface tension
24.  Kinematic viscosity
25.  Dynamic viscosity
26.  Solubility
27.  Specific heat capacity
28.  Specific heat of vapor
29.  Thermal conductivity
30.  Atomic polarization
31.  Parachor
32.  Electric dipole momemt
33.  Dielectric constant
34.  Specific dispersion
35.  Absolute dispersion
36.  Primary ionization potential
37.  Entropy of liquid
38.  Entropy of gas
                             A-l

-------
Appendix
                           B



         SYSTEM DESIGNED FOR THE TOXICITY DATA BANK

-------
                              APPENDIX B

              SYSTEM DESIGNED FOR THE TOXICITY DATA BANK   '

     A flexible system was designed to permit the resulting TDB to expand
to many thousands of compounds, should this be desirable.
     Although the system is not limited by hardwar§ considerations, a
brief description of the system designed for this particular application
is given.  Provision was made for a record length of 1316 characters,
unblocked, and the system was mounted on an IBM 360-40 with 384K bytes
of main-frame memory, 4 tape-drives and a 2314 disk unit.  Nine-track
tape at 1600 bpi and EBCDIC codes were employed.  The batch system can
readily be upgraded to an on-line system, should this become desirable.
The programs are written in ANSI COBOL under OS: an assembly language
call program was written to invoke a Fortran log function to permit
use of logarithms in calculations of log  /_  values.
     The programs permit the accession of all compounds having any
desired combination of parameters in common.  The parameters include
the toxicity data,  structural keys, and physical constants such as log
P values, but there is provision for many other parameters if desired.
                                  B-l

-------
Appendix
                          c
               DELIVERABLES UNDER CONTRACT

-------
                               APPENDIX C


                      DELIVERABLES UNDER CONTRACT
 A. Reports
      1.  Index to Subjects and Authors for Structure-Activity Correlation
          Bibliography: 68 pages.  Available from NTIS,  PB 240-658.

      2.  The Use of the Hansch Multiple Parameter Method of Structure-
          Activity Correlation to Identify the Hazard Potential of Environ-
          mentally Significant Chemicals: Paul N. Craig & Jon E. Villaume,
          11 pages, 1 Nov. 1974.

      3.  The Use of Classification to predict the Biological Activity of
          Environmentally Significant Chemicals:  Discriminant Analysis:
          Jon E. Villaume, 14 pages, 17 March 1975.

      4.  Review of paper by R. J. Hickey et. al.; 'Ecological statistical
          studies concerning environmental pollution and chronic disease':
          J. H. Waite, 22 pages, August 1975.

    5.  Overview of Alternate Methods for Data Analysis, by John Waite,
        75 pp., April 1976.

    6.  'Models for Biochemical Toxicity1, by Genessee Conputer Center,
        Inc.: 30 pages, 19 February 1976.  (Performed under subcontract
        to FIRL.)

    7.  A Computer Program to Extract and Display CIDS Keys,  by Fein-
        Marquart Associates, Inc: 208 pages, December 1975.  (Performed
        under subcontract to FIRL.)
B.  Data Files

    1.  Partition coefficient Data Bank from the Pomona College Medicinal
        Chemistry Project.  (Approximately 500 pages of computer print-
        out.)  Includes Sigma Constant Data Bank.

    2.  Computer Program (Hansch-3) for Multiple Parameter Regression
        Analysis.  (About 1,000 punched cards.)

    3.  Computer printouts from Toxicity Data Bank:
        a) A listing ordered by CIDS keys which contains Army Registry
           numbers and oral rat LDso data in ascending order of toxicity
           for each key (from Cryptanalytic)

                                   C-l

-------
        b) A listing ordered by rat oral LD^Q values (as log 1/C
           (from Cryptanalytic).

        c) A listing of compounds for which each key was assigned,
           ordered by key (from Fein-Marquart).
C.  Computer Programs Written for this Project

    1.  Frequency distributions of the Fein-Marquart structure key data
        (24 byte octal numbers) and print programs.

    2.  Extraction of applicable structure keys from the mass of Fein-
        Marquart data.

    3.  Conversion of the 24 byte octal keys into the conventional CIDS
        keys.

    4.i  Frequency distributions of C-3 above, and associated print pro-
        grams .

    5.  Generation of the sample toxic substances data base, which
        included:

        a.  Merging (3 data sources) of Fein-Marquart data with FIRL-
            supplied data (CAS number, rat and mouse LDso' Log P,
            Molecular formulae).

        b.  Error detection and correction for above.

        c.  Sort sample data base in ARN order.
                                                 »
        d.  Print sample data base.

    6.  The 'explosion1 program, including molecular weight and Log 1/C
        calculations and many associated computer runs, prints, etc.

    7.  A computer program to order Rat Log (Log 1/C) within substruc-
        ture keys and its associated sort, sum, and print programs.
                                   C-2

-------
Appendix
                           D
              REGRESSION EQUATIONS FOR CLASS I
                   AND CLASS II COMPOUNDS

-------
                                   TECHNICAL REPORT DATA
                            (Please read Instructions on the reverse before completing)
 1. REPORT NO.
  EPA-56071-76-006
                              2.
             3. RECIPIENT'S ACCESSIOt+NO.
 4. TITLE AND SUBTITLE
 Analysis and Trial  Application of Correlation
 Methodologies for Predicting Toxicity of Organic
 Chemicals
               IEPORT DATE
               May,  1976  -  Approved date
             6. PERFORMING ORGANIZATION CODE
 7. AUTHOR(S)
                                                           8. PERFORMING ORGANIZATION REPORT NO.
  Paul  N.  Craig and John H. Waite
                 F-C3947
 9. PERFORMING ORGANIZATION NAME AND ADDRESS
  The  Franklin Institute Research  Laboratories
  Science Information Services Department
  20th 6  The Benjamin Franklin Parkway
  Philadelphia,  Pennsylvania   19103
                                                           10. PROGRAM ELEMENT NO.
                 2LA328
             11. CONTRACT/GRANT NO.

                 68-01-2657
 12. SPONSORING AGENCY NAME AND ADDRESS

 Office  of Toxic Substances
 Environmental  Protection Agency
     M Street,  S.W.   Washington, D. C.   20460
             13. TYPE OF REPORT AND PERIOD COVERED
                 Final
             14. SPONSORING AGENCY CODE
 15. SUPPLEMENTARY NOTES
 16. ABSTRACT
              j n(j|ex to tne literature on  structure-activity correlation methods was  pre-
 pared  and  is available through NTIS (PB  240-658).   A study of each of the major methods
 was made  to determine requirements for  application to toxicity data. Simultaneously  a
 study  was  made of available toxicity data  and  of phys ica I -chemical properties shown
 to be  useful  in correlation studies.  These  evaluations suggested that the structural
 fragments  contained in chemical structures should  be considered in structure-activity
 relationship studies as well as the n-octanol  partition coefficients.  The U.S. Army
 C.I.D.S. computer-assigned fragment codes  were utilized for this purpose. A prototype
 toxicity data  base was selected from the  1974  Toxic Substances list for 687 compounds
 for which  oral  LDjg values were reported  in  the rat or mouse.  The use of discriminant
 and multiple regression analyses following preliminary clustering gave useful results,
 but a  new  extensJon of the method called "substructural  analysis" was used to predict
 the LD5Q values in the rat for 21  of 23  test compounds within ±1  log unit out of a
 range  of 5 log  units.   This method can  readily be  adapted  to computer operation, and
 is recommended  for extension to other sets of  toxicity data.  Independent study of
 the same data  by  discriminant analysis  is  also recommended.
 7.
                                KEY WORDS AND DOCUMENT ANALYSIS
                  DESCRIPTCRS
b.lDENTIFIERS/OPEN ENDED TERMS  C.  COSATI Field/Group
   Regression Analysis
   Discriminant Analysis
   Clusteri ng
   Correlations
   Toxici ty
   Pattern Recognition
  Structure-Activity Re-
   lationships
  Substructural  Analysis
  Chemical  Structure Codes
  Partition Coefficients
06/20
06/04
12/01
 8. DISTRIBUTION STATEMENT

   Release Unlimited
19. SECURITY CLASS (ThisReport}
                                                                         21. NO. OF PAGES
                                              20. SECURITY CLASS (Thispage)
                                                                        22. PRICE
EPA Form 2220-1 (9-73)

-------
                                                        INSTRUCTIONS

    1.   REPORT NUMBER
         Insert the EPA report number as it appears on the cover of the publication.

    2.   LEAVE BLANK

    3.   RECIPIENTS ACCESSION NUMBER
         Reserved for use by each report recipient.

    4.   TITLE AND SUBTITLE
         Title should indicate clearly and briefly the subject coverage of the report, and be displayed prominently. Set subtitle, if used, in smaller
         type or otherwise subordinate it to main title, when a report is prepared in more than one volume, repeat the primary title, add volume
         number and include subtitle for the specific title.

    5.   REPORT DATE
         Each report shall carry a date indicating at least month and year.  Indicate the basis on which it was selected (e.g., date of issue, date of
         approval, date of preparation, etc.).

    6.   PERFORMING ORGANIZATION CODE
         Leave blank.

    7.   AUTHOR(S)
         Give name(s) in conventional order (John R. Doe, J. Robert Doe, etc.]. List author's affiliation if it differs from the performing organi-
         zation.

    8.   PERFORMING ORGANIZATION REPORT NUMBER
         Insert if performing organization wishes to assign this number.

    9.   PERFORMING ORGANIZATION NAME AND ADDRESS
         Give name, street, city, state, and ZIP code. List no more than two levels of an organizational hirearchy.

    10.   PROGRAM ELEMENT NUMBER
         Use the program element number under which the report was prepared. Subordinate numbers may be included in parentheses.

    11.   CONTRACT/GRANT NUMBER
         Insert contract or grant number under which report was prepared.

    12.   SPONSORING AGENCY NAME AND ADDRESS
         Include ZIP code.

    13.   TYPE OF REPORT AND PERIOD COVERED
         Indicate interim final, etc., and if applicable, dates covered.

    14.   SPONSORING AGENCY CODE
         Leave blank.

    15.   SUPPLEMENTARY NOTES
         Enter information not included elsewhere but useful, such as: Prepared in cooperation with, Translation of, Presented at conference of,
         To be published in, Supersedes, Supplements, etc.

    16.   ABSTRACT
         Include a brief (200 words or less) factual summary of the most significant information contained in the report. If the report contains a
         significant bibliography or literature survey, mention it here.

    17.   KEY WORDS AND DOCUMENT ANALYSIS
         (a) DESCRIPTORS - Select from the Thesaurus of Engineering and  Scientific Terms the proper authorized terms that identify the major
         concept of the research and are  sufficiently specific and precise to be used as index entries for cataloging.

        (b) IDENTIFIERS AND OPEN-ENDED TERMS - Use identifiers for project names, code names, equipment designators, etc.  Use open-
        ended terms written in descriptor form for those subjects for which no descriptor exists.

         (c) COSATI FIELD GROUP - Held and group assignments are to be taken from the 1965 COSATI Subject Category List. Since the ma-
        jority of documents are multidisciplinary in nature, the Primary Field/Group assignment(s) will be specific discipline, area of human
        endeavor, or type of physical object. The application(s) will be cross-referenced  with secondary Field/Group assignments that will follow
         the primary posting(s).

    18.   DISTRIBUTION STATEMENT
         Denote releasability to the public or limitation for reasons other than  security for example "Release Unlimited." Cite any availability to
         the public, with address and price.  /

    19.&20.  SECURITY CLASSIFICATION
         DO NOT submit classified reports to the National Technical  Information service.

    21.  NUMBER OF PAGES
         Insert the total number of pages, including this one and unnumbered pages, but exclude distribution list, if any.

    22.  PRICE
        Insert the price set by the National Technical Information Service or the Government Printing Office, if known.
EPA Form 2220-1 (9-73) (Reverie)

-------
                      APPENDIX D
     REGRESSION EQUATIONS FOR CLASS I AND CLASS II COMPOUNDS
A. Regression Equation for Class I Compounds (See Section 2.7.3.3)



    "Best" Compromise Class I Regression Equation
Variable
FG112
MW
FG120
HR1R
NCN=0
FG83
GCN5=3
FG268R
FG96
A-C=0
GCN4=C2N
(Log P)2
FG96R
FG144
Constant






B. Regression
Coefficient
.783+00
.302-02
.318+00
.336+00
.143+01
.212+00
-.257+00
.198+00
-.214+00
.109+01
-.742+00
-.878-02
-.209+00
.162+00
.188+01






Equation for
"Best" Compromise Class
Variable
MW
FG80
Log P
FG154R
FG51R
FG120
GCN2=3
GCN4=C5N1
GCN5=6
HR1E
FG34R
DACN=6
GCN5=5
FG120R
GCN3=C5
FG24R
HR2ER
Coefficient
.217-02
.617+00
-.782-01
.618+00
.634+00
.430+00
.524+00
.647+00
.415+00
.219+00
-.545+00
.471+00
.598+00
-.341+00
.463+00
.346+00
-.216+00
Std. Error
.116
.000
.093
.102
.483
.087
.107
.084
.094
.501
.455
.005
.149
.122

R = .6277
7
R = .394
S.E. = .442
S.D. = .432
N.V. = 14
Class II Compounds (See Section
II Regression Equation
Std. Error
.001
.151
.019
.185
.190
.135
.198
,247
.173
.092
.230
.202
.289
.189
.258
.226
.156
F
45.9
45.4
11.7
10.9
8.73
5.95
5.80
5.53
5.16
4.77
2.66
2.60
1.96
1.76







2.7.3.3)

F
22.5
16.8
16.5
11.1
11.1
10.2
7.00
6.85
5.77
5.68
5.64
5.46
4.27
3.24
3.22
2.35
1.92
                             (Continued)

-------
                     APPENDIX D (Continued)
Variable

GCN2=5
GCN4=C10
GCN3=C501
FG178R

Constant
Coefficient

  -.236+00
  -.401+00
   .235+00
  -.209+00

   .205+01
Std. Error

   .171
   .291
   .208
   .157
                                         R = .6922
                                         R2= .479
                                         S.E. = .490
                                         S.D. = .466
                                         N.V. = 21
1.91
1.91
1.89
1.77
          Both of these regression equations were derived by
stepwise regression analysis, allowing new variables to be entered

as long as they made significant contributions to the reduction of

the residual variance.

-------