EPA-600/4-76-046 October 1976 Environmental Monitoring Series COMPUTER INTERPRETATION OF POLLUTANT MASS SPECTRA Environmental Research Laboratory Office of Research and Development U.S. Environmental Protection Agency Athens, Georgia 30601 ------- RESEARCH REPORTING SERIES Research reports of the Office of Research and Development, U.S. Environmental Protection Agency, have been grouped into five series. These five broad categories were established to facilitate further development and application of environmental technology. Elimination of traditional grouping was consciously planned to foster technology transfer and a maximum interface in related fields. The five series are: 1. Environmental Health Effects Research 2. Environmental Protection Technology 3. Ecological Research 4. Environmental Monitoring 5. Socioeconomic Environmental Studies This report has been assigned to the ENVIRONMENTAL MONITORING series. This series describes research conducted to develop new or improved methods and instrumentation for the identification and quantification of environmental pollutants at the lowest conceivably significant concentrations. It also includes studies to determine the ambient concentrations of pollutants in the environment and/or the variance of pollutants as a function of time or meteorological factors. This document is available to the public through the National Technical Inforrna- tion Service Springfield Virginia 22161 ------- EPA-600/4-76-046 October 1976 COMPUTER INTERPRETATION OF POLLUTANT MASS SPECTRA by Fred W. McLafferty Cornell University Ithaca, New York 14853 Grant No. R-801106 Project Officer John M. McGuire Environmental Research Laboratory Athens, Georgia 30601 ENVIRONMENTAL RESEARCH LABORATORY OFFICE OF RESEARCH AND DEVELOPMENT U.S. ENVIRONMENTAL PROTECTION AGENCY ATHENS, GEORGIA 30601 ------- DISCLAIM ER This report has been reviewed by the Environmental Research Laboratory— Athens, and approved for publication. Approval does not signify that the contents necessarily reflect the views and policies of the U. S. Environ- mental protection Agency, nor does mention of trade names or commercial products constitute endorsement or recommendation for use. 11 ------- FOREWORD Nearly every phase of environmental protection depends on a capability to identify and measure chemical pollutants in the environment. The Analytical Chemistry Branch of the Athens Environmental Research Laboratory develops techniques for identifying and measuring chemical pollutants in water and soil. This report describes two computer programs that assist chemists in identifying organic compounds from their mass spectra. One program rapidly selects, from a computer file, spectra that have a high probability of matching the spectrum of an unidentified compound. The other program greatly assists the analyst in postulating the identity of a compound whose spectrum is not in the file. The programs will significantly enhance the assess- ment of health and ecological effects of organic chemicals and the development and implementation of control measures. iii ------- ABSTRACT The objective of this research was to improve systems for computer examina— tion of the mass spectra of unknown pollutants. For this we have developed a new probability based matching (PBM) system for the retrieval of mass spectra from a large data base, and have substantially improved the inter- pretation of unknown mass spectra using the self-training interpretive and retrieval system (STIRS). PBM was designed as a prefilter to STIRS; If an unknown mass spectrum can be identified with a sufficiently high confidence by PBM, interpretation of the spectrum using STIRS is not necessary. The PBM system provides more efficient retrieval than presently accepted systems; it incorporates a “reverse search” algorithm, and through the use of weighted mass and abundance data provides a statistically valid prediction of the confidence of the matches found. STIRS has been improved to give a confidence-level prediction of the presence of —200 particular substructural features in the unknown molecule. Extensive studies have been made to improve the data selection for most data classes used by STIRS, resulting in a much higher level of overall system performance. Operation efficiencies of both PBM and STIRS have been improved dramatically so that both require less than 1 minute on a laboratory PDP-l1/45 computer. STIRS has been made available for outside use by long-distance phone connections to this P1JP—1l/45, and recently both PBM and STIRS have been made operational on the Cornell IBM 370/168 so that these are available internationally over the TYM NET computer network system. iv ------- TABLE OF CONTENTS Page Foreword iii Abstract iv List of Figures vi List of Tables Vii Acknowledgments viii Sections I Introduction 1 II Conclusions 7 III RecommendationS 8 IV Equipment and Reference Material 9 V Experimental and Evaluation Phase 12 VI Discussion 17 VII References 56 VIII List of Publications 58 V ------- LIST OF FIGURES Number ____ 1 PBM Performance for Unknown Mass Spectra of Pure Compounds 18 2 Effect of the Number of Flagged Peaks on K Values for Unknown LMWS Spectra of Pure Compounds 20 3 Effect of Structural Matching Criteria on K Values for Unknown LMWS Spectra of Pure Compounds 22 4 Effect of Structural Matching Criteria and Molecular Ion Information (K+) on K Values for Unknown HMWS Spectra of Pure Compounds 23 5 Comparative Performance of Retrieval Systems for Unknown Spectra of Pure Compounds 26 6 Comparative Performance of Retrieval Systems for Unknown LMWS Spectra of Mixtures 27 vi ------- LIST OF TABLES Number Page 1 Tested Data Classes for Characteristic Ions 6 2 Arbitrary Categories for Degree of Structural Match 15 3 Compounds Retrieved by PBM for a Mixture of 90% Methly n-Octadecanoate + 10% Methyl cis-9- Octadecenoate 28 4 Compounds Retrieved by the MIT and PBM Systems for a Mixture of 60% 3-Methoxyindazole, 30% Carbon Tetra— chloride and 10% tert-Butyl—3—ketobutyrate 30 5 Compounds Retrieved by the MIT and PBM Systems for a Mixture of 60% l_(2_Methylcyclohexyl)-3-phenylurea, 30% 1,2 ‘—Binaphthyl, 10% 0 ,O-Dimethyl--O-(4-riilro- rn-tolyl)phosphoroth loate 31 6 Recall Ability (%) of STIRS at the 98% Confidence Level for Nine Match Factors 33 7 Recall and Precision of STIRS MF 11 predictions at the 98% Confidence Level for 161 Other Substructures 34 8 Substructures of Zero Recall by MF 11 40 9 Performance of Data Classes for 98% Confidence Level Predictions 46 10 Data Classes giving Highest Recall Values for Individual Substructures 48 11 Recommended Data Classes for Characteristic Ions 50 12 Substructures for which MF 11. 1 has Substantially Superior Recall 52 vii ------- ACKNOWLEDGMENTS The author is pleased to acknowledge the very substantial research con- tributions of Drs. Rengachari Venkataraghavan, Gail M. Pesyna, and Henry E. Dayringer, which made possible the results described here, and the very stimulating suggestions and interactions with Dr. John M. McGuire. viii ------- SECTION I INTRODUCTION IDENTIFICATION OF POLLUTANT MASS SPECTRA The overall objective of this project was to develop a system for the auto- matic identification of pollutant molecules utilizing a gas chromatograph/ mass spectrometer (GC/MS). The original objective was to improve the Cornell self-training interpretive and retrieval system (STIRS) for automated computer identification and interpretation of mass spectra through extensive testing, statistical evaluation of results, modifications for optimized per- formance, and evaluation of the effect of data quality. During the project period, STIRS was used extensively by other United States scientists through a phone-line connection to our laboratory PDP—l1/45 computer developed under this grant. This provided dramatic evidence that a large proportion of users primarily needed a matching system; in a substantial proportion of cases, their unknown spectrum did not need to be interpreted because a reference spectrum of the same compound was already in the file. Thus we decided to develop a matching system as a prescreen to STIRS so that the latter would only be used for those unknown mass spectra for which a sufficiently good match could not be found in the reference file. The two main sections of this report therefore concern the development of PBM and the improvement of STIRS. We have recently prepared an extensive review of this field entitled Computerized Structure Retrieval and Interpretation of Mass § ectra, 1 to which the reader is referred for more details on the problems involved and previous computer system proposals. PBM In designing PBM, we attempted to utilize principles employed by informa- tion retrieval scientists in systems developed for such problems as document retrieval from libraries •2 We thus incorporated a reverse search concept, 3,4 which Is especially valuable for unknown spectra of mixtures and “weighting” of the mass and abundance data used. Determination of these weighting factors was based on the statistical occurrence of the mass and abundance values found on examination of a large comprehensive data base of 18,806 spectra. 5 The PBM concept was initially developed with R. H. Hertel and R. D. Villwock 3 for use with a quadrupole mass spectrometer operated under feedback control by a dedicated microcomputer. However, this system utilized a data base of only 16 or 64 mass spectra, all of which had been determined on the same instrument. Thus a major problem in the design of the more general purpose PBM system Involved the fact that the reference mass spectra had been determined on a wide variety of instruments under diverse experimental conditions, and that the spectra contained many 1 ------- artifacts such as impurity peaks and incorrect mass and abundance data. The solution to this problem was to have the computer repeat the comparison calculation a number of times, each time dropping another selected peak of the reference spectrum (the so-called “flagged peaks”) and retaining only the highest confidence value achieved. Thus a few impurity or aberrant peaks would not eliminate the spectrum as a possible match. Statistical Basis for PBM Probability based matching is based on the general rule of multiplication of probability theory 7 which states that if n independent events occur with probabilities 2l 22 •.. then the probability of all n of these events occurring is given by equation 1. Thus if peaks with mass rn 1 and 12 having n overall probability fl (1) i= 1 intensities i and 19 occurring in mass spectra with probabilities and the probability thafboth occur at random in an unknown spectrum is 21 X If this product Is small, it is much more likely that the presence of peaks rni and rn 2 in intensities i and 2 is due to the identity of the unknown spectrum with that of the reference compound. The low value of this probability provides a confidence that an identification is correct, measured by a “confidence value” or “K value.” This measure, as well as all the individual probabilities, Is expressed as the corresponding base two logarithm for convenience of calculation; inverse probabilities are also used to simplify the calculations and to produce a final result that Is a direct measure of “confidence.” In this reverse search, there is computed for each reference spectrum matched against the unknown a confidence value, K, equal to the sum of the individual K values. K is calculated for each peak in the unknown whose intensity agrees within a predetermined range to that of the corresponding peak in the reference spectrum. Kj combines four terms, K.=U.+A.+W.-D where 1J is the contribution to the probability of the “uniqueness” of the rn/e value of peak j; A. is the contribution to the probability of the abundance value of the peak as appears In the reference spectrum; W , the “window factor,” is a measure of the agreement required between the abundance of the peak in the reference and In the unknown; and D, the “dilution factor” for mixture spectra, is a measure of the overall reduction of peak Intensities in the unknown due to the presence of other components (If the unknown spectrum is of a pure compound, D 0). A peak in the unknown that does not agree within the window tolerance is ignored in the cumulative calculation of confidence. If it is more intense than would be expected, it is termed “contaminated.” However, peaks of intensities less than the minimum allowed are treated differently than in the earlier PBM system. In that system, all the reference and the unknown 2 ------- spectra were recorded on the same Instrument, and the background level was known for each unknown sample; therefore it was assumed that a peak in the unknown could not be of lower relative intensity if that reference compound was present in the unknown. In the present system, on the other hand, which uses a large data base of spectra from diverse sources, this could be true because of experimental variation or even impurities in the reference compound; thus a limited number of less intense peaks are “flagged” in the match to ignore this discrepancy. The assumption that mass spectral peaks are independent events, which Is essential to the rigorous application of the general rule of multiplication, is of course far from exact for many mass spectral peaks. For example, it is much more common to find rn/e 41 in a spectrum containing an abundant rn/e 57 peak. It would also be expected that the molecular Ion and other high mass peaks would show less cross-correlation, and so these are given extra preference. STIRS The first major improvement was also suggested by the phone-line operation. Information on the type of molecule giving the unknown and mass spectrum is obtained by examining the reference compounds selected by STIRS to be the most closely related to see if these contain common structural features; thus If 10 of the top 15 compounds contained an imidazole ring, this would indi- cate with relatively high probability that the unknown also contained imida- zole. Obviously the significance of this observation would be much greater if only 0.1 percent of the reference compounds in the file contained imidazole than If, for example, 10 percent of them did. Thus for this improvement the computer examines the top selected compounds for the presence of a variety of selected substructures, and calculates the probability that the number found could be due to a random selection instead of to the actual occurrence of the substructure in the unknown molecule. The successful development of such a quantitative evaluation procedure for STIRS performance with a wide variety of substructures then also provides a method to see If STIRS modifications actually Improve performance. In the basic operation of STIRS, the mass spectral data of the unknown of a particular type Is compared against the corresponding data of all the reference mass spectra to find those compounds of bestmatch. The original selection of these data classes was based on mass spectrometry knowledge and Intui- tion, plus relatively qualitative performance tests. The substructure quanti- tative results thus allow the evaluation of changes In the selection of data class according to the number of peaks, whether they are odd- or even- electron Ions, the extent of the mass range, overlapping of mass ranges, various combinations of data classes, and special series of mass spectral peaks. The ranges are incorporated to optimize the “recall”, the proportion of compounds examined which actually contain the particular substructure for which STIRS was able to identify that substructure with greater than 98 percent confidence. 3 ------- Basic Principles of STIRS A number of classes of mass spectral data known to have high structural significance, such as characteristic ions, series of ions, and masses of neutrals lost, are identified; for each class the computer matches the data of the unknown mass spectrum against the corresponding data of all reference spectra. In each data class the reference compounds whose spectra have the highest “match factor” (MF) values are examined by the chemist for any com- mon structural features, with a high frequency of occurrence i: ating a high probability that the structural feature Is present in the unknown. Substructure Identification The 15 selected compounds of highest MF value are examined by the computer for the presence of specific substructural groups to provide a statistical evaluation of the probability of the presence of each group in the unknown compound. The principle used for this statistical evaluation is random event theory. 8 9 In a manner similar to calculation of the odds in random drawing from a collection of colored balls, the probability of selecting a compound at random containing any given substructure can be calculated knowing only the percentage In the file of compounds that contain such a substructural unit. The probability P that any particular number N of compounds containing a given substructure out of 15 compounds has been drawn at random is P(N) = l 5 !(x)N (1 _X)(15 - N)/ [ N,( 15 - N)!] where x is the decimal fraction of the file compounds with the substructure. To evaluate the importance of this probability it Is compared to that of the selection of the most probable number. (This Is a more conservative esti- mate than comparing it to all more probable events; the difference is not large, especially for substructures of Infrequent occurrence, and the pre- cision of this estimate is also evaluated experimentally.) The most probable number of compounds containing the specific substructure to be drawn out of 15 is 15.x, so the relative frequency of finding N compounds in the top 15 is P(N)/P(15 .x). A ratio of 1/2 00 predicts that on a random basis STIRS would retrieve the most probable number (15 .x) of compounds with the sub- structure 200 times (on average) before it retrieves N compounds with the substructural unit. Thus if STIRS does retrieve N compounds with the sub- structure, only 1 time in 200 will this be due to chance, so that this result gives a 99.5% confidence that the substructure is actually in the unknown. Consider as an example an unknown compound analyzed by STIRS match factor 11 for the presence of the phenyl group. Based on the fact that 2 8.4% of the compounds in the reference file contain phenyl, if the unknown does not contain phenyl an average of 4.2 phenyl-containing compounds would still be expected in the top 15 compounds selected as matches in MF11. If 10 of the top 15 compounds actually contain phenyl, the probability that this occurred at random Is p(10)/P(4.2) = 1/113; that Is, this result indicates with >99% confidence that phenyl is in the unknown. 4 ------- To evaluate the method, it has been applied first to 204 substructures com- monly found in organic compounds. These were chosen to test STIRS’ ability to identify a broad range of structural features and would be expected to vary widely In their mass spectral behavior. For example, the carbonyl group, in contrast to its characteristic effect on the infrared spectrum, and despite its strong directing effect on mass spectral decompositions, does not produce peaks of masses unique to itself. Even in the case of functional groups which do give characteristic peaks, such as the terminal benzoyl group’s ions atrn/e 105 and 77, the abundance of such peaks can be greatly reduced by the presence in the molecule of an additional functional group which directs the fragmentation more strongly, such as 2-NH2C6H4CO-. Thus even if the presence of a particular ion (or ions) in the mass spectrum is reliable evidence of the presence of a particular substructure, the absence of that ion does not necessarily show that the group is absent . Thus, although most mass spectral learning machine methods 18 1 are designed to give “yes/no” answers on the presence of structural features, we have not attempted to obtain negative Information from STIRS, restricting our con- sideration of STIRS substructure predictions to those of 98% confidence. To evaluate these predictions, a large spectral collectIon 6 ’ 17 has been used to gather statistical data on the “information precision” and “recall” of the results. The definitions of these terms are patterned after those of terms used by information scientists to evaluate the efficiency of, for example, a document retrieval system. Because the term “precision” has a somewhat different meaning to chemists, we will use the modified term “Information precision” to mean the percentage of substructure predictions which are actually correct. The “recall” value Is the percentage of com- pounds actually containing the substructural group for which the group Is identified by STIRS. A more detailed discussion of the usefulness of such precision and recall values in evaluating mass spectral analysis systems will be presented separately. Data Class Improvements To improve the frequency of correct answers (recall) and reliability of answers (information precision) for the extraction of substructural features by STIRS, we have tested the following four approaches, embodied in the characteristic ion data classes shown in Table 1: (1) Variation of the number of ions used in matching for each mass range; (2) Forcing the use of both even- and odd- electron ions Instead of using the largest peaks regardless of mass; (3) Use of additional mass ranges overlapping the original set (for example, beside rn/e 6 - 88, MF2A, and rn/e 89 - 158, MF3B, including rn/e 61 - 116, MF3A); (4) ArIthmetic combinations of the MF values found by several data classes (this approach was previously shown 1 ° to be useful In the “overall match factor”, MF11, a combination of MF1 through MF6). 5 ------- TABLE 1. TESTED DATA CLP SSES FOR CHARACTERISTIC IONS Data class Maximum number of peaks Mass range 2 a,b 3 b 4 b 2aC 2aIC 3 even—mass, 3 odd—mass 5 5 5 even—maSs, 5 odd—mass 10 6 90 150 6 6 - 89 —149 —(M— - 89 — 89 1) 2A 4 even—mass, 4 odd-mass 6 - 88 2A’ 8 6 —88 3A 8 61 —116 3B 8 89 —158 4I 4B 40 8 8 (4 6 )d 8 ( 5 )d 117 159 201 —200 — 270 - (M - 1) 11.1 2A+ 3A+ 3B + 4A+ 4B + 4C 11.2 2A.’ + 3A + 3B + 4A + 4B + 40 aThe degree to which the class 2 data of the unknown spectrum match those of the reference is given by STIRS as the “match factor 2 (MF2)” value. bUsed in the original STIRS program (ref 10). CThjs data class was tested with only 20 of the most commonly occurring substructures. dTt were also made with alternative maximum number(s) of ions. 6 ------- SECTION II CONCLUSIONS Even in their present stage of development, we believe that PBM and STIRS constitute the most powerful combination system available for the computer examination of unknown mass spectra. PBM is the most rapid and efficient system for matching an unknown spectrum against a large and diverse reference file. This system should be especially useful for routine identi- fication of unknown pollutants because Its “reverse search” is uniquely appreciable to mixtures, and because it provides a confidence level measurement of the probability that the match Is correct, utilizing several “match classes” of the degree of structural similarity of the unknown and reference compound. For those unknown spectra that cannot be matched with this sufficiently high degree of confidence by PBM, STIRS can often provide partial or even complete information on the molecular structure with a direct evaluation of the confidence. Because STIRS is meant to be an aid to, not a replacement for, the trained mass spectrometrist, the best answer in difficult cases will be achieved through human Interpretation of computer results. In many instances, STIRS provides structural information thatwas notdlscerned by the trained mass spectrometrist, and the speed with which STIRS information with quantitative probabilities can now be obtained should mean that its rountine use would provide substantially Increased confidence and efficiency In the results of human Interpretation. 7 ------- SECTION III RECOMMENDATIONS It is strongly recommended that the combined PBM/STIRS system be used and evaluated extensively on real problems by EPA mass spectrometrists. A few people should make a sincere effort to apply PBM and STIRS to every unknown spectrum encountered over, for example, a 1-month period. Their feedback to us would be Invaluable for further improvements, and these key people could also train others in the future. We feel strongly that there still is a very real education problem in mass spectrometry. The author still gives a basic course on interpretation of mass spectra to hundreds each year, including many from EPA laboratories. Education in the basic principles and in new developments such as computer examination of mass spectra is a continuing problem; although this is not unique to mass spectrometry, it deserves the careful attention of all parties in the field. Further improvements should be valuable for both the PBM and STIRS systems. “Real—time PBM” should be incorporated directly into the GC/MS/computer systems now used routinely for pollutant identification in major EPA labs. A. possible system for this would involve computer collection and reduction of the GC eluent mass spectrum every 2 seconds; PBM matching of the most recently acquired spectrum against a data base would be run by the computer as a background operation during the next 2—second mass spectral collection. With proper GC/MS calibration, the computer could even calculate the quantity of identified components. Thus for complex pollutant samples, the GC/MS/computer system could give direct Identification and quantitation of many components, greatly reducing the burden on the operator and mass spectrometrist. 8 ------- SECTION IV EQUIPMENT AND REFERENCE MATERIAL Almost all of the experimental work was carried out on a Digital Equipment Corporation PDP-l1/45 and -11/10 dual processor configuration with 28K and 24K of core memory, respectively, connected with a special bus window to make all peripherals addressable by either processor, removable disks of 1.2 M, 1.2 M and 29 M word storage capacity, DEC-GT/40 cathode ray tube display, printer, plotter, 9-track IBM-compatible magnetic tape drive, and telephone modem link for dial—up use by outside Investigators. Most programs were in assembler language. Both PBM and STIRS performance evaluations were made by running approximately 400 unknown mass spectra for each item under investigation which often required continuous runs of many days to achieve statistically meaningful data. Evaluation of the STIRS system by outside users was done through a modem and interface for tele- phone processing of the submitted unknown spectrum. Users were contacted by both phone and letter. DATA. BASE The Re istry of Mass Spectral Data , representing 18,806 different corn- poundsb was used for STIRS, and these data plus 5,073 lesser-quality spectra of some of those compounds were used in the creation of the PBM library. The most recent work has been done with an expanded data base of >35,000 mass spectra. Although a large number of errors had been eliminated during the original preparation of the data base, checking of individual cases of poor results showed that a significant number remained. PBM U AND A. VALUES The Registry file of mass spectra of 18,806 different compounds was used to determine the pr 9 babilities of occurrence of the mass values of peaks of 1% abundance .° Although the uniqueness of peaks fluctuates substantially for rn/e 29 — 114, as expected, there is a linear decrease in occurrence probability >rn/e 114, being reduced by half approximately every 58 mass units. A data base of much higher molecular weight should be reduced linearly by a substantially smaller factor, approaching a halving every 130 mass units; these probabilities were used for the PBM “U” values. The abundance values for masses >rn/e 120 were found to show a surprisingly constant distribution which Is log normal; for lower abundance values this distribution is dependent on and predictable from the molecular weight range. The resulting data were used for the PBM “A” values. 9 ------- PBM CONDENSED REFERENCE FILE The following metastable, multiply charged and impurity peaks are eliminated from each reference spectrum before it is condensed: all peaks having non- integral masses, peaks at rn/e 18, 28 and 32 which may be due to water and air, and peaks found at masses higher than those in the molecular ion cluster, with the limit defined as the molecular weight + 3 + 2 (44: of Cl atoms + - of Br atoms) + 1/2(* of S atoms + 44 of Si atoms). If the compound does not contain elements other than the most common ones, and if its molecular weight is greater than 50 amu, peaks due to the illogical neutral losses of 4 — 12 amu and 21 - 23 amu are also excluded, plus the loss of 18 amu if the compound does not contain oxygen, the loss of 19 amu if no oxygen or fluorine, the loss of 20 if no fluorine, and the losses of 13, 14, 24 and 25 amu if no chlorine or bromine (in the latter case these may be isotopic Ions of losses of 15, 16, 26 or 27). The reference spectrum is renormalized if necessary and the U value of each peak is determined. All peaks below mass 29 are arbitrarily assigned U values of 1, a value which is low enough to actively discriminate against the selection of these peaks but still permits them to be used If the spectrum contains no peaks or very few peaks at higher rn/e values. The peak abundance percentages have been divided into standard ranges assigned to specific A values .5 For the reference spectrum the A value of each peak is determined by the range into which its abundance falls. Thirty— two of the amu values have abundance probability distributions significantly different from the standard distribution, so that special abundance ranges must be used for these A value determinations. All peaks in the spectrum are ordered by decreasing U + A values; within each set of peaks having the same value of U + .A, the peaks are ordered on the basis of decreasing rn/e values. The 15 peaks at the top of this ordered list are checked for the presence of the base peak, for the most abundant isotopic peak in the molecular ion (Mt) cluster, and for the peak (or two peaks if Mt is not present) corresponding to the neutral loss(es) of 18, 20, 27, 28, 30, 32, 34, 36, 42, 44, 46, 48, 56, 60 or 64 amu having the largest U + A. value(s). If any of these three are not already included in the list, they are substituted for the peaks of lowest (U + A) value. For each reference spectrum the serial number, lowest recorded mass In the spectrum, Mt , and the values of rn/e, abundance, and U + A for the 15 selected peaks are packed into 32 computer words and stored In a file which occupies 1494 blocks of 512 16—bit words each of disk storage. The disk file structure is optimized to reduce access time. STIRS CONDENSED REFERENCE FILE The STIRS file data were prepared as described earlier, 10 except for the information on substructures and new match factors. The WLN notation of each compound was used to assign a computer bit fragment code for each of the 204 substructures examined; the linear notation of WLN is particularly appropriate for such mass spectral relationships because its units often 10 ------- correspond to the pieces of the molecule giving rise to spectral peaks. Note that only abbreviated definitions are given In the Tables; for example, the class “U” includes most double and triple bonds, but not carbonyl (V) and phenyl (R) groups. These groups were taken mainly from the “Dictionary of Frequently Found Substructures”; 22 possible alternative WLN notations (for example, “20” as well as “02” indicates ethoxy) have been tabulated. .i6, 17 For each MY’ data class a count of the number of occurrences of each of the 204 substructures was made using the pre—generated fragment bit file to give the respective values of the decimal fraction x. This was used to calculate the value of N, the number of compounds of the top 15 containing the parti- cular substructure necessary to give P(N)/P(15 . x) ratios of 1/50, 1/200, 1/500, and 1/1000 (confidence levels of 98%, 99.5%, 99.8%, and 99.9%). For the new data classes MF2 —4C, the mass ranges chosen were based on the frequency of occurrence or “uniqueness” of mass values, 5 the limits of the ranges being selected to correspond to masses of high uniqueness. Because the occurrence frequency decreases at higher masses, the width of the ranges was increased for data classes of higher mass. 11 ------- SECTION V EXPERIMENTAL AND EVALUATION PHASE PBM System Design Unique features of PBM include “reverse search” 3 ’ 4 and the “weighting” of the rn/e and abundance values of the mass spectral peaks. The system was designed in particular to emphasize high information precision and retrieval, as those unknowns for which only low confidence matches can be achieved usually must also be examined by a mass spectrometrist or an interpretive algorithm such as STIRS. PBM Search Algorithm For each reference spectrum the search algorithm begins by examining the unknown for the presence of the reference peaks from highest to lowest m/e values. If a peak is not present It is flagged; if the number of missing pe ks exceeds the number of allowed flagged peaks the program proceeds to the next reference spectrum. If reference peak j is found In the unknown (j 1, 2, .., 15), the ratio, of its abundance in the unknown to its abundance in the reference is cal- culated, and if pj is less than the specified “minimum percent component” (which for these tudies was 10% for pure compounds and 1% for mixtures unless otherwise specified), peak j is flagged; p values are calculated for all such reference peaks unless the maximum number of flagged peaks is exceeded. The smallest p not associated with a flagged peak (Pmin) IS determined, and the confidence value K is calculated for this peak. Pmin specifies the smallest percentage of this reference compound which could be present in the unknown sample and thus directly determines the dilution factor, D. The product of the abundance of each reference peak and Pmjn is the abundance expected for that peak In the unknown spectrum. The reference abundance also determines the window tolerance that is demanded of the match. For these studies a ±30% tolerance was permitted for peaks of abundances 9%, ±39% for 3.4-9% peaks, ±46% for 1—3.4% peaks, ±51% for 0.24-1% peaks and ±71% ,for peaks less than 0.24%; this gives eight “abundance bins,” so W = 3. The expected abundance of the unknown peak is set at the bottom of the ±x% window and the top of the window is determined. If the actual abundance of the unknown peak falls within these limits, K is calculated for It from U. + A. (determined by the peak In the referenc spectrum) - D + W, and ad 1 ded lo the accumulated K value. If the abundance of the peak in the unknown spectrum is higher than the top of 12 ------- this window, it is termed “contaminated,” and K. = 0. (From the definition 0f Pmin, the abundance of the peak will never fail below the window). After the entire set of reference peaks has been examined, one factor of W is subtracted from the K value because the peak which gave rise to Pmin is guaranteed to fall within the window. The K value resulting from the match is compared with the threshold K value (an optional threshold, 25 being used in this study); if K is smaller than the threshold it is not stored as one of the results. Otherwise the “percent contamination” is calculated, and if it does not exceed a specified maximum, the K value is stored. The “percent contamination”, which is only an estimate of the true value, is calculated using the ten peaks o highest ij + A in the unknown and is based on the proportion of the peak abundances which fall above the predicted abundance windows. If a maximum percent contamination less than 100% ( i.e. , none of the 10 unknown peaks are contained in the reference spectrum) is specified, reference compounds of molecular weights less than the masses of any of the ten peaks are not examined. Unless otherwise specified, for this study the maximum percent contamination was set at 20% for pure compounds and 100% for mixtures. If the number of flagged peaks allowed has not been reached, the peak which gave rise to Pmin Is flagged and dropped from consideration, the next lowest value of Pmin substituted, and the matching algorithm is executed again. This determines a new K value for this reference spectrum which is stored in place of the previous value if it is higher. When no more flagged peaks are allowed, the next reference spectrum is examined. In the studies reported here a maximum number of three flagged peaks was allowed for pure compounds and two for mixtures unless stated otherwise. With each K value reported, a K value is also calculated and displayed. The K value is the difference between the K value found and the maximum value that could have been achieved by a perfect match with the reference spectrum. Methods for Evaluating PBM Performance A “low molecular weight set” (LMWS, MW 144-160 amu) and a “high molecular weight set” (HMWS, MW 232—312 amu) of unknown spectra were created for testing PBM’ s performance on the spectra of both pure compounds and mixtures. The sets of “pure” spectra were composed of 433 and 415 spectra (LMWS and HMWS, respectively) which are other spectra of com- pounds represented in the 18,806 spectra of the data base. These test spectra represent all those available in the “duplicate” portion (serial numbers 18,807—23,879) in the Wiley magnetic tape 6 except that a small number of spectra of impure and Isotopically labeled compounds were excluded. A spectrum In the “Registry” file which is of one of these compounds was combined with two others in the ratio of 60:30:10 to create sets of synthetic mixture spectra containing 102 and 80 spectra, respectively, in the LIvIWS and HMWS. 13 ------- To analyze the results obtained by the retrieval system, two parameters taken from the field of document retrieval 2 are defined: the recall , or the proportion of all possible matches which are actually retrieved, and the information precision , or the proportion of the retrieved spectra which are matches. These recall—precision pairs are computed at various retrieval levels: e.g. , at particular K value levels. The trade—off between recall and precision is evident when recall is plotted on the x-axis, precision on the y-axis, of a plot such as that in Figure 1. An ideal retrieval system would maintain a precision of 1.0 at every level of recall: i.e. , all matches and no mismatches are retrieved with high K values. A system whose recall— precision plot more closely approaches this ideal plot is the better system. The precision achieved will depend on the degree to which the structure of the retrieved reference spectrum is required to match that of the unknown; for example, mass spectra are not sensitive to optical isomerism, and so such a requirement should not be imposed. These studies used the four somewhat arbitrary classifications shown In Table 2. The molecular ion is uniquely characteristic, being especially valuable for differentiating the spectra of homologs. Thus a retrieved compound whose molecular ion is present and used in matching is given a special 11+fl notation. For these the degree of match is designated by “K+” and “ K+” values and the match classes by 1+ through IV+. The K+ recall values shown are based only on the number of possible matches which contained a molecular ion; this represented 82.8% of the total possible matches for the HMWS. To compare the performance of the PBM system with other retrieval systems, each spectrum in the file of 23,879 spectra was abbreviated using the two largest peaks in every 14 amu interval from g/ 6 to the highest recorded mass value, and the retrieval system designed by Biemann et al. ” was Implemented on the PDP-11/45 computer. The only difference between the system tested here and that described 11 is the elimination of the prefilter which specifies that the total abundance of homologous series of ions must be similar in the unknown and reference spectra; this should only result in a slower search and should not appreciably lower (it could improve) the matching capabilities. Exactly the same trial “unknown” mass spectra were tested on this and the PBM systems. Methods for Evaluating STIRS Performance The search algorithm for STIRS has already been described In detail, 10 and the computer program for STIRS is available through DECUS, Maynard, Massachusetts 01754. Testing of the validity of the random drawing model for STIRS operation was performed by running randomly selected sets of spectra and calculating the individual probabilities for the presence of all 204 substructures based on the 15 highest MF value compounds selected for each data class. The compounds used as unknowns for a particular substructure were selected at random by taking every fiftieth compound In the file starting at number 125, a total of 373 spectra. If this gave less than 30 compounds containing the substructure, additional spectra of such compounds were selected at random to give 30 (or the total number in the 14 ------- TABLE 2. ARBITRARY CATEGORIES FOR DEGREE OF STRUCTURAL MATCH Class of match Relationship of reference and “unknown” structures Example of ref. compounds matching cis-l ,4—dimethyl- cyclohexane as the unknown I Identical compound or stereolsomer trans-i ,4-dimethylcyclohexane II Class I or ring positional isomer 1 , 3 -dimethylcyclohexane III Clas s II or a homolog diethylcyclohexane IV Class III or an isomer of a class III compound formed by moving only one carbon atom trimethylcyclopentane 15 ------- file if smaller). For 20 of the more abundant substructures STIRS runs were made on two other sets of 373 spectra starting at numbers 130 and 135; the results from the three data sets were the same within experimental error. A tally of the number of correct and incorrect answers at the 98, 99.5, 99.8, and 99.9% confidence levels were obtained for each MF and sub- structural group; only those groups predicted by STIRS as being present with a confidence above the required level were examined to see if they were actually present in the unknown compound. The recall value is the percent- age of compounds containing the substructure in which it was correctly identified. The precision should reflect the occurrence of correct answers relative to the total of those correct and wrong; because only positive answers are considered ( vide supra) , only compounds containing the sub- structure can give correct answers, and only compounds which do not contain the substructure can give wrong answers. To compensate for the difference in the number of possible correct and wrong answers in the spectra examined, the precision was calculated as the recall value divided by the sum of this value and the percentage of compounds not containing the substructure In which STIRS indicated its presence. number right answers possible right answers Precision, = number right answers + number wrong answers X 100 possible right answers possible wrong answers Thus for a case in which 70 of the 373 tested compounds contained the sub- structure, and STIRS indicated that 35 of these 70 and 3 of the remaining 303 compounds contained the substructure, the recall would be 35/70 or 50% and the precision 0.5/(0.5 + 3/303) or 98%. 16 ------- SECTION VI DISCUSSION PERFORMANCE OF THE PBM SYSTEM A discussion of the PBM system in much greater detail can be found in the Ph.D. thesis of Dr. G. M. Pesyna) 2 The general behavior of the PBM system is shown by Figure 1; note that for these tests the spectrum of the “unknown” was not present In the data base, because the resulting “perfect match” would give an unrealistic bias to the evaluation. As found for other retrieval systems,’ increasing the matching criteria of K or K values in- creases the precision of the results but decreases the proportion of unknown spectra for which matching compounds can be recalled. The results of the LMWS and HMWS are qualitatively similar; the poorer performance with increasing molecular weight is consistent with the increasing number of possible compounds of a given molecular weight which must be differentiated using the same number of peaks. The precision achieved can be compared to the values of K on the basis of the original definition 3 of the “confidence value” as the base 2 logarithm of the number of spectra of other compounds which would have to be selected at random in order to find one which matches the unknown spectrum as well as does the reference spectrum in question. Because the reference file used 6 contains approximately two’ 5 spectra, a matching criterion of K 15 should select only one wrong answer on a random basis. If there 15 one correct answer in the file which is also retrieved, this would correspond to a precision of 50%; in the same way a criterion of K 20 should select a wrong answer only one time in 32, or- 97% precision. Obviously much higher K values are actually required to attain such precision values 3 (Figure 1). In substantial part this discrepancy results from the recognized inadequacy of the original assumption that the statistical uniqueness of a particular peak is Independent of the presence of other peaks in the spec- trum. It Is not surprising that particular combinations of masses, such as the “ion series”, occur much more frequently than predicted from equation 1 using the probabilities of the individual masses. This will be illustrated below in the discussion of classes of matches. It should also be noted that the precision values would be substantially higher If only the reference spectrum matched with the highest K value had been considered as the answer (instead of all spectra of K greater than a particular value), which is the more probable way that PBM will be used in practice. Thus if there are actually several reference spectra of the same compound as the unknown, this would substantially increase the probability that one of these reference spectra would be found to have the highest K 17 ------- I.0 0 C ’) °05 c i) 0 l.0 Recall Figure 1. PBM performance for unknown mass spectra of pure com- pounds. LMWS, K values: C , maxImum percent contamination (MPC) = 20; C , MPC = 70. HMWS, AK values: 0, MPC = 20, and 0 MPC = 70. HMWS, Xvalues: ‘ , MPC2O. 0.5 18 ------- value. Of course much higher precision would be found using only reference spectra obtained under the same experimental conditions as used for the un- known, but limitation to such a file would be a serious handicap for a com- pletely unknown spectrum. Number of Flagged Peaks Recalculation of the degree of match ignoring the peak of lowest abundance (unknown spectrum relative to the reference) was done to avoid mismatches due to impurities in the reference and other errors; the data for the LMWS (Figure 2) and HMWS 12 show that this is indeed beneficial. Increasing the number of flagged peaks from zero to two increases the maximum recall for LMWS from 40% to 61%, and for HMWS from 34% to 50%. The increase with the third flagged peak is much smaller, and the fourth has a nearly negli- gible effect. Because the flagged peak calculations increase the time requirement, three flagged peaks appear to be an optimum number for the present system. Under the constraints used to produce the data of Figure 2, the recall in- crease which can be achieved by lowering the matching requirements reaches a limit. Incremental increases in the required K value become less effective in increasing recall, approaching a maximum at K or 50. This is because most pairs of the unknown and reference spectra of the same compounds that could not be matched at that tolerance level contain a particular datum which eliminates the reference spectrum from further consideration. The increase in the maximum recall value (Figure 2) with increasing number of flagged peaks shows that this technique compensates for part of these incompatible data. Increasing the tolerance to incompatible data by flagg1ngH that peak also Increases the probability that the spectrum of an incorrect compound will be selected as a match; note (Figure 2) that the precision corresponding to a K of 40 drops from 0.69 using zero flagged peaks to 0.51 using four flagged peaks. For the HMWS the precision values achieved at particular recall values are also somewhat reduced by increasing the number of flagged peaks, but, surprisingly, this is not true for the LMWS (Figure 2). Apparently the peak flagging makes possible the retrieval of new correct answers to an extent which is proportionately large in comparison to the new wrong answers retrieved. Maximum Percent Contamination This restriction was Imposed on PBM to speed the search and Increase the precision by eliminating matches that require impurities in concentrations higher than the Interpreter feels are possible. The tests on the spectra of pure compounds show that this only benefits search time; at higher precision values (Figure 1) the restrictions of 20% and 70% maximum percent contamina- tion give nearly identical results (although a particular K value is Indicative of higher precision for the 20% than for the 70% runs). Note that the maxi- mum recall value, which had been extended to >60% by the use of flagged peaks, is >90% wIth the more lenient restriction of 70% maximum percent contamination. Thus use of such a higher value is recommended even for 19 ------- I.0 0 C l ) 005 0 0.5 .0 Recall FIgure 2. Effect of the number of flagged peaks on i K values for unknown LMWS spectra of pure compounds. Number of flagged peaks: 0, 0; 0, 1; 0 2; L , 3 V, 4. -I 1 I I I I I — N .0 30 > so— 60— >60- > so — I I I 20 ------- unknown spectra of pure compounds; this helps for “incompatible data ( vide su ) of unknown peaks whose abundance Is greater than predicted in the same way that flagged peaks help for those whose abundance is too low. Class of Match Adjusting the criteria for a structural match (Table 1) is equivalent to chang- ing the relevancy decisions in a document retrieval system. Figures 3 and 4 clearly show that there is a very high probability that if a spectrum re- trieved with a low K value is not that of the compound identical to the un- known, it is actually that of a ring positional isomer, a homolog, or a com- pound whose structure differs only by the position of one carbon atom. In fact, for the higher K (or lower K) values, most of the small proportion of remaining retrieved compounds even not in Class IV are of related structure, such as a dimethyihexadecanoic acid matched with octadecanoic acid. This behavior, which Is also found for other retrieval systems (1, 3, 4, 6), shows that there are substantial cross-correlations of peak uniqueness values, as postulated In the explanation above of the precision achieved versus that predicted for a particular K value. The relative effects of changing data classes on the results for the LMWS and the HMWS are significantly different: the largest proportion of the class I mismatches for the LMWS are ring positional isomers (Figure 3) whereas in the HMWS the homologs of the unknowns are the most significant (Figure 4). For example, at a K of 30 for the LMWS two—thirds of the class I mismatches are ring isomers, half of the remaining mismatches are homologs, and about one-third of the remainder belong to class IV; for the HMWS at K = 30 well over half of the mismatches are homologs, while nearly half of the remainder are Isomers which can be formed by moving only one carbon atom. This differing Importance of the structural classes for the LMWS and the HMWS appears to be mainly an artifact of the makeup of the reference file. For example, the LMWS contains numerous dimethylnaphthalenes and dimethyl- Indoles, with the positions of the two methyl groups occurring in various permutations on the rings; the spectra of these Isomers are very similar. The HMWS contains a number of homologous long—chain aliphatic hydro- carbons and their derivatives such as primary alcohols and esters. Although the LMWS also contains many spectra of homologs, the peak abundances of these spectra are much more sensitive to the addition of a methylene group, the fragmentation patterns of methyl acetate and methyl proplonate are easily distinguishable, while those of methyl heptadeconoate and methyl octadecanoate are nearly identical except in the molecular Ion region. This can account also for the much larger effect of class IV data on the HMWS than on the LMWS, as a single “misplaced” carbon atom will tend to have a much smaller effect. Molecular Ion Information (K+ ) The molecular ion provides additional information which is especially valu- able for distinguishing between homologs, as seen for the HMWS data in Figure 4. The Increases in precision obtained by examining those class I matches retrieved with a K+ value are nearly commensurate with the in- creases obtained by using class III matching criteria with K values; 21 ------- I.0 C 0 C ’, 0Q5 3- [ 0 Figure 3. Effect of structural matching criteria on K values for un- known LMWS spectra of pure compounds. Class of match: 0, I; 0. II; K , III;E , IV. Recall 22 ------- I.0 0 ‘ -I ) Z 0.5 G) 0 1.0 Figure 4. Effect of structural matching criteria and molecular Ion information (K+) on K values for unknown HMWS spectra of pure compounds. Class of match: 0, I; , 1+; 0, II; , III; , I V; I , P1+. 0.5 Recall 23 ------- obviously the molecular ion should be uniquely effective in distinguishing between homologs. The same effect is significant in the consideration of class IV data as well. Thus for a high molecular weight unknown (Figure 4) a K+ value 40 provides a 95% confidence of at least a class IV match, while for a low molecular weight unknown a K+ value 30, which is ob- tained for nearly 50% of all possible matches, provides >99% confidence of a class IV match .’° Note, however, that a high K+ value does not neces- sarily insure that the reference compound has the same molecular weight as the unknown. The molecular ion of a lower molecular weight compound can occur as an odd-electron fragment ion in the spectrum of a higher molecular weight unknown; matching the molecular ion is a necessary but nc t sufficient condition to prove identical molecular weights. The LMWS data 1 indicate that the K+ values are of little benefit in distinguishing between ring posi- tional isomers (class II), as would be expected. K Versus K Values The recall/precision performances using K and K values show appreciable differences. For the HMWS (Figure 1) the precision achieved using K values is superior for recalls of 50 — 80%, while the opposite is true for the LMWS 1 ’ for this recall range. Because the best K value (zero) is the same for all reference spectra, at this value 12 - 15% of the possible match are already retrieved; higher precision can be achieved for the LMWS 1 using K values 100 (for K 100, no mismatches and 4% recall were found for the spectra studied). These precision results at low recall levels are based on samples that are small statistically; for the HMWS (Figure 1) the decrease in pre ision at the highest K values Is probably an artifact of the small data set . Here the observed 50% precision is due to the fact that one of the two spectra retrieved at K 130 is a mismatch, the spectrum of hexachiorofulvene retrieved for hexachlorobenzene as an unknown (actually a match by class V criteria); the close similarity of these two spectra has been pointed out. 13 Precision Value as the Criterion of Match The K and K values found for a particular selected reference compound can thus give substantially different levels of confidence based on the recall/ precision performance. Also ( vide supra ) the precision found for a particular value of K or K is substantially dependent on the molecular weight, number of flagged peaks, maximum percent contamination, class of match, and inclusion of the molecular ion. Based on these recall/precision studies ,12 we are at present modifying PBM to convert the various types of K and K values found for each reference spectrum to a “predicted precision” value which can be used in place of the K value for ranking the matches found, and which should provide a more direct measure of the confidence which the interpreter can place in the result with respect to each class of match. Comparison of PBM with Other Systems Of the variety of retrieval systems proposed which do not require human decision, that of Biemann and coworkers (the MIT system) 11 appears to be accepted as the one of best overall performance; the number of peaks employed 24 ------- (two for every 14 mass units of the spectrum) also requires substantially more computer time and storage than do other methods in general use. Overall, the recall/precision performance for pure compounds of the PBM system is equivalent or superior to that of the MIT system (Figure 5) in the high precision (>50%) range, although inferior at low precisions; for mixtures (Figure 6) the PBM system Is dramatically superior at all precision/recall levels. It should be noted that performance quality at high precision levels was a prime objective for PBM, as an unknown spectrum for which a match of high confidence could not be obtained should also be Interpreted by a mass spectrometrist or a computer interpretive system such as STIRS. 10 For the spectra of pure compounds in the LMWS, PBM gives clearly superior precision using K criteria which recall 15% to 60% of the possible matches; precis Ions obtained using K values approach those of the MIT system for recalls <15%. Note that the MIT system uses ‘ 30% more peaks for matching in the LMWS range and employs a forward search comparison. At low recall levels the MIT system is In turn clearly superior to PBM. This is In part due to the larger number of peaks and possibly the forward search mode of the MIT system; it Is also in keeping with the tighter abundance criteria demanded by PBM which make spectra taken under substantially different experimental conditions irretrievable. It follows that relaxing the abundance criteria of PBM by increasing the window tolerances, an option available to the operator, should Increase the recall at low precision values. Improved performance in this range should also be possible by “skewIng’s the unknown spectrum, either increasing or decreasing the observed peak abundances as a function of mass; this should compensate for Instrumental mass discrimina- tion or for changing sample concentration during spectral recording which has occurred for either the unknown or reference spectrum. For the spectra of pure compounds In the HMWS (Figure 5) the MIT system performance again is substantially superior at low precision values. For precision values >50%, the PBM recall using K values is closely equiva- lent to the MIT retrieval performance; using K values (Figure 1) the PBM recall performance is actually superior for 50% to 80% precisIon. However, the MIT system uses twice as many peaks for matching the HMWS range; because the PBM performance is degraded much more than that of the MIT system with increasing molecular weight, we have increased the number of peaks for the system presently used: for the molecular weight range beginning at 170 amu, 16 peaks; 180, 17; 195, 18; 215, 19; 240, 20; 270, 21; 305, 22; 350, 23; 420, 24; 500, 25; and 600, 26. Results using this modified PBM system with 35,828 reference spectra are shown in Table 3. For an unknown spectrum made by combining 90% methyl stearate and 10% methyl oleate, the 22 spectra retrieved with highest K values were either correct answers or closely related molecules; note that the correct com- pounds, but not homologs, have been retrieved with K+ values. Relaxing the matching criteria (Table 2) affects the MIT system performance 12 for pure compounds in a manner that is closely similar to that found for the PBM performance (Figures 3 and 4); for example, using class IV criteria (Table 2) 27% of the possible LMWS matches can be recalled by the MIT system with 95% precIsion, 12 which compares to 53% recall for PBM (Figure 3). This supports the proposal that the differences observed for the LMWS 25 ------- Figure 5. Comparative performance of retrieval systems for unknown spectra of pure compounds. LMWS: , MIT; 0, PBM. HMWS: L , MIT; 0 • PBM. C 0 C ’) C) C) a- 0.5 Recall 1.0 26 ------- Figure 6. Comparative performance of retrieval systems for unknown LMWS spectra of mixtures. System and proportion of component present: A, MIT - 60%; 0, PBM - 60%; 0, PBM - 30%; , PBM - 10%. C 0 U ) 0 C) Recall .0 27 ------- T)\BLE 3. COMPOUNDS RETRIEVED BY PBMa FOR]\ MIXTURE OF 90% METHYL n-OCTADECANO]\TE + 10% METHYL cls-9-OCTJtDECENOATE Compoucd Confidence K value K Percent Contarni - nation Percent rnponent Methyl C ll 38 O 2 134+, 95, 92+ 90 + , 78+ 0, 7, 30, 12, 24 2, 27, 37, 22, 24 90, 76, 67, 91, 53 Methyl bel enate, C 23 H 46 0 2 112*, 79, 61 0, 23, 41 27, 30, 66 63, 54, 23 Methyl 16-methyiheptadecanoete. C H O 2 103+ 0 23 65 Methyl arachidato, C 21 H 42 0 2 101, 77* 73** 1, 25, 29 30, 27, 46 54, 56, 27 Methyl henelcosanoate, C 22 H 44 0 2 101 3 27 87 Methyl nonadecanoate, C 20 H 40 0 2 95**, 65 7, 37 27, 50 78, 56 Methyl —9--octadecenoate 85*+, 62**+ 17, 40 84, 91 14, 20 Methyl 33,14-didouterlooctadccanoate 81** 21 24 73 Methyl myristute, C 15 H 30 0 2 73 , 71 29, 31 32, 34 100, 97 Methyl palmitate. C 17 F1 34 0 2 ?0** 32 34 80 Methyl heptadecanoate, C 18 H 36 0 2 66** 36 36 61 aPBM specifications modified to include >15 peaks for reference compounds of mole- cular weight >170; 35,828 reference spectra searched. bc answer. 28 ------- arid the HMWS in changing these criteria are largely artifacts of the reference file composition. Application to Spectra of Unknown Mixtures The precision achievable by PBM Is dramatically superior for the spectra of unknown mixtures, with the differential improvement over the performance for pure compounds being attrlbutab to the use of reverse searching. 3 ’ 4 Both the LMWS (Figure 6) and HMWS 1 ’ are similar in showing recall/precision performance by PBM for components present in 30% concentration that is substantially above that for the MIT system with 60% components, which performance is actually approached by PBM using 10% components. For the 30% components, the MIT system retrieved only 10% and 7% of the total possible matches for the LMWS and HMWS sets, respectively, and <2% of the possible matches for the 10% components. Although many potential matches are apparently rejected by the base peak prefilter of the MIT systemP it is not expected that relaxing this criterion will be particularly helpful; for this system it is recommended 4 that “the mathematical resolution of com- ponent spectra of mixtures is the most satisfactory means of identifying minor components.” The PBM performance for 60% components appears to be superior to that shown earlier for the spectra of pure compounds (Figure 1). Although different data sets have been used, this is due mainly to the fact that the spectra used in making up the unknown mixture spectrum were not eliminated from the ref- erence file (the same mixture spectra were of course used in the MIT system evaluation). Relaxing the matching criteria improves the precision for mixture spectral retr val in the same fashion as observed for the spectra of pure compounds. Mixture Examples Tables 4 and 5 show the compounds retrieved using the MIT and PBM systems for spectra of “unknown” LMWS and HMWS mixtures. The first spectrum was created by combining the spectra of 3-methoxyindazole, carbon tetrachlorlde, 2 and tert-butyl—3—ketobutyrate In a 60:3 0:10 proportion. In the top 15 matches 1 (the top 10 are shown) selected by the MIT system, only the major component, 3—methoxyindazole, is retrieved, ranking third and seventh In the output list. The PBM results show that this 60% component is identified with high confi- dence ( K+ values corresponding 12 to >95% precision). Although the confi- dence associated with the 30% and 10% components is much lower, molecular ion Information and no flagged peaks were utilized in retrieving the butyl-3-ketobutyrate spectrum, so that the confidence of that match is much greater than the confidence In any of the Incorrect retrievals, for all of which the use of flagged peaks was necessary for matching. Table 5 presents the results for a mixture of the herbicide Siduron, 1—(2- methylcyclohexyl)-3—phenylurea (60%), 1 ,2’-binaphthyl (30%), and the Ins ecticide Sumthlon, 0, _dlmethyl_O_(4_nltro_ _tO1Yl)PhO5Ph0r0thl0ate (10%). All of the similarity Indices obtained by the MIT system are extremely low, so that the low precision observed Is not surprising. On the other hand, 29 ------- TABLE 4. COMPOUNDS RETRIEVED BY THE MIT AND PBM SYSTEMS FOR A MIXTURE OF 60% 3-METHOXYINDAZOLE, 30% CARBON TETRA.CHLORIDE AND 10% tert-BUTYL-3-KETOBUTYRATE Similarity index System and compound or K value MIT: 1—methyl—3-indazolone 0.35, 0.34 3_methoxyindazolea 0.32, 0.27 2 —methyl—3-indazolone 0.28, 0.28 2—allylanisole 0 .27 1—methoxy—4—(1—propenyl)—benzene 0.27 2—methyl-3(2H)—benzofuranone 0.26 1-allyl-4—methoxybenzene 0.24 PBM: 3_methoxylndazolea (33% 33 %)b 92+, 83+ carbon tetrachloridea (66%, 66 %)b 55,41 tert-butyl— 3_ketobutyratea (9 6 %)b 42+ 1-methyl-3-indazolone ( 48 %)b 34* , 25 *+ chioropicrin ( 66 %)b 29** 4-amino—1--methyl—1,2,3—benzotrlazole ( 71 %)b 26*÷ 3-phenylelcosane ( 83 %)b 26** aCorrect answer. b aiue of “percent contamination” (%C) found by PBM; note that (1 — %C) is only an approximation of the actual concentration of the component present. 30 ------- TABLE 5. COMPOUNDS RETRIEVED BY THE MIT AND PBM SYSTEMS FOR A MIXTURE OF 60% 1 -(2-METHYLCYCLOHEXYL)-.3-PHENYLTJREA, 30% 1,2’ - BINAPHTHYL, 10% 0 ,O-DIMETHYL-O-(4-NITRO-rn-TOLYL)PHOSpHOROTHIOATF System and compound Similarity index or K value MIT: alpha -lonone 1- (2 -methylcyclohexyl) -3 _phenylureaa 3-rn ethoxy -4 -hydroxymandellc acid 1 -ethoxym ethyl -4-rn ethylenecyclohexane N-phenyl-N’ -methylurea bornylene cyclofenchene tricyclene y—terp lnene bis ( 2 -chloroethoxy)methane PBM: 1,21_binaphthyla (51% 55 %)b 1,1—blnaphthyl (51%, 69%) a 0,0-dim ethyl-0- (4-nitro-rn-tolyl) -phosphorothloate (84%, 88%, 93 %)b 1 -(2 _methylcyclohexyl)_3_phenylureaa (61%, .6 2 1 2’—blriaphthyl (70%, 69 %)b ct—phenyldlbenzofulvene ( 53 %)b 3 1 4—benzpyrene (89%, 89 %)b ‘y—terplnene ( 65 %)b 0.08 0.07, 0.06 0.06 0.06 0.06 0.06 0.06 0.06, 0.05 0.06, 0.05 0.05 89+, 59+ 93* , 39* 83+, 38+, 35* 73+, 57** 44* , 43+ 41+ 37**, 37** 39** aPBM specifications modified to include >15 peaks for reference compounds of molecular weight >170; 35,828 reference spectra searched. bct answer. 31 ------- all three components actually present are retrieved by PBM, and the other compounds selected are structurally similar. The PBM results thus confirm the advantages of both the reverse search strategy 3 ’ 4 and the weighting of mass and abundance values of peaks 3 for matching unknown mass spectra. Reducing the number of peaks necessary to achieve relatively high Information precision also yields a significant re- duction in search time requirements; it should be possible to do such a PBM search in real time for CC/MS. For example, matching against a reference file of 1,500 spectra during guadrupole MS data acquisition and reduction by a DEC PDP-8 computer (16 K words core, 1.6 M words disc storage) should require —‘2 sec for an unknown mass spectrum. PERFORMANCE OF THE STIRS SYSTEM IMPROVEMENTS Further details in discussion are available in the Ph.D. thesis of Dr. H. E. Dayringer. 15 Results of the substructure identification system will be discussed first, and then this system will be used to evaluate the performance of STIRS with a variety of data class revisions. Substructure Identification Capability The recall abilities at the 98% confidence level exhibited by MF 1-7, 10, and 11 for 18 selected substructural groups and the averages for all groups are shown in Table 6. Tables 6 and 7 show the information precision and recall values for MF 11 for 179 substructureS at the predicted 98% confidence level. Table 8 lists the 25 substructural groups of the 204 tested for which STIRS was unable to identify the group in a single compound (zero recall) at the 98% confidence level. A separate, but more complex, system for esti- mating confidence levels was also developed which incorporated the actual match factor values of the selected compounds. Extensive tests on 27 substructures showed results comparable, but not clearly superior, to those of the random drawing model. Information Precision The overall STIRS results for 179 Individual substructures using MF 11 show an average information precision of 98.1%, surprisingly close to the predicted value of 98%. The values vary from 100% down to 89% (xylene), with only three showing values <93%, and 23 <96%. The fluctuation In these values appears to be primarily statistical, giving strong support to the basic validity of the random drawing model to predict the confidence level of the STIRS substructural assignment. (In a number of cases, incorrect assignments were found to be due to errors in the spectra used as unknowns). The infor- mation precision values found for the 98% confidence predictions of the other data classes (MF 1 - MF 10) varied over a much wider range. The MF 11 results at the 99.5, 99.8, and 99.9% levels show higher informa- tion precision (and lower recall) values, but the change was not as great as expected; even at the 99.9% confIdence level the average precision only increased to 99.0% • Thus In this study we will not attempt to distinguish between information precision levels above 99%. Note that in the previously 32 ------- TAIILI: 6. RECAI.L ASJ!.ITY (X ) or STIRS AT TUE 95% CONrIPLNCE LEVEL lOP NINE MATCh FACTORS alkyl C 3 20.3 V carbonyl 46.2 Q hydroxyl 20.3 VO estor,snhydrlda 19.4 R phenyl 28.4 Z amino, amido 4.3 M Imino, imido 11.9 N trisuhst , N 28.8 S sulfur 10.4 01 - Cn 2 O.., CH 3 O 19.0 L. ,J carbocyclic ring 24.1 F fluorIne 5.0 G chlorine 8.2 E bromine 3.4 I iodine 1.1 —SI—1&1&1 trimethylsilyl 4.5 L6TJ cyclohexane 3.1 QVR benzolc acid 0.8 24 43 1 13 15 25 5 21 30 54 6 31 12 7 22 27 16 33 3 25 30 46 0 39 17 29 6 11 10 23 43 90 47 73 7 43 28 11 36 19 10 11 6 27 12 5 15 4 36 5 10 16 11 49 18 9 40 16 27 25 7 13 0 19 6 13 14 9 14 2 2 16 11 24 11 14 14 4 22 10 22 18 12 35 15 6 43 14 20. 10 8 28 6 50 33 28 17 9 71 17 37 17 17 39 56 28 30 33 20 30 10 63 50 67 47 7 63 30 47 27 10 27 13 43 7 20 33 56 90.2 18 31 100.0 12 42 93.4 32 55 96.4 44 75 93.7 25 44 98.1 35 19 91.0 22 30 95.3 27 35 98.2 24 49 94.8 46 50 94.7 22 61 99.0 29 74 99.6 22 57 97.9 47 47 93.4 63 97 97.7 70 80 95.6 53 73 98.5 Number of functionalit ies giving non-zero recall values Average recall for these 112 167 158 136 159 127 125 169 179 15 32 28 22 26 18 14 40 49 5 1n many cases additional WLN permutations were used as descriptors of a particular substructure; most of these are listed in reierence 17; a complete list is available from the authors. bApproximale description of the group defined by the WLN symbol; see text for possible ambiguities. cComplete descriptions of these mass spectral data classes are contained in reference 2. Match factors 8 tid 9 dId not satisfactorily identify any of the fuectional groups. eAverage precision of MF11 results for 179 substructures giving non—zero recall. MF1 MF2 MEl ?vlI’4 MFS MF6 MF7 MF1O MF11 Information WLNA Substructure °h Ion series —CharactertttLc ee C— —Neutral lcs cs— 2 pezsks/ Overall, precision, file rn/a: 27—99 27—89 90—149 lS0 0—64 65 110 14 amu l2vlFl—6 MF11 98. l 33 ------- TABLE 7. RECALL AND PRECISION OF STIRS MPh PREDICTIONS AT THE 98% CONFIDENCE LEVEL FOR 161 OTHER SUBSTRUCTURES a b No./18806 WLN Substructure In file Information precision Recall 1 —CH 2 —, CH 3 8886 98 25 2 —CH 2 CH 2 —, C 2 H 5 3793 93 19 02 —CH 2 CH 2 O—, C 2 H 5 O 975 94 42 O linkIng oxygen 7709 99 39 U double bond 4495 95 38 K quaternary amine 138 96 50 Y single branch 4985 94 28 3O alkoxy, C 3 494 94 50 P phosphorus 424 98 70 T. .J heterocychic ring 6796 97 31 VS thioester 68 100 43 UU triple bond 216 96 29 T56 BN D0J benzoxazo le 20 100 55 X double branch 2039 93 44 MV1 acetamldo 202 97 60 —SI— silicon 950 98 90 U1V1 acetonylidene 53 98 37 OVR benzoate 387 99 53 lv acetyl, —CH 2 CO- 2159 95 47 VQ carboxy 597 98 43 OV 1 acetate 1100 96 47 L66BGAB-C1BITJ adamantyl 15 100 67 GV acid chloride 57 99 50 VYZ alanyl 24 99 25 V 1U1 acrylyl 167 96 50 1U2 ahlyl 312 99 23 V4V adipyl 21 100 19 ZR aniline 115 97 30 VII aldehyde 424 97 37 34 ------- Table 7. Continued a b No./ WLN Substructure in 18806 file Information precision Recall 1OR methoxyphenyl 829 97 43 VZ primary amlde 182 97 57 L 0666J anthracene 14 100 14 MR N—subst. aniline 397 96 43 NNN azide 95 99 53 1ORXV anisoyl 105 98 50 T3MTJ azlridine 46 99 73 L 0666 BV IVJ anthraquinone 16 99 19 NO&UN azoxy 7 100 57 TJNNU azino 44 97 30 VHR benzaldehyde 92 99 83 NUN azo 83 97 77 L 06 B666J benzphenanthrene—3,4— 9 100 89 L57J azulene 6 100 50 T56 BM DNJ benz lmidazo le 45 99 53 ZVR benzamide 21 98 48 T56 BOT benzofuran 2]. 100 33 Q1R benzyl alcohol 106 95 20 T56 BSJ benzothiophene 35 99 70 U1R benzylldene 377 94 50 L66 A BTJ b lcyclo(2,2.2)octane 40 98 70 RYR&U benzohydrylene 35 98 53 L4TJ cyclobutane 25 100 36 T56 BN DSJ benzothiazo le 36 100 37 L S5ATJAA bornane 19 99 47 RR blphenyl 19 100 79 T5OVTJ gamma-lactone 36 100 67 04 butoxy 184 98 53 V1U1R clnnamoyl 94 98 77 L7TJ cycloheptane 11 99 45 T66 BOVJ coumarln 45 100 27 35 ------- Table 7. Continued a b No./18806 WLN Substructure in file Information precision Recall T56 BOT&J coumaran 16 99 19 CN cyano 307 98 37 V1112 crotonyl 29 100 17 L6U CUTJ cyclohexadiene-l,3- 22 100 27 L5 ART cyclopentadiene 9 100 33 L6VTJ cyclohexarione 48 99 37 L5VTJ cyclopentanone 25 100 36 L5TJ cyclopentane 146 97 60 NW nitro 508 99 77 L3TJ cyclopropane 89 100 57 N1&1 dimethylamino 325 97 63 L66TJ decaliri 37 98 73 SW sulphonyl 254 97 53 UNMRBNWDNW 2,4.-dinitrophenylhydrazofle 32 100 87 L B656 HHJ fluorene 18 100 39 SS disuiphide 80 100 47 T50J furan 110 99 43 MVH formamido 14 100 14 Q1V glycolyl 26 99 23 T SNNOVJ sydnone 36 100 67 MZ hydrazino 63 99 23 T5MVMV EHI hydantoin 11 97 18 QM hydroxyamino 54 99 30 V2R phenylpropionyl 22 100 so T56 BMNJ indazole 18 100 50 L56T&J indan 53 98 83 T56 BMT&J indoline 12 100 63 T56 BMJ indole 134 98 63 QNU isonitroso 43 100 23 RMNU phenylhydrazone 53 100 73 T5NOJ isoxazole 64 99 43 36 ------- Table 7. Continued a b No./18806 W’LN Substructure in file Information precision Recall SCN thiocyanate 18 100 ii SH thiol 125 99 57 V1V maloriyl 75 98 23 is methylsulfide 306 97 33 010 methylenedioxy 14 100 29 L ES B666. .J steroid skeleton 782 97 87 T6M DOTJ morpholine 38 100 27 NO nitroso 508 99 70 L66J naphthalene 167 97 77 W oxalyl 97 96 20 L55 ATJ norbornane 65 99 70 T5N CQJ oxazole 35 99 83 L E5 B666 LUTJ steroid skeleton, 6,7—dehydro— 99 98 70 00 peroxy 346 96 60 QNU oxime 43 100 23 U1VR phenacylidene 29 98 52 1VR phenacyl, acetylphenyl 149 93 43 MVMR phenylureido 24 98 63 L B666J phenanthrene 34 99 70 QR phenol 957 97 70 T 0666 BN INJ phenazine 37 100 47 OR phenoxy 1464 94 40 T C666 BM ISJ phenothiazine 65 99 60 NUNR phenylazo 72 97 90 QV1R phenylacetic acid 15 99 80 T6M DMTJ piperazlne 30 100 60 T56 BVMVJ phthalimide 27 99 37 3U propylidene 764 97 30 V2 propionyl 393 99 40 T6N DNJ pyrazine 50 100 67 T66 BN DN GN JNJ pteridine 26 100 73 37 ------- Table 7. Continued a b No./18806 WLN Substructure in file Information precision Recall TGNJ pyridine 278 99 63 L666 86 2AB PTJ pyrene 6 100 50 T6N CNT pyrimidine 81 100 63 T6NJ AC pyridine-N-oxide 9 100 78 T66 BNT qulnoline 111 99 77 T5M1 pyrrolidine 79 96 20 L6V DVJ quinone 32 100 37 T66 BN IJNJ quinazoline 38 99 80 QR BV salicyloyl. 62 98 67 T66 BN ENJ quinoxaline 30 100 53 1U1R styrene 176 96 70 UNMVZ semicarbazono 41 100 83 V2V succinyl 16 96 6 T5VNVTJ succinimlde 12 100 8 T5OTJ tetrahydrofuran 124 97 60 MSW sulfonamido 38 99 50 T5N CSJ th lazole 25 100 20 T6OTJ tetrahydropyran 745 98 77 US thiocarbonyl 203 97 60 SUYZ thiocarbamyl 16 100 25 T5SJ thiophene 158 99 77 NCS thiocyanate 18 100 11 R1V phenylacetyl 71 97 70 MRX1 toludino 61 98 37 T6N ON ENT trlazine 59 99 80 1RX xylene 440 89 30 ZVZ urea 306 98 47 FXFF trlfluoromethyl 343 99 63 1U1 vinyl 557 94 23 4V valeryl 126 96 30 RV benzoyl 1039 94 53 38 ------- Table 7. Continued WLN a b Substructure No./ in 18806 file Information precision Recall T C666 BO IVJ xanthone 9 100 56 1V1V acetoacetyl 36 99 27 T56 ANJ indollzine 24 100 63 T7MVTJ caprolactam 8 99 50 MNW nitramino 7 99 29 T5MNDNJ triazole, 1H—1,2,4 26 100 23 a 1 many cases additional WLN permutations were used as descriptors of a particular substructure; most of these are listed in reference 17; a complete list is available from the authors. bApproximate description of the group defined by the WLN symbol; see test for possible ambiguities. 39 ------- TABLE 8. SUBSTRUCTURES OF ZERO RECALL BY MF11 b No./ WLNa Substructure in 18806 file ZR BV anthraniloyl 12 MVR benzamido 40 SHR benzenethiol 12 UU1R benzylidyne 13 L35TJ bicyc lo(3.1.0)hexane 4 T B656 HMJ carbazole 5 T B656 EN HMJ carboline, beta 5 L7 AEJ cycloheptatriene 8 U2U1R cinnamylidene 4 T B656 HOJ djbenzofuran 5 T B656 HSJ dibenzothiophene 4 V3V glutaryl 7 VHV glyoxylyl 4 V1MVR hippuryl 5 L56 BHJ indene 4 OCN isocyanato 9 T66 CNJ isoquinoline 9 T5VOVJ maleic anhydride 4 L46 ATJ norpinane 4 T4OTJ oxetane 9 T66 CNNJ phthalazine 8 1UU1V propiolyl 6 QR B1U salicylidene 12 T E5 B666. .r steroid, heterocyclic 19 SWQ sulfonic acid 5 many cases additional WLN permutations were used as descriptors of a particular substructure; most of these are listed in reference 17; a complete list is available from the authors. bApproxilnate description of the group defined by the N symbol; see test for possible ambiguities. 40 ------- cited example of 373 spectra of which 70 contained the substructure, and STIRS identified Its presence in 35 of the 70 (correctly) and 3 of the remaining 303 (incorrectly), a decrease of only one in the incorrect identifications in- creases the precision value from 98.0% to 98.7%. Further, and more importantly, a sampling of the incorrect identifications by MF 11 shows that most actually contain substructures that are closely similar, at least on the basis of their mass spectral behavior, to that identified. For example, one- third of the compounds incorrectly identified as containing a phenyl actually had a benzo group, and nearly half a pyridine ring; in many cases these alternative identifications were indicated by STIRS, either directly as that substructure or as a functionality of common occurrence in the compounds selected with highest MF values. For the VO classification, -00-0—, for a large proportion of the incorrect identifications the WLN of the compound contained VQ, -CO-OH. Although this results from the necessarily arbitrary nature of some classifications of the substructures, in the case of a true unknown such “incorrect” identifications could be helpful, or at least not be seriously misleading. The gratifying agreement between the average MF 11 information precision value of 98.1% and the predicted confidence value of 98% also Is a strong indication that the validity of the STIRS results are not significantly depen- dent on the data base, unknowns, or substructures. The substructures were not chosen on the basis of their mass spectral behavior (although this should be beneficial, vide infra) , so that a similar STIRS performance would be expected for new compounds even of unusual mass spectral behavior examined as unknowns or added as reference spectra to the file. However, because STIRS can only identify structural moieties which are already in the reference file, it should be remembered for a total unknown that the substructures indicated could thus be only those closely related in their mass spectral behavior. For example, if an unknown had a steroid structure except that the “D” ring was six—membered, and there were no reference spectra of this type in the file, STIRS might well indicate the steroid substructure at a high confidence level (although the actual MF values could be lower than ex- pected). However, if this compound type was also well-represented in the reference file, but was not in the substructure list, these reference com- pounds should then appear instead in the top 15 selected, and thus reduce the confidence of the steroid substructure prediction. In summary, the information precision values found here indicate that the STIRS identifications from MFU are reliable with the predicted confidence at least to the 99% level, i.e. , If STIRS predicts the presence of a pyridine ring in an unknown with 99% confidence, in only one case of 100 should this turn out not to be true. Recall To reemphasize, although it is important that STIRS be correct in a high proportion of the cases that it makes a substructure prediction (a high “information precision”), it is also important that it makes such a high confidence prediction of the particular substructure in a substantial pro- portion of the unknowns in which the substructure is actually present (a high “recall”). Considering the wide range of structural types, the average 41 ------- recall by MF 11 of 49% for 179 substructures appears to be a promising performance, and values such as 97% for trimethylsilyl (identifying the group in 29 of the 30 TMS compounds examined), 89% for benzphenanthrenes, 87% for steroids, and 87% for dinitrophenyihydrazones are quite impressive. For 12% of the substructures examined (Table 8), STIRS gave zero recall, but for all but one of these (MVR, benzamido) there were only thirteen or less compounds which contained that substructure in the file of 18,806 compounds. The utility of the system depends, fundamentally, on the amount of reliable information which it can supply on the average unknown molecule. Multiply- ing the recall for each substructure by its proportion in the data base, and summing these values, gives a figure of 2.55; this means for an average unknown spectrum STIRS should be able to identify two or three substructures by MF 11 with high confidence. Of course this number will increase if the list of substructures is made more comprehensive or if the STIRS performance is improved to increase the recall values. In almost every case the overall match factor, MF 11, showed a higher recall value than any of the other data classes. This is not surprising, as MF 11 is a weighted average of MF 1 — 6 (1 x MF1, 1 x MF2, 2 x MF3, 2 x MF4, 4 x MF5, 2 x MF6), and in most cases MF 7 - 9 were of marginal utility. In general the variation in recall values with data class (MF 1 - 11) and sub- structure corresponds to know spectra-structure correlations, as noted earlier for the qualitative information from STIRS. 10 Those substructures giving characteristic fragmentation behavior generally show higher recall; the high MF 2 recall for phenyl is consistent with the characteristic “aromatic ion series” in the rn/e 37 — 79 mass range, while the high MF5 recall for chlorine corresponds to the high tendency for the loss of this electronegative species as neutral Cl or HC1 (note that none of the data classes should be sensitive to the characteristic isotopic abundances of chlorine). Trimethyl- silyl derivatives exhibit abundant ions at unusual masses such as 73, 75, 89, and 147, making the identification of this substructure possible in a high proportion of spectra. Despite the substantially superior performance of MF 11 in general, for a particular unknown spectrum it is possible that one or more substructures can be identified at a higher confidence level by another data class (MF 1 - MF 10). It is important, however, that predictions be made only for those substructure/match factor combinations of high information precision and recall. If each of the 204 substructures were to be predicted by each of eleven data classes, even if there were only a 1% chance that a particular match factor indicates a particular substructure because it is present by chance in the required number of the top 15 compounds, this would mean that approximately 22 substructures (1% x 11 x 204) would be indicated incorrectly for the average unknown. Thus in practice we use the MF 11 results as the primary structural information. Although it is helpful to examine the MF 1 — MF 10 results for other structural clues, the only sub— structure—MF combinations used are restricted by the computer to those for which the statistical study showed information precision values 94%, and recall values >30% of the value for MF 11. 42 ------- Application to Unknown Spectra The applicability and limitations of this technique are Illustrated by STIRS substructure predictions (98% confidence level) for some unknown” mass spectra. As discussed above, primary reliance is placed on the MF 11 results, but those from MF 1 - MF 10 are reported to show how these can supply confirmatory evidence and indicate additional possible substructures. Results with simple molecules are generally excellent. From the spectrum of 5-tridecanone (WLN, 8V4) STIRS identified an alkyl group of three or more carbons (3) by MF 3 and MF 11, with no incorrect Identifications. For methyl trichloroacetate (WLN, GXGGVO1) the substructures methoxy (01, by MF 11), tetra substituted carbon (X, by MF 2, 3, 10), chlorine (G, by MF 5, 7, 10, 11), and “GV” (by MF 5, 11) were identified. Despite the fact that “GV” can indicate an acid chloride, C1CO-, this as well as all of the other STIRS identifications are actually correct; G js a terminator in V’TLN, and for all of the compounds selected by STIRS the chlorine was substituted on a carbon attached to the carbonyl group, not on the carbonyl itself. This emphasizes the importance of examining in addition the actual structures of the selected compounds, as well as careful definition of the WiN subcodes used (this subcode has now been narrowed to restrict this group to only acid chlorides by requiring that there be no structure symbol Immediately preceding the GV). From the spectrum of ethyl 2-isopropyl-3-oxobutyrate (WLN, 2OVYVI&Y) STIRS correctly identified six substructures from the MF 11 results with >99% confi- dence and gave no Incorrect MF 11 assignments. Although some of the simpler ones (ethyl, linking oxygen) are redundant, the Identification of groups such as Va (ester), 02 (ethoxy), y (single branch), and 1V (acetyl) would be useful in elucidating the structure of this as an unknown. Of the secondary predictions by other MF data classes, two substructures are correct, and lvi, V1V, and 1V1V (CH 3 00CH2CO-) predicted by MF2 are similar to the correct substructure CH 3 COCH200-)—. Three of these secondary predictions are misleading; carbocyclic (L. .1 by MF3), sydnone (T5NNOVJ, by MF 6), and naphthalene (L66J, by MF 3). However, if C 2 H 5 0-C0-C- and C2HSO-CO-CH(COCH3)- had been included in the sub- structure list, the STIRS results would have been much more dramatic; 11 of the top 15 compounds selected by MF 11 were ethyl esters, while five were ethyl esters of 2—alkyl—3—ketobutyrates.. From the spectrum of y-lactone functionality (T5OVTJ, by MF 2, 11) as well as the alkyl chain (3, by MF 3, 11), carbonyl (V 1 by MF 10, 11), and ester (VO, by MF 11). The related group —CH 2 CH2CO- (V2, by MF 10, 11) was the only other substructure found. The spectrum of 6-laurolactone also produced STIRS predictions of T5OVTJ, 3, v, and VO; the 6-lactone sub- structure is not included In the substructure list. The top 15 compounds selected by MF 11 actually contained two 6-lactones as well as the four y—lactones on which the substructure prediction was based, while the STIRS results for the y-lactone showed one —lactone as well as the four y—lactones selected for MF 11. Thus STIRS does not differentiate well between the two substructures, making it preferable to combine these into a single substructure Indicating either of these functionalitles. 43 ------- Utility of STIRS Substructure Identification It should be emphasized that this system for substructure identification, as STIRS itself, is intended as an aid to, not as a replacement for, the human interpreter. The capabilities of STIRS are subject to the basic limitations of mass spectrometry. STIRS can give no more information from a particular unknown mass spectrum than a human interpreter having unlimited time, insight, and experience, and for a molecule of moderate complexity neither the interpreter nor STIRS can make a substructure prediction with absolute certainty (10 0% information precision). Thus it did not seem particularly useful, or feasible, to conduct a statistically valid comparison of the abilities of human interpreters versus STIRS. In our applications of the method to date there have been many instances of STIRS substructure identifications by MF 11 at a high confidence level that were missed by manual interpretation. However, if the human interpreter did not recognize, for example, a sydnone substructure, it might be argued that someone with sufficient experience in the spectra of such compounds would not have missed this. Even for cases in which this is true, the increase in speed and confidence of the interpretation process provided by the STIRS sub- structure information would appear to justify the relatively small effort required to obtain this information. Further comment may also be appropriate for the information precision and recall values observed here. The interpreter can consider substructures indicated by STIRS at the 80% confidence level, but must keep in mind that postulation will be incorrect in one of five cases. Of course it is much better to consider first substructures predicted by STIRS with higher precision values, although there will be fewer of these (lower recall). We contend that the average recall of 49% at the 98% confidence level for this variety of substructures indicates that this method has substantial utility; however, this is not based on a comparison with the performance of a trained mass spectrometrist, but rather on substantial experience in the degree to which STIRS substructure identifications can help any, including a highly experi- enced, mass spectrometrist. For example, an interpreter might be able to surpass STIRS in the Identification of chlorine and bromine in unknown spec- tra, as none of the STIRS data classes utilize directly the isotopic abundance information characteristic of these elements. However, in cases in which the isotopic information is ambiguous because of interference or sensitivity problems, the interpreter might still be helped by a STIRS identification of these elements. Finally , the quantitative evaluation provided by the information precision/ recall values should be valuable for measuring the improvement achieved by further modifications to STIRS, and for performance comparisons with other systems (including those utilizing human as well as computer interpretation). For example, a very recent study by Kent and Cäumann for substructure identification from mass spectra using learning machine methods 16 gives performance data for twenty simple functionalities; useful identifications were not possible for compounds containing more than one type of substruc- ture. Complete elucidation of structure, the ultimate goal in interpreting an unknown mass spectrum, utilizies a variety of spectral information such as molecular weight and ion elemental compositions from isotopic abundances 44 ------- in addition to substructure information. For complex molecules for which the complete structure cannot be determined at a high confidence level, it would appear to be helpful to know at least what parts of the structural assignment can be given with high confidence. System Improvements STIRS using MF 11 can identify carbonyl (V) with 31% recall, a performance which is encouraging in view of the scarcity of specific mass spectral peaks characteristic of this group. However, the recall is substantially higher for terminal carbonyl-containing functionalities such as acetyl (lv, 47%) and benzoyl (RV, 53%). Thus as might be expected, functionalities which are known to have a strong directing effect on mass spectral fragmentations, such as saturated nitrogen, can exhibit a low recall if defined in too general terms; the performance for the secondary amine (M) and benzamido (MVR, Table 8) substructures should be improved by substituting a number of more specific classifications such as -CH 2 NHCH 3 (—1M1) and -CH 2 NHCOC 5 H 5 (—1MvR). Because the search for the present 204 substructures only increases the time for a STIRS run by a few percent, an obvious possible improvement is to subclassj.fy each of the more general substructures of the 204 into a number of new substructures expected to give the most characteristic specific effects on the spectrum. In particular cases such as for the isopropyl group, substructure definition for the reference spectra by WiN is difficult, and definition from a connection table (or manually) will be preferable. The con- nection table is a computer representation of the atom connectivities of the molecule, and can be generated from the WTLN. With a few exceptions, the substructures tested here are relatively simple. An obvious extension of the method would be to include more complex sub- structures, especially those which are well represented in the reference file. For example, although a substructure for the steroid skeleton (WLN, L E5 B666. .J) is included, the subset of estrogens (steroids with an aromatic “A” ring) is not. When the spectrum of 17-vinylestradiol 3—methyl ether was examined by STIRS, all of the top ten reference spectra selected by MF 11 were of estrogens; 17 this would predict that this relatively complex substructure should show a high recall value. More complex substructures might also be identified by having the computer search for combinations of substructures in the selected compounds. Improvements In the information precision/recall performance should also be possible by modifications to STIRS, such as new match factors incorporating different data classes, and new “overall” match factors Involving different combinations of the Individual match factors. We would then plan to use for the prediction of a particular substructure only the few match factors which show the highest precision/recall performance. STIRS Data Class Improvements The recall and precision values found for STIRS predictions of substructure identity at the 98% confidence level are shown in Table 9. The statistical validity of these results is supported by the close correspondence to those of the original data classes which employ similar criteria. For example, 45 ------- TABLE 9. PERFORMANCE OF DATA CLASSES FOR 98% CONFIDENCE LEVEL PREDICTIONS Data class Recall, % Precision, % Number of substructures with non-zero recall 2 a 31.6 96.1 167 3 a 27.8 95.7 158 4 a 21.5 93.0 136 2A 32.7 95.5 174 2A’ 33.4 97.3 171 3A 34.6 96.1 173 3B 31.2 96.0 169 4A 26.4 93.7 163 4B 18.6 91.7 139 4C 16.0 87.5 99 11 a 49.1 98.1 179 11.1 47.4 97.9 183 11.2 47.3 98.1 182 aD from reference 1. 46 ------- the MF2A values are derived from data very similar to that used for the ori- ginal MF 2 values, and these give comparable results in precision, recall, and number of identifiable substructures. The relatively smooth change in average recall found through each series of data classes, such as 2A-4C, Indicates that there are no serious inconsistencies in the limits or extent of the mass ranges used. Because the contribution of a particular data class to the overall STIRS performance Is also dependent on the number of sub- structures for which It can give the best recall, such data have also been determined (Table 10). These data give a very different impression of the usefulness of the data classes employing high mass peaks; although MF4C values give an average recall of 16% on 99 substructures, only a very small percentage of these show a recall ability that is superior to that of other data classes. Thus the additional structural information supplied by MF4C values does not appear to justify the computer time and storage required for its use. Number of Ions Changing the number of Ions used in matching has a relatively small effect on the performance of a STIRS data class; using more peaks than the number previously recommended (Table 1) slightly Increases the recall for substruc- tures without substantially affecting the precision values. The mass ranges of data classes 3 and 3B correspond closely; thus the increase in average recall from 27.8% to 3 1.2% for predictions by MF3 versus MF3B results mainly from increasing the number of peaks employed from five to eight. In a study of 20 commonly occurring substructures it was found that in using the six (MF 2), eight (MF2A), and ten (MF2a) largest even-mass and odd- mass ions In the low mass range (Table 5) the average recall went from 26.0 to 28.2 to 29.7%, respectively, while the precision stayed the same (89-90%) versus the predicted 98% confidence level. However, for half of the substruc- tures the use of ten peaks (MF2a) Instead of eight (MF2A), gave the same or a lower recall value, Indicating that in some cases the use of more information for matching can even confuse the substructural identification. Because additional computer time and space are required for each additional peak matched, it was decided for further testing to Increase only to eight the number of Ions used In the lower mass ranges (MF2A, 3A, 3B). Varying the number of Ions used in the higher mass ranges give similar trends In the recall values, but indicates that fewer peaks are necessary per mass increment. For the mass range 117-200 (MF4A) the average recall increases from 21.7% to 25.2% to 26.4% (based on 155, 156, and 163 substructures, respectively) in changing the number of peaks employed from four to six to eight. In the mass range 159—270 (MF4B) increasing the number of Ions used for matching from five to eight Increased the recall from 17.9% to 18.6% (based on 127 and 139 substructures, respectively). Those trends are not surprising In light of the increased uniqueness and scarcity of peaks in the higher mass ranges. 5 UtilIty of High Mass Characteristic Ions The use of eight peaks by MF4B (and also by MF4C) gave average recalls + below that of MF4 which uses only five peaks between rn/e 150 and (M - 1) This can be due only In part to “confusion” caused by too much data used in 47 ------- TABLE 10. DATA CLASSES GIVING HIGHEST RECALL VALUES FOR INDIVIDUAL SUBSTRUCTURES Data class Number of substructures with best recalla Percentage of substructures with best recall 2A (2A. ) 58.3 (62.7) 33.0 (35.2) 3A 57.3 32.4 3B 36.5 20.6 41\ 17.3 9.8 4B 4.3 2.4 4C 3.3 1.9 aThe averaged STIRS results for each substructure were examined to find which data class gave the highest recall value; fractional credit was given for ties. bThe STIRS data for data class 2A’ was ignored in determining the other results of the Table. The 2A’ results shown were then determined in the same way, ignoring the 2A data. 48 ------- matching, as ( vide supra ) increasing the number of peaks used by MF4B from five to eight gave a small increase in average recall. Thus we will extend (Table 11) the mass range of MF4B to (M - 1Y’, making the specifications for data class 4B little changed from those of the original class 4. The number of possible molecular fragments which can produce a particular mass increases rapidly with increasing mass, apparently offsetting the higher uniqueness of these peaks, so that a much broader mass range is advantageous at higher rn/e values. Requirement of Even-Mass and Odd-Mass Ions For the lowest mass range (MF 2) equal numbers of the largest even- and odd- mass ions were originally chosen’° to insure that both odd-electron and even- electron ions were included, as these are produced by different mechanisms. On average, however, there is little difference in the results from the use of the largest four even- and four odd—mass ions in the range rn/e 6—88 (MF2A) and the use of the eight largest peaks in the same range (MF2A’). MF2A gave an average of 95.5% precision and 32.7% recall for 174 of the 204 substruc- tures, while MF2A’ gave an average of 97.3% precision and 33.4% recall for 171 substructures. Some individual substructures show substantial differences in recall between MF2A and MF2A’, but this was not felt to be a sufficient justification for the use of both data classes in view of the additional match- ing time and storage required to retain both. Because the overlapping data class 3A uses only the most abundant ions, it was decided to use 2A instead of 2A’, thereby having low mass data classes which both require, and do not require, even and odd masses. A study of 20 commonly occurring substruc- tures using MF2a and MF2a’ produced results using 10 ions similar to the above eight ion results. The recall average of 28.2 and 31.1%, respectively, for MF2a and MF2a’ with 90% precision follow the trend above. The difference in recall can, in part, be accounted for by the fact that often there may not be five odd—electron ions of importance in the mass range. Overlapping Mass Ranges Evaluation of the effects of overlapping mass ranges is possible by comparison of the results for data classes 3A, 4A, and 4C to the results for the adjacent ranges 2A, 3B, and 4B. Enumeration for each data class of the number of sub- structures for which a maximum recall was achieved (Table 10) shows that such overlapping is of substantial value; predictions by MF3A and MF4A have 32 and 10%, respectively, of such substructures. The recall average (Table 2) of 34.6% for MF3A is the highest average of all the individual characteristic ion data classes, and for 64 substructures the recall performance of MF3A was greater than or equal to that for MF2A, MF2A’, or MF3B. In contrast to the use of ten instead of eight peaks, the additional information obtained from characteristic Ions by the use of overlapping mass ranges appears to be more than sufficient to justify the extra storage and time required for their use. The fact that nearly two-thirds of the substructures are found with maximum recall by MF2A. and MF3A, suggests that an additional data class in this mass range would be helpful. We propose to add one covering rn/e 47-102 (data class 2B), In part compensating for the extra computer time and storage requirements through reductions in the number of peaks used in other data classes (Table 11). 49 ------- TABLE 11. RECOMMENDED DATA CLASSES FOR CHARACTERISTIC IONS Data classa Maximum number of peaks Mass range 2A 4 even-mass, 4 odd-mass 6 - 88 2B 8 47 —102 3A 7 61 —116 3B 7 89 —158 4A 6 117 —200 4B 6 159 —(M—l) 11 , 1 b 2A÷2B+ 3A+ 3B+4A+4B aNote that classes 2A, 3B, and 4B correspond closely to the original (reference 3) data classes 2, 3, and 4, respectively, except for the In- creased number of peaks. b To replace the original data class 10. 30 ------- Combination Match Factors It has been shown Lvide supraj that an arithmetic combination of the MF 1 - MPG match factor values (the “overall match factor,” MF 11) give signifi- cantly higher average values of both recall and precision for substructure identification than those found for the individual data classes. Two com- binations of characteristic ion match factors were used in this study, one employing MF2A plus MF3A through MF4C (MF11.l), the other MF2A’ plus MF3A, through MF4C (MF11.2). As expected from the MF2A and MF2A’ results discussed above, there is no significant difference between the results for these two different combinations (Table 9), and only the MF11.1 results will be discussed. As found for MF 11, combining individual MF values substantially improves the precision of the results, bringing the performance up to the predicted 98% confidence level. The recall value of 47% for MF11.1 is far superior to that of any of the individual characteristic ion data classes, and this should Increase with any improvements resulting from modifications of the data classes recommended in Table 11. The average recall value of 47% for MF11.1 is substantially higher than the 40% value for ME 10, which uses two peaks every 14 mass units from 90 to Mt Therefore it is recommended that MF11..l be used in place of MF 10 in STIRS; this will effect a substantial saving in the bulk data storage requirement, as the MF 10 data must be stored in variable length records. Although the original version of STIRS did not have a match factor combining MF 2, 3, and 4 to which MF 11. 1 can be directly compared, the average recall using MF11.l actually approaches that for MF 11, and the 183 substructures found with non-zero recall is a slight increase over the number for MF 11. The substructures for which the MF 11 . 1 recall values were substantially improved over those of MF 11 are shown in Table 12. One-third of those sub- structures for which MF 11 had zero recall can be identified in some cases by MF11.1. The substructures benzphenanthrene-3,4- and indazole were identified correctly by MF1 1. 1 for every compound in the file (9 and 18 com- pounds, respectively) containing the substructure. Because the MF 11 cal- culation includes neutral loss and ion series contributions, its performance should be substantially improved when the new data classes (MF2A - 4B) are included. The use of arithmetic combinations of data classes, such as MF 11 and 11.1, for matching takes no disk storage and little additional calculation time, yet appears to provide the most important and reliable structural infor- mation of all the STIRS results. Examples A statistically valid comparison of STIRS and manual interpretive methods was not attempted, as it would hardly be feasible for a human interpreter to examine several hundred spectra to predict the presence of each of 204 sub- structures. STIRS has been designed to be an aid to, not a replacement for, the Interpreter. For STIRS predictions of substructures, it should be kept in mind that a 99% probability of presence Is also a 1% probabilIty that the sub- structure is absent; because there are several hundred acceptable (>94% precision, >30% of MF 11 or MF11.1 recall) combinations of the 13 match factors and 204 substructures for which STIRS makes predictions, this 51 ------- TABLE 12. SUBSTRUCTURES FOR WHICH MF11.1 HAS SUBSTANTIALLY SUPERIOR RECALL Recall, % WLN Substructure MF11 MF11.1 ZR BV anthraniloyl 0 25 MVR benzamido 0 10 SHR benzenethjol 0 8 T B656 EN HMJ carboljne, beta 0 60 T B656 HOJ dibenzofuran 0 60 T66 CNJ isoquinoline 0 11 QR B1U salicylidene 0 8 SWQ sulfonjc acid 0 14 T56 BN DOJ benzoxazole 55 75 L C666J anthracene 14 43 L 06 B666J benzphenanthrene-3,4- - 88 100 T56 BN DSJ benzothiazo le 36 70 V1U2 crotonyl 17 48 T56 BMNJ indazole 50 100 T56 BMT&J indoline 33 66 T56 ANJ indolizine 63 83 T5NQJ isoxazole 43 66 V1V malonyl 23 43 T 0666 BN INJ phenazine 47 67 T C666 BM ISJ phenothiazine 60 90 T56 BVMVJ phthalimide 37 63 T5MTJ pyrrolidine 20 40 1U1 vinyl 23 46 52 ------- probability will, on average, produce several predictions which are incorrect, at least on a strict chemical basis. Thus the interpreter must evaluate the STIRS substructure predictions in light of the probable significance of the other STIRS results, other mass spectral data, and other information available on the unknown. The incorrect prediction of a particular substructure often occurs because the mass spectral data typical of the substructure resemble those of the correct answer; thus a mass spectrometrist would not be sur- prised if the STIRS results for the low mass characteristic ions (MF2A) indi- cated phenyl for a pyridine compound, even though a chemist would call this substructure an incorrect answer. Perhaps the aid to interpretation supplied by the improved characteristic ion data classes can best be Illus- trated by a few “unknown examples. The compound 1-undecanol (WLN, Qil), when analyzed by STIRS, gives confi- dence values of greater than 99% for the presence of two substructures: alkyl chain 3 carbons (by MF3B, MF11.,l and MF 11) and hydroxyl (Q, by MF 11). Incorrect indications of S and SH by MF2A. and MF11.l, respectively, are not entirely misleading, as for these data classes such compounds show behavior similar to that of alcohols. The other incorrect predictions are cyclopenty]. by MF4A and cyclohexanone by MF2 4 4. The top 15 compounds retrieved by MF11. 1, which uses only characteristic ions, are five alkan—l— ols, three of the corresponding thiols, and five n-alk—1—enes. The mass spectra of alkan•-l—ols are characterized by an initial loss of water, making their characteristic ion data similar to those from the spectra of alk—1-enes. However, the top 15 compounds of the MF 11 results, which in addition use information on neutral losses, are all alcohols, eight of which are n-alkan-1- ols. Thus the STIRS results should substantially help the interpreter in ob- taining the structure of the unknown. For the compound o-hydroxyphenyl tert-butyl sulfone (WLN, QR BSWX) STIRS indicated at the >99% confidence level the following substructures: hydroxyl (Q, by MF3J\, MF11.1), phenyl (R, by MF3 , MFll.l and MF11), hydroxy phenyl (QR, by MF11.1 and Mf’ll), sulfonyl (SW, by MF3J , MF3B, MF4J\, MF11.l and MF11), sulfur (5, by MF11.1 and MF11), and double branch (X, by MF2A). Incorrect indications of ester (\JO, by MF8), linking oxygen (0, by MF7 and MF11), benzoate (OVR, by Mph), and salicyloyl (QR By, by MF1 and MF11) result from the similar mass spectral behavior of a molecular fragment such as o-HOC 6 H 4 -C0 2 -, for which these substructures would be indicative, and the fragment o-HOC 6 H 4 -S0 2 - of the correct struc- ture. The indication of cyclopropyl (L3TJ, by MF4B) Is incorrect and unrelated to the actual structure. The top 8 compounds retrieved by MF11.l and the top 4 retrieved by MF 11 are 2-hydroxyphenylsulfones. Thus in this example the characteristic ion data classes, which are combined in MF11.l, actually give the best indication of structure. It was previously found that the overall match factor, MF 11, gives the most reliable substructure predictions; it appears that MF11.l Is also a much better indicator than the individual match factors. For the compound methyl 8-phenylnonanoate (WLN, 1YR&6V01) STIRS gives substructure assignments at the >99% confidence level for: alkyl. chain 3 carbons (by MF3B, MF11.l and MPh), phenyl (R, by MF3 A, MF11.l and MF11), and single branch (Y, by MF3A and MF11.1). Indications of acrylyl 53 ------- (Viul, by MF4B) and the 6.7-dehydro steroid skeleton (L E5 B666 LUTJ, by MF4A) are errors from 99% confidence level predictions. Surprisingly, the only high confidence prediction of the ester function came from MF4C, which is being dropped Cvide supra ) because of generally poor recalls. In the MF11. 1. and ME’ 11 results, nine and five, respectively, of the top 15 compounds are 2—phenylalkanes, and two and four, respectively, are methyl esters of phenylnonanoic acid. Thus the results from the characteristic ions (MF11.l) nicely complement those from the overall match factor (ME’ 11); by using the STIRS results and deducing the molecular weight the interpreter should easily be able to obtain at least a close approximation to the molecular structure. Improvement of Other STIRS Data Classes The substantial increases in recall for substructures and the valuable infor- mation obtained from the use of the new overall match factor MF1 1. 1 suggest that such modifications could also be used to improve the performance of other STIRS data classes, again applying the statistical methods to evaluate the success of such modifications. At present we are studying the effect of adding overlapping mass ranges and a special overall match factor to the neutral losses data class. Implementation of STIRS Improvements The substructure and data class improvements have been implemented and are operational on our laboratory PDP-11/45 computer, and thus could of course be made available over the long-distance phone link to outside users. This is made possible in direct data transmission by a program written by Dr. Walter M. Shackelford of ERL, Athens. For obvious reasons we would rather have outside users use either of the generally available STIRS systems on the Cornell IBM—370/l68 over TYMNET, or on the NIH PDP-1O computer; for the latter contact Dr. G. W. A. Mime of NIH or Dr. S. R. Heller of EPA, Washington, DC. We will take the responsibility of implementing the sub- structure and data class improvements on the Cornell IBM-370/TYMNET system, however, early in 1976. Note that in particular EPA. laboratories the PDP-8 computers used for data acquisition and reduction on the GC/MS systems now have the capability of direct transmission of unknown mass spectra over the TYMNET phone link to the Cornell PBM/STIRS system. This has the added advantage that for at least the near future the Cornell PBM/STIRS system will have reference spectra of a few thousand more compounds than any other available system. Testing of PBM/STIRS by EPA As emphasized in the recommendations, we feel that it is highly important that the combined PBM/STIRS system be tested on real unknowns by qualified per- sonnel over an extensive period. Initial testing of the STIRS system on the NIH PDP-1O has been extremely disappointing for a variety of reasons. Poor communications make it difficult to resolve problems of user education. There appear to be misunderstandings concerning the “interpretive” nature of STIRS; all of the top compounds selected in each data class should be examined for structural consistencies, and one (or more) trial molecular weight(s) should 54 ------- be entered, even if it is just an educated guess. Inordinately long computer times (-‘-15 minutes per spectrum) are required; these presumably give costs of $200 or more per spectrum, which severely lower the probability that STIRS will prove useful in a test on this computer. It is strongly recommended that the Cornell PBM/STIRS system on TYMNET be used instead; PBM has been designed specifically as a prefilter to STIRS, so that the two together are a much more effective system than either separately. Certainly the improved version of STIRS described above should be used in this extensive test; for example, note the nearly “perfect” information precision and recall values for the carcinogenic substructure benzphenanthrene—3,4- (Table 12). Further, there are full—time personnel in Cornell’s Office of Computer Services who are interested in insuring that these systems are working properly, and that the user receives the information he needs. The wide acceptance which the Cornell system has already received from experienced mass spectrometristS is a strong indication of the value of PBM/STIRS. As an alternative, if a matching system such as PBM could be implemented in real time on the GC/MS computers in the EPA laboratories, this important prefiltering would then be done, and only spectra unidentified by PBM could then be submitted to STIRS, as has been outlined In Section II. 55 ------- SECTION VII REFERENCES 1. Pesyna, G. M., and F, W. McLafferty. Computerized Structure Retrieval and Interpretation of Mass Spectra. In: Determination of Organic Struc- tures by Physical Methods, Machod, F. C., J. J. Zuckerman, and E. W. Randall (Eds.). New York City, Academic Press, 1975. Volume 6. 2. Salton, G. Automatic Information Organization and Retrieval. New York, McGraw—Hill, 1968. 3. McLafferty, F. W,, R. H. Hertel, and R. D. Viliwock, Probability Based Matching of Mass Spectra. Rapid Identification of Specific Compounds in Mixtures. Org. Mass Spectrom. 9:690, 1974. 4. Abramson, F. P. Anal. Chem. 47:45, 1975. See also: Abramson, F. P. Proc. of the 21st Ann. Conf. on Mass Spec, and Allied Topics. San Francisco, ? SMS, 1973. p. 76; Abramson, F. P., and M. F. Schulman. Proc. of the 22nd Ann. Conf, on Mass Spectrom. and Allied Topics. Philadelphia, ASMS, 1974. p. 453. 5. Pesyna, G. M.,, F. W. McLafferty, R. Venkataraghavan, and H. E. Dayringer. Statistical Occurrence of Mass and Abundance Values in Mass Spectra. Anal, Chem. 47:1161, 1975. 6. Stenhagen, E., S. Abrahams son, and F. W. McLafferty. Registry of Mass Spectral Data. New York City, Wiley—Interscience, 1974. 7. Freund, J, E, Mathematical Statistics. Englewood Cliffs, Prentice—Hall, 1962. p. 46. 8. Winer, B. J. Statistical Principles in Experimental Design. Second Edition. New York City, McGraw-Hill, 1972. 9. Siegel, S. Non-Parametric Statistics. New York City, McGraw—Hill, 1956. Chapter 2. 10. Kwok, K.-S.,, R. Venkataraghavan, and F, W. McLafferty. Computer— Aided Interpretation of Mass Spectra. III. A Self—Training Interpretive and Retrieval System. J. Amer. Chem. Soc. 95:4185, 1973. 11. Hertz, H. S.,, R. A, Hites, and K. Biemann. Anal. Chem. 40:681, 1971. 56 ------- 12. Pesyna, G. M. Computerized Structure Retrieval and Interpretation of Mass Spectra: The Design and Evaluation of a Probability Based Matching System Using a Large Data Base. Ph.D. Thesis. Ithaca, Cornell Univer- sity, 1975. 269 p. 13. Meyerson, S. and E. K. Field. J. Chem. Soc. (B). 1001, 1966. 14. Costello, C. E., H. S. Hertz, T. Sakai, and K. Biemann. Clin. Chem. 20:255, 1974. 15. Dayringer, H. E. Computer—Aided Interpretation of Mass Spectra: An Improved STIRS Program Giving Information on Substructure Probabilities. Ph.D. Thesis. Ithaca, Cornell UniversIty, 1976. 162 p. 16. Kent. P., and T. G umann. Helv. Chim. Acta. 58:787, 1975. 17. McLafferty, F. W .,, M. A. Busch, K.—S. Kwok, B. A. Meyer, G. Pesyna, R. C. Platt, I. Sakai, J. W. Serum, A. Tatematsu, R. Venkataraghavan, and R. G. Werth. A Self—Training Interpretive and Retrieval System for Mass Spectra. The Data Base. In: Mass Spectrometry and NMR Spec- troscopy In Pesticide Chemistry, Biros, F. J., and R. Haque (Eds.). New York, Plenum Press, 1974. p. 49. 18. Isenhour, T. L., B. R. Kowalski, and P. C. Jurs. Critical Review Anal. Chem. 4:1, 1974. 19. Justice, J. B., and T. L. Isenhour. Anal. Chem. 46:223, 1974. 20. Tunnic].iff, D. D., and P. A. Wadsworth. Anal. Chem. :12, 1973. 21. Franzen, J., and H. Hillig. Adv. Mass Spectrom. 6:991, 1974. 22. Chemical Substructure Dictionary, Institute for Scientific Information, Philadelphia, 1974. 57 ------- SECTION VIII LIST OF PUBLICATIONS References 1, 3, 5, 12, 15, and 17 are publications which have resulted from this research grant. In addition, the following articles have either been published or submitted for publication: McLafferty, F. W., R. Venkataraghavan, K—S. Kwok, and C. Pesyna. A Self—Training Interpretive and Retrieval System for Mass Spectra. Adv. Mass Spectrom. 6:999, 1974. Dayringer, H. E., G. M. Pesyna, R. Venkataraghavan, and F. W. McLafferty. Computer—Aided Interpretation of Mass Spectra. Information on Substructural Probabilities from STIRS. Org. Mass Spectrom. (accepted). Venkataraghavan, R., G. M. Pesyna, and F. W. McLafferty. Computer Identification and Interpretation of Unknown Mass Spectra Utilizing a Computer Network System. In: Computer Networks and Chemistry, Lykos, P. (Ed.). Washington, American Chemical Society, 1975. p. 183. Pesyna, G. M., R. Venkataraghavan, H. E. Dayringer, and F. W. McLafferty. A Probability Based Matching System Using A Large Collection of Reference Mass Spectra. Anal. Chem. (submitted). Dayringer, H. E.,, and F. W. McLafferty. Computer-Aided Interpretation of Mass Spectra. Increased Information From Characteristic Ions. Org. Mass Spectrom. (submitted). 58 ------- TECHNICAL REPORT DATA (Please read Jp sJ.ructior,s on the reverse before completing) 1. REPORT NO. 12. EPA-600/4-76-046 3. RECIPIENrS ACCESSIO +NO. 4. TITLE AND SUBTITLE Computer Interpretation of Pollutant Mass Spectra 5. REPORT DATE October 1976 (Tssuing date) 6. PERFORMING ORGANIZATION CODE 7. AUTHOR(S) Fred W. McLafferty 8. PERFORMING ORGANIZATION REPORT NO. 9. PERFORMING ORG NIZATION NAME AND ADDRESS Cornell University Department of Chemistry Ithaca, NY 14853 10. PROGRAM ELEMENT NO. 1 BA 027 11. CONTRACT/GRANT NO. R-801106 12. SPONSORING AGENCY NAME AND ADDRESS 13. TYPE OF REPORT AND PERIOD COVERED Final 10/1/72 — 9/30/75 Environmental Research Laboratory 14. SPONSORING AGENCY CODE Office of Research and Development U.S. Environmental Protection Agency EPA-ORB Athens ( enr 3fltSrn 6, SUPPLE IENTAR NOTES 16,Ab INAcT The objective of this research was to improve systems for computer examina — tion of the mass spectra of unknown pollutants. For this we have developed a new probability based matching (PBM) system for the retrieval of mass spectra from a large data base, and have substantially improved the Interpretation of unknown mass spectra using the self-training interpretive and retrieval system (STIRS). PBM was designed as a prefilter to STIRS; if an unknown mass spectrum can be identified with a sufficiently high confidence by PBM, Interpretation of the spectrum using STIRS is not necessary. The PBM system provides more efficient retrieval than presently accepted systems; it incorporates a “reverse search” algorithm, and through the use of weighted mass and abundance data provides a statistically valid prediction of the confidence of the matche found. STIRS has been improved to give a confidence-level prediction of the presence of’- ’200 particular substructural features in the unknown molecule. Extensive studies have been made to improve the data selection for most data classes used by STIRS, resulting in a much higher level of overall system performance. Operation efficiencies of both PBM and STIRS have been improved dramatically so that both require less than I minute on a laboratory PDP-11/45 computer. STIRS has been made available for outsick use by long-distance phone connections to this pDp—].] ./45, and recently both PBM and STIRS have been made operational on the Cornell IBM-370/168 so that these are avail- able internationally over the TYMNET computer network system. 17. KEY WORDS AND DOCUMENT ANALYSIS a. DESCRIPTORS b.IDENTIFIERS/OPEN ENDED TERMS C. COSATI Field/Group Computer programming 07B Mass spectra Algorithms Information retrieval Chemical analysis 18. DISTRIBUTION STATEMENT 19. SECURITY CLASS (This Report) 21. NO. OF PAGES Release Unlimited Unclassified 67 20. SECURITY CLASS (This page) 22. PRICE Unclassified EPA Form 2220.1 (9.73) 59 USGPO: 1976— 757-056/5421 RegIon 5 -Il ------- |