United States Environmental Protection Agency Environmental Research Laboratory Athens GA 30613 Research and Development EPA-600/S4-83-029 Sept. 1983 Project Summary A Computer Survey of GC/MS Data Acquired in EPA's Priority Pollutant Screening Analysis: System and Results W. M. Shackelford, D.M. Cline, F. 0. Burchfield, L Faas, G. Kurth, and A. D. Sauter The screening analysis phase of the best available treatment (BAT) review of wastewater treatment techniques by EPA was initiated to assess 21 industrial categories for the 129 "pri- ority pollutants." Implicit in the purpose of the screening analysis for these pollutants was the notion that the raw gas chromatography/mass spectrome- try data would be saved for later evalua- tion for compounds not on the priority pollutant list. To this end, a system of computer programs was built that automatically extracted the pure spectra of components in a GC/MS run; matched the spectra against a reference library, and dealt appropriately with matched and unmatched spectra. Matched com- ponents were entered into a database for statistical studies to determine their priority for further study. Unmatched spectra were compared to each other to find recurring unknowns so that priorities for ab initio identifications could be set. Component software was obtained from Stanford University (CLEANUP) and Cornell University (PBM); some software was written at the Athens Environmental Research Laboratory. The automated survey techniques appeared to work well on most of the GC/MS data. The system was efficient and cost effective, for tentative identi- fication of the major components in the samples. This Project Summary was developed by EPA's Environmental Research Lab- oratory, Athens. GA to announce key findings of the research project that is fully documented in a separate report of the same title (see Project Report ordering information at back). Introduction In June 1976, the U.S. Environmental Protection Agency (EPA), as a result of court action by several environmental groups, was directed by a Consent Decree from the U.S. District Court in the District of Columbia to assess the wastewater of 21 industrial categories for 65 chemical substances and to prescribe the best avail- able treatment (BAT) for the effluent To begin the task, a scheme for analysis of the wastewaters for the 65 substances had to be designed Although some of the 65 substances were unique chemical compounds, many included whole classes of compounds (e.g. polynuclear aromatic hydrocarbons). Real- izing that such classes of compounds could contain literally hundreds of individual members, EPA included for analysis only those members that had been previously identified a significant number of times. were produced in quantity by industry, and were available as analytical standards. The now familiar 129-compound priority pollutant list was the result of this work Even though the list of 129 specific substances made the analysis task man- ageable, the plaintiffs in the court action were concerned that some members of the chemical classes not on the 129- compound list would be missed in the ------- analysis procedure Because it was generally agreed that computerized gas chromato- graphy/mass sepctrometry (GC/MS) would be the analysis tool of choice, the advantage of saving all raw GC/MS data for later processing to look for compounds other than the priority pollutants became obvioua The state-of-the-art GC/MS instrumenta- tion includes a computer system; thus, the data would be saved in computer-readable format for later study. Magnetic tape, the cheapest mass storage medium, was chosen for recording all GC/MS data from sample analysis. Initial analysis of each sample at the laboratories operating under EPA contract was to be directed only toward compounds among the 129 priority pollutants. Although EPA might have contracted for a general survey of all compounds in each sample, a number of limiting factors precluded this approach: • Cost of a general survey was esti- mated at $2000 per sample versus $700 per sample for the limited analysis. • Time was extremely important Al- though the data acquisition times for general survey and specific analysis are the same, data-evaluation times could be 5 to 10 times longer for survey analysis (if only computer matching were required for identifi- cation). Decreasing the number of samples per unit time by a factor of 5 to 10 would have played havoc with the court-ordered deadlines. • Management of the large volumes of unconfirmed data would have re- quired a massive secondary effort to confirm and collate the results of the survey analysis. By requiring that all data from each GC/MS acquisition be sent to a central location for survey processing, EPA assured proper management of non-priority pol- lutant data and at the same time obtained timely response on the priority pollutants directly from contractor laboratories at a reasonable cost Because all parties in- volved in the Consent Decree had agreed that the non-priority pollutant data were of less immediate need, no part of the spirit of the Consent Decree was sacrificed; yet provision was made for assessing all the data for compounds other than the 129. The analysis laboratories were required to supply each sample extract along with the GC/MS data as a second provision for possible later analysis of the sample. Thus, should some compound be tentatively identified in the GC/MS data, it could be confirmed by reanalysis of the correspond- ing extract Also, recurring components, not identifiable from their mass spectra, possibly could be identified using another analysis technique on the saved extract The screening analysis phase of BAT review was expected to require the quali- tative/semi-quantitative analysis of about 4000 samples. Each sample analysis in- volved GC/MS data acquisition for at least five fractions: a volatile organics analysis (VGA), a VOA blank, an extractable base/ neutral (B/N), an extractable acid (ACI), and a direct aqueous injection (DAI). Other blank, standard, and pesticide confirmation runs also were needed. All calculations for the task were based on 20,000 GC/MS runs (4000 samples x 5 fractions). Con- sidering that each GC/MS run was expected to contain some 500 to 1000 individual spectra, the magnitude of the task of evaluating these data is evident Implicit in this data evaluation task was the development of a computer system that might evaluate the data in a manner comparable to a human using computer- aided spectrum extraction and spectra matching to tentatively identify all sample components. An additional goal was the identification of those spectra which did not match any spectrum in the reference library yet were seen in multiple GC/MS runs. Thus, a library of compounds tenta- tively identified in each industrial category, as well as a library of recurring but uniden- tified spectra, were to be generated for use in effluent regulation. Also, the data in both libraries were to be studied in a subsequent project, which will reanalyze the saved extracts. Tentative identifica- tions made in that project could be con- firmed by comparison with standards, and recurring but unidentified spectra could be examined for ab initio determination of compound identity. System Description The POP 11/70-based GC/MS Data Survey System consisted of computer hardware and software programmed to accomplish the following functions: 1. Inventory all incoming magnetic tapes and sample extracts. 2. Copy the data on each magnetic tape to a second tape in an internal use format and plot the recon- structed gas chromatogram. 3. Retrieve data as necessary from tapes in batch mode. 4. Extract the spectra of components in each GC/MS run from the back- ground spectra in the run. 5. Match the extracted spectra with a library of reference spectra. 6. Check if matched spectra have been seen before under the same cir- I cumstances. 7. Check spectra that are not matched against their fellow unmatched spectra. 8. Generate reports on the numbers of matched spectra by industry, fraction type, analytical laboratory, GC/MS run conditions, etc. 9. Provide graphics capability neces- sary to view the data from any run. 10. Search any run for specific com- pounds. Inventory System To inventory and track the 20,000 GC/ MS data runs and the estimated 12,000 extracts (a B/N, ACI, and pesticide fraction for each of 4000 samples), a database management system was implemented. This system was the INFORM manage- ment program, a well-known tool for data- base management In INFORM were kept the GC/MS data run descriptors that allowed physical location of each run and corres- ponding extract and all available informa- tion about the sample. As each magnetic tape or extract was received at the Athens Laboratory, it was manually entered into the INFORM system. Important parameters entered for each data run were the tape on which it was found, the EPA sample number, an Athens Laboratory run number, the fraction type, and various GC/MS parameters. The cor- responding data for the extract included all of this information and the precise location of the extract in a freezer. During the inventory process, data re- ceived from contractor laboratory tapes were copied onto an Athens Laboratory tape in a format that was both more space- efficient and damage resistant. Thus, the original tape and a backup copy were saved. The backup copy, which had only the Athens Laboratory number for identifi- cation of each run, was used for all data processing needs. Confidentiality of the data was maintained through the use of the backup copy so that descriptive data were not associated with the GC/MS data. Software that had access to both descrip- tive and GC/MS data was password pro- tected. At the time of tape conversion from the contractor's format to the Athens Labora- tory format, the data of each run were scanned and a reconstructed gas chroma- togram (RGC) plotted. The RGCs were then bound in volumes to serve as refer- ences at the time runs were submitted for analysis. Inspection of an RGC by a chemist might result in the discarding of the cor- 4 ------- responding run because of obvious flaws such as absence of peaks or premature end of data. When data were to be processed, a chemist identified the runs that passed visual inspection for processing. Software then retrieved the designated runs from the magnetic tape and prepared each run in turn for processing by the analytical system. The inventory system was reap- plied when the run had been processed and descriptors contained in INFORM were necessary for reporting. Analytical System The analytical system consisted of four main parts: the internal standard locator, PEAK; the peak or spectrum extractor, CLEANUP; the spectrum matching system, PBM; and the result collator. Chemists had opportunities at various points during the process to make decisions that could end processing or affect further processing of any given component spectrum. Ideally, data analysis proceeded with minimal operator intervention. Only when the an- alytical system was presented with deci- sions that it was unqualified to make did the chemist intervene. The program PEAK was developed to assure identification of internal standard location in each run. Because all subsequent processing of the data required knowledge of the internal standard in the run, it was imperative that software be available that would unambiguously define the location and area of the internal standard peak in each data run. CLEANUP is a system of programs developed at Stanford University that finds and extracts the spectra of components in a GC/MS data run. Successive 1 6-scan windows are searched for ion peaks that have 2 ascending points, a maximum, and 2 descending points. When an ion peak is found, successive ions from mass 40 to 400 are checked to see whether any maximize within a distance of ± 1 scan number of the first found peak. When 8 or more such masses maximize simultane- ously, a component peak is said to be detected. In this case, all the masses maximizing at this point are collected, their areas are normalized to the largest mass of the group, and they are passed along to the next phase of the analysis as a mass spectrum. CLEANUP involves a number of checks to insure that such artifacts as column bleed, noise spikes, and background are not chosen as sample components. Criteria are input at the start of processing to insure that only ion peaks of a defined sharpness will be considered. This pro- cedure will normally eliminate peaks caused by column bleed, which usually shows up in the form of broad peaks. Noise spikes, which are generally of only one- or two-scan duration, are guarded against by requiring a minimum of four scans in the ion peaks. Instrumental back- ground noise caused by pump oil or other contaminants normally does not peak during a run; therefore, it does not interfere with the CLEANUP process. The spectra extracted by CLEANUPwere passed to PBM, a library matching program developed at Cornell University under an EPA grant PBM, or probability based matching, employs a reverse search tech- nique to compare a reference library of condensed spectra to a similarly condensed unknown spectrum. Reporting Reporting is accomplished in two ways. The first system is a series of hard copy outputs that describe the flow of data through the total system and the results generated from the data. The contents of the historical library can be printed out either in totality or as a listing of unique entries. The data can be sorted by param- eters such as CAS number, RRT, GC column, analysis laboratory, industrial category, relative concentration, etc. 100.00 87.50- 75.00 13 I "5 62.50- 50.00- 37.50- 25.00- 12.50- 0.00- A second method for reporting was a graphics system that allowed the chemist to recall data and plot it in various ways. For instance, the raw data for a spectrum, the cleaned up spectrum, and the reference spectrum can all be plotted on the same screen simultaneously. The extracted ion current profile (EICP) for any ion can be plotted between any scan limits. Multiple EICP plots can be displayed on the screen. The graphics system is used by chemists to evaluate ambiguous results from the computer analysis. Extraction Results The extraction of information-containing spectra from the mass of spectra in a GC/MS run is the key to a successful automated system for GC/MS data analysis Figure 1 shows an RGC of a group of 11 phenols. The scans 213 and 215 can be seen to be on opposing sides of an apparent single component peak. Manual subtrac- tion of a baseline spectrum (eg. 208 or 220) from spectrum 214 results in a spectrum that is not recognizable as any of the components injected. The use of CLEANUP to find spectra however, reveals that the peak is actually the sum of two componenta Figure 2 shows the resultant spectra of 213 and 215. Also depicted are spectra from the reference library that 775 678 581 484 50 100 150 200 250 Scans 387 291 194 97 I 30O 350 4OO 4SO 5OO 0.05 2.55 5.04 7.54 1003 12.53 15.02 17.52 20.01 22.51 25.0O Time (mini Figure 1. RGC of 11 component phenol standards. Arrow indicates apparent single component peak that is actually the sum of two components. ------- establish the identity of the two components Although this example represents an ideal case in which standards were used with no interferences, it does serve to illustrate the ability of CLEANUP to separate com- ponents eluting within two scans of each other. The data presented thus far indicate that for the systems studied, automated tech- niques are at least the equal of manual techniques for pointing out components in the run and identifying them by spectrum matching with a reference library. Spectrum extraction and identification are not always so clear cut As shown in Table 1, despite the fact that more peaks are found with the automated method, the ratio of identifica- tions-to-peaks has decreased In fact as the number of components in a run increases, identification becomes more and more difficult even though the automated system apparently is able to deliver a spectrum for each component Spectrum-Matching Results The spectrum-matching portion of the data analysis system has undergone the least modificatioa PBM has been evaluated in the literature and has been in use at the EPA Athens Laboratory for several years. Selection of a database of reference spectra for use with PBM involved no small problem. Three databases were avail- able: the Wiley collection, the National Bureau of Standards collection and the EPA master collection. The Wiley library contains 30,476 spectra of 30,476 com- pounds; the NBS library 25,025 spectra of 25,025 compounds; and the EPA master list -^40,000 spectra of ^32,000 compounds. The EPA list is the master list of spectra from which the NBS library was takea Because the GC/MS data used in this study came from a great variety of sources it was thought that "duplicate" spectra, Le. multiple spectra of the same compound that differ slightly due to run conditions, in the database would be of some help in the matching process. Table 2 compares the matching ability of two databases on the same spectra As can be seen. Database II (the EPA master database) enjoys a distinct advantage over Database I (the NBS library) for the cases mentioned in the table. Comparison of the matches suggested by the two databases with the manual identification shows the superior ability of Database II. Data gen- erated using the Wiley library showed similar shortcomings to the NBS library. In cases where the identical matched spectrum occurred in all the libraries (as was generally the case), no problems were Rl File: 1 795. CLN 100- 0 - 5 Spectrum * 213 i .»«. II ..!.. Base Peak 139 1 0 60 70 80 90 100 ) 7 70 I 720 730 ) 740 IS CAS # 88755 CMPO: Ortho-Nitrophenol Base Peak 139 100-1 Rl 50 60 70 80 9O 100 110 120 130 140 150 Rl File: 1795. CLN 100- 0 - 5 Spectrum 1 j I III [ttl 0 60 n 215 t\ Ij (III [1 1 1 1 1 1 70 \\ , Base Peak 122 80 90 100 110 120 130 140 150 CAS tf 105679 CMPD: 2,4-Dimethyl Phenol Base Peak 122 /OO-i Rl "I""!""!"11! 50 60 70 80 90 100 110 120 130 140 150 M/Z Figure 2. Resultant spectra from scans 213 and 215 of figure 1 compared to matching library spectra. Table 1. Comparison of Automated vs. Manual Peak Extraction {Values in Parentheses are Additional Components Identified by the Automated Method} Peaks 30 18 31 26 Manual ID's 18 7 11 10 CLEANUP-PBM % ID 0.60 0.39 0.35 0.38 Peaks 46 44 41 43 ID'S 18(6) 7(4) 11(2) 10(2) % ID 0.52 0.25 0.32 0.28 ------- Table 2. Comparison of PBM Matching Ability for the NBS Library (Database I) and the EPA Master Database (Database II) Database I (K value, missing masses) Database II (K value, missing masses) Manual ID Toluene (75+) 7-oxabicydo 2,2,1 heptane (49+) Phthalide (56, -2) Hexacosanoic acid (102, -3) Toluene (75+) 2-cyclohexene- 1-ol (76+) Methyl benzoate (69+) Octadecanoic acid (105+) Toluene 2-cyclohexene- 1-ol Methyl benzoate Octadecanoic acid File: 5943. CLN Spectrum tt 339 100-i 100 -i ( + means that the molecular ion was matched within the proper intensity tolerance). observed. In some cases, when spectrum extraction or run conditions slightly affected the spectrum, a match occurred only if there were duplicate spectra. Thus, the EPA master library was chosen for use in this work The reverse search approach of PBM is also useful for analyzing environmental A samples with an automated system. Figure 3 is an example of a mixed spectrum obtained by CLEANUP from a VGA stand- ard run. Although the two compounds (c/s-1,3- dichloropropene and 1,1,2-tri- chloroethane) are not resolved, PBM was /?/ able to match both compounds m this spectrum. This ability of a reverse search to recognize components of mixed spectra is clearly an advantage. Collator Results The combination of MS data and reten- tion indices provides a powerful tool in the automated extraction and identification of B components in GC/MS data. Several other automated systems described in the litera- ture rely on retention data as well as MS data for computer-aided identification of a known set of compounds. Although the match quality parameters vary from 45 to 100, the RRT variation is only 0.01 to 0.03. RRTs are particularly important in the identification of compounds such as alkanes or alcohols, which have little or no molecular ion and exhibit highly similar spectra. Summary Automated survey techniques for pro- cessing GC/MS data appear to work well on most of the data encountered in this study. The sensitivity of the CLEANUP- PBM package is not as great as that of a reverse search for specific ions, but it is adequate for the tentative identification of the major components in a sample. CLEAN- UP-PBM \s cost effective when compared to the procedure in which an operator finds peaks, subtracts background, then matches the spectrum manually or by using computer search. Use of a historical library for cataloguing data collected over long periods of time Figures. ids identification immensely by adding Base Peak 75 ........ , I, ,1,1, III 0 50 6C Iliiiiniiili ) 70 .Jj ,||,| 80 90 100 110 120 130 140 CAS # 79005 Base Peak 97 CMPD: 1,1,2-Trichloroethane (Vinyl Trichloride) fit . II ,1 .....1 llllllll lllllllllll 0 50 6C 70 80 J ilii|iiii 90 ll ... 1 1 700 110 120} 130 140 CASH 10061015 CMPD: CIS- 1.3-Dichloropropene Base Peak 75 ftl 'I'"1!1 40 50 60 70 80 90 M/Z 100 110 120 130 140 A. Mixed spectrum extracted by CLEANUP from a VOA standard run. B and C. Library spectra are indicated as matched by PBM. ------- Table 3. Comparison for RRJ vs. K Variation for Selected Matched Compounds Compound flange of RRT Range of K Dioctylphthalate Phthalide Toluic acid 0.03 0.01 0.03 45- 100 57- 77 48-85 the dimension of GC retention data. Con- fidence in a tentative identification is heightened if corroborating GC retention data are available. This combination of spectral and retention data can effectively catalogue and highlight recurring unidenti- fied substances for future study. Study continues in two areas of the CLEANUP-PBM package. First, the proper compensation of background by CLEANUP is of concern because errors in intensity calculations reduce the chance that PBM will find a match for the spectrum. Studies underway indicate that background com- pensation similar to that used in PEAK would be effective in CLEANUP. Imple- menting such a routine is under study. Second, various parameters must be set for CLEANUP, and these are an area of concern. Parameters that pertain to peak shape and minimum area are inflexible during a given data run. Thus, changing chromatographic conditions that affect peak size and shape may cause CLEANUP to miss pertinent data. Automatic setting of these parameters during data processing is under study. The EPA authors, W. M. Shackelford and D. M. dine, are with Environmental Research Laboratory, Athens, GA 30613. F. 0. Burchfield, L Faas, G. Kurth, and A. D. Sauter are with Computer Science Corporation, Falls Church, VA 22046. W. M. Shackelford is the EPA Project Officer (see below). The complete report, entitled "A Computer Survey of GC/MS Data Acquired in EPA's Priority Pollutant Screening Analysis: System and Results," (Order No. PB 83-220 111; Cost: $17.50, subject to change} will be available only from: National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone: 703-487-4650 The EPA Project Officer can be contacted at: Environmental Research Laboratory U.S. Environmental Protection Agency Athens, GA 30613 ------- |