United States
Environmental Protection
Agency
Environmental Research
Laboratory
Athens GA 30613
Research and Development
EPA-600/S4-83-029 Sept. 1983
Project Summary
A Computer Survey of GC/MS
Data Acquired in EPA's Priority
Pollutant Screening Analysis:
System and Results
W. M. Shackelford, D.M. Cline, F. 0. Burchfield, L Faas, G. Kurth, and
A. D. Sauter
The screening analysis phase of the
best available treatment (BAT) review
of wastewater treatment techniques
by EPA was initiated to assess 21
industrial categories for the 129 "pri-
ority pollutants." Implicit in the purpose
of the screening analysis for these
pollutants was the notion that the raw
gas chromatography/mass spectrome-
try data would be saved for later evalua-
tion for compounds not on the priority
pollutant list. To this end, a system of
computer programs was built that
automatically extracted the pure spectra
of components in a GC/MS run; matched
the spectra against a reference library,
and dealt appropriately with matched
and unmatched spectra. Matched com-
ponents were entered into a database
for statistical studies to determine their
priority for further study. Unmatched
spectra were compared to each other
to find recurring unknowns so that
priorities for ab initio identifications
could be set. Component software was
obtained from Stanford University
(CLEANUP) and Cornell University
(PBM); some software was written at
the Athens Environmental Research
Laboratory.
The automated survey techniques
appeared to work well on most of the
GC/MS data. The system was efficient
and cost effective, for tentative identi-
fication of the major components in
the samples.
This Project Summary was developed
by EPA's Environmental Research Lab-
oratory, Athens. GA to announce key
findings of the research project that is
fully documented in a separate report
of the same title (see Project Report
ordering information at back).
Introduction
In June 1976, the U.S. Environmental
Protection Agency (EPA), as a result of
court action by several environmental
groups, was directed by a Consent Decree
from the U.S. District Court in the District
of Columbia to assess the wastewater of
21 industrial categories for 65 chemical
substances and to prescribe the best avail-
able treatment (BAT) for the effluent To
begin the task, a scheme for analysis of the
wastewaters for the 65 substances had to
be designed
Although some of the 65 substances
were unique chemical compounds, many
included whole classes of compounds (e.g.
polynuclear aromatic hydrocarbons). Real-
izing that such classes of compounds
could contain literally hundreds of individual
members, EPA included for analysis only
those members that had been previously
identified a significant number of times.
were produced in quantity by industry,
and were available as analytical standards.
The now familiar 129-compound priority
pollutant list was the result of this work
Even though the list of 129 specific
substances made the analysis task man-
ageable, the plaintiffs in the court action
were concerned that some members of
the chemical classes not on the 129-
compound list would be missed in the
-------
analysis procedure Because it was generally
agreed that computerized gas chromato-
graphy/mass sepctrometry (GC/MS) would
be the analysis tool of choice, the advantage
of saving all raw GC/MS data for later
processing to look for compounds other
than the priority pollutants became obvioua
The state-of-the-art GC/MS instrumenta-
tion includes a computer system; thus, the
data would be saved in computer-readable
format for later study. Magnetic tape, the
cheapest mass storage medium, was
chosen for recording all GC/MS data from
sample analysis.
Initial analysis of each sample at the
laboratories operating under EPA contract
was to be directed only toward compounds
among the 129 priority pollutants. Although
EPA might have contracted for a general
survey of all compounds in each sample, a
number of limiting factors precluded this
approach:
• Cost of a general survey was esti-
mated at $2000 per sample versus
$700 per sample for the limited
analysis.
• Time was extremely important Al-
though the data acquisition times for
general survey and specific analysis
are the same, data-evaluation times
could be 5 to 10 times longer for
survey analysis (if only computer
matching were required for identifi-
cation). Decreasing the number of
samples per unit time by a factor of 5
to 10 would have played havoc with
the court-ordered deadlines.
• Management of the large volumes of
unconfirmed data would have re-
quired a massive secondary effort to
confirm and collate the results of the
survey analysis.
By requiring that all data from each
GC/MS acquisition be sent to a central
location for survey processing, EPA assured
proper management of non-priority pol-
lutant data and at the same time obtained
timely response on the priority pollutants
directly from contractor laboratories at a
reasonable cost Because all parties in-
volved in the Consent Decree had agreed
that the non-priority pollutant data were of
less immediate need, no part of the spirit
of the Consent Decree was sacrificed; yet
provision was made for assessing all the
data for compounds other than the 129.
The analysis laboratories were required
to supply each sample extract along with
the GC/MS data as a second provision for
possible later analysis of the sample. Thus,
should some compound be tentatively
identified in the GC/MS data, it could be
confirmed by reanalysis of the correspond-
ing extract Also, recurring components,
not identifiable from their mass spectra,
possibly could be identified using another
analysis technique on the saved extract
The screening analysis phase of BAT
review was expected to require the quali-
tative/semi-quantitative analysis of about
4000 samples. Each sample analysis in-
volved GC/MS data acquisition for at least
five fractions: a volatile organics analysis
(VGA), a VOA blank, an extractable base/
neutral (B/N), an extractable acid (ACI),
and a direct aqueous injection (DAI). Other
blank, standard, and pesticide confirmation
runs also were needed. All calculations for
the task were based on 20,000 GC/MS
runs (4000 samples x 5 fractions). Con-
sidering that each GC/MS run was expected
to contain some 500 to 1000 individual
spectra, the magnitude of the task of
evaluating these data is evident
Implicit in this data evaluation task was
the development of a computer system
that might evaluate the data in a manner
comparable to a human using computer-
aided spectrum extraction and spectra
matching to tentatively identify all sample
components. An additional goal was the
identification of those spectra which did
not match any spectrum in the reference
library yet were seen in multiple GC/MS
runs. Thus, a library of compounds tenta-
tively identified in each industrial category,
as well as a library of recurring but uniden-
tified spectra, were to be generated for use
in effluent regulation. Also, the data in
both libraries were to be studied in a
subsequent project, which will reanalyze
the saved extracts. Tentative identifica-
tions made in that project could be con-
firmed by comparison with standards, and
recurring but unidentified spectra could
be examined for ab initio determination of
compound identity.
System Description
The POP 11/70-based GC/MS Data
Survey System consisted of computer
hardware and software programmed to
accomplish the following functions:
1. Inventory all incoming magnetic
tapes and sample extracts.
2. Copy the data on each magnetic
tape to a second tape in an internal
use format and plot the recon-
structed gas chromatogram.
3. Retrieve data as necessary from
tapes in batch mode.
4. Extract the spectra of components
in each GC/MS run from the back-
ground spectra in the run.
5. Match the extracted spectra with a
library of reference spectra.
6. Check if matched spectra have been
seen before under the same cir- I
cumstances.
7. Check spectra that are not matched
against their fellow unmatched
spectra.
8. Generate reports on the numbers
of matched spectra by industry,
fraction type, analytical laboratory,
GC/MS run conditions, etc.
9. Provide graphics capability neces-
sary to view the data from any run.
10. Search any run for specific com-
pounds.
Inventory System
To inventory and track the 20,000 GC/
MS data runs and the estimated 12,000
extracts (a B/N, ACI, and pesticide fraction
for each of 4000 samples), a database
management system was implemented.
This system was the INFORM manage-
ment program, a well-known tool for data-
base management In INFORM were kept
the GC/MS data run descriptors that allowed
physical location of each run and corres-
ponding extract and all available informa-
tion about the sample.
As each magnetic tape or extract was
received at the Athens Laboratory, it was
manually entered into the INFORM system.
Important parameters entered for each
data run were the tape on which it was
found, the EPA sample number, an Athens
Laboratory run number, the fraction type,
and various GC/MS parameters. The cor-
responding data for the extract included all
of this information and the precise location
of the extract in a freezer.
During the inventory process, data re-
ceived from contractor laboratory tapes
were copied onto an Athens Laboratory
tape in a format that was both more space-
efficient and damage resistant. Thus, the
original tape and a backup copy were
saved. The backup copy, which had only
the Athens Laboratory number for identifi-
cation of each run, was used for all data
processing needs. Confidentiality of the
data was maintained through the use of
the backup copy so that descriptive data
were not associated with the GC/MS data.
Software that had access to both descrip-
tive and GC/MS data was password pro-
tected.
At the time of tape conversion from the
contractor's format to the Athens Labora-
tory format, the data of each run were
scanned and a reconstructed gas chroma-
togram (RGC) plotted. The RGCs were
then bound in volumes to serve as refer-
ences at the time runs were submitted for
analysis. Inspection of an RGC by a chemist
might result in the discarding of the cor- 4
-------
responding run because of obvious flaws
such as absence of peaks or premature
end of data.
When data were to be processed, a
chemist identified the runs that passed
visual inspection for processing. Software
then retrieved the designated runs from
the magnetic tape and prepared each run
in turn for processing by the analytical
system. The inventory system was reap-
plied when the run had been processed
and descriptors contained in INFORM
were necessary for reporting.
Analytical System
The analytical system consisted of four
main parts: the internal standard locator,
PEAK; the peak or spectrum extractor,
CLEANUP; the spectrum matching system,
PBM; and the result collator. Chemists
had opportunities at various points during
the process to make decisions that could
end processing or affect further processing
of any given component spectrum. Ideally,
data analysis proceeded with minimal
operator intervention. Only when the an-
alytical system was presented with deci-
sions that it was unqualified to make did
the chemist intervene.
The program PEAK was developed to
assure identification of internal standard
location in each run. Because all subsequent
processing of the data required knowledge
of the internal standard in the run, it was
imperative that software be available that
would unambiguously define the location
and area of the internal standard peak in
each data run.
CLEANUP is a system of programs
developed at Stanford University that finds
and extracts the spectra of components in
a GC/MS data run. Successive 1 6-scan
windows are searched for ion peaks that
have 2 ascending points, a maximum, and
2 descending points. When an ion peak is
found, successive ions from mass 40 to
400 are checked to see whether any
maximize within a distance of ± 1 scan
number of the first found peak. When 8 or
more such masses maximize simultane-
ously, a component peak is said to be
detected. In this case, all the masses
maximizing at this point are collected,
their areas are normalized to the largest
mass of the group, and they are passed
along to the next phase of the analysis as a
mass spectrum.
CLEANUP involves a number of checks
to insure that such artifacts as column
bleed, noise spikes, and background are
not chosen as sample components. Criteria
are input at the start of processing to
insure that only ion peaks of a defined
sharpness will be considered. This pro-
cedure will normally eliminate peaks
caused by column bleed, which usually
shows up in the form of broad peaks.
Noise spikes, which are generally of only
one- or two-scan duration, are guarded
against by requiring a minimum of four
scans in the ion peaks. Instrumental back-
ground noise caused by pump oil or other
contaminants normally does not peak
during a run; therefore, it does not interfere
with the CLEANUP process.
The spectra extracted by CLEANUPwere
passed to PBM, a library matching program
developed at Cornell University under an
EPA grant PBM, or probability based
matching, employs a reverse search tech-
nique to compare a reference library of
condensed spectra to a similarly condensed
unknown spectrum.
Reporting
Reporting is accomplished in two ways.
The first system is a series of hard copy
outputs that describe the flow of data
through the total system and the results
generated from the data. The contents of
the historical library can be printed out
either in totality or as a listing of unique
entries. The data can be sorted by param-
eters such as CAS number, RRT, GC
column, analysis laboratory, industrial
category, relative concentration, etc.
100.00
87.50-
75.00
13
I
"5
62.50-
50.00-
37.50-
25.00-
12.50-
0.00-
A second method for reporting was a
graphics system that allowed the chemist
to recall data and plot it in various ways.
For instance, the raw data for a spectrum,
the cleaned up spectrum, and the reference
spectrum can all be plotted on the same
screen simultaneously. The extracted ion
current profile (EICP) for any ion can be
plotted between any scan limits. Multiple
EICP plots can be displayed on the screen.
The graphics system is used by chemists
to evaluate ambiguous results from the
computer analysis.
Extraction Results
The extraction of information-containing
spectra from the mass of spectra in a
GC/MS run is the key to a successful
automated system for GC/MS data analysis
Figure 1 shows an RGC of a group of 11
phenols. The scans 213 and 215 can be
seen to be on opposing sides of an apparent
single component peak. Manual subtrac-
tion of a baseline spectrum (eg. 208 or
220) from spectrum 214 results in a
spectrum that is not recognizable as any of
the components injected. The use of
CLEANUP to find spectra however, reveals
that the peak is actually the sum of two
componenta Figure 2 shows the resultant
spectra of 213 and 215. Also depicted are
spectra from the reference library that
775
678
581
484
50
100 150
200 250
Scans
387
291
194
97
I
30O 350 4OO 4SO 5OO
0.05 2.55
5.04 7.54 1003 12.53 15.02 17.52 20.01 22.51 25.0O
Time (mini
Figure 1. RGC of 11 component phenol standards. Arrow indicates apparent single
component peak that is actually the sum of two components.
-------
establish the identity of the two components
Although this example represents an ideal
case in which standards were used with
no interferences, it does serve to illustrate
the ability of CLEANUP to separate com-
ponents eluting within two scans of each
other.
The data presented thus far indicate that
for the systems studied, automated tech-
niques are at least the equal of manual
techniques for pointing out components
in the run and identifying them by spectrum
matching with a reference library. Spectrum
extraction and identification are not always
so clear cut As shown in Table 1, despite
the fact that more peaks are found with the
automated method, the ratio of identifica-
tions-to-peaks has decreased In fact as the
number of components in a run increases,
identification becomes more and more
difficult even though the automated system
apparently is able to deliver a spectrum for
each component
Spectrum-Matching Results
The spectrum-matching portion of the
data analysis system has undergone the
least modificatioa PBM has been evaluated
in the literature and has been in use at the
EPA Athens Laboratory for several years.
Selection of a database of reference
spectra for use with PBM involved no
small problem. Three databases were avail-
able: the Wiley collection, the National
Bureau of Standards collection and the
EPA master collection. The Wiley library
contains 30,476 spectra of 30,476 com-
pounds; the NBS library 25,025 spectra
of 25,025 compounds; and the EPA
master list -^40,000 spectra of ^32,000
compounds. The EPA list is the master list
of spectra from which the NBS library was
takea
Because the GC/MS data used in this
study came from a great variety of sources
it was thought that "duplicate" spectra, Le.
multiple spectra of the same compound
that differ slightly due to run conditions, in
the database would be of some help in the
matching process.
Table 2 compares the matching ability
of two databases on the same spectra As
can be seen. Database II (the EPA master
database) enjoys a distinct advantage over
Database I (the NBS library) for the cases
mentioned in the table. Comparison of the
matches suggested by the two databases
with the manual identification shows the
superior ability of Database II. Data gen-
erated using the Wiley library showed
similar shortcomings to the NBS library.
In cases where the identical matched
spectrum occurred in all the libraries (as
was generally the case), no problems were
Rl
File: 1 795. CLN
100-
0 -
5
Spectrum *
213
i .»«.
II ..!..
Base Peak 139
1
0 60 70 80 90 100 ) 7 70 I 720 730 ) 740 IS
CAS # 88755
CMPO: Ortho-Nitrophenol
Base Peak 139
100-1
Rl
50 60 70 80
9O
100 110 120 130 140 150
Rl
File: 1795. CLN
100-
0 -
5
Spectrum
1 j I III [ttl
0 60
n 215
t\
Ij (III [1 1 1 1 1 1
70
\\ ,
Base Peak 122
80 90 100 110 120 130 140 150
CAS tf 105679
CMPD: 2,4-Dimethyl Phenol
Base Peak 122
/OO-i
Rl
"I""!""!"11!
50 60 70 80 90 100 110 120 130 140 150
M/Z
Figure 2. Resultant spectra from scans 213 and 215 of figure 1 compared to matching library
spectra.
Table 1. Comparison of Automated vs. Manual Peak Extraction {Values in Parentheses are
Additional Components Identified by the Automated Method}
Peaks
30
18
31
26
Manual
ID's
18
7
11
10
CLEANUP-PBM
% ID
0.60
0.39
0.35
0.38
Peaks
46
44
41
43
ID'S
18(6)
7(4)
11(2)
10(2)
% ID
0.52
0.25
0.32
0.28
-------
Table 2. Comparison of PBM Matching Ability for the NBS Library (Database I) and the EPA Master Database (Database II)
Database I
(K value, missing masses)
Database II
(K value, missing masses)
Manual ID
Toluene (75+)
7-oxabicydo 2,2,1 heptane (49+)
Phthalide (56, -2)
Hexacosanoic acid (102, -3)
Toluene (75+)
2-cyclohexene- 1-ol (76+)
Methyl benzoate (69+)
Octadecanoic acid (105+)
Toluene
2-cyclohexene- 1-ol
Methyl benzoate
Octadecanoic acid
File: 5943. CLN
Spectrum tt 339
100-i
100 -i
( + means that the molecular ion was matched within the proper intensity tolerance).
observed. In some cases, when spectrum
extraction or run conditions slightly affected
the spectrum, a match occurred only if
there were duplicate spectra. Thus, the
EPA master library was chosen for use in
this work
The reverse search approach of PBM is
also useful for analyzing environmental A
samples with an automated system. Figure
3 is an example of a mixed spectrum
obtained by CLEANUP from a VGA stand-
ard run. Although the two compounds
(c/s-1,3- dichloropropene and 1,1,2-tri-
chloroethane) are not resolved, PBM was /?/
able to match both compounds m this
spectrum. This ability of a reverse search
to recognize components of mixed spectra
is clearly an advantage.
Collator Results
The combination of MS data and reten-
tion indices provides a powerful tool in the
automated extraction and identification of B
components in GC/MS data. Several other
automated systems described in the litera-
ture rely on retention data as well as MS
data for computer-aided identification of a
known set of compounds. Although the
match quality parameters vary from 45 to
100, the RRT variation is only 0.01 to
0.03. RRTs are particularly important in
the identification of compounds such as
alkanes or alcohols, which have little or no
molecular ion and exhibit highly similar
spectra.
Summary
Automated survey techniques for pro-
cessing GC/MS data appear to work well
on most of the data encountered in this
study. The sensitivity of the CLEANUP-
PBM package is not as great as that of a
reverse search for specific ions, but it is
adequate for the tentative identification of
the major components in a sample. CLEAN-
UP-PBM \s cost effective when compared
to the procedure in which an operator
finds peaks, subtracts background, then
matches the spectrum manually or by
using computer search.
Use of a historical library for cataloguing
data collected over long periods of time Figures.
ids identification immensely by adding
Base Peak 75
........ , I, ,1,1, III
0 50 6C
Iliiiiniiili
) 70
.Jj ,||,|
80 90 100 110 120 130 140
CAS # 79005 Base Peak 97
CMPD: 1,1,2-Trichloroethane (Vinyl Trichloride)
fit
. II ,1 .....1
llllllll lllllllllll
0 50 6C
70 80
J
ilii|iiii
90
ll ... 1 1
700 110 120} 130 140
CASH 10061015
CMPD: CIS- 1.3-Dichloropropene
Base Peak 75
ftl
'I'"1!1
40
50
60
70
80
90
M/Z
100 110 120 130
140
A. Mixed spectrum extracted by CLEANUP from a VOA standard run.
B and C. Library spectra are indicated as matched by PBM.
-------
Table 3. Comparison for RRJ vs. K Variation for Selected Matched Compounds
Compound flange of RRT Range of K
Dioctylphthalate
Phthalide
Toluic acid
0.03
0.01
0.03
45- 100
57- 77
48-85
the dimension of GC retention data. Con-
fidence in a tentative identification is
heightened if corroborating GC retention
data are available. This combination of
spectral and retention data can effectively
catalogue and highlight recurring unidenti-
fied substances for future study.
Study continues in two areas of the
CLEANUP-PBM package. First, the proper
compensation of background by CLEANUP
is of concern because errors in intensity
calculations reduce the chance that PBM
will find a match for the spectrum. Studies
underway indicate that background com-
pensation similar to that used in PEAK
would be effective in CLEANUP. Imple-
menting such a routine is under study.
Second, various parameters must be set
for CLEANUP, and these are an area of
concern. Parameters that pertain to peak
shape and minimum area are inflexible
during a given data run. Thus, changing
chromatographic conditions that affect
peak size and shape may cause CLEANUP
to miss pertinent data. Automatic setting
of these parameters during data processing
is under study.
The EPA authors, W. M. Shackelford and D. M. dine, are with Environmental
Research Laboratory, Athens, GA 30613. F. 0. Burchfield, L Faas, G. Kurth, and
A. D. Sauter are with Computer Science Corporation, Falls Church, VA 22046.
W. M. Shackelford is the EPA Project Officer (see below).
The complete report, entitled "A Computer Survey of GC/MS Data Acquired in
EPA's Priority Pollutant Screening Analysis: System and Results," (Order No.
PB 83-220 111; Cost: $17.50, subject to change} will be available only from:
National Technical Information Service
5285 Port Royal Road
Springfield, VA 22161
Telephone: 703-487-4650
The EPA Project Officer can be contacted at:
Environmental Research Laboratory
U.S. Environmental Protection Agency
Athens, GA 30613
------- |