c/EPA
Environmental Monitoring
Systems Laboratory
P.O. Box 15027
Las Vegas NV 89114-5027
March 1984
Research and Development
Feasibility of Using Infrared
Spectroscopy and Pattern
Classification for Screening
Organic Pollutants in
Waste Samples
-------
600R84110
FEASIBILITY OF USING INFRARED SPECTROSCOPY AND PATTERN
CLASSIFICATION FOR SCREENING ORGANIC POLLUTANTS
IN WASTE SAMPLES
by
Donald E. Leyden
Department of Chemistry
Colorado State University
Fort Collins, CO 80523
Interagency Agreement No. DW 930078-01-1
Project Officer
Chas Fitzsimmons
Advanced Monitoring Systems Division
Environmental Monitoring Systems Laboratory
Environmental Protection Agency
Las Vegas, NV 89114
This study was conducted in cooperation with
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23665
ENVIRONMENTAL MONITORING SYSTEMS LABORATORY
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
LAS VEGAS, NV 89114
-------
NOTICE
The information in this document has been funded wholly or in part by the
United States Environmental Protection Agency under Interagency Agreement
Number DW 930078-01-1 to the National Aeronautics and Space Administration,
Langley Research Center. It has not been subjected to the Agency's peer and
administrative review, and therefore does not necessarily reflect the views of
the Agency. Mention of trade names or commercial products does not constitute
endorsement or recommendation for use.
11
-------
ABSTRACT
This project was undertaken to determine the feasibility of using pattern
classification techniques and infrared spectroscopy to screen hazardous waste
samples in the field. The technique would require a portable IR spectrometer
and a microcomputer to perform a binary pattern classification of the spectra.
The classification scheme requires "training" on a main frame computer to pro-
duce weighting vectors from infrared library spectra. The weighting vectors,
when applied to pattern vectors obtained from sample spectra, could classify
samples in the field as being likely or not likely to contain hazardous sub-
Stances as defined by the spectral library.
Preliminary tests of the scheme using 50 compounds from the U.S. Environ-
mental Protection Agency Priority Pollutant List are encouraging. The ability
of the simple, linear, binary pattern classification scheme to predict whether
a compound is in the class known as hazardous pollutants appears feasible.
This report was submitted in partial fulfillment of Interagency Agreement
No. DW89930548-01-1 by Colorado State University (CSU) under the sponsorship of
the EPA. CSU was a subcontractor to Martin-Marietta, the prime contractor for
this project. The contract was administered by the National Aeronautics and
Space Administration under the Agreement with EPA. This report covers a period
from March 7, 1983 to December 1, 1983 and work was completed as of December 1,
1983.
-------
CONTENTS
Abstract iii
Acknowledgment v
1. Introduction 1
2. Conclusions 2
3. Infrared Spectroscopy 3
4. Experimental 9
5. Results 12
6. Suggested Further Research 14
References 16
Appendices
Appendix A - Binary Pattern Recognition Code 17
Appendix B - Output run one 20
Appendix C - Output run two 21
Appendix D - Output run three 22
iv
-------
ACKNOWLEDGMENT
The assistance of Mr. Jeff Cornell, a student at Colorado State University,
and the assistance of Mr. John Coates and Mr. Dennis Schaff, Perkin-Elmer
Corporation, are acknowledged.
-------
SECTION 1
INTRODUCTION
This feasibility study was part of a larger project jointly funded by EPA
and NASA under an interagency agreement entitled Electronic Methods for In-Situ
Monitoring of Hazardous Wastes. Two approaches were under investigation, x-ray
fluorescence spectroscopy and infrared spectroscopy. Martin-Marietta, Denver
Division, was the prime contractor (to NASA) and was responsible for both
efforts. The infrared feasibility study was subcontracted to Coloraado State
University and composed only 5% of the total project budget, the major effort
being the development of x-ray fluorescence spectrometry as a viable field
screening technique for hazardous wastes.
The goal of this project was to perform a feasibility study to determine
whether it is possible to screen environmental samples, especially industrial
wastes and sludges in the field, and thus to determine if hazardous pollutants
are likely present. The proposed instrumental technique is infrared spectroscopy,
most likely some form of Fourier transform infrared spectroscopy. The proposed
decision making technique is pattern recognition or pattern classification.
-------
SECTION 2
CONCLUSIONS
By using a limited data set of infrared spectra and limited time, it has
been determined that the ability of a simple, linear, binary, pattern clas-
sification scheme to predict whether a compound is in the class known as
hazardous pollutants appears feasible.
This study also has shown that preliminary investigations using infrared
spectra and pattern classification schemes can be conducted on a microcomputer.
-------
SECTION 3
INFRARED SPECTROSCOPY
The coupling of infrared and pattern classification has precedents in the
literature (1,2). A brief introduction will be given for each.
INFRARED SPECTROSCOPY
Organic molecules contain a variety of forms of energy. One of these
is that manifested as vibration of the chemical bonds. The absorption of
electromagnetic radiation in the region known as infrared (2.5-15 micrometer
wavelengths) can cause transitions in the level or state of these vibrations.
Scanning through this wavelength range results in a plot of absorption versus
wavelength, or an infrared spectrum, which is characteristic of the compound or
mixture of compounds in a sample. Fourier transform infrared spectroscopy is
an instrumental and mathematical method of collecting many such scans in a
short period of time, thus improving the signal-to-noise ratio. The signal-to-
noise ratio increases proportionally to the square root of the number of repet-
itive scans. Thus, for example, by scanning 100 times, improvement by a factor
of 10 is usually realized experimentally. As a result of Fourier transform
techniques, it is reasonable to expect to obtain a spectrum from less than
microgram quantities of many types of organic molecules. Thus, infrared spec-
troscopy has found use in environmental analyses (3). It is expected that in
-------
many types of matrices, a few hundred parts per billion of several molecular
types can be detected, but not quantitatively determined. The detection limit
will depend upon the type of infrared chromophore (color producing group) in
the molecular structure.
PATTERN CLASSIFICATION
The availability of high-speed computers for processing large amounts of
data has led to the consideration of volumes of data which were previously
implausible to treat. One outcome of this ability has been the use of pattern
recognition or pattern classification techniques in chemistry. According to
Jurs and Isenhour (4), pattern recognition "includes the detection, perception,
and recognition of regularities (invariant properties) among sets of measure-
ments describing objects or events." Pattern recognition is normally used by
chemists and others to classify a set of experimental data as a member of a
class. This technique has been applied to many types of problems.
A basic pattern recognition system usually contains the units shown in
Figure 1 (4, p.3).
Figure 1. Block diagram of a basic pattern recognition system.
4
-------
The transducer converts information from the laboratory format into the pattern
space of the pattern recognition system. Often, this entails no more than
converting the raw data into a suitable computer format. The preprocessor
accepts the data and converts it into a form which is dealt with more easily by
the classifier. The classifier treats the data by some algorithm to produce a
classification decision. The classifier may be based on various branches of
applied mathematics, statistical decision theory, information theory, or geom-
etric theory. There exists a variety of pattern classification systems includ-
ing those for multicategory classification. However, in this report only a
binary classification system is considered and discussed.
The object of this feasibility study is to determine whether the presence
of hazardous organic pollutants such as, but not limited to, those on the
Environmental Protection Agency Priority Pollutant List can be predicted from
an infrared spectrum of industrial waste samples. Thus, only a binary classi-
fier is required to determine whether or not the samples contain such compounds.
The hazardous pollutants often contain such organofunctional groups as C-C1
bonds, phenolic groups, polyaromatic hydrocarbons (PAH's) and other structural
units represented in infrared spectra. Usually, determining even the likely
presence of such compounds requires extensive preanalytical separation for
successful detection by IR spectroscopy. A fast inexpensive method of sample
classification could be an effective cost-saving aid.
Chemical data such as infrared spectral information may be represented as
a d-dimensional pattern vector:
-------
X = xljx2s...xd (1)
The components Xj are observable quantities such as the wavelength of a peak in
an infrared spectrum of a compound. Alternatively, the spectral region may be
divided into subregions, and the x,- values would then represent the intensity
of the absorption in each subregion. If there were 100 such subregions, there
would be 100 dimensions of data, or a set of vectors in 100-space, one set for
each of the subregions of the infrared spectrum. If thousands of compounds are
considered, clearly a vast amount of data could result.
For a binary classifier, the two classes of data should fall on either
side of a decision surface. For a simple two-space case, this amounts to
tracing a line (not necessarily a straight one) that runs between the two
classes of data. In hyperspace, the analogy is a hyperplane that may or may
not be linear and separates the two classes of data. The case is simpler if a
linear hyperplane can be used as it can be represented by a vector from the
origin. In such a case, the sign of the dot product of the normal vector W and
a pattern vector X defines on which side of the hyperplane a given pattern
point lies (4, p.11):
W-X = |W| |X| cos 0 (2)
where 9 is the angle between the two vectors. Since the normal vector is
perpendicular to the hyperplane, all patterns having dot products that are
positive lie on the same side of the plane as the normal vector, and all those
with negative dot products lie on the opposite side. Although decision
-------
surfaces need not be linear, their simplicity is appealing.
Often a concept called Threshold Logic Units (TLU) is used for linear,
binary classification. This method uses some function which generates one of
two results based on the input data. A decision is based upon whether the
result is greater or less than the threshold value. The result may be computed
by weighted components, wj, of the normal vector, W, applied to each term in
the data set
W-X = |W| |X| cos 0 = w1x1+w2x2+...+wrfxc|+w(.|+1 (3)
where W^+i is added to project the vector from the origin. The weight compo-
nents are determined by "training" the classifier with a set of data of known
classification. These data are known as the "training set" which is considered
by the classifier one set at a time. The weight vector components w
-------
be used with a single Read Only Memory (ROM) for the program and data.
x x
x x .•
• *
X..'
,x O .''a O O
o o
Figure 2. Example of a two-space, linear, binary classification.
Two classes of data represented by x and o, respectively,
fall on either side of the decision plane represented by
the dashed line. An upper and lower threshold (TLU) are
represented by dotted lines. Data which fall between
the threshold limits are not classified.
8
-------
SECTION 4
EXPERIMENTAL
This feasibility study was performed with limited resources. However, the
success that was obtained illustrates the possibility of using small computers
for the application of using pattern classification and infrared spectrometry
to screen hazardous waste samples in the field. Appendix A shows the listing
of a computer program for linear, binary pattern classification written in
Apple Computer Applesoft language. This program was translated from the FORTRAN
program given in the appendix of the book by Jurs and Isenhour (4). The program
was executed on an APPLE 11+ (Apple Computer, Cupertino, CA) computer.
The original plan for this study was to use a computer data station from a
vendor of infrared instrumentation along with infrared data on diskette.
Several unfortunate events occurred. The liaison from the vendor failed for
several months to arrange the loan of a data station. Once obtained, no soft-
ware support or manuals were available. The form of the data on the diskettes
was found to be unsatisfactory for use in a classification program. Therefore,
as described below, an alternative was found. Although not considered to be
completely satisfactory, a meager amount of data were utilized which shed some
insight to the question at hand.
The spectra for 100 compounds were encoded for use in this study. Fifty
-------
compounds were selected from the Environmental Protection Agency Priority
Pollutant List. An additional 50 compounds were selected which are not on
either the EPA Priority Pollutant List or in Appendix VIII, 40CFR261 (RCRA).
Data for the Priority Pollutants were derived from spectra published by Sadtler
(5) and data for the other compounds were derived from spectra published by the
Aldrich Chemical Co. (6). The infrared spectra of these compounds were divided
into eight regions (Table 1). Each region is in units of cm~l, and the data
entry is a one (1) if a peak is present in the region and zero (0) if no peak
is present in the region.
TABLE 1.
Spectral
Region
1
2
3
4
Range of Wave
Numbers (cm*l)
200-500
501-1000
1001-1500
1501-2000
Spectral
Region
5
6
7
8
Range of Wave
Numbers (cm-1)
2001-2500
2501-3000
3001-3500
3501-4000
The data set was assigned a "dot product" or class of one (1) if the compound
were a hazardous pollutant, or a negative one (-1) if it were not. A training
set was made up from 80 of the 100 compounds, 40 from the Priority Pollutant
List (hazardous), and 40 from the Aldrich library (nonhazardous). This left
the spectra of 20 compounds (10 from each classification) to be used as test
data. Although this is a meager and greatly simplified data set, the results
are encouraging. These data were analyzed using the program shown in Appendix
10
-------
A, requiring approximately 45 seconds to execute on the APPLE 11+ computer.
11
-------
SECTION 5
RESULTS
The results of the use of the data described above executed in the
Applesoft program are shown in Appendix B. The first line indicates that 60
data sets are to be used in training, that there are eight data in each set,
and that the TLU has been set to 0.75 on each side of a linear surface. The
nine weight vectors (including the w^+i component) are the weight vectors for
each datum. There were 26 feedback iterations to determine the weight vectors;
each were set initially at 0.1 in line 130 of the program. With a deadzone
(TLU) about the decision surface of 0.75, 13 of the 20 data sets were predicted
and 7 were not. Of the 13 predicted, 1 was predicted incorrectly. With a
deadzone (TLU) about the decision surface of 0, 20 of the 20 data sets were
predicted and 5 predicted incorrectly. Thus, with this simple set of data, 75
percent of the test set were correctly predicted with a training of 26 feed-
backs.
Appendix C shows a modified run of the program in which 70 data sets were
used as the training set and a TLU of 1 was specified. The increased TLU
increased the magnitude of the weight vector components which has the effect of
spreading the vectors in hyperspace. All 10 compounds of the test set were
predicted when 500 feedbacks were allowed, but when the TLU was reduced to
zero, 3 were incorrectly predicted.
12
-------
Appendix D shows the results of a run in which 80 data sets were used to
train for the prediction of 20 data sets, all of which were known to fall into
one of two classes. All 20 compounds of the test set were predicted correctly
with 100 iterations and the TLU set at both 0.75 and zero.
Although this feasibility study was not as extensive as desired because of
a variety of problems including limited funding and delays in the loan of equip-
ment, some encouraging results were obtained. If an appropriate number of
training data were to be used, the execution time on a microcomputer would be
prohibitively long, but this study shows that preliminary work can be conducted
on such a computer. The majority of the computer time is spent in the training
session. Once the weight vectors are obtained, the prediction takes only a few
seconds to determine, as this is a direct, not an iterative computation.
Clearly, a small microcomputer such as those associated with modern spectrom-
eters can perform this computation. Most importantly, although the data set
used was small, the ability of the simple, linear, binary pattern classifica-
tion scheme to predict whether a compound is in the class known as hazardous
pollutants appears feasible.
13
-------
SECTION 6
SUGGESTED FURTHER RESEARCH
The results and conclusions of this feasibility study suggest the probable
success of further research. Provided that an infrared spectrometer containing
even the most basic microcomputer can be designed with sufficient sensitivity
and portability, a research plan to develop a system for the rapid, inexpensive,
and reliable screening of hazardous waste samples for as little as a few micro-
grams of organic pollutant is recommended. First, a large data file of infrared
spectra suitable for use in a pattern recognition scheme would be obtained on a
lease basis. The most obvious of these data bases is that from Sadtler. The
general pattern classification program "ARTHUR"£/ would be obtained for execu-
tion on a large mainframe computer. This program permits the use of a wide
variety of pattern classification techniques. Therefore, one would not be
restricted to the linear, binary classification used here. However, linear,
binary classification would be explored in detail first because of its mathe-
matical simplicity. The judicious use of asymmetric TLU's would be explored to
"bias" the decision to predict the presence of probable pollutants even when
they might not be present, j_f that were a desired result.
ARTHUR is a generalized pattern classification program available from
Infometrix, Seattle, Washington. It is planned to be made available in
a microcomputer version.
14
-------
These studies would be required on pure compound spectra first. Then,
computer-generated spectra of mixtures simulated by linear addition of the
spectra of pure compounds would be investigated. For example, the reliability
of the prediction when a trace of pollutant was mixed with a large amount of
some other compound would be tested. This would be a severe and critical test.
Fortunately, it can be performed using computer-generated data. Preparation of
laboratory mixtures would only be necessary to test the instrumentation.
It is estimated that this research could be conducted during one year at a
cost of approximately $70,000.
15
-------
REFERENCES
1. B. R. Kowalski, P. C. Jurs, T. L. Isenhour and C. N. Reilley, Anal. Chem.,
4U 1945 (1969).
2. R. W. Liddell, III, and P. C. Jurs, Appl. Spectres., 27_, 371, (1973).
3. A. L. Smith, APPLIED INFRAFRED SPECTROSCOPY, FUNDAMENTALS, TECHNIQUES AND
ANALYTICAL PROBLEM-SOLVING, Wiley-Interscience, New York, New York, 1979.
4. P. C. Jurs and T. L. Ishenhour, CHEMICAL APPLICATIONS OF PATTERN
RECOGNITION, Wiley-Interscience, New York, NY, 1975.
5. INFRARED SPECTRA HANDBOOK OF PRIORITY POLLUTANTS AND TOXIC CHEMICALS,
Sadtler Research Laboratories, Philadelphia, PA, 1982.
6. C. J. Pouchert, THE ALDRICH LIBRARY OF INFRARED SPECTRA, 2ND EDITION,
Aldrich Chemical Co., Inc., Milwaukee, WI, 1978.
16
-------
APPENDIX A
TO REM »»»*>*»*»»*»» BINARY PATTERN RECOGNITION »***»»*********»»I»»«»»»
30 REM
40 REM
49 REM t«***»»****»**»*********t»**t*»************»****»
50 REM THIS PROGRAM WAS TRANSLATED FROM
51 REM A FORTRAN VERSION IN THE BOOK
52 REM "CHEMICAL APPLICATIONS OF PATTERN RECOGNITION"
53 REM P.C. OURS AND T.L. 1SENHOUR
54 REM WILEY 1NTERSCIENCE, NY. 1975
55 REM TRANSLATION BY D.E. LEYDEN
56 REM DEPT. OF CHEMISTRY
57 REM COLORADO STATE UNIVERSITY
58 REM FORT COLLINS CO 80523
59 REM *****S****S******************tt****************t*
80 RESTORE
90 HOME
95 PRINT CHR* <4);"PR#1"
98 PRINT CHR* <9>"BON"
100 DIM D<5,100>,W<6>,L<1OO),ID<100>,ICUOO>,NS<100>,KP<20>
110 NT = 80
120 NP = 2O
130 WI = .1
140 TS = .75
150 NO = NT + NP
160 NA = 1000
170 NU = 5
180 REM READ DATA SET
190 FOR I = 1 TO NO
2OO READ L(I)
210 FOR J = 1 TO NU
220 READ D
-------
1000 REM SOUBROUTINE TRAIN
1010 NC = O
1O20 PRINT "TRAINING ":NT,NU,TS
1030 NV = NU * 1
1040 NF = 0
1O50 KK = O
1060 KV = O
1070 REM STARTING POINT OF MAIN LOOP & RETURN FROM LINE NUMBER 1520
1O90 KZ = O
1090 IF KV < =0 GOTO 1120
110O ND = KV
111O GOTO 1170
1120 ND = NT
1130 FOR I = 1 TO NT
114O NSU) = ID(I)
1150 NEXT I
1160 REM THE NEXT LOOP CLASSIFIES THE ND MEMBERS OF THE CURRENT SUBSET
1170 FOR IR = 1 TO ND
1180 I = NS(IR)
1185 REM THE NEXT LOOP CALCULATES THE DDT PRODUCT
119O S = W(NV)
120O FOR 0 = 1 TO NU
1210 S = S + D(J.I) * W(J)
1220 NEXT 0
1230 REM THE NEXT THREE STATEMENTS TEST FOR CORRECT ANSWER
1240 IF L 0 GOTO 1260
1250 IF (S + TS) < =0 GOTO 1420
1255 GOTO 1290
1260 IF (S - TS) > O GOTO 1420
1265 REM 1270 OR 129O CALCULATES THE CORRECTION INCREMENT
1270 C = 2 « (TS - S)
1280 GOTO 1300
129O C = 2 * ( - TS - S)
1300 XX = 1.0
1310 FOR J = 1 TO NU
1320 XX = XX + D ~ 2
1330 NEXT J
1340 C = C / XX
135O REM THE NEXT LOOP PERFORMS THE FEEDBACK
136O FOR J = 1 TO NU
137O W(0) = W<0) + C * D(0,I)
1375 NEXT J
1330 W(NV) = W O GOTO 1550
1510 REM TEST WHETHER CURRENT SUBSET IS INTIRE TRAINING SET
1520 IF (ND - NT) < > O GOTO 1080
1530 REM TEST FOR ZERO ERROR
1540 IF KV < > 0 GOTO 1080
1550 NC = 1
-------
1!:'.60 REM SUMMARY OUTPUT OF TRAINING ROUTINE
1570 FOR K = 1 TO KK
17,30 PRINT INT (KP(K»:
1590 NEXT K
1595 PRINT
1600 PRINT "WEIGHT VECTOR"
161O FOR J = 1 TO NV
1620 PRINT W(J)
1630 NEXT J
1640 PRINT "FEEDBACKS ":NF
1650 RETURN
2000 REM SUBROUTINE PREDICTION
2010 LI = O
2O2O L2 = O
203O KW = 0
2040 Nl = 0
205O N2 = 0
2060 FOR II = 1 TO NP
207O I = IC(II)
2090 S = W 0 GOTO 2150
2130 K'W = KW + 1
2140 GOTO 2230
2150 IF L(I) > 0 GOTO 2200
2160 N2 = N2 + 1
2170 IF < - S - TS) > 0 GOTO 2230
2180 LI = LI -i- 1
2190 GOTO 2230
2200 Nl = Nl + 1
2210 IF 0 GOTO 2230
2220 L2 = L2 + 1
2230 NEXT II
2240 PRINT "PREDICTION WITH DEADZONE = ";TS
2250 LT = LI * L2
2260 JW = Nl + N2
2270 PW = 100 - (100 * LT / JW)
2280 PI = 100 - (100 * LI / N2)
2290 P2 = 100 - (100 » L2 / Nl)
2300 PRINT "NUMBER PREDICTED = ";OW
2310 PRINT "NUMBER NOT PREDICTED = ";KW
232O PRINT "NUMBER PREDICTED INCORRECTLY = ";LT
2330 PRINT
2340 PRINT LT;"/";JW;" "; INT (PW)
235O PRINT L1;"/";N2;" "; INT (PI)
2360 PRINT L2;"/";N1;" "; INT (P2)
2365 PRINT
237O RETURN
-------
APPENDIX B
TRAINING 60 8 .75
WEIGHT VECTOR
-2.32670578
1.938407BB
-2.80600923
-2.90653334
1.72147704
-.949600764
1.82722324
1.7296907
.512025906
FEEDBACKS 26
PREDICTION WITH DEADZONE = .75
NUMBER PREDICTED =13
NUMBER NOT PREDICTED = 7
NUMBER PREDICTED INCORRECTLY = 1
1/13 92
1/4 75
O/9 100
PREDICTION WITH DEADZONE = O
NUMBER PREDICTED = 2O
NUMBER NOT PREDICTED = O
NUMBER PREDICTED INCORRECTLY = 5
5/2O 75
5/8 37
O/12 100
-------
APPENDIX C
70 8
8 8
HEIGHT VECTOR
-14.1103004
50.9810323
-46.4651464
-48.769018
2.36107728
-7.01234953
5.69704767
12.8294714
-4.55528285
FEEDBACKS 50O
PREDICTION WITH DEADZONE = 1
NUMBER PREDICTED = 1O
NUMBER NOT PREDICTED = O
NUMBER PREDICTED INCORRECTLY = 3
3/10 70
1/5 80
2/5 60
PREDICTION WITH DEADZONE = 0
NUMBER PREDICTED =10
NUMBER NOT PREDICTED = 0
NUMBER PREDICTED INCORRECTLY = 3
3/10 70
1/5 80
2/5 60
-------
APPENDIX D
TRAINING 80 5 .75
8
3
1
O~>22104333332222100
WEIGHT VECTOR
.605293118
.959874479
.299292939 .
-.713884741
-.409&5779&
.0221140133
FEEDBACKS
PREDICTION WITH DEADZONE = .75
NUMBER PREDICTED = 2O
NUMBER NOT PREDICTED = O
NUMBER PREDICTED INCORRECTLY = 0
O/2O 100
O/9 100
O/ll 1OO
PREDICTION WITH DEADZONE = O
NUMBER PREDICTED = 20
NUMBER NOT PREDICTED = O
NUMBER PREDICTED INCORRECTLY = 0
O/20 100
0/9 1OO
O/ll 10O
------- |