United States
Environmental Protection
Agency
Environmental Monitoring Syste
Laboratory
Research Triangle Park NC 2771
Research and Development
EPA-600/S4-80-048 August 1982
Project Summary
Application of Cluster
Analysis to Aerometric Data
Harold L. Crutcher, Raymond C. Rhodes, Maurice E. Graves, Beth Fairbairn, A.
Carl Nelson, and Michael Symons
The NORMIX data analysis program,
which incorporates cluster analysis
and multivariate statistical analysis
routines, was modified and revised for
use in a UNIVAC 1110 computer. The
revised program was tested on three
sample data sets and produced results
in agreement with those from the
original program. The NORMIX pro-
gram was then used to evaluate and
analyze eight sets of aerometric data
from various sources. Comparison of
the performance of NORMIX with two
other cluster analysis algorithms,
MIKCA and SAS CLUSTER, revealed
that all three programs produce simi-
lar results in terms of hierarchical clus-
tering, but NORMIX produces consid-
erably more statistical evalu ation and
information to the user. Thus NOR-
MIX is recommended as the most use-
ful cluster analysis program of these
three.
This Project Summary was developed
by EPA's Environmental Monitoring
Support Laboratory. Research Triangle
Park, NC. to announce key findings of
the research project that is fully
documented in a separate report of the
same title (see Project Report ordering
information at back).
Introduction
Pollutants in the environment pose an
enduring threat to all living organisms
and inanimate structures. It is continu-
ally necessary to assess this threat and
to take remedial action. For such
assessment, it is essential to monitor
environmental conditions in order to
provide information bases about the
possible threats and their variation over
time. Such comparisons permit reas-
sessment of the average conditions,
their expected variabilities, and any
significant changes in those average
conditions and variabilities.
To produce credible models and
assessments of atmospheric pollution
requires extensive, reliable aerometric
data. Production of valid data bases
requires adequate instrumentation and
maintenance, representative exposure,
competent personnel, excellent com-
munications, and sufficient quality
assurance and control systems in an
ongoing updated observational program.
To aid in the development of valid data
bases and to extend methods of data
analysis, five specific goals of this study
on aerometric data clustering were
defined:
1. Develop and document a validated
and calibrated digital computer
program for cluster analysis.
2. Extend the theory for clustering
data.
3. Validate the data obtained.
4. Classify the data.
5. Demonstrate the application of the
computer program to various
types of data.
As a result of this project, an exten-
sively developed computer program for
cluster analysis is available to all users
of the UNIVAC 1110 computer at the
National Computer Center at the Environ-
mental Protection Agency, Research
Triangle Park (NCC-EPA/RTP), North
Carolina.
-------
Several clustering techniques that
separate heterogeneous aerometric
data sets into homogeneous groups
were reviewed. It is sometimes difficult
to state categorically that a datum
belongs to a specified group or cluster,
hence assignment or classification to a
group of related data is made in a
probabilistic sense. This report series
illustrates the use of these clustering
algorithms to grouped data from a large
data base of observed aerometric and
meteorological information. These
grouped data can be interpreted by
researchers and presented to decision-
makers. For example, conditions accom-
panying pollutant episodes can be
identified, and oftentimes specific
pollutant sources can be identified.
Data from the Community Health Air
Monitoring Program (CHAMP) were
analyzed by the NORMIX clustering
algorithm1, as described in the three-
volume Project Report. Volume I presents
a detailed examination of the historical
development of the NORMIX algorithm
and its application to the CHAMP data
set, plus descriptions of cluster analyses
of data from the Los Angeles Catalyst
Study (LACS) by the SAS2 and MIKCA3
programs. Volume II contains the
modified NORMIX program with com-
plete documentation, and Volume III
contains more detailed examples and
discussion of the application of the
NORMIX algorithm to the CHAMP data.
Until the advent of clustering tech-
niques and their algorithms, analyses of
multivariate data were generally of the
multiple regression type, factor analysis,
or principal component analysis. Clus-
tering analysis involves a hierarchical
grouping of data. Some cluster analysis
programs (notably NORMIX) also provide
statistical estimates of the multimodal,
multivariate distribution. Cluster analy-
sis has been used on air pollution data to
reveal cyclic patterns (over days, weeks,
or seasons) and to identify local source
effects. All of these relations are
reflected in different combinations of
values of the variables that cluster
together. Because of the clustering in
aerometric data, multivariate cluster
analyses are useful as preliminary
analytical and data validation techniques.
The results of the report support the
well-known inverse concentration
relationship between ozone and the
oxides of nitrogen, presumably due to
their production timing and to their
chemical interaction. These and many
other relationships are indicated in
tabular form and illustrated in some
examples by diagrams in which data are
plotted with overlying distribution
ellipses in bivariate form, or by profile
models of means andthree-sigma limits
for given measurements.
Data from monitoring activities cannot
be considered as random samples from
a single universe, but rather, such result
from sampling mixtures of distributions,
usually internally correlated within
each group. As expected, these studies
show that pollutant data depend on, or
are highly correlated with, meteorolog-
ical conditions. At a given site and with a
given set of pollutant sources, the
pollutant concentration at the site is
heavily dependent upon the meteoro-
logical regime. Thus, the pollutant
distributions will be a mixture of the
distributions that result from the
mixture of the meteorological regimes
and the interaction of the pollutants
over the time of the monitoring. Solar
radiation is an effective agent in the
meteorological regime, but these data
were not available for inclusion in this
study. This discussion suggests that the
pollutant data distributions first be
separated into meteorological regimes
by cluster analysis and that these
subsets then be evaluated individually
by other approporiate techniques.
Moreover, analysis involving prediction
of future pollutant concentration distri-
butions for each meteorological regime
should also consider the probabilities of
occurrence of each of the meteorological
regimes. This will enhance the clustering
and the classification of data.
Normality of distribution is not
required for simple hierarchical clus-
tering, but if statistical significance
statements are to be made, or if
statistical characteristics of the clusters
are to be used, the normal distribution is
quite useful. It is not necessary that
exact normality be achieved, as the
techniques used are sufficiently robust
to accommodate considerable departure
from normality.
The results are more reliable if
individual element (variate) distributions
are assumed to be normal or near
normal during the application of clus-
tering techniques. If the distributions
are distinctly non-normal in appearance,
various mathematical operations are
available to transform the individual
data so that the distributions of the
transformed data may be described by
the normal distribution prior to their
entry into the heirarchical clustering
scheme. If no prior information is
available, the heirarchical clustering
may be the only product of the operation,
or it may be a prelude to further
information extraction.
Because the NORMIX program has a
substantial statistical basis with corre-
sponding statistical assumptions and
tests of significance, it was selected for
further development in this study. The
program, originally written in IBM
FORTRAN IV language, was converted
to the ASCII FORTRAN language require-
ment of the UNIVAC 1110 at NCC-
EPA/RTP.
Figure 1 shows the general config-
uration of the expanded NORMIX
program, detailing the NORMIX pre-
processing options, the central NORMIX
core algorithm, and the post-processing
flow: Figure 2 presents the NORMIX
flow chart. Documentation is available
in supplementary and complementary
reports. Volumes II and III, which are
discussed later.
Calibration and Validation of
the Expanded Normix
Program
The ability of the present revised
version of the NORMIX program to
produce results equivalent to those of
Wolfe1 (for the same data in the older
program version and with a different
computer) indicated that the revised
version has been adequately calibrated.
Program validation consisted of applica-
tion over several types of data sets, not
necessarily all aerometric. Data valid-
ation consisted of the isolation of
outliers, if any, for examination and
treatment. Both single and clustered
outliers are identifiable in the hierarch-
ical clustering as well as in the NORMIX
processing. Since the NORMIX program
uses the same hierarchical clustering
algorithms as several of the other
programs, it is not necessarily more
useful for this purpose.
In order to demonstrate that the
present version of the NORMIX program
is available and works properly, 11 sets
of data were used. These were:
1. A classical data set composed of
measurements of petal and sepal
lengths and widths (four variates)
made by Anderson4 and used by
Fisher5 to illustrate clustering and
classification.
2. A synthetic bivariate data set from
Wolfe1.
3. A set of synthetic data consisting
of three predetermined three-
element data sets that could be
expanded both in variances and
distances between the centroids.
-------
Normix-Prep
Normix
Calcomp
Input
Data
Displays
Trans-
formed
Data
Displays
Chi-square tests
Histograms
Probability chart^
1*
Size
Grouping
^ Preliminary
Estimates
^
Clustering
— >• Mapping
of subsample Chi-square tests
Grouping pattern Minimum number of points in clustei
(3rd iteration)
Merge details
Number of hypothesized types
(in seguence)
Probability
level for chi-square tests
Override chi-square test abort
Random
Scaling
numbers
Eigenanalysis printout Printing at
Time limit
Wind components
Maximum
Covariance
each iteration
on computations
number of iterations
matrices: same or different
Time printouts
Input of preliminary estimates
Ellipse-plotting and mapping
Profiles of cluster variables
Figure 1. NORMIX flow and options.
4. A bivariate set of maximum and
minimum temperatures for June
and July at the Raleigh-Durham,
North Carolina, airport, supplied
by Professor Jerry Davis of the
North Carolina State University at
Raleigh.
5. A five-variate set of health-related
data for trace metals.
6. A univariate set of river discharge
data supplied by the U.S. Geolog-
ical Survey.
7. A 12-variate set of data on new
filters, 12 impurities, and trace
metals.
8. A six-variate set of data on the
physical and chemical character-
istics of new filters.
9. A univariate set of acid precipitation
data for each of several locations
within the United States.
10. The LACS data set.
11. The CHAMP data set.
The first three sets of data mentioned
above were processed by the NORMIX
algorithm to calibrate the program
( PROFILE j ^ ELLIPSE \
2/7 INFORM. Lj.4 NORMIX-
/J INITLE r~l PREP
Data \ _J l_
/
Paramete
estimates
i i
MOMENT > ( SYMINV
Figure 2. NORMIX flow chart.
-------
conversion. The output of Sets 1 and 2
agreed with those of Wolfe. The output
of Set 3 returned the original stipulated
clusters prior to their being mixed.
The NORMIX program produces
hierarchical mapping of the data, as do
most of the other programs discussed in
this report, although the metrics used
may differ. Tree (branching or dendritic)
diagrams may be prepared from the
maps, which show the coalescence of
data into clusters. The report contains
such tree diagrams, which are not
reproduced here. From such tree
diagrams, outliers (either singletons or
small groups) can be identified easily, as
they are the last to enter a larger cluster
or the last to join the total group. A
lengthy discussion on the reading and
interpretation of the diagrams is also
available.
The presence of extreme outliers, as
singletons or as minimum sized clusters,
creates near-singularities in the data
matrices, which halt a running computer
program, produce slow convergence, or
do not allow the program to converge to
a solution. This phenomenon is charac-
teristic of any program that uses
convergence routines involving matrix
calculations.
The extensive environmental data
bases, LACS and CHAMP, were pro-
cessed by means of cluster algorithms;
SAS CLUSTER and MIKCA were used
for the LACS data and NORMIX was
used for the CHAMP data.
In order to study the effect of the use
of automobile catalytic converters on
aerometric parameters, the period of
the LACS data necessarily had to
include the periods before and after the
1975 introduction of these devices; the
period selected was 1974 through
1978. The pollutant elements (variates)
observed were suspended particulates
(SP), ozone (Oa), nitrogen oxide (NO),
nitrogen dioxide (N02>, sulfur dioxide
(SOa), carbon monoxide (CO), and lead
(Pb). The meteorological variates were
temperature, wind speed, wind direc-
tion and traffic count. For this analysis,
measurements of the variates SP, CO,
Pb, wind speed, wind direction, and
traffic count from two selected sites
were used. These sites were on either
side of the San Diego Freeway between
the intersection of the freeway with the
two boulevards, Sunset and Wilshire.
When more than one element is being
observed and recorded, one of the
elements may not be obtained for a
particular observation time for various
reasons. Effectively, in the multivariate
sense, the omission of a single element
requires the rejection of the entire
observation from the data set under
consideration. In some cases, it may be
reasonable to merge one incomplete
multivariate vector (observation) with
another complementary incomplete
vector from a nearby locale to obtain a
usable complete vector. If this is done,
this factor must be considered when
interpreting the results
Here is an example of the two effects
mentioned above: using only two sites
from the LACS data bank, the number of
available and useful hourly observations
for the 1977 through 1978 period is
about 3031, as compared to about
25,000 observations originally available
from all eight sites of the LACS. The
reader should consult Part 2 of Volume I
of the Project Report for further details.
The periods of record at the CHAMP
stations were relatively short, i.e. from
September 1975 through November
1976 for Angwin, California, and Loma
Linda, California, and from August 1974
through September 1976 for Magna,
Utah. The data selected for use are for
certain hours of the day, days of the
week, weeks, and seasons.
The pollutant and meteorological data
consisted of oxides of nitrogen (NO*),
calculated nitrogen oxide (NO), nitrogen
dioxide (NOz), sample nitrogen oxide
(SNO), ozone (O3), sulfur dioxide (SO2),
total hydrocarbons (THC), non-methane
hydrocarbons (NMHC), temperature (T),
dew point, winds, and atmospheric
pressure (P). For this study, all winds
were transformed to east-west and
north-south components, positive from
the west and south. An option in the
program permits transformation of
polar coordinates (wind direction and
speed) to rectangular coordinates, along
and at right angles to any preselected
direction. The default option is the east-
west and north-south configuration.
Comparison of Three
Clustering Programs.
Table 1 compares three clustering
algorithms, SAS CLUSTER, MIKCA, and
NORMIX. All three algorithms select
clusters by an agglomerative rather
than a divisive procedure, and the
number of clusters to be examined must
be stipulated. For each program, the
recommended maximum number of
cluster configurations is seven.
The reader and user of the Project
Report will find a rather extensive
discussion of clustering and data
validation problems and uses for
computer programs of clustering tech-
niques. No one program satisfies all
users. Some of the limitations of each
program are discussed. The NORMIX
program, being the most complex,
produces much more informational
output than do the SAS CLUSTER and
MIKCA programs.
Examples of Processing Output
Table 2 shows an intercomparison of
selected pollutant datl throughout the
year for NO, NOX, and Oa at Angwin,
California, Loma Linda, California, and
Magna, Utah.
The information in Table 2 reveals
that Angwin, California probably has
the lowest levels of oxides of nitrogen
and highest level of ozone, and the
lowest variability of these three variates
at the three stations. This is, of course,
one reason why the Angwin, California,
site was selected for monitoring and for
this study. The large standard deviation
and negative mean value for the Loma
Linda, California NOX data reflect that
the number added to low observed
values (to ensure a minimum value of
two before logarithms are taken was
insufficiently large.
Figure 3, developed from NORMIX
output information from Magna, Utah
data, illustrates the 0.50 probability
ellipses for wind speed and direction
and the associated pollutant means,
standard deviation, proportions, and the
number of observations. Cluster num-
bers for each point are included to help
assess the clustering efficacy. It must be
remembered that the pollutant variables
have been logarithmically transformed
and numbers refer to such transformed
data. The 0.50 probability ellipses are
centered on the centroids of the plots of
east-west wind components versus
north-south wind components. The
ellipses are for the wind components,
but the cluster classifications are in
terms of the eight variates. As previously
noted, it is in the overlapping regions
that errors of misclassification may
occur for an individual datum. However,
the statistical estimates are generally
expected to provide the best estimates
of the cluster configurations. This type
of presentation is a projection of the
total multidimensional ellipsoid onto
the plane of the two selected variates.
Any two variates can be selected by
options provided in the program.
The program option that developed
Figure 4 arranged the variate output in
terms of the largest cluster with variate
-------
Table 1.
Property
Comparison of Capabilities of the SAS CLUSTER. MIKCA and NORMIX
Algorithms
SAS
MIKCA
NORMIX
Complexity
Output Quantity
Number of
Input Data
Limit to Number
of Variables
Maximum Number
of Clusters
Distance Options
Criteria Options
Hierarchical
Clustering
Low
Minimum
250
10
250
1
1
Yes with Maps
Medium
Moderate
500
20
15
3
9
No
High
Extensive
2000*
20*
150
1
1
Yes with Maps
*May be increased if computer memory space permits. Computation time increases
exponentially with the numbers of variates and observations'
Table 2. Intercomparison of Selected Pollutant Data Throughout the Year for NO,
NO* and Os at Three Locations
Transformed variates*
NO
NO,
Locations Mean
Std
dev.
Std
Mean dev.
Std
Mean dev.
Num-
ber
of
obser-
va-
tions
Angwin. 0.6981 0.0017 0.7002 0.0045 0.7127 0.0087 288
California
LomaLinda. 0.7036 0.0130 -3.0442 0.9576 0.7081 0.0198 122
California
Magna, 0.7055 0.0164 0.7123 0.0216 0.7081 0.0090 324
Utah
*Values are in logarithmically (base e) transformed data originally in ppm.
means of increasing in sequence.
Again, as in all presentations such as
this, the values of the other variates
follow the sequence established by the
largest cluster. The numbers below the
minimum three-sigma limit, ranging
from one through eight, identify the
variates in order of their entry into each
observational vector. The other num-
bers identify the three-sigma levels.
In Figures 3 and 4, it may be noted
that weak southeast winds with a mean
speed of approximately 4 km/h are
associated with ozone readings lower
than average and oxides of nitrogen and
sulfur readings higher than average.
Strong winds from the northwest,
approximately 9 km/h, and from the
south-southeast, approximately 12
km/h, again show the inverse relation-
ship of ozone with oxides of nitrogen
and sulfur. Clusters 1 and 3, with
southeast and south-southeast winds,
respectively, show the greatest tem-
perature difference between the means,
namely, 17.54°C. Further investigation
might yield the reason for this tempera-
ture difference. Speculatively, this
feature might be associated with
seasonal characteristics or synoptic
episodes.
Conclusions
The applications of three clustering
algorithms to aerometric data bases
were compared in order of investigation:
SAS CLUSTER, MIKCA, and NORMIX.
The three routines produce similar
results through the processing steps of
hierarchical clustering and output of
cluster means. Beyond that point,
MIKCA appears to provide slightly more
information than SAS CLUSTER.
NORMIX, as modified, produces consid-
erably more information and guidance
than either MIKCA or SAS CLUSTER. -
NORMIX is the recommended clustering
program; a calibrated and tested program
with full documentation, available as
Volume II of this three-volume report
series.
Many other clustering programs may
be used, but only the aforementioned
three have been examined in this study,
and only NORMIX has been examined in
detail. Of the three, only NORMIX
provided complete statistical estimates
of the multimodal, mulitvariate distri-
butions. SAS CLUSTER is strictly
hierarchical in grouping and mapping
and uses this information as initial
statistical estimates for further iter-
ations to achieve maximum likelihood
estimates.
Los Angeles Catalyst Study (LACS)
data were analyzed by use of the two
algorithms, SAS CLUSTER and MIKCA.
The results were similar. Community
Health Air Monitoring Program (CHAMP)
data also were analyzed by means of the
NORMIX program.
References
1. Wolfe, John J. (1971 (NORMIX 360
Computer Program. Naval Personnel
and Training Research Laboratory,
San Diego, California. Research
Memorandum SRM 72-4. 125 pp.
2. Barr, A.J., J.H. Goodnight, J.P. Sale,
and J.T. Helwig (1976) A User's
Guide to SAS. Spanks Press.
3. McRae, D.J. (1973) MIKCA. A FOR-
TRAN IV Iterative K-Means Cluster
Analysis Program. CTB/McGraw
Hill, Del Monte Research Park,
Monterey, California. Revised by
M.J. Symons, October 1973.
4. Anderson, Edgar (1953) The Irises of
the Gaspe Peninsula. Bull. Amer. Iris
Soc. 59:2-5.
5. Fisher, R.A. (1936) The Use of Multi-
ple Measurements in Taxonomic
Problems. Ann. Eugen. Vll:11:179-
188.
-------
-20.00
"
-70.00
I -5.00
o
"to
-S 0.00
1 5.00
c3\
' 70.00
to
75.00
20.00
25.00
30.00
Cluster
Variable
1 -NO
Mean
Standard deviation
2 -NOX
Mean
Standard deviation
3 - Ozone
Mean
Standard deviation
4- TS
Mean
Standard deviation
5 - West wind comp
Mean
Standard deviation
6 - South wind comp
Mean
Standard deviation
7 - Temperature
Mean
Standard deviation
8 - Dew point
Mean
Standard deviation
2 2
2
2/^~
2 f 2
( 2
V 2
21 Y^
2 7
74.00 \ 10.00 I 6.00
12.00 8.00
7
P = .354
0.72
0.03
0.74
0.03
0.70
0.01
0.78
0.07
-1.70
3.41
3.40
4.12
0.47
3.49
-5.60
3.36
2~~~^
22
2 21
2 21 2
—
3 V
' ;V
3
3/
}
3 \
\ 2.00
4.00
2
P = .333
0.70
0.00
0.70
0.00
0.71
0.00
0.72
0.03
3.13
4.42
-8.32
5.56
14.85
5.59
-1.31
2.95
^2
2\ 2
2 2}2
2 y ;
==lr~-^ 7
/! ^^
^3'3'-22
^/^--J '
3
3
3 3
\3
3~— 3
3
3
\ -2.00
3
P = .313
0.70
0.00
0.70
0.01
0.71
0.00
0.71
0.07
-2.86
3.86
12.12
6.00
18.01
6.22
-3.83
3.02
2
^
ri^ ' 3 3
— 1"^ ^ X
31\ 3
3 3
33
— ~
\ -6.00 -10.0O\ -14.OO
0.00 -4.00 -8.00 -12.00
5 - West wind comp
West vs South wind - probability level = O.50
Figure 3. Magna, Utah, Day 3, 0.50 probability ellipses of the west-east and
south-north wind components for three cluster types.
-------
38.6240
X+3S
7,378
31.5780
0.1471
jfor
0.313
0.333'r-'
0.354^
X—3S *-
-4-
-0.0248 -14.4900 -0.2841 -0.1339
-17.1200 -14.4340 -27.1330 -0.1076
73856412
Profile plot
Figure 4. The means and three-sigma limits for each variate of the three data
clusters of Figure 3 are represented by triangles (cluster 1), diamonds
(cluster 2} and circles (cluster 3), respectively. The variate numbers along
the abscissa refer respectively to: 1. NO; 2. NOx; 3, Ox 4, TS;
5, W component of wind; 6, S component of wind; t. temperature; and
8, dew point. The sequence of variates is determined by the value of their
logarithms for cluster 1 (lowest to highest). The other numbers refer to the
three sigma limit values for each variate.
6USGPO: 1982 — 559-092/0494
-------
Harold L Crutcher is a private consultant at 35 Westell Avenue, Asheville, NC
28804; the EPA author Raymond C. Rhodes (also the EPA Project Officer, see
below) is with the Environmental Monitoring Systems Laboratory, Research
Triangle Park, NC 27711; Maurice E. Graves is with Northrop Services, Inc.,
Research Triangle Park, NC 27709; Beth Fairbairn and A. Carl Nelson are with
PEDCo Environmental, Inc., Durham, NC 27701; Michael J. Symons is with
the University of North Carolina, Chapel Hill. NC27514.
The complete report consists of three volumes, entitled "Application of Cluster
Analysis to Aerometric Data:"
"Volume I. Part 1—Clustering, Validation, and Classification of Data;
Part 2—Investigation and Report of Cluster Analysis." (Order No. PB 82-226
432; Cost: $ 13.50, subject to change)
"Volume II. Part 3—Modifications and Options Applied to Wolfe's NORMIX
360 Cluster Analysis Program," (Order No. PB 82-226 440; Cost: $16.50,
subject to change)
"Volume III. Part 4—Separation of Environmental Data Into Clusters by the
NORMIX Program." (Order No. PB 82-226 457; Cost: $10.50, subject to
change)
The above reports will be available only from:
National Technical Information Service
5285 Port Royal Road
Springfield, VA 22161
Telephone: 703-487-4650
The EPA Project Officer can be contacted at:
Environmental Monitoring Systems Laboratory
U.S. Environmental Protection Agency
Research Triangle Park. NC 27711
United States
Environmental Protection
Agency
Center for Environmental Research
Information
Cincinnati OH 45268
Postage and
Fees Paid
Environmental
Protection
Agency
EPA 335
Official Business
Penalty for Private Use $300
Pj> 0000329
U S ENVIR PROTECTION AiiEHCY
HtGlON b LItiftAKY
230 i> DE.AKBORN STREET
IL 606U4
------- |