United States Office of Air Quality EPA-450/4-81-031e
Environmental Protection Planning and Standards September 1981
Agency Research Triangle Park NC 27711
Air
v>ERA Evaluating Simple Oxidant
Prediction Methods
Using Complex
Photochemical Models
Cluster Analysis Applied
To Urban Ozone Characteristics
-------
This report was furnished to the U.S. Environmental Protection
Agency by Systems Applications, Incorporated in fulfillment of
Contract 68-02-2870. The contents of this report are reproduced
as received from Systems Applications, Incorporated. The opinions,
findings and conclusions expressed are those of the author and not
necessarily those of the Environmental Protection Agency. Mention
of company or product names is not to be considered as an endorsement
by the Environmental Protection Agency.
-------
EPA-450/4-81-031e
Evaluating Simple Oxidant Prediction
Methods Using Complex
Photochemical Models
Cluster Analysis Applied To Urban
Ozone Characteristics
EPA Project Officer: Edwin L. Meyer, Jr.
Prepared by
U.S. Environmental Protection Agency
Office of Air, Noise and Radiation
Office of Air Quality Planning and Standards
Research Triangle Park, North Carolina 27711
September 1981
-------
EXECUTIVE SUMMARY
Control of urban ozone pollution poses a unique problem because ozone
is not directly emitted into the atmosphere by anthropogenic sources.
Rather, it results from atmospheric photochemical reactions involving
hydrocarbons and nitrogen oxides. Because the reactions involved in ozone
formation take several hours to produce maximum ozone levels, the
dispersion during such periods results in ozone's being a regional, rather
than a local, problem. Thus, the processes involved in determining the
location and severity of ozone concentrations include the temporal and
spatial characteristics of the emission rates of the two precursor
species, transport and dispersion by meteorological and topological
effects in the region, and photochemical reactions dependent on solar
intensity, temperature, and the like.
Currently, the only satisfactory way to characterize the temporal and
spatial nature of ozone pollution in an urban area is through the use of
complex mathematical models. Because of the complexity of these models,
the costs associated with data gathering and their exercise on a computer
can be substantial. Moreover, considerable expertise is needed to
successfully mount a full-scale study of an urban area. It is thus
fruitful to search for ways in which knowledge gained in one application
can be transferred to application in another urban area. Specifically, if
two urban areas can be shown to have sufficiently similar characteristics
with respect to their ozone problems, a control strategy developed for one
through application of a complex model could also be applied to the other,
thus potentially obviating the need for a second costly study.
This report covers two exploratory studies that apply multivariate
clustering techniques to the identification of similarities between urban
areas. The results showed promise in assigning urban areas to distinct,
relatively homogeneous classes; however, no clear-cut classification was
achieved. The more qualitative of the two techniques showed a greater
ability to classify areas into well-defined groups,. The results indicate
that further work is needed to refine the choice of classificatory
variables, and that other techniques might be applied with greater
success.
8032^2 1
-------
CONTENTS
EXECUTIVE SUMMARY ii
LIST OF ILLUSTRATIONS iv
LIST OF TABLES v
I INTRODUCTION 1
II CLASSIFICATION OF URBAN AREAS BY PROFILE ANALYSIS 4
III CLASSIFICATION OF URBAN AREAS BY HIERARCHICAL CLUSTERING 14
A. Stepwise Discriminant Analysis 17
B. Cluster Analysis 21
IV SUMMARY AND RECOMMENDATIONS 33
REFERENCES R-l
8032*^2 1
-------
ILLUSTRATIONS
1 Profile of Cluster 1 7
2 Profile of Cluster 2 8
3 Profile of Cluster 3 9
4 Profile of Cluster 4 10
5 Profile of Cluster 5 11
6 Profiles of Denver, Phoenix, and Salt Lake City 12
7 Urban Areas Included in Hierarchical Cluster
Analysis 16
8 Dendogram Based on All Variables 24
9 Dendogram Based on Meteorological and Emissions
Vari abl es 25
10 Dendogram Based on Meteorological Variables Excluding
Temperature Va ri abl es 27
11 Dendogram Based on Ozone Level Variables 29
12 Dendogram Based on Meteorological Variables 30
13 Dendogram Based on Emissions Variables 31
8032^2 1 IV
-------
TABLES
1 Urban Areas Included in the Profile Analysis 5
2 Urban Areas Included in Cluster Analysis 15
3 Ozone Monitors Corresponding to Certain Urban Areas 18
4 Urban Area Classifications 18
5 Identification of Urban Area Classifications 19
6 Variables Entered and Percent of Cases Classified
Correctly at Each Step of Discriminant Analysis 22
7 Summary of Clusters Based on Ozone, Meteorological,
and Emission Variables 32
-------
I INTRODUCTION
Ozone is unique among regulated air pollutants in that it is not
directly emitted into the atmosphere from anthropogenic sources. Rather
it results from atmospheric photochemical reactions involving hydrocarbon
(HC) and nitrogen oxide (NOX) precursors, which are emitted in varying
amounts by industrial, utility, and automotive sources. Levels of atmos-
pheric ozone thus depend not only on the usual factors of atmospheric
transport and dispersion and on the amount of pollutant emitted, but also
on the relative amounts and spatial distribution of emissions of two pre-
cursors and on the level of solar radiation necessary to initiate the
ozone-producing reactions. The speed as well as the extent of the reac-
tions depend on the ratio of HC to NOX and on the level of solar radia-
tion.
Because the ozone-producing reactions take several hours to produce
the maximum amounts of ozone, by the time this maximum has been reached
much atmospheric dispersion has taken place. Thus, ozone tends to be a
regional, rather than a local, problem. Furthermore, since a period of
maximum ozone concentration depends on the amount of solar radiation
available to sustain the photochemical reactions leading to its produc-
tion, the highest concentrations of ozone are reached during summer and
early fall when higher insolation is observed.
The regional nature of the problem and the lack of a direct source-
receptor relationship make the direct control of ozone concentrations,
using emissions control strategies, more difficult. Currently, the
concentration level at which the NAAQS is set is 0.12 ppm, and many urban
areas exceed this level more than the prescribed one time per year.
However, control of HC or NOX emissions, or both, does not necessarily
produce corresponding reductions in ozone (EPA, 1977a). In fact, the
benefit to be derived from controlling a given set of sources in an urban
area depends on current levels of HC and NOX as well as on the location of
those sources relative to the location of observed ozone maxima. To
assess potential benefits of different strategies, many methods have been
developed and applied in recent years.
The simplest of these methods is that of proportional rollback, which
assumes that a reduction in HC emissions will result in a proportional
reduction in ozone. As pointed out above, this approach does not work,
32«»R/2 1
-------
not only because of the influence of NOX levels on ozone-producing reac-
tions, but also because the relationship between HC and ozone is non-
linear. Other rollback methods that take nonlinearity into account (e.g.,
the Appendix J method, 40 CFR) also fail because of their lack of con-
sideration of NOX.
More complex methods, that account for the dependence of ozone
concentrations on both hydrocarbon and nitrogen oxides (EPA, 1977a), are
more successful in describing or predicting the results of potential ozone
control strategies. However, these models concentrate on the chemistry of
the problem and do not account for the spatial relationships between emis-
sions sources and transport and dispersion phenomena in the region of
interest. To include all aspects of the problem, simulation models have
been developed that account for emissions and their spatial and temporal
relationships, atmospheric photochemical reactions, and atmospheric trans-
port and dispersion. Such models are required to solve a large and com-
plex set of equations describing all of the atmospheric phenomena listed
above. Moreover, these equations are solved many times through a series
of time steps to yield temporally- and spatially-averaged ozone
concentrations.
Such photochemical simulation models are computationally extremely
complex, and for use in simulating a major urban area they require access
to a large computer. In addition, because of the complexity of the com-
puter programs, a knowledge of atmospheric pollution processes and com-
puter programming is necessary. An additional characteristic of these
models that discourages their use for many potential ozone problems is the
cost of setting up and running the program. Before applying such a model
to an urban area, an extensive data base containing emissions and meteoro-
logical data on a temporally- and spatially-resolved basis must be
developed.
Because of the cost and the difficulties involved in applying complex
photochemical models to a large number of urban areas, it seems fruitful
to investigate ways in which knowledge and experience gained by model
application to one city could have transfer value when a second city is
being considered. To that end, we have applied multivariate clustering
techniques to pertinent emissions, meteorological, and ozone-level data
from several cities in the United States. The objective of the study was
to determine whether urban areas could be objectively classified by
characteristics relevant to photochemical pollution. Identification of
city groups with similar characteristics would permit a small number of
prototypical cities to be subjected to detailed analysis using complex
photochemical models. Results obtained for one city within a group could
be used in evaluating possible control strategies for the rest of the
group. In addition, the performance of a model when applied to a proto-
typical city could be used as an indicator of its likely performance when
applied to other cities in the same group.
2 32UR/2
-------
The work described in this report covered two essentially exploratory
studies using different clustering techniques: first, a study employing
profile analysis to identify similarities between 29 urban areas (chapter
II), and second, a further analysis that applies a hierarchical clustering
technique to data from 45 urban areas (chapter III).
Cluster analysis comprises a set of mathematical techniques that are
used to examine and develop underlying structure in multivariate data.
The techniques range from fairly mathematical to almost purely descrip-
tive, and have been applied to an extremely wide range of data sets
(Hartigan, 1975). Clustering techniques can be thought of as a qualita-
tive analog of regression analysis in terms of describing structural con-
tent of multivariate data. In carrying out regression analysis a well-
defined mathematical model of the data structure is used; however, in
cluster analysis one frequently has no preconceived notions about an
appropriate model. Thus, whereas the goal of regression analysis is to
estimate the parameters of the model, the purpose of cluster analysis is
often merely to see whether or not the data fall into any reasonably well-
defined groups.
32«tR/2
-------
II CLASSIFICATION OF URBAN AREAS BY PROFILE ANALYSIS
In this study the urban areas chosen were those where the ozone NAAQS
was exceeded by more than 100 percent during the period 1974 to 1976 (EPA,
1977b). The areas included are shown in table 1. For classification
variables, five meteorological and two emissions-related variables were
chosen. These were:
> Summer morning mixing height. This quantity was chosen as
a measure of the volume into which morning emissions are
injected. These emissions are mainly responsible for
ozone formation in the afternoon.
> The difference between summer afternoon and summer morning
mixing height. This variable represents the degree to
which morning emissions are diluted by the increase in the
depth of the mixed layer.
> Summer afternoon wind speed. This measure represents the
dilution of the pollutant cloud in the afternoon, when
ozone reaches its peak in some areas.
> Normal daily July maximum temperature. Ozone formation is
a strong function of temperature, and areas having high
maximum temperatures are expected to have a greater
potential for ozone formation.
> Mean daily July solar radiation. Ozone is formed photo-
chemical ly, and the amount of solar radiation is an
important measure of its formation potential.
> Ratio of hydrocarbon to nitrogen oxides emissions. This
ratio affects the rates of atmospheric photochemical reac-
tions and the amount of ozone that can be formed.
> Percentage of nitrogen oxides from transportation
sources. This variable was chosen as a surrogate for the
mix of point and mobile sources in the area. This factor
can have important effects on ozone impacts.
32UR/2
-------
TABLE 1. URBAN AREAS INCLUDED IN THE PROFILE ANALYSIS
Hartford-New Haven, Connecticut
Philadelphia, Pennsylvania
Chicago, Illinois
Milwaukee, Wisconsin
Houston, Texas
Denver, Colorado
San Francisco Bay Area, California
Fresno, California
Boston, Massachusetts
Northern New Jersey
District of Columbia
Erie, Pennsylvania
Richmond, Virginia
Newport News, Virginia
Huntsville, Alabama
Tampa, Florida
Louisville, Kentucky
Nashville, Tennessee
Kingsport, Tennessee
Detroit, Michigan
Minneapolis-St. Paul, Minnesota
Cincinnati, Ohio
Cleveland, Ohio
Baton Rouge, Louisiana
Dallas, Texas
Wichita, Kansas
St. Louis, Missouri
Salt Lake City, Utah
Phoenix, Arizona
32<*R/3
-------
The values of the variables for the different urban areas (mixing
heights and wind speeds) were taken from Holzworth (1972). the Climatic
Atlas of the United States 1968 and the 1973 Emissions Trends report
(EPA, 1974). The meteorological data were interpolated from the maps
given in the two compilations.
Profile analysis, as a cluster analysis technique, falls into a more
qualitative class and is very simple in both concept and execution. First
a set of axes, one for each variable, is laid out. Scales on these axes
are chosen so that the ranges of the variables over tne complete set of
urban areas (or "cases") cover approximately equal lengths. Each case is
then plotted on each axis according to its value for that variable. A
"profile" for each case can then be constructed by joining all of the
plotted points for that case. The similarities between the profiles can
then be used to judge similarities between urban areas. In addition,
points of difference can be identified by discrepancies between
profiles. Using this technique, we identified five clusters of cities,
which are illustrated in figures 1 through 5.
In figure 1, Los Angeles and San Francisco are identified as forming
a fairly good cluster of coastal California urban areas. Some discrepancy
is noted in July maximum temperatures, but the value obtained from the
climatic atlas has more marine influence than do the values obtained in
the area having the highest ozone concentration. In figure 2, a cluster
of midwestern urban areas consisting of Chicago, Milwaukee, Minneapolis-
St. Paul, Detroit, and Cleveland is identified. Although their climates
are predictably similar, the emissions-related variables also reinforce
the clustering. Figure 3 shows a group of urban areas, all of which are
on the eastern seaboard of the United States, stretching from Connecticut
to Virginia. In figure 4 we have a cluster of eastern United States
cities consisting of Washington, D.C., Louisville, Nashville, Kingsport,
Cincinnati, and St. Louis. These cities can be differentiated from those
in the previous cluster on the basis of lower wind speeds, which are
presumably related to greater distance from the ocean. This cluster seems
to be closer in terms of meteorology than it is in terms of emissions-
related variables. Figure 5 shows profiles of three urban areas located
near the Gulf of Mexico: Houston, Tampa, and Baton Rouge. While these
cities do not form a particularly tight cluster, the similarities between
them are evident.
In figure 6 we show an example of three profiles that, though they
might be expected to show similarities, in fact show large discrepan-
cies. These profiles are for Denver, Phoenix, and Salt Lake City. The
largest discrepancies are for July maximum temperature, July insolation,
and the ratio of HC to NOX emissions.
32I+R/2
-------
SUMMER
MORMNG
MIXING
HEIGHT
800-
SUMMER
CHANGE IK AFTERNOON JULY
MIXING HIND MAXIMUM JULY
HEIGHT SPEED TEMPERATURE INSOLATION
U»F
I NOX FROM
HC/NO TRANSPORTATION
I x
700-
600-
400-
300-
100-J 700-J 0.5-
500-1 2000-
200 -J 3500-1
FIGURE 1. PROFILE OF CLUSTER 1
-------
MORNING
MIXING
HEIGHT
800-
SUMMER
CHANGE IN AFTERNOON JULY
MIXING HIND MAXIMUM JULY
HEIGHT SPEED TEMPERATURE INSOLATION
J NO* FROM
HC/NOx TRANSPORTATION
700-
FIGURE 2. PROFILE OF CLUSTER 2
8
32U R/5
-------
Sl»"1ER SUMMER
MORNING CHANGE IN AFTERNOON JULY
MIXING MIXING WIND MAXIMUM JULY ,JLLsB,,T,«i.
HEIGHT HEIGHT SPEED TEMPERATURE INSOLATION HC/NO TRANSPORTATION
800 H
700H
600 -i
500 H
400H
300 H
NJ
NNEWS
200 -J 3500-J
«-J TOO-' 700-1 0.5-J
FIGURE 3. PROFILE OF CLUSTER 3
-------
SUPMER
MORI, ING
MIXING
HEIGHT
800-
700-
600 J
500-1
SUMMER
CHANGE IK AFTERNOON JULY
MIXING HIND MAXIMUM JULY
HEIGHT SPEEO TEMPERATURE INSOLATION
500-
1 000-1
8-
DC
KINGSP
INCI
85-
80-
ZOO-" 3500-J
CINCI 95-
KINGSP
NASHV
4-1 100 -J 700-1
500-
l NOX FROM
HC/NO TRANSPORTATION
100-
3,0-
DC
KINGSP
2,5-
NASHV
70-
90-
80-
KINGSP30-
LOUISV
DC
5 -'NASHV 2°
-------
SUHMER
HORWNG
MIXJNG
HEIGHT
BOO-
SUMMER
CHANGE IN AFTERNOON JULY
H1X1NG WIND MAXIMUM JULY
HE1CHT SPEED TEMPERATURE INSOLATION
J NOX FROM
HC/NOx TRANSPORTATION
700-
600-
500--
400-
300-
TAMP
HOUS
B.ROUG
200-* 3500J
100 J 700-1 0.5-1 20J
FIGURE 5. PROFILE OF CLUSTER 5
32**R 5
11
-------
HORDING
MIXING
HEIGHT
SUMMER
CHANGE IN AFTERNOON JULY
NIXING HIND MAXIMUM JULY
HEIGHT SPEED TEMPERATURE INSOLATION
S NOX FROM
HC/NOx TRANSPORTATION
0.5-J 20J
FIGURE 6. PROFILES OF DENVER, PHOENIX, AND SALT LAKE CITY
12
32«»R/ 5
-------
This preliminary analysis demonstrated the feasibility of grouping
urban areas according to the factors contributing to oxidant problems.
However, profile analysis gives no information of a quantitative nature
about the degree to which cases in a given cluster resemble each other and
differ from cases in other clusters. Moreover, since the clusters are
identified by visual inspection of profiles, there is an arbitrary element
in the selection of cases. We therefore carried out a further study in an
attempt to achieve a more quantitative clustering and to consider more
variables in the clustering process.
32W2 13
-------
Ill CLASSIFICATION OF URBAN AREAS BY HIERARCHICAL CLUSTERING
For this further analysis, we examined data for 45 urban areas. They
were selected as follows: First, we took those major urban areas that
requested an extension to 1987 of their attainment date for the ozone
NAAQS (Federal Register, 44. 65667). Of these, we eliminated Wilmington,
Delaware, because some of the required data were not available. To this
list of urban areas we added six more, to have more comprehensive geo-
graphical coverage of the United States. The urban areas included are
shown in table 2 and figure 7.
Three types of data were compiled for each urban area: emissions,
climatological, and ozone levels. Emissions data for each area, obtained
from the National Emissions Data System, were taken for each county within
that area. Three emissions variables were included: total HC emissions,
the ratio of HC to NOX emissions, and CO emissions from transportation
sources. The HC and NOX emissions influence the ozone-producing chemical
reactions as detailed above and, thus, should be important in classifying
urban ozone problems. We included CO emissions from transportation
sources as a surrogate for vehicle miles traveled in an area. The amount
of transportation-related emissions is an important, facet of urban ozone
problems.
Climatological data were again obtained from Holzworth (1972) and the
Climatic Atlas of the United States, 1968. Data were interpolated from
maps or taken from tabular compilations in these documents. Since ozone
is a regional problem, regional climatology is likely to be more apposite,
it therefore seems appropriate to use data interpreted on a large-scale
rather than local-scale, climatology. The climatological temperature data
used for the analysis were June, July, August, and September maximum
temperatures, the annual maximum temperature, and the average maximum for
June through September. This late-summer period generally produces the
highest ozone concentrations. Three variables related to the amount of
sunlight at each location were obtained: the total hours of insolation in
the summer months, the average daily summer insolation, and the average
percent of cloud cover.
Since the ozone-producing reactions are initiated and sustained by
sunlight, ozone formation should be sensitive to the amount of sunlight
incident at a specific location. Average summer morning and afternoon
14 32HR/2
-------
TABLE 2. URBAN AREAS INCLUDED IN CLUSTER ANALYSIS
1 Allentown, PA
2 Baltimore, MD
3 Boston, MA
4 Bridgeport CT
5 Butte, MT
6 Chicago, IL
7 Cincinnati, OH
8 Cleveland, OH
9 Dallas, TX
10 Dayton, OH
11 Denver, CO
12 Detroit, MI
13 Fresno, CA
14 Hartford, CT
15 Houston, TX
16 Indianapolis, IN
17 Kansas City, KN
18 Los Angeles, CA
19 Louisville, KY
20 Miami, FL
21 Milwaukee, MI
22 Minneapolis, MN
23 New Haven, CT
24 New Orleans, LA
25 New York, NY
26 Philadelphia, PA
27 Phoenix, AZ
28 Pittsburgh, PA
29 Portland, OR
30 Providence, RI
31 Richmond, VA
32 Sacramento, CA
33 Salt Lake City, UT
34 San Bernardino, CA
35 San Diego, CA
36 San Francisco, CA
37 Scranton, PA
38 Seattle, WA
39 Springfield, MO
40 St Louis, MO
41 Trenton, NJ
42 Ventura - Oxnard, CA
43 Washington, DC
44 Worcester, MA
45 Youngstown, OH
324R/3
15
-------
re
0)
r«
re
1/1
r- C
V) =>
re
c -C
re x
C-r-
reu
0)
.*
re x
O
C£
«r
Qi
O
UJ
Q
OO
«t
UJ
Q£
-------
wind speeds were obtained because a higher average wind speed should favor
dispersion of precursor emissions and thus limit the concentrations of
ozone that can be formed. Average summer morning and afternoon mixing
heights were also recorded, as well as the change in average mixing height
from morning to afternoon. The height of the mixing layer gives a measure
of the effective volume into which emissions are discharged, and the con-
centrations reached are to a first-order approximation inversely propor-
tional to this volume. Moreover, the change in mixing height is a measure
of the dilution of morning precursor emissions. In some hot, interior
locations a low morning inversion is largely dissipated by afternoon,
whereas in a coastal location an inversion can persist into the afternoon,
trapping pollutants into a concentrated layer near the ground.
The ozone data used in this study were obtained from the Monitoring
and Data Analysis Division of the Office of Air Quality Planning and
Standards. They consisted of the maximum and second highest ozone level,
the average ozone level, and the number of exceedances of the ozone
standard for 1978. In cases where data from more than one station were
available for an urban area, the stations with the readings most represen-
tative of the area's ozone problems were chosen. The areas for which a
differently located monitor was used are shown in table 3.
A. STEPWISE DISCRIMINANT ANALYSIS
We first attempted to reduce the number of variables to be considered
by ascertaining which of the total number were most effective in discrimi-
nating between levels of severity of ozone problems. To do this we
applied stepwise discriminant analysis, using the variables related to
ozone concentration level to classify the cases. The cases were classi-
fied into five groups of approximately equal size using the variable
values shown in table 4, which also shows the urban areas in each group,
it may be seen that the groups do vary according to the variable used for
classification.
Since we carried out the discriminant analysis in a stepwise manner,
those variables entered early in the analysis should be the most influen-
tial in discriminating between the groups shown in table 4 (an analogy can
be drawn using stepwise regression). Ideally, the results of the three
classifications would show the same variables to be important, but the
results obtained allowed only general conclusions to be drawn.
Table 5 shows the order of entry of variables for the three cases run
and the percentage of cases correctly classified, for the first 14
steps. Entry of variables was halted when an entering variable had a
squared multiple correlation coefficient (R ), with the other variables,
of more than 0.99. At this stage, more than 60 percent of the cases were
324R/5 17
-------
TABLE 3. OZONE MONITORS CORRESPONDING TO CERTAIN URBAN AREAS
Area Monitor Used
New York Richmond County
Philadelphia Morristown
Springfield Amherst
Cleveland Painesville
San Diego Escondido
Ventura Simi Valley
New Haven Derby
Bridgeport Greenwich
San Francisco San Jose
Dallas Arlington
TABLE 4. URBAN AREA CLASSIFICATIONS
Second Highest
Ozone
Group
No.
Values
(pphm)
No. of
Cases
Average Ozone
Concentration
Values
(pphm)
No. of
Cases
Number of
Exceedances
Values
(pphm)
No. of
Cases
1 Less than 12 6 Less than 6 9 Less than 5 13
2 12-16 11 6-7 8 5-10 8
3 16-18 10 7-8 10 10-15 9
4 18-20 9 8-9 1 15-20 6
5 More than 20 9 More than 9 6 More than 20 9
324R/3
18
-------
TABLE 5
IDENTIFICATION OF URBAN AREA CLASSIFICATIONS
(a) Based on Second Highest Ozone Concentration
Group No.
1
Portland
Miami
New Orleans
Dallas
Minneapolis
Butte
2
Boston
Springfield
Worcester
Trenton
Youngstown
Dayton
Indianapolis
Denver
Phoenix
Fresno
Seattle
3
Washington
Pittsburgh
Detroit
San Diego
Providence
Hartford
Al lent own
Scranton
San Francisco
Kansas City
4
New York
Philadelphia
Baltimore
Chicago
St. Louis
Cincinnati
Milwaukee
Sacramento
Louisville
5
Houston
Los Angeles
Cleveland
Ventura
New Haven
Bridgeport
Richmond
Salt Lake City
San Bernardino
(b) Based on Average Ozone Concentration
Group No.
1
Boston
Worcester
Trenton
Seattle
Portland
Miami
New Orleans
Minneapolis
Butte
2
Springfield
Providence
Hartford
Denver
Phoenix
San Francisco
Fresno
Dallas
3
New York
Washington
Cincinnati
Detroit
Milwaukee
Sacramento
New Haven
Youngstown
Dayton
Kansas City
4
Philadelphia
Baltimore
Chicago
Pittsburgh
Cleveland
San Diego
Bridgeport
Al lentown
Scranton
Richmond
Indianapolis
Louisville
5
Houston
St. Louis
Los Angeles
Ventura
Salt Lake City
San Bernardino
19
-------
TABLE 5 (Concluded)
(c) Based on number of exceedances
Group No.
1
Springfield
Worcester
Trenton
Youngstown
Denver
Phoenix
Seattle
Portland
Miami
New Orleans
Dallas
Minneapolis
Butte
2
Boston
Baltimore
Cincinnati
Providence
Dayton
Indianapolis
San Francisco
Fresno
3
New York
Detroit
Milwaukee
Sacramento
San Diego
Hartford
New Haven
Scranton
Kansas City
4
Washington
Chicago
Pittsburgh
Bridgeport
Al 1 entown
Louisville
5
Philadelphia
Houston
St. Louis
Los Angeles
Cleveland
Ventura
Richmond
Salt Lake City
San Bernardino
20
-------
correctly classified. However, with 5 to 6 variables, over 50 percent
could be correctly classified. The data in table 6 show that somewhat
different variables are important in discriminating between groups based
on the three criteria, as would be expected given the different composi-
tion of the groups for different classification variables.
Some general conclusions can be drawn from these discriminant
analyses, however: First, the effect that appears to be most important
overall is the insolation; cloud cover and total and average insolation
are among the first variables to be entered in each case. Next most
important appear to be precursor emissions, since all three of these vari-
ables are brought in among the first eight or so. After these two
factors, it appears that some measure of ventilation (that is, a wind
speed or a mixing height, or both) is brought in. Summer temperatures do
not appear to have great importance in the classifications; they are only
used after many other variables have been brought in.
B. CLUSTER ANALYSIS
We had hoped that the discriminant analysis would give us a clear
picture of the most influential variables to include in a cluster analy-
sis. Because this did not happen, we tried clustering the cases on the
basis of several different sets of variables. According to Hartigan
(1975), this method can be used to test the stability of the clustering
process; that is, clusters that persist for different combinations of
variables have a greater probability of representing a real effect.
Accordingly, we carried out clustering using the program BMDP2M (Dixon and
Brown, 1979), with the following sets of variables:
1) All variables (ozone levels, meteorological variables,
emissions).
2) Meteorological and emissions variables.
3) Meteorological variables excluding temperature variables.
4) Ozone level variables.
5) Meteorological variables.
6) Emissions variables.
The clustering based on all variables should give an indication of
overall effects. Analyses 2 and 3 give a clustering based on ozone forma-
tion potential, analysis 3 being restricted by eliminating temperatures,
which were shown to be relatively unimportant by the discriminant analy-
324/2
-------
u
I/)
^ -
X>
o>
4-> OJ
(0 S-
Ol CL 4-> E L. O>
c i- *j CL CL
- c c CL x OJ
C O O - -i- 4J X -r-
O ini/)O> O x O) E
= '!Z _xc c E
4J C
i_
CL
OJ
X
E
oo
00
UJ
to
o
u.
o
z
LU
OC
LU GO
fN t.^
O >
Ig
o <:
UJ
a: t-
LU Z
Z Z
LU ii
OO '
LU Q£
I O
CO bO
VO
LU
_J
CO
o
O)
CO
0)
O) C
c o
O "-
IM <->
<*^ w
1. U
ft> c
<5
c
o
(U
JO
iff
0)
c
LU
1C
0) tJ
CO) J-
O Q.
i- «/) «/)
*J C
T3 ^3 O
^^ ^« ,^_
O «- t/) t/» «/»
C O *- O tt)
O in O) 01 O
CD C - X-i-
i ivofOir>'9-csjvo
c: 4J 01
O JC C. a>
c c J- E CL CL
X C Q.+J Q>
r- 4-> (C X !-
o> c E o>4-> c
><4-OOOi OJZ
3 > 3
0)
c
o
rsi c
O O
Vt TJ
en c
p- O>
z u
o o
c u
o
u
LU
0>
Lo
tt)
4-» O
CO Z
22
32<*R/5
-------
sis. Analyses 4, 5, and 6 should show groups of cities that have similar
overall ozone problems, similar meteorology, and similar emissions,
respectively.
Output from BMDP2M includes a dendogram (Everitt, 1977). The
dendogram based on all variables is given in figure 8. Identification of
clusters is still, to a degree, a matter of judgment, but the dendogram
gives quantitative information on the similarities between cases. The
distance measure used is the Euclidean distance between cases:
k=l
where x^ is the value of the i-th variable in the k-th case. The program
standardizes the variables to z-scores (subtract mean and divide by stan-
dard deviation) so that distances are comparable for all variables. The
dendogram produced by the program is based on the single-linkage algorithm
(Everitt, 1977). In this method, cases are joined according to the dis-
tance between them, with the closest being joined first. (The separation
between cases is read from the distance scale on the figure.) Clusters
are identified visually, and several can be seen in figure 8, though the
appearance of this dendogram is indicative of little group structure
(Everitt, 1977). Five clusters can be tentatively identified:
1) Boston, Hartford, Bridgeport, New Haven, Providence,
Worcester, Springfield, Scranton.
2) St. Louis, Kansas City.
3) Cleveland, Detroit, Milwaukee.
4) Baltimore, Washington, Allentown, Richmond, Youngstown,
Dayton, Indianapolis, Louisville, Cincinnati,
Philadelphia.
5) Fresno, Sacramento.
The first cluster would represent urban areas in New England, and the
second, the midcontinent. The third cluster has cities in the Great Lakes
area, and the fourth includes the Ohio river valley and the East Coast.
Cluster 5 has warm, dry, interior California areas. Thus, these clusters
can be interpreted mainly on a geographical basis.
Figure 9 shows the dendogram based on meteorological and emissions
variables. Again, there is a lack of obvious clusters, though more group
32^/2 23
-------
I
oo
LU
CD
O
o
UJ
I/)
et
CD
O
O
CO
UJ
O
csi
o
CO
to
9DUEJSIQ
O
OJ
o
o
24
32UR/5
-------
1
0
ff>
1 1
0 0
00 t^
1
o
10
1
0
*
ir>
0
«st
-I
_J
C
c
t
/r
CC
F, ,.., , . ,,, +fh
1C
_ 1-J
r n.
1 8
1 r-
M tl
T*.
M IJ
r\r
UL
Ofr
t. ... QT
yi
^
... , /
01
1 T
rt
L r1 -t
J 1C
H 4l
o*1
06
IT
1 i:;
9
*-i
bo
81
be
?!>
*"
yt.
be
r
rr
TT
n-7
+f-5
5
rr
j I
/J
frr
VL
1 " "" PT
ct
81
1 1 1
3 0 O 0
O
C5
25
-------
structure is apparent than in figure 8. However, it is hard to identify
many clear clusters. Possible clusters are:
1) Boston, Scranton, Springfield, Worcester, Milwaukee,
Minneapolis, Cleveland, Bridgeport, New Haven, Hartford,
Trenton, Philadelphia, Providence.
2) St. Louis, Indianapolis, Louisville, Cincinnati.
3) Dayton, Allentown, Washington, Youngstown, Richmond,
Baltimore.
After these three clusters are identified, the remainder show several
pairs of similar cases:
4) Detroit, Chicago.
5) Ventura, San Francisco.
6) Butte, Salt Lake City, Denver.
7) Miami, New Orleans, Dallas, Houston.
8) Fresno, Sacramento.
Again, there are the obvious geographical connotations to the clusters,
except for cluster 1, which consists mostly of the New England area but
also includes Milwaukee, Minneapolis, and Cleveland.
When temperatures are left out of the analysis, we obtain the dendo-
gram in figure 10. There is a little more structure in this diagram, and
we identify these clusters:
1) Springfield, Worcester, Bridgeport, New Haven, Hartford,
Providence.
2) Scranton, Trenton.
3) St. Louis, Indianapolis, Louisville, Dayton, Cincinnati,
Milwaukee, Minneapolis, Cleveland.
4) Baltimore, Richmond, Youngstown, Washington, Allentown.
5) Seattle, Philadelphia.
26 324R/2
-------
1
o
*
CO
1 1
O 0
1
c
LT
1
0
^-
H
.
c
r
\
3
)
bd
w L
r-t
U6
rf
06
CT
bl
/T
i fif
L| tl
V I
nr>
ut
/r
1 f +,
_4 01
|L_ QT
y i
I 61
1 /
T »
Lo
_J 66
" n
7
r i i IC
c +,
T
or
1 DC.
1 n->
"'f
or
cr
n-J
Uc
r
rr
Lt
i /**
PT
b L
1 1 1
000
CM 1 O
CO
UJ
I
CO
a:
cr
UJ
a.
UJ
o
X
UJ
I/O
UJ
_J
C2
it
cr
o
cr
o
o
UJ
CO
-------
6) Fresno, Sacramento.
On the basis of the analyses performed to this point, no obvious pat-
tern emerges. As with the discriminant analysis, the results obtained
appear to depend more on the details of the analysis than on any under-
lying structure in the data. The problem may lie with the algorithm used
in BMDP2M, which does not deal effectively with noisy data even when there
is clear structure (Everitt, 1977). Possibly a different algorithm, or
use of a divisive rather than an agglomerative technique, would be more
successful.
Dendograms based on ozone levels, meteorological variables, and emis-
sions variables alone are given in figures 11, 12, and 13. In these cases
the algorithm has been more successful in identifying clusters, and these
clusters are listed in table 7. As would be expected, clustering based on
meteorology alone produces geographically close groups. The other two
types of variables, however, produce clusters that do not have any geo-
graphical component to them at all. For instance, it appears that Boston
and Seattle resemble each other in terms of their ozone levels. The
values of the variables for these two cities are, respectively, maximum
ozone, 16.9 and 16.0 pphm; second highest ozone, 13.8 and 14.0 pphm;
average ozone, 5.7 and 4.3 pphm; and numbers of exceedances, six and
four. Similarly, based on emissions, St. Louis and San Diego are in the
same cluster. Their emissions are, respectively: HC, 127,000 and
137,000; HC/NOX, 1.42 and 1.42; and CO, 495,000 and 1,000.
28 324R/2
-------
l_r<=zz
62
2fr
K
81
CO
ct
o
o
o
o
LjJ
CO
I
cc.
C3
O
O
O
00
o
ro
J
o
o
32UR 5
29
-------
92
Li
ee
9
12
ZI
22
8
-61
-z
-91
01
-Of
-2
-I
-It
-92
S
-SC
-62
-ae
-ge
-Zfr
-81
-02
-frZ
-ZI
-6
-91
-ee
-ii
-ei
ze
CO
-------
I/)
UJ
_J
CQ
OL
-------
TABLE 7. SUMMARY OF CLUSTERS BASED ON OZONE METEOROLOGICAL AND
EMISSION VARIABLES
(a) Clusters Based on Ozone Variables (figure 5)
1) Boston, Trenton, Worcester, Seattle, Phoenix, Fresno,
Denver, Dallas, Youngstown.
2) Philadelphia, Louisville, Chicago, San Diego, Scranton.
Allentown, Pittsburgh, Washington, Providence, San
Francisco, Detroit, Hartford, Sacramento, Kansas City,
Baltimore, Cincinnati, Milwaukee, New York.
3) Cleveland, St. Louis, Bridgeport, New Haven, St. Louis,
Houston.
4) Miami, Minneapolis, Butte, Portland.
(b) Clusters Based on Meteorological Variables (figure 6)
1) Boston, Worcester, Providence, New Haven, Bridgeport,
Hartford.
2) Chicago, Milwaukee, Detroit, Minneapolis, Cleveland.
3) Washington, Richmond, Louisville, Cincinnati,
Indianapolis, Dayton, St. Louis.
4) Pittsburgh, Youngstown.
5) Baltimore, Allentown, Trenton, Philadelphia.
6) Ventura, Los Angeles.
(c) Clusters Based on Emissions Variables (figure 7)
1) St. Louis, San Diego, San Francisco, Milwaukee.
2) Ventura, New Haven, Indianapolis, Hartford, Denver,
Louisville, Scranton, New Orleans, Kansas City,
Trenton.
3) St. Louis, San Bernardino, Pittsburgh.
4) Baltimore, Allentown, Richmond, Youngstown, Providence,
Washington.
5) Springfield, Portland, Fresno, Dayton, Worcester,
Sacramento, Bridgeport, Cincinnati.
6) Phoenix, Minneapolis, Seattle, Miami, Dallas,
Philadelphia.
32 32HR/3
-------
IV SUMMARY AND RECOMMENDATIONS
The analyses carried out for this study do not lead to a definite
conclusion about the possibility of classifying cities by using combina-
tions of characteristics such as we have used here. On the one hand, the
profile analysis appears qualitatively to show that there are definite
resemblances and differences, and the discriminant analysis was reasonably
successful in classifying the ozone problems of the cities on the basis of
a set of variables that included both meteorology and emissions. On the
other hand, the agglomerative hierarchical clustering algorithm with which
we attempted some quantitative clustering failed to achieve a clear-cut
classification. This technique is known to be susceptible to failure in
the presence of noisy data, and it is possible that a different agglomera-
tive algorithm (e.g., Ward, 1963) or a divisive technique such as the
Automatic Interaction Detector (A.I.D.) method (Sonquist and Morgan, 1963,
1964) could give better results. These methods are more robust in the
presence of noisy data.
We believe that the results presented here indicate that classifica-
tion techniques can be used to identify urban areas with similar ozone
problems. However, more work is necessary to determine the best group-
ings. One possible approach would be to apply principal components or
factor analysis to identify groups of variables that best account for
variations in the data. An alternative to this approach would be to apply
insights into the physical nature of the problem. Once an appropriate set
of variables has been identified, clustering algorithms could be applied
to the data; many of these algorithms can be found in the work of Hartigan
(1975).
32UR/2 33
-------
REFERENCES
Dixon, W. J., and M. B. Brown, eds. (1979), Biomedical Computer Programs
P-Series, Systems, Program and Statistical Development (University of
California Press, Berkeley, California).
EPA (1977a), "Uses, Limitations and Technical Basis of Procedures for
Quantifying Relationships Between Photochemical Oxidants and
Precursors," EPA-450/2-77-021a, U.S. Environmental Protection Agency,
Research Triangle Park, North Carolina.
EPA (1977b), "National Air Quality and Emissions Trends Report, 1976,"
EPA-450/1-77-002, U.S. Environmental Protection Agency, Research
Triangle Park, North Carolina.
EPA (1974), "Monitoring and Air Quality Trends Report, 1973," EPA-450/1-
74-007, U.S. Environmental Protection Agency, Research Triangle Park,
North Carolina.
Federal Register (1979), Vol. 44, No.l 221, Nov. 14, 1979.
Everitt, B. (1977), Cluster Analysis (Heinemann Educational Books, London,
England).
Hartigan, J. A. (1975), Clustering Algorithms (John Wiley & Sons, New
York, New York).
Holzworth, 6. C. (1972), "Mixing Heights, Wind Speeds, and Potential for
Urban Air Pollution Throughout the Contiguous United States," AP-101,
U.S. Environmental Protection Agency, Research Triangle Park, North
Carolina.
The Climatic Atlas of the United States, 1968 (U.S. Department of
Commerce, Washington, D.C.).
Sonquist, J. A., and J. N. Morgan (1964), "The Determination of
Interaction Effects," Survey Research Centre, Institute of Social
Research, University of Michigan.
Sonquist, J. A., and J. N. Morgan (1963), "Problems in the Analysis of
Survey Data and a Proposal," j. Am. Stat. Assoc., Vol. 58,
pp. 415-435.
Ward, J. H. (1963), "Hierarchical Grouping to Optimize an Objective
Function." J. Am. Stat. Assoc., Vol. 58, pp. 236-244.
-------
TECHNICAL REPORT DMA
,'Plcase read Instructions e >i the revtrsf utiforc con. pie ting1
1. REPORT NO.
EPA-450/4-81-031e
2.
4. TITLE AND SUBTITLE
Evaluating Simple Oxidant Prediction Methods Using
Complex Photochemical Models: Cluster Analysis Applied
to Urban Ozone Characteristics
7. AUTHOR(S)
Martin J. Hi 1 Iyer
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Systems Applications, Incorporated
101 Lucas Valley Road
San Rafael, California 94903
12. SPONSORING AGENCY NAME AND ADDRESS
U.S. Environmental Protection Agency
Office of Air Quality Planning and Standards
Research Triangle Park, North Carolina 27711
15. SUPPLEMENTARY NOTES
3. RECIPIENT'S ACCESSION NO.
5 REPORT DATE
August 1981
T. PERFORMING ORGANIZATION CODE
8 PERFORMING ORGANIZATION REPOHT NO.
SAI No. 81176
10. PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
68-02-2870
13. TYPE OF REPORT AND PLRIOD COVERED
14. SPONSORING AGENCY CODE
16. ABSTRACT
This report describes efforts to classify cities observing ozone levels greater than
0.12 ppm into distinct subgroups. Cluster analysis, using such factors mixing height,
wind speed, temperature, NMOC/NOX ratio and type of precursor sources, is used to
identify subgroups of cities. Identification of a limited number of such subgroups
could provide a means for more convincingly demonstrating the general applicability
of complex photochemical models by conducting validation exercises in cities repre-
sentative of each subgroup. The report indicates that the technique shows promise
but, nevertheless, requires some further refinement before it can be used to identify
most appropriate subgroups.
17.
a. DESCRIPTORS
Photochemical models
Ozone
Cluster analysis
18. DISTRIBUTION SiA'IMENT
Unlimited
KEY WORDS AND DOCUMENT ANALYSIS
b.lDCNTIFIERS/OPKN ENDED TERMS C. COSATI Field/Group
.- i
: -> stc. r- * 01 A
|20 SE'ru'qiT" C> '
I -
;- /TVv Vppcrt/' ,'21 NO. OF P/ GES
42
?S (THispaze, -22. P^ICE
!
r
EPA Form 2270-1 (Rev. 4-77) PREVIOUS ED-T ON is c SSOLE-EL
------- |