EPA/600/A-97/081
6.4
AGGREGATION USING CLUSTER ANALYSES
FOR MODELS-3 CMAQ APPLICATIONS
Sharon LeDuc*, Brian Eder*, and Robin Dennis*
Atmospheric Sciences Modeling Division, Air Resources Laboratory,
National Oceanic and Atmospheric Administration, Research Triangle Park, NC 27711
Richard Cohn
Analytical Sciences, Inc.
Durham, NC 27713
1,
INTRODUCTION
Models-3, a framework for air quality modeling,
is scheduled for public release In June 1998. The
framework will support statistical analysis and presently
uses four SAS® modules (base, graph, AF and FSP)
in the emissions processing. Statistical capability in
Models-3 is also found in PAVE (Thorpe, 1996) which
displays time series plots. Statistics such as mean,
median, and percentiles can be plotted from Models-3
with IBM's Visualization Data Explorer.
The statistical tool described in this paper is
for policy planning for air quality issues related to
annual or multi-year measures, rather than the most
extreme events. Scientifically credible and reliable
estimates of air quality for large regions rely on air
quality models, such as the Community Multiscale Air
Quality (CMAQ) model in Models-3. Application of such
models requires massive resources, both human and
computer, for each policy and/or meteorological
scenario. Analysis of benefits proposed for the Clean
Air Act Amendments of 1990 requires annual
timescales. Unfortunately, CMAQ model, like most
Eulerian models, challenges the practical limits of
current computer resources as well as our ability to
collect the pertinent input data on annual scales. As a
result, applications to determine the long-term
relationship between changing emissions patterns and
ambient air concentrations are resource intensive.
To circumvent this problem, a statistical
aggregation method, initially developed for RADM acid-
deposition applications (Brook et al., 1995), will be
modified to provide estimates of long-term (seasonal or
annual) ambient air concentrations, wet and dry
deposition amounts, and measures related to visibility.
An important feature of the aggregation method is to
represent source attribution. The method uses cluster
analysis based on the premise that, at any given
Corresponding author address: Sharon LeDuc (MD-80).
On assignment to National Exposure Research
Laboratory, U.S. Environmental Protection Agency,
Research Triangle Park, NC 27711; email:
leduc@hpcc.epa.gov
location, ambient air concentrations (also deposition
amounts) can be represented by a finite number of
different, though recurring, meteorological regimes.
The meteorological regimes identified for RADM
considered only the eastern U.S. and Canada where
acid deposition was the issue. Now air quality issues,
such as regional haze, are more geographically
extensive. The sample selection and aggregation
weightings used with RADM need to be reexamined for
use with these issues.
2.
2.1
DATA
Meteorological
To accommodate the continental domain and
to achieve sufficient spatial resolution, the cluster
analysis uses data at 336 grid nodes with 2.5° spatial
resolution from the NCEP/NCAR 40-year reanalysis
project (Kalnay et al,, 1996). The RADM clusters were
defined using 850mb winds, but because of the
mountains in the western U.S. the 700 mb wind
components for 1800 UTC were used here. The
domain is designed to prevent excessive influence from
ocean-based meteorology. Since a model (RADM or
CMAQ) is usually run for a 5-day period (the first two
days establish initial conditions with model predictions
from days 3-5 saved as a "3-day event"), 5-day periods
were clustered instead of 3-day periods used for
RADM.
2.2
Air Quality
Assignment of a 5-day period to a cluster, will
determine how air quality data, model estimates or
monitoring measures, for that period will be used to
estimate annual statistics. Evaluating how well the
meteorologically derived clusters relate to air quality
requires air quality monitoring data for the same period
of record as the meteorological data. Air quality data
are more limited than the meteorological data.
Surrogate air quality data, derived from human
observations of visible range at airports, will be used
first. The near noon observation, converted to an
extinction coefficient (Husar and Wilson, 1993) has an
-------
inverse relationship with fine particles in the air. Later,
other sources of air quality data will be used for shorter
time periods: National Atmospheric Deposition
Program (NADP); Clean Air Status and Trends Network
(CASTNet); Interagency Monitoring of Protected Visual
Environments (IMPROVE); and Aerometric Information
Retrieval System (AIRS).
3. METHODOLOGY
3.1 Clustering
The purpose of objectively defining
meteorological categories is to identify recurring
atmospheric transport patterns associated with varying
concentration and deposition patterns of air pollutants.
Identification of these patterns facilitates selection of
time periods, i.e. the sample, for simulation by CMAQ.
Model output from the sample will be weighted in the
aggregation based on population weights of the dusters
or strata. Representative meteorological categories
have been explored by others (Fernau and Samson,
1990; Davis and Kalkstein, 1990). The approach used
here is based on a variation of the methods previously
used by Brook ef a/, (op cif) in selecting a RADM 30-
episode sample for aggregation.
The common thread is the cluster analysis of
zonal u and meridional v wind components. A 10-year
period (1980-1990) was used. To make the analysis
computationally feasible, the first, third, and fifth days of
each 5-day episode were considered. Clusters were
initially defined based upon "consecutive" rather than
"running" or overlapping 5-day periods from 1980-1985.
Then, each remaining episode ("running" 5-day periods
from 1980 through 1990) was classified into the cluster
that minimizes the sum (over the 336 grid nodes and
three days) of the squared deviations of each u and v,
Cluster analyses were carried out using SAS®.
However, due to the extreme computational burden of
these analyses, it was necessary to calculate the
distance matrix externally from the clustering procedure
itself.
Winds can be defined in polar coordinates as
well as in the Euclidean u and v coordinate system.
Preliminary cluster analyses investigated polar
coordinate systems, clustering with the angle (direction)
defined by the wind vector. Results suggested that
clusters defined with polar coordinates didn't
discriminate seasonal differences in wind vector
patterns as well as clusters defined using Euclidean
coordinates. Later analyses with u and v coordinate
investigated four clustering variations:
• 30 clusters, defined using annual data
(consecutive 5-day periods from 1980-1985,
as described above)
• 30 clusters, defined seasonally (15 clusters
defined from the warm season Apr.-Sept.
period, and 15 clusters defined from the cold
season Oct.-Mar. period)
• 60 clusters, defined using annual data
• 60 clusters, defined seasonally (30 clusters
defined from the warm season Apr.-Sept.
period, and 30 clusters defined from the cold
season Oct.-Mar. period)
For the annual defined clusters, remaining 5-
day events (running 5-day periods from 1980-1990)
were then classified based on minimizing the squared
distance from the cluster average. For the seasonally
defined clusters, the remaining 5-day events were
classified into clusters defined for the same season as
the event. For the 30-cluster analyses, cluster
averages were displayed as dots on maps illustrating
the intracluster variability in the wind fields. Star chart
histograms were used to illustrate the frequency of
occurrence of events from each cluster, for each month
of the year.
3.2
Aggregation
The aggregation approach is based upon
weights determined for meteorological categories that
account for a significant proportion of the variability.
These categories need to account for variation in the
air quality measures as well. Within and between
cluster variability of air quality measures will be
examined. The extinction coefficient will be used first
as was done for RADM (Eder et a/., 1996). Other air
quality characterizations wili be evaluated with available
air quality data sets. Air quality is feature of interest,
but what is considered when weights are based on
strata of wind, are transport mechanisms involved in the
associated atmospheric processes, and in particular
that source-attribution analyses be facilitated. This
requires that clusters reflect wind flow parameters.
Other parameters are important, but may not be
necessary in defining strata or clusters since wind field
patterns in essence describe frontal passages, along
with their meteorological properties. Evaluation of
aggregation results may require additional parameters
in the defining of clusters.
4. RESULTS AND FUTURE WORK
Maps of mean wind vectors were done for
each of the 30 clusters defined using annual data. The
clusters are ordered according to overall frequency of
occurrence, with Cluster 1 being most prevalent and
Cluster 30 least prevalent. Mean vectors for day 5 of
the 5-day events of Cluster 11 (Fig.la) illustrate
average behavior associated for a cluster, but do not
indicate the variability inherent in the cluster. For
example (Figure 1 b) illustrates the wind field for the fifth
day of an individual event (Dec. 19-23,1989) that was
assigned to Cluster 11 (Fig.la). Compared to the
mean wind field for day 5 of Cluster 11 reveals fairly
close resemblance between the two. By contrast,
(Fig.lc) depicts the fifth day from another event (Dec.
3-7,1990) belonging to Cluster 11. This pattern does
-------
not resemble the mean wind field nearly as well.
Simultaneously viewing all of the wind fields
assigned to the fifth day of Cluster 11 (Fig.ld) shows
the mean wind vectors for day 5 of Cluster 11 on a
thinned-out grid that only includes alternating grid
nodes. Each group of dots depicts the location of the
wind vector arrowheads for individual events assigned
to this duster and collectively illustrate the distribution
of arrowheads for all events belonging to Cluster 11.
The star chart histograms of the 30 clusters
defined using annual data (Figure 2) illustrate the
frequency of occurrence of 5-day events belonging to
several clusters. Events from Cluster 1 accounted for
13.88% of all 5-day events between 1980 and 1990,
those from Cluster 8 accounted for 3.76% and Cluster
11, 3.29%. The numbers arranged radially on each
chart depict the number of events belonging to the
duster from each month of the year. The length of the
line pointing to each month is proportional to this
frequency of occurrence, and the ends of the lines are
connected to facilitate the visualization of patterns.
Several observations may be made:
• Although defined using annual data, the
cluster frequencies reveal definite seasonal
tendencies. Clusters do not occur randomly
throughout the year, but rather exhibit a
tendency to occur more frequently within
specific seasons. The clustering procedure
successfully identifies and discriminates wind
field patterns that are associated with
seasonally distinct meteorological classes.
• While dusters containing summer events tend
to be quite distinct from those containing
winter events (and vice versa), many clusters
contain events from a combined "transitional"
season that includes spring and fall months.
• The three most prevalent clusters heavily
emphasize the summer months; however,
these months are rarely represented by the
remaining 27 clusters (not shown).
The disportionate representation of non-summer events
in the set of 30 clusters is not surprising, since the wind
fields are expected to be less variable in the summer.
Seasonal differences in meteorology and atmospheric
chemistry are important in explaining the variability
exhibited by the air quality parameters of interest.
Evaluating the relationship of these clusters to air
quality is still in progress.
5.
REFERENCES
Brook, J.R, Samson, P.J., and Sillman, S., 1995:
Aggregation of selected three-day periods to
estimate annual and seasonal wet
deposition total for sulfate, nitrate and
acidity. Part I: A synoptic and chemical
climatology for Eastern North America. J.
Applied. Meteor. 34, 297-325.
Brook, J.R.; Samson, P.J.; and Sillman, S., 1995:
Aggregation of selected three-day periods to
estimate annual and seasonal wet
deposition total for sulfate, nitrate and acidity
Part II: Selection of events, deposition totals
and source-receptor relationships. J. Appl d.
Meteor. 34, 326-339.
Davis, R. E. and Kalkstein, L. S., 1990: Development
of an automated spatial synoptic
climatological classification. International
Journal of Climatology 10, 769-794.
Eder, B.K. and LeDuc, S.K., 1996: Can selected
RADM simulations be aggregated to
estimate annual concentrations of fine
particulate matter. Preprints of the 11th
Annual International Symposium on the
Measurement of Toxics and Related Air
Pollutants, RTP, NC, pp. 732-739.
Eder, B.K. and LeDuc, S.K. and F.D.Vestal., 1996:
Aggregation of selected RADM simulations
to estimate annual ambient air
concentrations of fine particulate matter.
Preprints of AMS 9th Joint Conference on
the Application of Air Pollution Meteorology
withAWMA, Jan. 28-Feb. 2, 1996, Atlanta,
GA, pp. 390-392.
Fernau, M.E. and Samson, P.J., 1990: Use of cluster
analysis to define periods of similar
meteorology and precipitation chemistry in
eastern North America. Part I: Transport
patterns. J. of Applied Meteor. 29, 735-750.
Husar, R.B. and W.E. Wilson, 1993. Haze and sulfur
emission trends in the eastern U.S..
Environ. Sci. Technol.. 27, 13-16.
Kalnay, E., M.Kanamitsu, R.Kistler, W.Collins, D.
Deaven, LGandin, M.lredell, S. Saha,
G.White.J.Woollen. Y.Zhu,M.Chelliah,
W.Ebisuzaki.W. Higgins.J.Janowiak,
K.C.Mo.C. Ropelewski, J.Wang, A
.Leetmaa,.R.Reynolds,R.Jenne,and
D.Joseph, 1996: The NCEP/NCAR 40-year
reanalysis project. Bulletin of the American
Meteorological Society 77, 437-471.
Thorpe, S., D.Hwang.W.T.Smith.T.LTurner, 1996.
The Package for Analysis and Visualization
of Environmental Data. Proc. of Computing
in Environmental Resource Management,, 2-
4 Dec., RTP, NC, AWMA 241-249.
This paper has been reviewed in accordance with the U. S.
Environmental Protection Agency's peer and administrative review
policies and approved for presentation and publication. Mention of
trade names or commercial products does not constitute endorsement
or recommendation for use.
-------
a)
c)
b)
d)
Figure 1 700 mb Wind Vector for Cluster 11, Day 5: a) mean; b)
Dec. 23, 1989; c) Dec, 7, 1990; d) variability
Figure 2 Star chart for Clusters 1, 8, 11
-------
*
1. REPORT NO.
PA/600/A-97/081
TECHNICAL REPORT DATA
2.
4. TITLE AND SUBTITLE
Aggregation Using Cluster Analyses for Models-3 CMAQ Applications
7. AUTHOR(S)
Sharon LeDuc, et al, (See title page)
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Atmospheric Modeling Division
National Exposure Research Laboratory
U.S. Environmental Protection Agency
Research Triangle Park, NC 2771 1
12. SPONSORING AGENCY NAME AND ADDF
NATIONAL EXPOSURE RESEARCH
OFFICE OF RESEARCH AND DEVEL
U.S. ENVIRONMENTAL PROTECT1O
RESEARCH TRIANGLE PARK, NC 27
£SS
LABORATORY
OPMENT
N AGENCY
711
3.
5.REPORTDATE
6.PERFORMING ORGANIZATION CODE
8.PERFORMFNG ORGANIZATION REPORT NO.
10.PROGRAM ELEMENT NO.
1 1 . CONTRACT/GRANT NO,
I3.TYPE OF REPORT AND PERIOD COVERED
14. SPONSORING AGENCY CODE
15. SUPPLEMENTARY NOTES
16. ABSTRACT
Models-3, a framework for air quality modeling, is scheduled for public release in June 1998. The framework will support
statistical analysis, and presently uses four SAS® modules (base, graph. AF and FSP) in the emissions processing. Statistical capability in
Models-3 is also found in PAVE (Thorpe , 1996) which displays time series plots. Statistics such as mean, median, and percentiles can be
plotted from Models-3 with IBM's Visualization Data Explorer .
The statistical tool described in this paper is for policy planning for air quality issues related to annual or multi-year measures.
rather than the most extreme events. Scientifically credible and reliable estimates of air quality for large regions rely on air quality
models, such as the Community Multiscale Air Quality (CMAQ) model in Models-3. Application of such models requires massive
resources, both human and computer, for each policy and/or meteorological scenario. Analysis of benefits proposed for the Clean Air Act
Amendments of 1990 requires annual timescales. Unfortunately, CMAQ model, like most Eulerian models, challenges the practical limits
of current computer resources as well as our ability to collect the pertinent input data on annual scales. As a result, applications to
determine the long-term relationship between changing emissions patterns and ambient air concentrations are resource intensive.
To circumvent this problem, a statistical aggregation method, initially developed for RADM acid-deposition applications (Brook
et al., 1 995), will be modified to provide estimates of long-term (seasonal or annual) ambient air concentrations, wet and dry deposition
amounts, and measures related to visibility. The aggregation methods, which use cluster analysis, are based on the premise that, at any
given location, ambient air concentrations (also deposition amounts)
are governed by a finite number of different, though recurring, meteorological regimes. The meteorological regimes identified for RADM
considered only the eastern U.S. and Canada where acid deposition was the issue. Now air quality issues, such as regional haze, are more
geographically extensive. The sample selection and aggregation weightings used with RADM need to be reexamined for use with these
issues.
17.
a DESCRIPTORS
KEY WORDS AND DOCUMENT ANALYSIS
b.IDENTIFIERS/ OPEN ENDED TERMS c.COSATl
1 8. DISTRIBUTION STATEMENT
Release to Public
19. SECURITY CLASS (This Report) 21. NO. OF PAGES
Unclassified
20. SECURITY CLASS (This Page) 22. PRICE
Unclassified
------- |