EPA/600/A-97/046
OZONE: MODELING AND MONITORING AN ATMOSPHERIC POLLUTANT
This article reports research on design and evaluation of ambient air quality
monitoring networks, focused on ozone, performed under cooperative research agreement
CR819638-01 between U.S. EPA and the National Institute of Statistical Sciences. Douglas
Nychka was lead investigator. Space limitations allow only a summary here; principal results
and references can be found in Nychka, Yang and Royle (1997). The information in this
document has been funded wholly or in part by the U.S. Environmental Protection Agency.
It has been subjected to Agency review and approved for publication. Mention of trade
names or commercial products does not constitute endorsement or recommendation for use.
1. Environmental Issues
An important problem in monitoring pollution is selection of monitoring locations.
This paper concerned with monitoring ambient (tropospheric) ozone. High ozone levels affect
human health and also agricultural crops. Despite the focus on ozone, many of our statistical
principles and computational algorithms can be extended to other pollutants and media.
Tropospheric ozone exhibits variability over space due to heterogeneity in formation
and transport. Monitoring networks comprise a small number of locations, so one concern is
a network's ability to estimate pollutant levels at unobserved locations. This is important to
quantify exposure and to validate numerical atmospheric pollution models. Key objectives are
to evaluate the performance of the current set of monitoring locations and to quantify the
effect of modifications to this network. Key problems are how to add sites to improve
regional or trend estimates and how to reduce the number of sites while retaining as much
accuracy as possible. An unmonitored region presents the design problem of specifying a
new network, and the same problem is useful to provide a benchmark to assess the
improvement resulting from redesign of an existing network. For ozone there is significant
temporal variability at each location, and a more comprehensive approach to network design
would reflect mixed spatial-temporal structure. Here measurements are averaged over time so
the temporal component is ignored. Designs problems aimed at producing good spatial
summaries can be complementary to those aimed at prediction at unobserved locations.
Lawrence H. Cox
U.S. Environmental Protection Agency
NERL (MD-75)
Research Triangle Park, NC 27711
USA
Douglas Nychka
North Carolina State Universitv
Department of Statistics
Raleigh, NC 27695
USA

-------
2. Statistical Considerations and Data
Statistical models for the spatial distribution of a pollutant are based on random fields.
Pollutant levels are assumed to be randomly distributed but spatially correlated. Spatial
prediction and evaluation of a monitoring network hinges on the assumed spatial covariance
function between pollutant concentrations at different locations. Extrapolation beyond network
locations is based on a spatial statistical model for the pollutant. The typical assumption of
spatial isotropy, viz., correlation depends on distance but not location, is not reasonable for
ozone, however. Nonstationary models can be estimated from observational data or inferred
from numerical models. Even with perfect knowledge of covariance structure, it is difficult
to compute statistically optimal designs for large numbers of locations ( > 30) and
complicated covariance functions. Thus, emphasis here is to construct suboptimal solutions to
design and redesign problems exhibiting reasonable properties. In evaluating a network it is
natural to plot locations and judge whether the distribution of points is reasonable. This can
be quantified by geometric design criteria that measure how well the design covers a region.
The use of geometric criteria is motivated also by theoretical relationships between optimal
designs and geometric criteria. Furthermore, geometric designs can be readily computed and
can be tailored to fit the irregular boundaries of geographical regions.
In order to construct credible spatial designs one needs statistical models for the
pollutant field. For ozone, two forms of data were used; observational data from the U.S.
NAMS/SLAMS network, and output from the Regional Oxidant Model (ROM). For
comparability, observational data were taken from summer 1987. Data from urban Chicago
urban area for 1987-1991 were also used. Ozone measurements are in units of parts per
billion (ppb), aggregated hourly. The daily summary is the 8 hour average over hours 9-17.
Ozone levels peak in early afternoon, so typically this captures the maximum 8 hour average.
Due to gaps in observational data, ROM output is used to model spatial covariances.
3. Design Criteria and Design Evaluation
Assume that the concentrations, or a transformation, are a realization of a Gaussian
random surface. Z(x) is measured ozone at location x. Assume Z(x) is normally distributed
with E(Z(x)) =0 and Cov(Z(x), Z(x')) = k(x, x'). Xj for 1 
-------
Another evaluation criterion is based on thinning a network: use the thinned network
to predict the full network average, and compare thin to full PSE, A third family of design
criteria is based on geometric measures of how well a design covers the region. The region
is reduced to a large finite set C of candidate points, DcC denotes the set of N design points.
A distance metric between any point and a design is dp(x, D) = (]T ||x -	For p<0,
u£D
dp(x, D)-» 0 as x converges to a member of D, For q>0, an overall coverage criterion is
Cp q(D) = (]£ dp(x, D)q)ltq. In the limit, Cp q defines minimax space filling designs:
uGC
the maximum of nearest neighbor distances of candidate points to design points is minimised.
A modification that comes closer to prediction variance and reflects spatial correlation scales
is: take p=-l and replace Euclidean norm in dp with k(x, x) - k(x, u) = cr(x)2(l - C(x, i*))
where C(.,.) is the correlation of the field, viz., a covariance filling criterion.
Designs can be computed using regression subset selection. X = (1/N)£Z; is the
full network average and X an estimate based on a subset. A good design estimates X with
small variance. Recall the regression formula
(7 - f)%Y - f) = Y'(i - x&'xy^xyr = yt - Y,xoc,xylxtY. if x'x = k,
X'Y = y , and Y*Y = Var(X), the residual sum of squares (RSS) from regressing Y on X
equals the mean squared error (MSE) of the kriging estimate for X: X = K"2 and Y =
K'iay . The regression subset selection problem is to find a subset of regression variables
that minimizes some criterion, typically RSS. A procedure known as Leaps searches for an
RSS minimizing subset of fixed size. An alternative for subset selection is the Lasso
procedure, which finds an optimal subset for which the sum of absolute-value regression
coefficients does not exceed a predetermined value. Leaps exploits the monotonicity of RSS
using a branch and bound strategy; Lasso is based on iterative minimization of weighted least
squares problems.
Coverage designs are generated using a simple "swapping" algorithm. Given an
initial design one evaluates the effect on the coverage criterion of swapping the first point in
the design with all of the candidate points. The candidate (if one exists) that reduces the
coverage criterion by the greatest amount is swapped for the first design point.
4. Results of Statistical Analysis
The results for ozone network design based on the statistical tools described above
divide naturally into three phases. The first effort investigates the properties of designs
generated by subset selection or a space-filling criterion for two small regular grids and a
range of scale parameters (theta) in the exponential covariance function. The second applies
these methods to thin the Chicago urban network. The final phase investigates the ozone
network for a small rural area and a large region comprising several Midwestern states.
It is important to discuss the differences between the three methods for constructing
designs because Leaps, Lasso, space-filling methods are not guaranteed to yield optimal
designs in terms of PSE. A simulation study was performed. Space-filling (p=-5 and q=5)
and Leaps designs exhibited comparable average prediction variance and maximum prediction
variance over simulated square and hexagonal test regions. The different types of designs had
similar performance over a range of correlations from extreme dependence to near
independence, bracketing the levels of correlation encountered in the ozone monitoring data.

-------
For the period 1987-1991 there were twenty stations in the Chicago urban area. The
observations recorded for summer 1987 were used to estimate covariance for ozone at the
stations and to generate subset designs. The remaining three years were used to calculate
prediction error. A common feature was the rapid increase in the design efficiency as the
number of sites increases, e.g., a five station design did well in estimating the full network
average, exhibiting PSE 2.5 ppb compared with standard deviation 16.1 ppb for the full
network average over the three year validation period. Additional stations produced some
decrease in variance, but improvement was marginal relative to the large decrease up to five
stations. To generate Lasso designs, a nonstationary covariance function was estimated for
the Great Lakes region using ROM. Pairwise correlations are calculated for pairs of ROM
cells based on the 8-hour average. It was assumed that the daily ROM ozone fields are
independent realizations from a temporally stationary process, so that the 89 days covered
were treated as replicates, making it possible to estimate both the marginal variance of ozone
over the region and covariance between any two ROM grid cells.
A section of Northern Illinois was used for testing the methodology of estimating
nonstationary covariance functions and generating designs . To adjust for the tendency of
algorithms to select stations along the boundary, existing stations outside the boundary were
added as fixed stations. Points were added using the coverage criterion (p=-5 and q=5).
Adding five points decreased median PSE from 3.9 ppb to 3.0 ppb. Adding ten points
resulted in greater decrease, with median PSE 2.8 ppb.
The last design study quantified the results of thinning, augmenting and creating a new
ozone network for the Great Lakes region. Stations outside this region were not included, so
the resulting designs could be sensitive to boundary effects. The thinking was that larger
designs might reduce the edge effects caused by ignoring sites outside the region.
Augmentation was accomplished using existing stations as fixed members of the design, and
minimizing the coverage criterion for locations drawn from a candidate set. However, the
new points tended toward the edges of the region rather than towards the interior, so that
augmentation had little effect on overall performance. An important exception was in northern
Michigan, suggesting the need for additional sites. Smaller or thinned networks were
constructed from subsets of existing stations (constrained) or drawn from a larger uniform
grid (unconstrained). For small numbers of stations, performance of the constrained and
unconstrained networks was similar. Differences in candidate sets become more apparent as
the number of design points approached the total number of existing stations.
BIBLIOGRAPHY
Nychka, D., Yang, Q. and Royle, J. A. (1997). Constructing spatial designs using
regression subset selection. Statistics for the Environment 3: Sampling and the
Environment, V. Barnett and K.F. Turkman (eds.). Wiley, New York (to appear).
SUMMARY
This paper is concerned with statistical methods for design and evaluation of air
quality monitoring networks, focusing on ambient ozone monitoring.
Ce papier est concerne' avec des me'thodes statistiques pour le mode'leet evaluation
de qualite' d'air contre'lant des re'seaux, portant sur l'ozone ambiant oontre'lant.

-------
TECHNICAL REPORT DATA
1, RSPORT NO.
EPA/8Q0/A-97/046
2.
3
4. TITLE AND SUBTITLE
Ozone: Modeling and Monitoring an Atmospheric Pollutant
5.REPORT DATE
6.PERFORMING ORGANIZATION CODE
7. AUTHOR(S)
Lawrence H. Cox/ Douglas Nychka
g.PERFORMING ORGANIZATION REPORT NO.
9, PERFORMING ORGANIZATION NAME AND ADDRESS
National Exposure Research Laboratory, Research Triangle Park, NC 27711
ID.PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
12. SPONSORING AGENCY NAME AND ADDRESS
NATIONAL EXPOSURE RESEARCH LABORATORY
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
RESEARCH TRIANGLE PARK, NC 27711
13.TYPE OF REPORT AND PERIOD COVERED
Proceedings Paper
14. SPONSORING AGENCY CODE
USEPA
15. SUPPLEMENTARY NOTES
16. ABSTRACT
This paper is concerned with statistical methods for design and evaluation of air quality monitoring networks, focused
on ambient ozone monitoring ,
17. KEY WORDS AND DOCUMENT ANALYSIS
a. DESCRIPTORS
(^IDENTIFIERS/ OPEN ENDED TERMS
c COS ATI



18. DISTRIBUTION STATEMENT
RELEASE TO PUBLIC

19. SECURITY CLASS (litis Report)
UNCLASSIFIED
21.NO. OF PAGES

20. SECURITY CLASS (This Page)
UNCLASSIFIED
22. PRICE

-------