ional Institute of Statistical Sciences
Accounting for Meteorological Effects in
Measuring Urban Ozone Levels and Trends
Peter Bloomfield Andy Royle
Qing Yang
-------
Accounting for Meteorological Effects in
Measuring Urban Ozone Levels and Trends
Peter Bloomfield Andy Royle
Qing Yang
Technical Report # 1
June, 1993
National Institute of Statistical Sciences
P.O. Box 14162
Research Triangle Park, N. C. 27709
-------
Accounting for Meteorological Effects in
Measuring Urban Ozone Levels and Trends*
Peter Bloomfield Andy Royle
Qing Yang
National Institute of Statistical Sciences
and
Department of Statistics
North Carolina State University
Preliminary Report: 9 June, 1993
Abstract
Surface ozone levels are determined by the strengths of sources and
precursor emissions, and by the meteorological conditions. Observed ozone
concentrations are valuable indicators of possible health and environmental
impacts. However, they are also used to monitor changes and trends in the
sources of ozone and of its precursors, and for this purpose the influence
of meteorological variables is a confounding factor. This report describes
a study of ozone concentrations and meteorology in the Chicago area. The
data are described using a variety of exploratory methods, including median
polish and principal components analysis. The key relationships observed in
these analyses are then used to construct a model relating ozone to meteorol-
ogy. The model can be used to estimate that part of the trend in ozone levels
that cannot be accounted for by trends in meteorology, and to "adjust" ob-
served ozone concentrations for anomalous weather conditions. The model
'Research supported by the U.S. Environmental Protection Agency under Cooperative Agree-
ment #CR819638-01-0.
-------
Preliminary Report 9 June, 1993 2
parameters are estimated by nonlinear least squares. Its goodness of fit is
assessed by comparison with nonparaxnetric regression results Gowess).
Keywords: Ozone concentration, meteorological adjustment, nonlinear
regression, nonparametric regression.
Contents
1 Introduction 4
2 Description of Data 5
2.1 Ozone data 5
2.2 Meteorological data 9
3 Preliminary Analyses 11
3.1 Diurnal cycles 11
32 Imputation of missing values 13
4 Summarizing the Network 15
4.1 Principal components analysis 15
4.2 Network average 23
5 Modeling Ozone Concentrations 23
5.1 Surface meteorology 23
5.2 Upper air meteorology 29
5.3 Lagged meteorological variables 30
5.4 Seasonal structure and trend 30
5.5 Fitted coefficients and standard errors 33
6 Discussion of the Model 35
6.1 Quality of the fitted model 35
62 Predicted and adjusted perccntiles 38
6.3 Interpretation of model components 41
6.4 Cross validation 44
6.5 Jackknifing 46
6.6 Network maximum 49
7 Conclusions 54
-------
Preliminary Report 9 June, 1993 3
List of Tables
1 The ozone monitoring stations 6
2 Meteorological variables and station locations 9
3 Results of principal components analysis for 16 stations 17
4 Station loadings for the first five principal components 18
5 Coefficients in the fitted model 34
6 Root mean square residual and one day lagged correlation coeffi-
cient, by month 37
7 Predicted residual mean and root mean square by year. 44
8 Jackknifed estimates and standard errors 48
9 Coefficients in the fitted model for the network maximum 50
List of Figures
1 Locations of the ozone monitoring stations 8
2 Locations of the surface and upper air weather stations 10
3 Daily typical values for station 170317002 , 12
4 Diurnal cycle for station 170317002 12
5 Imputation of 2p.m. observations for station 170317002 13
6 Quantile-quantile plot of imputed 2 p.m. observations for station
170317002 against valid 2 p.m. observations 14
7 Locations of the 16 stations used in principal components analysis. 16
8 Station loadings for component 1 19
9 Station loadings for component 2 20
10 Station loadings for component 3 21
11 Station loadings for component 4 22
12 Scatter plots of ozone against temperature, wind speed, relative
humidity, and cloud cover. 25
13 Scatter plots of ozone against wind direction and barometric pressure. 26
14 Nonparametric regression surface for ozone against temperature
(maximum from 09:00 to 18:00) and noon relative humidity. ... 26
15 Nonparametric regression surface for ozone against temperature
(maximum from 09:00 to 18:00) and noon wind speed. 27
16 Residuals from meteorological model against day of year. 31
-------
Preliminary Report 9 June, 1993 4
17 Nonparametric regression surface for ozone against season and
the fit from the nonseasonal model. . 32
18 Quantile-quantile plot of the residuals from the fitted model against
the Gaussian distribution 36
19 Nonparametric regression surface for magnitude of residuals against
season and fitted value 37
20 Actual and predicted 95th percentiles of ozone by year. 39
21 Actual and adjusted 95th percentiles of ozone by year. 40
22 Fitted polynomial effect of temperature at zero wind speed. .... 41
23 Fitted seasonal effect 42
24 Quantile-quantile plot of predicted root mean square residual by
year, against a bootstrap reference distribution 46
25 Quantile-quantile plot of ratio of jackknifed standard errors to
adjusted standard errors. 49
26 Fitted polynomial effect of temperature at zero wind speed for
network maximum 51
27 Fitted seasonal effect for network maximum 52
28 Actual and predicted 95th percentiles of network maximum ozone
by year. 53
29 Actual and adjusted 95th percentiles of network maximum ozone
by year. 53
1 Introduction
Current knowledge about trends in surface ozone concentrations has been reviewed
recently by the National Research Council (1991, Chapter 2), who divided efforts
to quantify the impact of meteorology on ozone into those based on classificatory
methods, and those based on regression methods. The former have the advantage
of nonparametric flexibility, but are essentially uninterpretable. The latter give
interpretable models, but are heavily tied to linear models. Cox and Chu (1992)
used a generalized linear model approach, introducing interaction between two key
meteorological variables (temperature and wind speed) by including their product
The modeling effort described here was based on ozone and meteorological
data from the Chicago area (Cox, 1992). Graphical displays and nonparametric
modeling, described in detail in Section 5, were used to confirm the identities of
the key meteorological variables and their interactions, and to choose functional
-------
Preliminary Report 9 June, 1993 5
forms for modeling the dependence of ozone. Nonlinear least squares methods
were then used to fit a model incorporating these variables and interactions.
When the model is extended to include a dependence on year, the associated
coefficient may be interpreted as an estimate of proportional trend in ozone,
allowing for changes in the meteorological variables. The model may also be used
to construct "adjusted" ozone levels, that is, estimates of what the concentration of
ozone would have been on a given day or in a given year, had the meteorological
conditions been different.
The development of the model is driven by empirical study of the associa-
tion of ozone levels with the various meteorological variables. This approach
complements that used by Comrie and Yamal (1992), who constructed a synoptic
climatology of atmospheric circulation and ozone concentrations in Pittsburgh,
Pennsylvania. Kelly, Ferman and Wolff (1986) similarly studied the meteorologi-
cal conditions associated with high and low ozone days in Detroit, Michigan. The
conclusions of such studies provide considerable guidance in the construction of
models such as that developed here.
Feister and Balzer (1991) have also studied the dependence of ozone concen-
trations on meteorology, but with a view towards short-term forecasting. Their
model includes the previous day's ozone concentration, and they find that this
is the most important variable. Previous values of ozone have not been used in
the present study, as their incorporation in a model makes its use for adjustment
problematic.
2 Description of Data
2.1 Ozone data
The ozone data consisted of hourly averages at the 45 stations described in Table 1.
The locations of the ozone monitoring stations are shown in Figure 1.
Most of the 45 stations recorded data only during the summer months, al-
though some were operated essentially year-round. In all the analyses described
subsequently, data were limited to the ozone "season" of 1 April to 31 October.
-------
Preliminary Report
9 June, 1993
Table 1: The ozone monitoring stations. "AIRS" is the EPA air quality data
base. "MSA" is the Metropolitan Statistical Area identifier for the station location.
Dates of first and last observations are given in "yymmdd" form. Codes are:
R-residential, I-industrial, C-commercial, A-agricultural, M-mobile; S-suburban,
U-urban, R-ruraL
AIRS Site ID Latitude Longitude MSA State First and Last Dates Code
170310001
170310003
170310009
170310032
170310037
170310038
170310044
170310045
170310050
170310053
170310062
170310063
170310064
170311002
170311003
170311601
170312301
170313005
170314002
41.669
42.061
41.747
41.757
41.979
41.705
41.790
41.923
41.708
41.739
41.917
41.877
41.790
41.616
41.984
41.668
41.874
42.060
41.855
87.733
88.003
87.732
87.545
87.669
87.632
87.583
87.634
87.568
87.726
87.573
87.634
87.602
87.558
87.792
87.990
87.849
87.753
87.753
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
EL
EL
IL
IL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
IL
880612
810206
810331
870331
850723
810331
811001
810101
820101
821130
890516
910729
890701
810101
810331
830407
810101
810101
830217
911030
821216
820526
911031
911231
810912
861112
860105
911231
870914
890727
911231
911031
901031
911030
911030
820831
820831
911031
RS
RU
IU
RU
RS
RS
CU
IS
RS
AS
MU
RS
RS
MU
IS
cs
cs
RS
continued on next page
-------
Preliminary Report
9 June, 1993
Table 1 (continued)
AIRS Site ID
170314003
170315001
170316002
170317002
170318003
170431002
170436001
170890005
170970001
170971002
170971003
170973001
171110001
171971007
171971008
180890011
180891016
180892001
180892002
180892008
181270020
181270021
181270024
181270903
181271004
181271005
Latitude
42.039
41.514
NA
42.062
41.631
41.856
41.813
42.050
42.177
42.391
42.446
42.290
42.221
41.278
41.592
41.627
41.600
41.600
41.631
41.639
41.631
41.563
41.617
NA
41.463
41.467
Longitude
87.898
87.645
NA
87.676
87.568
88.077
88.073
88.273
87.865
87.847
88.103
87.982
88.241
88.220
88.049
87.481
87.335
87.383
87.521
87.501
87.087
87.076
87.199
NA
87.044
87.061
MSA
1600
1600
1600
1600
1600
1600
1600
620
3965
3965
3965
3965
1600
3690
3690
2960
2960
2960
2960
2960
2960
2960
2960
2960
2960
2960
State
EL
EL
DL
IL
IL
EL
DL
EL
IL
DL
IL
IL
IL
IL
IL
IN
DM
IN
IN
IN
IN
IN
IN
IN
IN
IN
First and Last Dates
850321
810101
830701
810101
910401
810226
840208
810101
810101
810101
900606
810101
810101
810227
810101
810131
820901
810101
811002
810101
830930
810512
831115
810101
830430
810430
911031
820201
841016
911231
911031
831231
911231
911231
911231
911231
911031
911231
911231
891108
911231
820818
910930
820815
820831
910930
911031
851031
910930
810131
851018
821031
Code
RS
CS
RS
RS
RS
AS
RS
RS
RS
AR
RS
RS
AR
RS
CU
RU
IU
RU
CS
R
AR
RS
IS
RS
CU
-------
Preliminary Report
9 June, 1993
3
2" °
s ^*
88.4 88.2 88.0 87.8 87.6 87.4 872 87.0 86.8
J I I I I I | I I
88.4 88.2 88.0 87.8 87.6 87.4 872 87.0 86.8
CM
-------
Preliminary Report
9 June, 1993
2.2 Meteorological data
Surface and upper air meteorological variables were collected at the two stations
shown in Figure 2. The meteorological variables are described in Table 2. The
surface observations were made each hour, while the upper air soundings were
made largely at OOZ and 12Z (00:00 and 12:00 UTC, respectively, occasional
soundings at 23Z and 11Z were taken as being at OOZ and 12Z, respectively; other
soundings were ignored). The upper air measurements were made at many levels;
these always included 950mb, 850mb, 700mb, and 500mb, which were the only
levels at which the data were used.
Table 2: Meteorological variables and station locations.
Variable
Units Surface Upper Symbol
Total cloud cover
Opaque cloud cover
Ceiling height
Barometric pressure
Temperature
Dewpoint temperature
Relative humidity
Specific humidity
Wind direction
Wind speed
Visibility
Height of pressure layer
Latitude
Longitude
%
%
m
mb
oF
°F
%
g/kg
°fromN
m/s
km
m
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
41.98°
87.90°
Yes
Yes
Yes
Yes
Yes
Yes
Yes
40.67°
89.68°
totcov
opcov
cht
pr
t
td
rh
q
wdir
wspd
vis
ht
-------
Preliminary Report
9 June, 1993
10
p
-------
Preliminary Report 9 June, 1993 11
3 Preliminary Analyses
3.1 Diurnal cycles
The recorded ozone concentrations for a given station may be written as a two-way
array:
y. (1)
The decomposition was made using median polish (Tukey, 1977), as implemented
in S (Becker, Chambers andWilks, 1988, see twoway). The advantage of median
polish, which is closely related to least absolute deviations fitting of equation (1)
(Bloomfield and Steiger, 1983), is that it is influenced more by the body of a
distribution and less by its tails than methods based on averages. Thus the "typical"
values for an hour or for a day are less affected by deviations from normal levels
that last for less than half a day. Median polish also works well in the presence of
missing data (Tukey, 1977, Chapter 10).
The decomposition was made on the logarithmic scale, to correspond to a mul-
tiplicative decomposition of the actual ozone concentrations. This is appropriate
when effects are expected to be proportional; for instance, when the typical diurnal
profile is expected to be scaled by the daily effect, rather than offset by it The
diagnostic plot (Tukey, 1977, Section 10F) indicated that the decomposition was
more satisfactory on the logarithmic scale.
Figure 3 shows the fitted daily typical values, exp(/x + ay), for a station with
one of the longer records. The seasonal cycle is visible, as are the unusually high
values in 1988. Figure 4 shows the corresponding diurnal cycle, exp(/z + fa).
This shows the characteristic behavior of urban ozone: a mid-afternoon maximum,
followed by a decay to an early morning minimum, which is sharpened by NO
scavenging during the morning rush hour, and is followed by a rise through the
late morning and the middle part of the day. This rise is caused initially by mixing
of the surface layer with the relatively ozone-rich residual layer above it, and later
by photochemical production of ozone. For this station, the root mean square
residual in the median polish decomposition (on the logarithmic scale) was 0.62,
indicating quite large proportional variations of the observed data around the fitted
values.
-------
Preliminary Report
9 June, 1993
12
8
S
o
1*1 ? -
£ >
1 &. &
I fe
3 ^q
1982
1984
1986
1988
1990
1992
Figure 3: Daily typical values for station 170317002.
12 ,
Hour of the day
18
24
Figure 4: Diurnal cycle for station 170317002.
-------
Preliminary Report
9 June, 1993
13
3.2 Imputation of missing values
The decomposition (1) was used to impute values for missing hourly ozone con*
centrations. If y^ is missing, but the first three terms on the right hand side of (1)
are available, then their sum provides a "fitted value" for the logarithm of the
ozone concentration at that hour. These three terms are available if there are any
data for the relevant station on the same day but at different hours, and for that
hour but on different days. At every station there was enough data to construct a
reliable estimate of the diurnal cycle terms &,&,..., £24, so this procedure gave
imputed values for all missing data other than where entire days were missing.
However, it was found that the imputed values were unreliable when there were
fewer than 2 valid hourly averages between 9:00 a.ra. and 6:00 p.m., and they
were not used in this situation.
The root mean square residual in equation (1) of 0.62 means that the imputed
value typically differed from the missed value by a factor of 0.5 to 2. However,
their distribution was quite similar to the valid data. Figure 5 shows the histograms
of the valid and imputed 2 p.m. observations for the station shown in Figures 3
and 4. Figure 6 is the corresponding quantile-quantile plot (Becker et al., 1988,
see qqplot). Here and later, the order statistics of the smaller sample are graphed
against estimated quantiles from the larger sample, obtained by interpolation and
0 50 100 150
2230 valid 2 pjn. observations (ppb)
0 50 100 150
71 imputed 2 pjn. observations (ppb)
Figure 5: Imputation of 2 p.m. observations for station 170317002.
-------
Preliminary Report
9 June, 1993
14
0 40 80 120
Valid observation:; (ppb)
Figure 6: Quantile-quantile plot of imputed 2 p.m. observations for station
170317002 against valid 2 pjn. observations.
by assuming that the ith order statistic in a sample of size n estimates the (i \}jn
quantile (Becker et al., 1988, see quant lie). The distribution of the imputed
data agrees closely with that of the valid data up to about 60 ppb, but the largest
imputed values appear to be too small. However, this affects only a handful of
days, and should lead to negligible bias in further analyses. For instance, 26 of
the 2230 valid 2 p.m. values exceed 120 ppb, suggesting that 1 or so of the missed
values might have been in this range, while the largest imputed value was below
90 ppb. However, the percentages of values above 120 ppb are 1.166% in the
valid data and 1.130% in the data set including the imputed values.
A further level of imputation was used for the analyses described in follow-
ing sections, including principal components analysis (Section 4.1) and nonlinear
modeling (Section 5). Where an entire day of data was missing, but at least some
data were measured on both the preceding and succeeding days, the logarithmic-
scale daily effect a* was imputed as the average of the two neighboring values.
This is equivalent to using the geometric mean on the original scale of measure-
ment Days with fewer than 2 valid observations between 9:00 a.m. and 6:00 p.m.
were treated in the same way: although a* was available, it was flagged as missing,
and replaced by the average of the neighboring; values, if available. For most sta-
tions there were 10 or fewer days meeting these conditions, so this had relatively
-------
Preliminary Report 9 June, 1993 15
little effect.
4 Summarizing the Network
4.1 Principal components analysis
The joint behavior of ozone concentrations at the various stations was explored
using principal components analysis of the daily maximum concentration. This
was carried out computationally through the singular value decomposition of a
data matrix whose rows corresponded to days of data, and whose columns corre-
sponded to the stations. Since no missing values were allowable, an appropriate
subset of stations and days had to be found. To minimize missing data, both levels
of imputation described in Section 3.1 were used. The matrix was constructed
sequentially, beginning with the station with the most complete data, and succes-
sively adding stations in descending order of amount of data. The first 16 stations
yielded 662 days of data, while adding the 17th reduced this to 274. The locations
of the 16 stations chosen for the analysis are shown in Figure 7. The singular
values and some related quantities are shown in Table 3. The station loadings for
the first 5 components are shown in Table 4. Figures-8 to 11 show the results of
spatial interpolation (Becker et al., 1988, see int erp) of the station loadings for
the first four components.
The first component, accounting for 79% of the variance of the 16 station
sub-network, is a weighted average of the station values, with weights ranging
from 0.20 to 0.32. Stations close to Lake Michigan carry the higher weights,
while those further from the lake have lower weights. The second and third
components account for a further 5% and 4% of variance, respectively. They
represent gradients across the network, from East to West and from North to South,
respectively. The remaining singular values are not well separated, and most are
presumably noise. However, the fourth component, while not well distinguished
from the rest, nevertheless has some spatial interpretation as a contrast between
the interior and boundary stations. Components 2 to 4 may reflect meteorological
conditions; this will be explored elsewhere.
-------
Preliminary Report
9 June, 1993
16
88.4 88.2 88.0 87.8 87.6 87.4 87.2 87.0 86.8
I I I | I I I | I
88.4 88.2 88.0 87.8 87.6 87.4 872 87.0 86.8
Longitude (degrees W)
Figure 7: Locations of the 16 stations used in principal components analysis.
Small dots indicate stations not used.
-------
Preliminary Report
9 June, 1993
17
Table 3: Results of principal components analysis for 16 stations.
Singular value
1828.81
451.36
415.50
284.81
268.73
240.70
230.02
221.47
210.90
197.13
16928
161.87
156.91
144.68
118.49
97.98
Percent of variance
78.8523
4.8031
4.0703
1.9124
1.7026
1.3659
1.2474
1.1564
1.0486
0.9162
0.6756
0.6178
0.5805
0.4935
0.3310
0.2263
Cumulative percent
78.85
83.66
87.73
89.64
91.34
92.71
93.95
95.11
96.16
97.08
97.75
98.37
98.95
99.44
99.77
100.00
-------
Preliminary Report
9 June, 1993
18
Table 4: Station loadings for the first five principal components.
Station
181270024
170436001
170311601
170311003
170314002
180891016
170311002
170310050
180892008
171971008
170970001
170973001
170971002
171110001
170890005
170317002
1
0.2786
0.2065
0.2359
0.2525
0.2512
0.2399
0.2334
0.2384
0.2574
0.2306
0.2593
0.2492
0.2959
0.2043
0.2185
0.3202
Component
234
0.3956
-0.3170
-0.3976
-0.0622
-0.0159
0.3188
0.1039
0.1691
0.2667
-0.3262
-0.0300
-0.1252
0.0772
-0.2771
-0.3230
0.2423
-0.3512
-0.1700
-02663
0.1928
0.0759
-02732
-0.1675
-0.0237
-0.2379
-0.2942
0.3581
0.3090
0.4390
0.0052
-0.0716
0.2567
-0.1620
-0.0380
0.0826
-0.5339
-0.4037
02106
-0.3765
-0.0391
0.2017
0.1316
-0.1165
0.0005
0.3147
0.1937
0.1242
0.3261
5
0.5381
0.0692
-0.0573
0.1447
-0.4021
-0.1318
-0.0154
-02313
-0.5233
0.0619
0.0287
-0.0682
-0.0893
0.0834
0.1124
0.3715
-------
Preliminary Report
9 June, 1993
19
88.2
3 3
o
88.2 88.0 87.8 87.6
Longitude (degrees W)
r-i
5
O
-------
Preliminary Report
9 June, 1993
20
88.2
87.3
87.6
87.4 87.2
00
88.2
88.0 87.8 87.6
Longitude (degrees W)
87.4
Figure 9: Station loadings for component 2.
00
-------
Preliminary Report
9 June, 1993
21
87.2
(N
g> <=>.
3 <^"
-
88.2
87.8 87.6
Longitude (degrees W)
87.4
87.2
Rgure 10: Station loadings for component 3.
-------
Preliminary Report
9 June, 1993
22
8S.2
\
87.8 87.6
Longitude (degrees W)
87.4
87.2
f-l
00
Figure 11: Station loadings for component 4.
-------
Preliminary Report 9 June, 1993 23
4.2 Network average
The principal components analysis of Section 4.1 showed that most of the variance
was associated with a single component It was based on a subset of 662 days
of data, because of the data requirements of the method, and consequently the
principal component time series could be calculated only for the same subset of
days. However, it was essentially a weighted average of the 16 stations, which
suggested that a similar network average, calculated for a more extensive set of
days, would provide a bener basis for the subsequent modeling.
Median polish was used a second time to provide an outlier-resistant summary.
Both levels of imputation (Section 3.1) were used to construct daily maximum
ozone concentrations by station. The daily maxima were written as a two-way
array:
y
-------
Preliminary Report 9 June, 1993 24
Other variables mentioned were wind direction, dew point temperature, sea level
pressure, and precipitation. All of these except precipitation were included in the
suite of meteorological data available for the present study (surface barometric
pressure was used in place of sea level pressure).
Figure 12 shows scatter plots of the network average daily maximum ozone
concentration, described in Section 4.2, against four surface meteorological vari-
ables. Temperature is measured by the maximum from 9:00 aan. to 6:00 p.m.,
while the other variables are for noon. The curve overlayed on each graph is i
cubic least squares smoothing spline with 6 degrees of freedom. Rgure 13 shows
corresponding plots for surface wind direction and barometric pressure; dew point
temperature was not considered, as it is a function, albeit nonlinear, of temperature
and relative humidity, and highly correlated with temperature.
It is evident that temperature has the strongest effect, that the effect would
be well approximated by a low order polynomial, and that relative humidity
and wind speed appear to be the next most important variables. To explore the
dependence of ozone level on these variables two at a time, the lowess method
of nonparametric regression (Qeveland, 1979; Chambers and Hastie, 1992, see
loess) was used to find regression surfaces. Rgure 14 shows the surface for
ozone against temperature and relative humidity. The surface indicates that the
polynomial effect of temperature is maintained at all levels of relative humidity, and
that the effect of relative humidity is reasonably linear at all levels of temperature,
but with a larger slope where the temperature effect is larger. This suggests that a
model of the form
ozone = (polynomial in temperature) x (linear function of relative humidity)
(2)
might express the joint effect of these two variables. Polynomial regression of
ozone against temperature suggests that the polynomial needs to be of order at
least 3; raising the order from 3 to 4 increases R* from 0.6052 only to 0.6065,
suggesting that order 3 is adequate. Fitting equation (2) by nonlinear least squares
leads to/? = 0.6509. The dimensionless function of relative humidity is estimated
to be 1 - 0.00498%-' x (relative humidity - 50*), which decreases from 1.249 to
0.751 as relative humidity increases from 0% to 100%. The B? for this fit may be
compared with that for the lowess fit of Figure 14, which was 0.66; the equivalent
number of parameters was 10.2. This indicates that the parametric model, with
only 5 parameters, performs very nearly as well as the nonparametric one.
The corresponding regression surface for ozone against temperature and wind
-------
Preliminary Report
9 June, 1993
25
s s \
1 s
s
40 60 80 100
Maximum temperature, 09:00 to 18:00 (F)
0 5 10 15
Wind speed, 1100 (mfc)
1 2
20 40 60 80 100
Relative humidity, 12,-QO (%)
*
41
0 20 40 60 80 100
Opaque cloud cover, 1100 (%)
Figxire 12: Scatter plots of ozone against temperature, wind speed, relative hu-
midity, and cloud cover.
-------
Preliminary Report
9 June, 1993
26
s §<
I 8
a
: ,:: :.. ":,;:!:; .:
vi:,:'.:i,:!iilll
0 100 200 300
Wind direction, 1100 (deg)
I i
I S
s
990 1010 1030
Pressure, 11-00 (mbar)
Figure 13: Scatter plots of ozone against wind direction and barometric pressure.
Figure 14: Nonpararoetric regression surface for ozone against temperature (max-
imum from 09:00 to 18:00) and noon relative humidity.
-------
Preliminary Report
9 June, 1993
27
speed is shown in Figure 15. This surface shows similarly that the polynomial
effect of temperature is maintained at all levels of wind speed, but that the effect
of wind speed is neither linear nor the same (nor even proportional) at different
wind speeds. Rather, increasing wind speed is associated with a drop in ozone at
high temperatures, leveling off above around 6 m/s, but with a slight rise in ozone
at low temperatures. This behavior could be captured in a model of the form
ozone = constant + (polynomial in temperature) x (function of wind speed) (3)
and the surface suggests that the function of wind speed might be of the form
1
1 +
wind speed"'
where v is a critical speed at which the effect of this dimensionless factor drops
from 1 to 0.5. The nonlinear least squares fitted value of v is 7.72 m/s, and
fl2 = 0.6280. That the effect of wind speed differs from that of relative humidity
is shown in the fitted value of the constant term in equation (3), 40.8 ppb. This
terra would have to be set to 0 ppb for the form of the equation to reduce to
the simpler multiplicative form. Forcing this change reduces /? to 0.6143 (and
Figure 15: Nonparametric regression surface for ozone against temperature (max-
imum from 09:00 to 18:00) and noon wind speed.
-------
Preliminary Report 9 June, 1993 28
increases c to 51 m/s). By contrast, if such a constant is included in equation (2),
it is estimated to be -1.2 ppb, and /? is unchanged to 5 decimal places.
Again, the /? for the model (3), 0.6280, may be compared with that for the
lowess fit, which was 0.63 with the equivalent of 10.2 parameters. The parametric
model performs well by comparison.
Equations (2) and (3) may be combined as
ozone = (constant
+ (polynomial in temperature) x (function of wind speed)}
x (linear function of relative humidity), (4)
with wind speed entering in the same form as before. The .ft2 for the combined
model is 0.6752, and the coefficients change only slightly: the multiplier of
(relative humidity - 50%) becomes -0.00505%"*, and u becomes 6.85 m/s. The
corresponding lowess fit has /? = 0.68 for the equivalent of 16.7 parameters,
again indicating good performance of the parametric model, which here has 7
parameters.
Study of the residuals from combined models such as equation (4) showed
that ozone also depends on visibility and opaque cloud cover. On the other hand,
these residuals showed essentially no dependence on barometric pressure, despite
the association shown in Figure 13. High temperatures tend to be associated with
pressures close to 1015 mbar, and when the effect of temperature is removed
from the ozone concentrations, the association with pressure is removed as welL
Much but not all of the association of ozone concentration with wind direction is
similarly removed by taking out the effect of temperature.
Noon values of visibility and opaque cloud cover were used, and these were
included in the model by multiplying the right hand side of equation (4) by
(linear function of visibility) x (linear function of opaque cloud cover).
i
Including these terms in order raised & to 0.7016 and 0.7035, respectively. The
dependence of ozone levels on wind direction suggested including a similar term,
of the form
,. . F fir x wind direction \
1 -|- linear combination of cos I J
2jr x wind direction\
360 )' (5)
-------
Preliminary Report 9 June, 1993 29
Including this term raised A2 to 0.7064. Multiplying the cosine and sine by wind
speed gave a better fit and has greater physical meaning, since the terra can then
be interpreted as the inner product of the wind vector with a vector of coefficients.
This change increased R? to 0.7074. Finally, it was found that averaging the wind
vector over the hours 06:00 to 18:00 gave a still better fit, with If = 0.7108.
Equation 5 was adequate for these data because the (remaining) dependence on
wind direction was simple. In many cases the dependence of ozone concentration
on wind direction is multimodal, in which case a longer Fourier series would be
needed. This requires including the sines and cosines of small multiples of wind
direction.
5.2 Upper air meteorology
The residuals from the model described in the preceding section were graphed
against and correlated with upper air meteorological variables from the 12Z sound-
ing (06:00 CST). The only substantial association was with wind speed, and this
was stronger at 700 mbar than at 950 mbar, 850 mbar or 500 mbar.
First the effect of including upper air wind speeds in the wind speed interaction
with temperature was explored. The factor
1
1 a. wind speed*
1 "*" v
was extended by adding the four upper air wind speeds, with arbitrary weights, in
the denominator. The weights were then estimated by nonlinear least squares. The
coefficient of 950 mbar wind speed was very small and negative. The coefficients
of 850 mbar and 500 mbar wind speed were both smaller than that of 700 mbar
wind speed, consistently with the ordering of the correlation coefficients. The
700 mbar term was therefore retained, and the model was refined with the wind
factor extended to
1
t i surface wind speed 700 mbar wind speed'
1 "*" t>4 T V700
This gave an R2 of 0.7204.
Next the effect of including the 700 mbar wind vector in the wind vector part
of the model was studied. The term
1,1- u- ei j j\ /2r x wind direction's
1 + linear combination of (wind speed) x cos I ^r I
-------
Preliminary Report 9 June, 1993 30
'2r x wind direction\
.... ,. . /2r x wind direction \
and (wind speed) x sin ( )
was extended by adding a corresponding linear combination of the components of
the 700 mbar wind vector. However, this increased If by only 0.0003, indicating
that there is negligible further information in the upper air wind vector beyond that
in the surface wind vector. A similar experiment with the 950 mbar wind vector
also gave negative results. This extension was not used further.
53 Lagged meteorological variables
The residuals from the preceding analysis were examined for association with
lagged surface meteorological variables by using multiple linear regression meth-
ods. The lagged variables were all 24 hour averages. This examination showed
that there were moderately strong effects of temperature at lags 1 and 2, and rel-
ative humidity and wind speed at lag 1. The model was therefore extended by
including these lagged temperature variables linearly in the temperature factor,
and by adding lagged relative humidity and wind speed into the corresponding
factors.
These additions raised & to 0.7499. The coefficients of lagged temperatures
were both negative, with that of the lag 2 temperature three times the size of that
of kg 1 temperature, though still only one thirtieth of the coefficient of the same
day's temperature. The coefficient of lagged relative humidity was the same sign
and size as that of the same day's relative humidity. Similarly, the critical wind
speed for lagged wind speed was similar to that for the same day's wind speed.
The multiple regression had indicated that neither variable needed to be lagged
more than one day.
The residuals were also fitted linearly to lagged upper air variables, in the form
of the average of the OOZ sounding for the given day (which is made at 6:00 p.m.
on the previous day) and the 12Z sounding for the previous day (made at 6:00 a.m.
on the previous day). The t-statistics for these variables, while larger in magnitude
than their nominal 95% points, were all smaller than those for the lagged surface
variables. The lagged upper air variables were not included in the model
5.4 Seasonal structure and trend
The residuals from all models were also found to have seasonal dependence. For
instance, the residuals from the model described above are plotted against day of
-------
Preliminary Report
9 June, 1993
31
year in Figure 16. The graph shows an essentially linear decline over the ozone
season, and to model this effect a seasonal term was included in the model. To give
a reasonable fit both in the present context and in a model with no meteorological
factors, the term was taken to be a short Fourier series, with the annual and semi-
annual frequencies represented, and with the mean removed When this term was
included as another multiplicative factor, it raised R2 considerably, to 0.7982. The
alternative of an additive seasonal term gave a somewhat higher R2 of 0.8037.
To explore further the choice between an additive seasonal term and a multi-
plicative one, lowess was again used to give a nonparametric view. Here ozone
was fitted as a function of season and of the fit from the previous (nonseasonal)
model. Figure 17 shows the lowess surface, which is not easy to interpret. The
seasonal change is larger at fitted values of 90-100 ppb than at low fitted values
such as 20-30 ppb, which is consistent with a multiplicative combination of ef-
fects. However, there are relatively few data points with high fitted values near
the ends of the season. Up to 60-80 ppb, the surface shows a more nearly constant
seasonal change, suggestive of an additive combination. The R2 for the lowess
fit was 0.80, essentially the same as that obtained with both the multiplicative
seasonal term and the additive version. The residual root mean square for lowess,
8.225 ppb, falls between those for the additive fit (8.203 ppb) and the multiplica-
Apr
May Jun
Jul
Aug Sep
Oct
ft
100
150
200
Day of year
250
300
Figure 16: Residuals from meteorological model against day of year.
-------
Preliminary Report
9 June, 1993
32
Figure 17: Nonparametric regression surface for ozone against season and the fit
from the nonseasonal model
live fit (8.317 ppb). Although the choice is not clear cut, the additive form was
chosen for the later analyses.
The model is easily extended to estimate a trend in ozone concentrations, of
any chosen form. The simplest extension is the incorporation of a factor that is
linear in time into the model, as an extra multiplicative factor in the main part
of the model. Since possible trends in ozone concentration that are traceable to
trends in causative factors such as temperature are accounted for in the model, the
fitted coefficient in the trend term represents an estimate of that part of the trend
that is not explained by meteorology, or in other words an adjusted trend. An
unadjusted trend may also be calculated by omitting all meteorological variables.
The adjusted trend was found to be -2.7%/decade, while the unadjusted trend was
-4-5.3%/decade. Adding the trend term to the model with meteorological variables
resulted in a relatively small rise in /? to 0.8(342. The & for the model with no
meteorology, only trend and seasonality, was 0.3084, showing that meteorological
effects as formulated in the model account for around 50% of the variance in ozone
concentrations.
The 8%/decade difference between the adjusted and unadjusted trend estimates
is large relative to all of the trend standard eirors calculated below. It is caused
by bias in the misspecified model that omits meteorology, the removal of which is
-------
Preliminary Report 9 June, 1993 33
one of the motives for constructing these models.
5.5 Fitted coefficients and standard errors
The trend coefficients discussed in the previous section are of quantitative interest,
as are others of the fitted coefficients in the model. It is therefore desirable to asso-
ciate a standard error with each of them. This is possible in nonlinear least squares
fits (Gallant, 1987, for instance), but typically requires the usual assumptions of
constancy of variance and lack of correlation in the residuals. These assumptions
are easily seen to be false for the residuals from the present model (see Section 6. 1 ).
If these requirements are ignored, the standard errors reported by nonlinear least
squares programs (Chambers and Hastie, 1992, see nls, for instance) are invalid.
However, Gallant (1987, Sections 2.1, 2.2) describes methods for correcting the
variance estimates for heteroscedasticity and serial correlation in the errors. When
combined in a way that allows for serial correlation with a 2-3 day span, these
corrections gave the standard errors shown in Table 5. The final form of the model
was
ozone ~ {constant
+ (polynomial in temperature) x (function of wind speed)}
x (linear function of same day and lag 1 relative humidity)
x (linear function of visibility)
x (linear function of opaque cloud cover)
x (linear function of mean surface wind vector
x (linear function of time in years)
-f (seasonal model)
where the polynomial in temperature is cubic in same-day temperature plus linear
in lags 1 and 2 temperature, the function of wind speed is
_ _1 _
i . surface wind speed 700 mbar wind speed lag 1 wind speed *
"*" "*" +
and the seasonal model is a linear combination of the cosines and sines of the
annual and semiannual frequencies. It may be written specifically as
o3 ~ (muO + (tO-f tl*(maxt-60)
-------
Preliminary Report
9 June, 1993
34
Table 5: Coefficients in the fitted model, with standard errors computed conven-
tionally and adjusted for heteroscedasticity arid serial correlation.
Coefft. Fitted Value
Conventional! Adjusted
Standard error i Value Standard error t Value
muO
to
tl
t2
t3
til
t!2
vh
vh700
vhl
r
rl
op
V
m.u
m.v
y
al
bl
a2
b2
3.977e-K)l
-l.Olle+01
3.015e+00
9.480e-02
-1.541e-03
-2.531e-02
-8.198e-01
6.055e+00
1.358e+01
4.686e400
-3.384e-03
2.0446-03
-6.976e-04
-7.658e-03
5.245e-03
9.877e-04
-2.685e-03
-8.028e+00
4.085e+00
-2.6786+00
-1.1246+00
1.6149160
6.8563271
0.4637083
0.0141952
0.0002857
0.1758384
0.1851373
1.2663745
2.8454335
1.1571911
0.0003513
0.0003430
0.0001289
0.0006391
0.0014633
0.0015238
0.0011339
1.4207495
0.4924658
0.6501565
0.4750715
24.6262
-1.4747
6.5029
6.6780
-5.3931
-0.1439
-4.4280
4.7815
4.7728
4.0498
-9.6329
-5.9603
-5.4120
-11.9820
3.5841
0.6482
-2.3680
-5.6504
8.2946
-4.1195
-2.3656
1.5570528
6.6209256
0.5347362
0.0165989
0.0003189
0.1729924
0.2053805
1.5009998
3.7519675
1.4042068
0.0003664
0.0003769
0.0001417
0.0008258
0.0015789
0.0016596
0.0014404
1.3814916
0.4682407
0.7043474
0.5213316
25.5414
-1.5271
5.6391
5.7109
-4.8325
-0.1463
-3.9916
4.0341
3.6196
3.3374
-92342
-5.4244
-4.9215
-9.2733
3.3218
0.5952
-1.8640
-5.8110
8.7237
-3.8026
-2.1557
-------
Preliminary Report 9 June, 1993 35
+t2 * (maxt - 60)2 +13 * (maxt - 60)3
+tll * (tlagl - 60) +1!2 * (tlag2) - 60)
*!/(! -I- wspd/vh + wspd700/vh700 -|- wlag/vhl))
*(1 + r * (rh - 50) + rl * (rhlag - 50))
*(1 -f op * (opcov - 50))
*(l+v*(vis-12))
*(1 + m. u * mean. u -f m. v * mean. v)
*(l+y* (year- 1985))
+ al * cos(2 * pi * year) + bl * sin(2 * pi * year)
+ a2 * cos(4 * pi * year) + b2 * sin(4 * pi * year). (6)
The seasonal cosine and sine terms were in fact deviations from their respective
means.
Most of the /-ratios for variables in the model are at least 2 in absolute value.
Aside from the trend coefficient y, the only parameters that have ^-ratios lower than
3 are associated with other parameters with higher ^-ratios. Thus all meaningful
groupings of parameters have a high level of statistical significance.
6 Discussion of the Model
6.1 Quality of the fitted model
The root mean square residual from the fitted model is 8.195 ppb. The lower and
upper 2.5% points of the residuals are -14.6 ppb and 17.6 ppb, respectively, while
the quartiles are 5.0 ppb and 4.3 ppb. Thus model predictions differ from the
actual values by up to ±5 ppb about half the time, and by up to ±16 ppb about
95% or the time. Figure 18 is a quantile-quantile plot of the residuals against the
Gaussian distribution. The points would fall on a straight line if the distribution of
the residuals were exactly Gaussian in shape. In the figure, the behavior is roughly
linear up to a Gaussian quantile of around 2, or in other words for the lower 97.5%
of the distribution. Above this level, the points fall above the extrapolation of the
straight line, meaning that the quantiles of the residuals are progressively higher
than the corresponding quantiles of the Gaussian distribution that would fit the
lower part of their distribution. The highest residual, 65 ppb, is at around the So-
level.
-------
Preliminary Report
9 June, 1993
36
8
1
3
20 2
Quantises of Standard Normal
Figure 18: Quantile-quantile plot of the residuals from the fitted model against the
Gaussian distribution.
Table 6 shows the root mean square residual and the first lagged correlation
coefficient, by month. The standard error of a month's root mean square is around
6% of the observed value, and therefore ranges from 0.4 to 0.6. The seasonal
variation in the root mean square is therefore highly significant. The standard
error of each correlation coefficient is around 0.06; thus the correlations in June
and July are not significantly different from zero, while the higher correlations are
clearly different from these mid-summer values.
The seasonal rise in root mean square residual parallels the rise in mean levels
associated with summer temperatures. If the extra variability is solely caused by
the higher mean value, it might be eliminated by reexpressing ozone concentrations
on a "variance-stabilizing" scale. However, it may be caused partly by differing
meteorological variability in the different .seasons, and not simply by the change
in mean levels. This issue may be addressed by exploring the dependence of
the magnitude of the residuals on the fitted values and the season. Figure 19
shows the regression surface that results from using lowess to fit such a model
nonparametrically. It appears that there is a strong relationship between the
magnitude of the residuals and the fined values early in the season, but that it
becomes weaker as the season progresses. The decrease in residual magnitude at
high fitted values late in the season is presumably spurious, there being relatively
-------
Preliminary Report
9 June, 1993
37
Table 6: Root mean square residual and one day lagged correlation coefficient, by
month
Month Rms Number in rms Correlation Number in correlation
4
5
6
7
8
9
10
6.492
7.203
9.205
10.404
8.975
7.467
6.515
317
336
318
331
331
318
332
0.28475
0.25152
0.06903
0.06727
0.22361
0.42167
0.28397
295
320
298
312
315
298
313
Figure 19: Nonparametric regression surface for magnitude of residuals against
season and fitted value.
-------
Preliminary Report 9 June, 1993 38
few high fitted values at these dates.
It should be noted that the adjusted standard errors tabulated in Section 5.5
(page 34) are valid in the presence of heteroscedasticity and serial correlation of
the form displayed in this section. In particular, they remain valid when the serial
correlations are not stationary, provided they are short term in nature. Standard er-
rors that are valid for longer term correlation are discussed in Section 6.5 (page 46).
However, ordinary least squares parameter estimates may be inefficient in these
cases. Efficiency can be partially restored by reexpressing the ozone concentra-
tions on a variance-stabilizing scale. In the present case, Figure 19 suggests that
the magnitude of the residuals is roughly proportional to the fitted values, mean-
ing that the logarithms of ozone concentrations would have more nearly constant
variance.
It must be recognized, however, that the model is only an empirical approxima-
tion to the actual physical/chemical mechanism whereby meteorological variables
influence ozone concentrations. When a fitting procedure such as least squares is
used to fit the model, it produces estimates of the values of the parameters that
make the empirical model as close as possible to the actual mechanism, in the least
squares sense. If the model were fitted differently, for instance by least squares
on the logarithmic scale, the fitted parameters would be estimates of different
parameter values, namely those that make the model best approximate the actual
mechanism on the new scale. In the case of a logarithmic reexpression, the effect is
to estimate a model that fits the data better at low concentrations, but worse at high
concentrations. In other words, rexpression may produce more efficient parameter
estimates, but they may be estimates of less appropriate parameter values. It is
for this reason that all the fitting described in this report has been carried out on
the original scale, with standard errors computed in a way that makes them valid
in the face of heteroscedasticity and serial correlation, rather than on reexpressed
data.
6.2 Predicted and adjusted percentiles
One aspect of model performance is how well it predicts the highest levels of
ozone. To address this question the model predictions of 95% points by season
were calculated. The model prediction for a given day consists of a probability
distribution, whose mean is the predicted value for that day. The distribution was
taken to be the empirical distribution of the residuals from the model (6), centered
at the predicted value. These prediction distributions were averaged within years
-------
Preliminary Report
9 June, 1993
39
(actually within ozone seasons, April-October of each year). Figure 20 shows the
actual 95th percentiles and those of the yearly averaged prediction distributions.
The model percentiles track the actual percentiles well, with the largest deviation
occurring in 1987. Cox and Chu (1992) carried out a similar exercise for the
network maximum values rather the network typical value, as here. Their Figure 3
shows similar model ability to track the observed 95th percentile, but with some-
what poorer agreement, presumably reflecting the noisier character of the network
maximum versus the network typical value.
To explore the effect of the residual distribution on the prediction of per-
centiles, the calculation of predicted percentiles was repeated using a Gaussian
shape for the prediction distribution with a standard deviation of 8.195 ppb. This
gave essentially the same predicted quantiles, presumably because of the close
agreement between the percent points of the distribution of the residuals and those
of a Gaussian distribution, up to the 97.5% level.
One of the major uses of a model such as that discussed above is to allow for
the effect of year-to-year variations in the explanatory variables. Adjustment of
overall trends in ozone levels was described in Section 5.4. An individual day's
ozone value may also be adjusted, by adding the predicted value for an adjusted
set of meteorological variables to the residual for the actual day. Figure 21
Actual
Predicted
8
* 8 I
1981
1983
1985
1987
1989
1991
Figure 20: Actual and predicted 95th percentiles of ozone by year.
-------
Preliminary Report
9 June, 1993
40
shows the 95th percentiles by year of such adjusted ozone values, as well as the
actual percentiles. As in Cox and Chu (1992), the meteorological variables in the
model (6) were adjusted linearly season by season, so that the mean and variance
from June to September matched those for all 11 years combined.
The adjusted percentiles show somewhat less year-to-year variation than the
actual percentiles, indicating that much of the variation was associated with me-
teorological variability. Again, one year is exceptional: in 1987 the adjusted
percentile is very close to the actual value. However, since the 95th percentile is
determined by the values of the largest 10 or so residuals, the exceptional behavior
in 1987 may be due simply to sampling variability.
Cox and Chu (1992) also constructed adjusted ozone percentiles, for the net-
work maximum value. Their Figure 6 shows much less variability around a
downward linear trend than does our Figure 21, despite the fact that Cox and
Chu's calculation is for the 99th percentile of the network maximum, rather than
the 95th percentile of the network typical value. However, their construction does
not involve the observed residuals, as was done here, but is closer to the calculation
of predicted percentiles for the adjusted meteorology.
Actual
Adjusted
o
CO
p
1981
1983
1985
1987
1989
1991
Figure 21: Actual and adjusted 95th percentiles of ozone by year.
-------
Preliminary Report
9 June, 1993
41
6.3 Interpretation of model components
The parameters of the model (6) have various interpretations. As always, these
interpretations are only suggestive of actual physical or chemical causes of ozone
variations.
rauO : Predicted ozone level when temperature x wind speed interaction term is
zero (that is, high wind speed or temperature = zero of polynomial, around
60°F), and all other variables are at their centering values. 40 ppb.
tO, tl, t2, t3, til, t!2 : Coefficients of a cubic polynomial in max-
imum temperature and lagged temperature, which added to muO gives the
predicted ozone level for a given temperature at zero wind speed, and all
other variables at their centering values. The polynomial plus muO is shown
in Figure 22 (horizontal line indicates muO). In constructing the figure,
lagged temperatures were taken as equal to same day temperature.
vh, vh700, vhl : Critical speeds for surf ace wind, 700mb wind, and lagged
surface wind, respectively, at which wind speed factor drops to one half,
with the other wind speeds at zero. 6 m/s, 14 m/s, 5 m/s.
o
10
40 50 60 70 80 90 100
degF
Figure 22: Fitted polynomial effect of temperature at zero wind speed.
-------
Preliminary Report
9 June, 1993
42
r, rl : Effects of relative humidity and lagged relative humidity. -0.003%~l,
-Q.002%-1. The factor drops from 1.27 to 0.73 as both humidities rise from
0% to 100%.
op : Effect of opaque cloud cover. -.0007%~I. The factor drops from 1.03 to
0.97 as opaque cloud cover rises from 0% to 100%.
v : Effect of visibility. -.008km"1. The factor drops from 1.09 to 0.91 as
visibility rises from Okm to 24km.
m. u , m. v : The relative effect of the 24hr mean wind vector is its vector (inner)
product with the vector (m.u,m.v) = (0.0052,0.0010)(m/s)"!. Maximum
when wind is from the south. The effect of a 5 m/s wind varies from 1.03 to
0.97 as the wind direction changes from south to north.
y : Trend parameter, -O.OC^yr"1 = -2.7%/decade.
al, bl, a2, b2 : Coefficients of the annual and semiannual cosines and
sines. The seasonal term is shown in Figure 23. The location of the
maximum at May 1 rather than at April 1 may reflect the small number
of terms in the Fourier series rather than a true increase from April 1 to
i
Apr
100
May Jun
Jul
Aug Sep
150
200
day of year
250
Oct
300
Figure 23: Fitted seasonal effect
-------
Preliminary Report 9 June, 1993 43
April 30. The fitted values at the start of the ozone season are influenced
by the observed values at the end of the season, because of the periodicity
of this component However, a similar rise through the month of April is
perceptible in Figure 16 (page 31).
The need for a seasonal term in the model indicates that there are other de-
terminants of surface ozone concentrations than those incorporated in the model,
with systematic seasonal variations. The observation that the seasonal term fits
better in an additive form rather than as a further multiplicative factor suggests
that it reflects variations in a source of ozone rather than in precursors of ozone.
However, the difference in quality of fit was not clear cut (R* of 0.8037 rather
than 0.7982), and does not amount to more than a suggestion.
Altshuller (1987) reviewed information on background tropospheric ozone
concentrations, and stated that during the warmer months it appears to be in the
range 10-20 ppb. Although Altshuller did not quantify the seasonal variability
in background tropospheric ozone, springtime maxima in oxidants and ozone in
particular have been observed by Wakamatsu, Uno, Ueda and Uehara (1989),
for example. Thus the effect shown in Figure 23 is qualitatively consistent with
seasonal variations in background tropospheric ozone concentrations.
The model (6) contains no term that may be interpreted as an estimate of
the average background ozone level. The coefficient muO is independent of
temperature and wind speed, but still multiplies the factors associated with the
remaining meteorological variables. An intercept may be added to the model by
extending the "seasonal" part to
mul -f al * cos(2 * pi * year) + bl * sin(2 * pi * year)
+ a2 * cos(4 * pi * year) + b2 * sin(4 * pi * year)
(recall that in (6) the cosine and sine terms had means of zero, by construction).
The fined value of mul is 10.5 ppb, and the seasonal coefficients are essentially
unchanged. Thus the background level of ozone is estimated as the curve in
Figure 23 (page 42), offset by+10.5 ppb, in reasonable agreement with Altshuller's
statement. However, the standard error of this estimate is 5.6 ppb, and it is therefore
poorly quantified. The estimates of the coefficients muO and mul are highly
(and negatively) correlated; their sum is well estimated, but neither coefficient
is individually. In particular the estimated value of mul is barely significantly
different from zero, and therefore this extended seasonal model has not been used
elsewhere.
-------
Preliminary Report 9 June, 1993 44
6.4 Cross validation
As with any model containing many parameters, the possibility exists that (6) is
over-parameterized. This may be explored by refitting the model to subsets of the
data, and studying
the quality of predictions from the refit model to the remainder of the data
(cross validation), and
the variability of the parameter estimates in the refit models (jackknifing).
In the present case, where the focus is on the possible effects of interannual
variations in conditions, it is appropriate to construct the subsets based on years
(or "ozone seasons"). This was done by leaving out one year at a time.
Table 7 gives some results. The line for a given year gives the number of
days of data used in the model for that year, and the mean and root mean square
error when the model refit to the remainder of the data is used to predict that
year. This statistic is a grouped version of the PRESS statistic discussed by Cook
Table 7: Predicted residual mean and root mean square by year.
Year Number of days Mean Root mean square
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
204
208
210
208
209
206
209
212
208
204
205
-0.1860
-1.2221
-0.6555
-0.1020
0.7356
-0.4128
1.4206
-0.7896
2.3823
0.9323
-2.9085
8.878
8.371
9.776
8.490
7.629
8.371
8.628
7.162
8.240
7.372
8.088
-------
Preliminary Report 9 June, 1993 45
and Weisberg (1982, Section 2.2.3). The overall root mean square of the cross-
validated prediction errors is 8.302 ppb, which compares very favorably with the
root mean square residual of the fit to all the data, 8.195 ppb.
There is some variability from year to year in Table 7. However, it is not
clear how it relates to the characteristics of either ozone levels or meteorology.
For instance, both 1983 and 1988 experienced high ozone levels (see Figure 20,
page 39, for example), and yet 1983 shows the highest root mean square error and
1988 the lowest Note also that 1987, while showing the poorest fit in the predicted
95th percentile (see the same figure), has a mean squared error that is exceeded
in two other years. The possibility that these variations were caused solely by
sampling error was explored by comparing the values in the second column of
Table 7 with quantiles of a reference distribution. Some care has to be taken in
constructing the reference distribution, because of the following features of the
context
There is variability in sample size from year to year.
The residuals from the fitted model are noticeably nonGaussian (see Fig-
ure 18, page 36).
Each line in Table 7 summarizes an entire season, and the residuals do not
have the same distribution across the season (see Figure 19, page 37).
To allow for these features, the reference distribution was constructed by
bootstrap methods (Efron, 1979; Wu, 1986), in the following steps.
1. Select a season length randomly from the first column of Table 7.
2. Construct a random season as a sample of the dates from April 1 to Octo-
ber 31, without replacement, with size equal to the season length selected in
step 1.
3. For each date in the random season, choose a residual at random from the
population of residuals labeled by that date. Note that there are at most 11
such residuals.
4. Calculate the standard deviation of the resulting random season of residuals.
This procedure was carried out 250 times to produce simulation estimates of
the quantiles of the reference distribution. It should be noted that the reference
-------
Preliminary Report
9 June, 1993
46
distribution was constructed from the residues from the model when fitted to all
of the data, rather than from "leave one out" predicted residuals. However, these
were very similar in overall magnitude, and the substitution should not cause a
noticeable bias.
Figure 24 shows the ordered values from the second column of Table 7,
graphed against the bootstrap quantiles. The graph suggests that at most the
highest observed value, that for 1983, is more extreme than could be explained by
sampling variability, and by only around 0.5 ppb.
6.5 Jackknifing
The "leave out one year" parameter estimates discussed in Section 6.4 may be
used to construct "pseudo-values" (Tukey, 1958) from
pseudo-value = Y(estimate from all data) - (Y l)(leave one out estimate),
where Y is the number of years, 11. The pseudo-values may then be treated
as a sample of values estimating the given parameter. Their mean is used as
a less biased "jackknifed" estimate of the parameter, and the standard error of
7.5 ' 8 J 9.5
Bootstrap quantile
Figure 24: Quantile-quantile plot of predicted root mean square residual by year,
against a bootstrap reference distribution.
-------
Preliminary Report 9 June, 1993 47
the mean provides a standard error for either the original parameter estimate or
the jackknifed estimate. This standard error is valid in the presence of arbitrary
variances and covariances of the model residuals, provided only that they are
uncorrelated across years. Miller (1974) discussed the use of the jackknife for
linear regression models, and proposed its use also for nonlinear models. Hinkley
(1977) showed that in general a weighted jackknife performs better for unbalanced
situations such as regression models, and other modifications were suggested by
Wu (1986) for both linear and nonlinear models. However, in the present case
each year should be reasonably similar to the others, which suggests that the extra
complication of the weighted jackknife outweighs any performance advantage.
The jacknifed estimates and standard errors are shown in Table 8. The jack-
knifed parameter estimates differ very little from the overall values shown in
Table 5 (page 34). The jackknifed standard errors are also generally similar to
those in Table 5, the most notable difference being the increase in the standard
error of the trend coefficient from 1.4 %/decade to 3.4 %/decade. The associated
change in the \t(-statistic from nearly 2 (allowing for heteroscedasticity and short-
term correlation) to less than 1 suggests that the information about trend in the
data is overstated in the standard errors in Table 5. However, the standard errors
in Table 8 are based on only 10 degrees of freedom, and the effects of sampling
variations must not be neglected
Figure 25 shows the ratios of the two sets of standard errors, ordered and
graphed against the quantiles of the x distribution with 10 degrees of freedom. If
there were no long-term or interannual effects, and if the dispersion matrix of the
parameter estimates were diagonal, the points would be expected to lie close to
the solid diagonal line. Correlations among the parameter estimates would tend
to reduce the slope of the points, but this effect is expected to be small. The
highest point in the figure, corresponding to the trend coefficient, falls far enough
above the line to preclude the possibility that it is the result of sampling variability.
The jackknife standard error is therefore preferred for inferences about the trend
coefficient
The dotted line in Figure 25 has a slope of 1.2, and provides a good compromise
match to the remaining points. This suggests that the jackknife standard errors are
estimating values up to 20% larger than the adjusted standard errors of Table 5. The
adjusted standard errors multiplied by 1.2 should therefore give rise to conservative
inferences about all parameters other than the trend, at the same time being less
sensitive to sampling variability than the jackknife standard errors.
-------
Preliminary Report
9 June, 1993
48
Table 8: Jackknifed estimates and standard errors.
Coefft.
muO
to
tl
t2
t3
til
t!2
vh
vh700
vhl
r
rl
op
V
m.u
m.v
y
al
bl
a2
b2
Fitted Value Standard error
3.980e-K)l
-1.042e+01
3.050e+00
9.606e-02
-1.557e-03
-2.382e-02
-8.322e-01
6.018e-KX)
1.353e-f01
4.679e+00
-3.385e-03
-2.046e-03
-6.951e-04
-7.640e-03
5.284e-03
9.919e-04
-2.646e-03
-8.024e+00
4.084e+00
-2.684e-fOO
-1.129e-KX)
1.6932387
8.0915148
0.7471219
0.0267512
0.0003707
0.2129552
0'.3544919
. 2.0013063
4.0252847
1.7690150
0.0002265
0.0003206
0.0001320
0.0009879
0.0011824
0.0015540
0.0034115
1.2199726
0.5528235
0.4108871
0.3694495
t Value
23.5035
-1.2881
4.0830
3.5908
-4.2002
-0.1119
-2.3477
3.0071
3.3612
2.6447
-14.9437
-6.3810
-5.2658
-7.7332
4.4688
0.6383
-0.7757
-6.5774
7.3876
-6.5331
-3.0560
-------
Preliminary Report
9 June, 1993
49
00
8 2
V)
d
0.5 1.5
Chiquamile
Figure 25: Quantile-quantile plot of ratio of jackknifed standard errors to adjusted
standard errors, against the x distribution with 10 degrees of freedom, with lines
of slope 1 and 1.2.
6.6 Network maximum
The modeling effort described above dealt with only the daily network typical
value. The median polish analysis that gave rise to these network typical values
was also used to construct a network maximum. Fitting the same form of model
to the network maximum gave R2 = 0.6731 and a root mean square residual of
15.611 ppb. The fitted coefficients are shown in Table 9. The major changes from
the model for the network typical value are as follows.
muO : Increased from 40 ppb to 59 ppb.
tOf tl, t2, t3, til, t!2 : The same day temperature coefficients are
generally larger. This is to be expected, as they are absolute effects. How-
ever, the lagged temperature effects are smaller, and both are now non-
significant. The polynomial plus muO is shown in Figure 26 (horizontal line
indicates muO). As in constructing Figure 22, lagged temperatures were
taken as equal to same day temperature.
vh, vh700, vhl : All critical wind speeds are lower, meaning that the effects
of high temperature are moderated more rapidly by increasing wind speed
-------
Preliminary Report
9 June, 1993
50
Table 9: Coefficients in the fitted model for the network maximum, with standard
errors computed conventionally and adjusted for heteroscedastitity and serial
correlation.
Coefft. Fitted Value Conventional Adjusted
Standard error t Value Standard error t Value
muO
to
tl
t2
t3
til
t!2
vh
vh700
vhl
r
rl
op
V
m.u
m.v
y
al
bl
a2
b2
5.8516401
-8.222e+00
5.540e+00
1.626e-01
-1.615e-03
-1.019e-01
-7.087e-01
2.908e400
1.167e-K)l
2.965e+00
-1.796e-03
-2.048e-03
-6.176e-04
-7.002e-03
2.574e-03
-1.328e-02
-9.492e-03
-4.2356400
6.380e+00
-1.441e-fOO
2,531e-01
2.7046400
1.526e401
1.370e400
3.900e-02
6.316e-04
4.806e-01
4.170e-01
8.012e-01
3.791C-KX)
1. 013e-KX)
4.579e-04
4.456e-04
1.674e-04
8.283e-04
1.863e-03
1.916e-03
1.441e-03
2.6886400
9.344e-01
1.240e-K)0
9.072e-01
21.6400
-0.5389
4.0449
4.1699
-2.5578
-0.2120
-1.6994
3.6292
3.0774
2.9261
-3.9214
-4.5967
3.6893
-8.4542
1.3814
6.9274
-6.5882
1.5758
6.8283
1.1623
0.2790
2.817e400
1.510e401
1.499e400
4.318e-02
6.969e-04
4.854e-01
4.568e-01
9.122e-01
4.343e400
1.249e400
5.258e-04
5.315e-04
1.783e-04
9.134e-04
1.915e-03
2.089e-03
1.828e-03
2.912e400
1.0916400
1.4016400
1.0756400
20.7682
-0.5444
3.6964
3.7664
-2.3179
-0.2099
-1.5514
3.1873
2.6863
2.3746
-3.4150
-3.8537
-3.4647
-7.6662
1.3438
-6.3554
-5.1939
-1.4543
5.8466
-1.0289
0.2354
-------
Preliminary Report
9 June, 1993
51
f
8
en
8
40
50
60
70
degF
SO
90
100
Figure 26: Fitted polynomial effect of temperature at zero wind speed for network
maximum. Dotted line shows values for network typical value, shown in Figure 22.
for the network maximum than for the network typical value.
T, rl, op, v : The proportional effects of relative humidity, opaque cloud
cover, and visibility are all lower for the network maximum than for the
network typical value.
m. u, m. v : The proportional impact of the wind vector is now greatest when
winds are from the north, and is nearly three times as large as for the network
typical value.
y : The estimated trend parameter for the network maximum is -O.OOPSyr"1 =
9.5%/decade, notably larger than for the network typical value. It is ap-
parently statistically significant, with respect to the standard errors adjusted
for heteroscedasticity and serial correlation. Jackknife standard errors have
not been computed, but if the same ratio were found as for the trend in the
network typical value, the jackknife standard error for the trend would be
4.8%/decade, and the trend would be barely significant
al, bl, a2, b2 : The seasonal term is shown in Figure 27, together with the
corresponding curve for the network typical value. The curves are essentially
-------
Preliminary Report
9 June, 1993
52
Apr May Jun Jul
Ang Sep Oct
v»
f
100
150
200
day of year
250
300
Figure 27: Fitted seasonal effect for network maximum. Dotted line shows values
for network typical value, shown in Rgure 23.
I
identical, the largest difference being the somewhat earlier maximum. The
close agreement between the seasonal curves contrasts sharply with the
difference between the temperature curves (Figure 26), and tends to support
the interpretation of this part of the model as the effect of seasonal variations
in background ozone concentrations.
Actual, predicted, and adjusted 95th percentiles for the network maximum,
corresponding to those in Figures 20 and 21 (pages 39 and 40), are shown in
Figures 28 and 29. These figures show similar performance to those for the
network typical value. The relatively good performance in 1987 suggests that the
problems in the prediction and adjustment of the 95th percentile for the network
typical value were due to chance effects.
The results of this section may be compared with those of Cox and Chu
(1992), who fitted a Weibull probability model to the same network maximum
time series. In the model the logarithm of the Weibull scale parameter was
represented as a linear function of selected meteorological variables and linear
trend. Parameters were fitted by maximum likelihood. The trend was estimated as
0.0054/yr = -5.4%/decade with a standard error (computed by bootstrapping
3 day sequences) of 3.7%/decade. The difference between Cox and Chu's estimate
-------
Preliminary Report 9 June, 1993
53
Actual
Predicted
s
1981 1983 1985 1987 1989 1991
Figure 28: Actual and predicted 95th percentiles of network maximum ozone by
year.
Actual
Adjusted
1981 1983 198S 1987 1989 1991
Figure 29: Actual and adjusted 95th percentiles of network maximum ozone by
year.
-------
Preliminary Report 9 June, 1993 54
and the one obtained from the present model is around one standard error. They
are therefore reasonably consistent with each other, given the differences between
the models and the fitting criteria.
7 Conclusions
The daily maximum one-hour average surface ozone concentrations from the 45
stations were highly correlated. A 16-station subnetwork with essentially the same
spatial coverage as the whole network showed a dominant principal component that
accounted for 78% of the variance, and the corresponding time series was nearly
perfectly correlated (p = 0.99) with a simple "typical" value for the network,
obtained by median polish.
The network typical value time series shows nonlinear and nonadditive depen-
dence on various meteorological quantities, including individual measurements
and constructs (averages). There is also sarong seasonal dependence, even after
allowing for the effects of the meteorological variables. The dependence can
be approximated by a nonlinear parametric model, and when the parameters are
fitted by least squares, the model accounts for 80% of the variance of the ozone
concentration data. The root mean square residual is 8.2 ppb. The model may be
extended to include a trend parameter, which is estimated to be -2.7%/decade,
with a (jackknife) standard error of 3.7%/decade. This represents an estimate of
trend adjusted for meteorological variability. If the meteorological variables are
omitted, the model contains just seasonaliry and trend, and the trend estimate is
found to be +5.3%/decade, representing the unadjusted trend in the surface ozone
concentrations.
References
Altshuller, A. P. (1987). 'Estimation of the natural background of ozone present
at sruface rural locations', Journal of the Air Pollution Control Association
37,1409-1417.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language,
Advanced Books and Software. Pacific Grove, California: Wadsworth.
-------
Preliminary Report 9 June, 1993 55
Bloomfield, P. and Steiger, W. L. (1983). Least Absolute Deviations: Theory,
Applications, and Algorithms, Progress in Probability and Statistics. Boston:
BirkhSuser.
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Pacific Grove,
CA: Wadsworth.
Cleveland, W. S. (1979). 'Robust locally weighted regression and smoothing
scatterplots', Journal of the American Statistical Association 74(368), 829-
836.
Comrie, A. C. and Yarnal, B. (1992). 'Relationships between synoptic-scale
atmospheric circulation and ozone concentrations in metropolitan Pittsburgh,
Pennsylvania', Atmospheric Environment 26B, 301-312.
Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression,
Monographs on Statistics and Applied Probability. New York: Chapman
and Hall.
Cox, W. M. (1992). Personal communication.
Cox, W. M. and Chu, S.-H. (1992). Meteorologically adjusted ozone trends in
urban areas: A probability approach, U.S. Environmental Protection Agency,
Technical Support Division MD-14, Research Triangle Park, NC 27711.
Efron, B. (1979). 'Bootstrap methods: Another look at the jackknife', Annals of
Statistics 7,1-26.
Feister, U. and Balzer, K. (1991). 'Surface ozone and meteorological predictors
on a subregional scale', Atmospheric Environment 25A, 1781-1790.
Gallant, A. R. (1987). Nonlinear Statistical Models. New York: Wiley.
Hinkley, D. V. (1977). 'Jackknifing in unbalanced situations', Technometrics
19,285-292.
Kelly, N. A., Ferman, M. A. and Wolff, G. T. (1986). 'The chemical and me-
teorological conditions associated with high and low ozone concentrations
in southeastern michigan and nearby areas of Ontario', Journal of the Air
Pollution Control Association 36,150-158.
-------
Preliminary Report 9 June, 1993 56
Miller, R. G. (1974). 'The jackknifea review', Biometrika 61,1-15.
National Research Council (1991). Rethinking the Ozone Problem in Urban and
Regional Air Pollution. Washington, DC: National Academy Press.
TWcey, J. W. (1958). 'Bias and confidence in not quite large samples (abstract)',
Annals of Mathematical Statistics 29,614.
Tukey, J. W. (1977). Exploratory Data Analysis. Reading, Massachusetts:
Addison-Wesley.
Wakamatsu, S., Uno, I., Ueda, H. and Uehara, K. (1989). 'Observational study
of stratospheric ozone intrusions into the lower troposphere', Atmospheric
Environment 23,1815-1826.
Wu, C. F. J. (1986). 'Jackknife, bootstrap and other resampling methods in
regression analysis'', Annals of Statistics 14,1261-1295.
------- |