ional Institute of Statistical  Sciences
      Accounting for Meteorological Effects in
     Measuring Urban Ozone Levels and Trends
            Peter Bloomfield  Andy Royle
                  Qing Yang

-------
 Accounting for Meteorological Effects in
Measuring Urban Ozone Levels and Trends
         Peter Bloomfield  Andy Royle
                 Qing Yang
             Technical Report # 1
                 June, 1993
      National Institute of Statistical Sciences
               P.O. Box 14162
       Research Triangle Park, N. C. 27709

-------
     Accounting for Meteorological Effects in

   Measuring Urban Ozone Levels and Trends*


                Peter Bloomfield       Andy Royle
                             Qing Yang

              National Institute of Statistical Sciences

                                and

                      Department of Statistics
                  North Carolina State University

                  Preliminary Report: 9 June, 1993


                               Abstract

        Surface ozone levels are determined by the strengths of sources and
     precursor emissions, and by the meteorological conditions. Observed ozone
     concentrations are valuable indicators of possible health and environmental
     impacts. However, they are also used to monitor changes and trends in the
     sources of ozone and of its precursors, and for this purpose the influence
     of meteorological variables is a confounding factor. This report describes
     a study of ozone concentrations and meteorology in the Chicago area. The
     data are described using a variety of exploratory methods, including median
     polish and principal components analysis. The key relationships observed in
     these analyses are then used to construct a model relating ozone to meteorol-
     ogy. The model can be used to estimate that part of the trend in ozone levels
     that cannot be accounted for by trends in meteorology, and to "adjust" ob-
     served ozone concentrations for anomalous weather conditions. The model
   'Research supported by the U.S. Environmental Protection Agency under Cooperative Agree-
ment #CR819638-01-0.

-------
Preliminary Report             9 June, 1993                             2

     parameters are estimated by nonlinear least squares. Its goodness of fit is
     assessed by comparison with nonparaxnetric regression results Gowess).
        Keywords: Ozone concentration, meteorological adjustment, nonlinear
     regression, nonparametric regression.


Contents

1   Introduction                                                      4

2   Description of Data                                                5
   2.1   Ozone data	   5
   2.2   Meteorological data	   9

3   Preliminary Analyses                                             11
   3.1   Diurnal cycles   	11
   32   Imputation of missing values	13

4   Summarizing the Network                                        15
   4.1   Principal components analysis   	15
   4.2   Network average	23

5   Modeling Ozone Concentrations                                   23
   5.1   Surface meteorology   	23
   5.2   Upper air meteorology  	29
   5.3   Lagged meteorological variables	30
   5.4   Seasonal structure and trend	30
   5.5   Fitted coefficients and standard errors	33

6   Discussion of the Model                                           35
   6.1   Quality of the fitted model	35
   62   Predicted and adjusted perccntiles	38
   6.3   Interpretation of model components	41
   6.4   Cross validation	44
   6.5   Jackknifing	46
   6.6   Network maximum	49

7   Conclusions                                                     54

-------
Preliminary Report             9 June, 1993                            3


List of Tables

   1    The ozone monitoring stations	   6
   2    Meteorological variables and station locations	   9
   3    Results of principal components analysis for 16 stations	17
   4    Station loadings for the first five principal components	18
   5    Coefficients in the fitted model	34
   6    Root mean square residual and one day lagged correlation coeffi-
        cient, by month  	37
   7    Predicted residual mean and root mean square by year.	44
   8    Jackknifed estimates and standard errors	48
   9    Coefficients in the fitted model for the network maximum	50


List of Figures

   1    Locations of the ozone monitoring stations	   8
   2    Locations of the surface and upper air weather stations	10
   3    Daily typical values for station 170317002	,	12
   4    Diurnal cycle for  station 170317002	  12
   5    Imputation of 2p.m. observations for station 170317002	13
   6    Quantile-quantile plot of imputed 2 p.m. observations for station
        170317002 against valid 2 p.m. observations	  14
   7    Locations of the 16 stations used in principal components analysis.  16
   8    Station loadings for component 1	  19
   9    Station loadings for component 2	20
   10  Station loadings for component 3	21
   11  Station loadings for component 4	22
   12  Scatter plots of ozone against temperature, wind speed, relative
        humidity, and cloud cover.	25
   13  Scatter plots of ozone against wind direction and barometric pressure. 26
   14  Nonparametric regression  surface for ozone against temperature
        (maximum from 09:00 to 18:00) and noon relative humidity.  ...  26
   15  Nonparametric regression  surface for ozone against temperature
        (maximum from 09:00 to 18:00) and noon wind speed.	27
   16  Residuals from meteorological model against day of year.	31

-------
Preliminary Report             9 June, 1993                             4

    17   Nonparametric regression surface for ozone against season and
         the fit from the nonseasonal model.  .	32
    18   Quantile-quantile plot of the residuals from the fitted model against
         the Gaussian distribution	36
    19   Nonparametric regression surface for magnitude of residuals against
         season and fitted value	37
    20   Actual and predicted 95th percentiles of ozone by year.	39
    21   Actual and adjusted 95th percentiles of ozone by year.  	40
    22   Fitted polynomial effect of temperature at zero wind speed. ....  41
    23   Fitted seasonal effect	42
    24   Quantile-quantile plot of predicted root mean square residual by
         year, against a bootstrap reference distribution	46
    25   Quantile-quantile plot of ratio of jackknifed standard errors to
         adjusted standard errors.	49
    26   Fitted polynomial effect of temperature at zero wind speed for
         network maximum	51
    27   Fitted seasonal effect for network maximum	52
    28   Actual and predicted 95th percentiles of network maximum ozone
         by year.  	53
    29   Actual and adjusted 95th percentiles of network maximum ozone
         by year.  	53


 1    Introduction

 Current knowledge about trends in surface ozone concentrations has been reviewed
 recently by the National Research Council (1991, Chapter 2), who divided efforts
 to quantify the impact of meteorology on ozone into those based on classificatory
 methods, and those based on regression methods. The former have the advantage
 of nonparametric flexibility, but are essentially uninterpretable.  The latter give
 interpretable models, but are heavily tied to linear models. Cox and Chu (1992)
 used a generalized linear model approach, introducing interaction between two key
 meteorological variables (temperature and wind speed) by including their product
    The modeling effort described here was based on ozone  and meteorological
 data from the Chicago area (Cox,  1992). Graphical displays and nonparametric
 modeling, described in detail in Section 5, were used to confirm the identities of
 the key meteorological variables and their interactions, and to choose functional

-------
Preliminary Report             9 June, 1993                             5


forms for modeling the dependence of ozone. Nonlinear least squares methods
were then used to fit a model incorporating these variables and interactions.
   When the model is extended to include a dependence on year, the associated
coefficient may be interpreted as an estimate of proportional trend in ozone,
allowing for changes in the meteorological variables. The model may also be used
to construct "adjusted" ozone levels, that is, estimates of what the concentration of
ozone would have been on a given day or in a given year, had the meteorological
conditions been different.
   The development of the model is driven  by empirical study of the  associa-
tion of ozone levels with the various meteorological variables.  This approach
complements that used by Comrie and Yamal  (1992), who constructed a synoptic
climatology of atmospheric circulation and ozone concentrations in Pittsburgh,
Pennsylvania. Kelly, Ferman and Wolff (1986) similarly studied the meteorologi-
cal conditions associated with high and low ozone days in Detroit, Michigan. The
conclusions of such studies provide considerable guidance in the construction of
models such as that developed here.
   Feister and Balzer (1991) have also studied the dependence of ozone concen-
trations on meteorology, but with a view  towards short-term forecasting. Their
model includes the previous day's ozone concentration, and they find  that this
is the most important variable. Previous values of ozone have not been used in
the present study, as their incorporation in a model makes its use for adjustment
problematic.


2    Description of Data

2.1    Ozone data

The ozone data consisted of hourly averages at the 45 stations described in Table 1.
The locations of the ozone monitoring stations are shown in Figure 1.
   Most of the 45 stations recorded data only during the summer months, al-
though some were operated essentially year-round.  In all the analyses described
subsequently, data were limited to the ozone "season" of 1  April to 31 October.

-------
Preliminary Report
9 June, 1993
Table 1:  The ozone monitoring stations.  "AIRS" is the EPA air quality data
base. "MSA" is the Metropolitan Statistical Area identifier for the station location.
Dates of first and last observations are given in "yymmdd" form.  Codes are:
R-residential, I-industrial, C-commercial, A-agricultural, M-mobile; S-suburban,
U-urban, R-ruraL
 AIRS Site ID  Latitude  Longitude  MSA  State   First and Last Dates  Code
170310001
170310003
170310009
170310032
170310037
170310038
170310044
170310045
170310050
170310053
170310062
170310063
170310064
170311002
170311003
170311601
170312301
170313005
170314002
41.669
42.061
41.747
41.757
41.979
41.705
41.790
41.923
41.708
41.739
41.917
41.877
41.790
41.616
41.984
41.668
41.874
42.060
41.855
87.733
88.003
87.732
87.545
87.669
87.632
87.583
87.634
87.568
87.726
87.573
87.634
87.602
87.558
87.792
87.990
87.849
87.753
87.753
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
1600
EL
EL
IL
IL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
EL
IL
880612
810206
810331
870331
850723
810331
811001
810101
820101
821130
890516
910729
890701
810101
810331
830407
810101
810101
830217
911030
821216
820526
911031
911231
810912
861112
860105
911231
870914
890727
911231
911031
901031
911030
911030
820831
820831
911031
RS
—
RU
IU
RU
RS
RS
CU
IS
RS
AS
MU
RS
RS
MU
IS
cs
cs
RS
                                                     continued on next page

-------
Preliminary Report
9 June, 1993
                           Table 1 (continued)
AIRS Site ID
170314003
170315001
170316002
170317002
170318003
170431002
170436001
170890005
170970001
170971002
170971003
170973001
171110001
171971007
171971008
180890011
180891016
180892001
180892002
180892008
181270020
181270021
181270024
181270903
181271004
181271005
Latitude
42.039
41.514
NA
42.062
41.631
41.856
41.813
42.050
42.177
42.391
42.446
42.290
42.221
41.278
41.592
41.627
41.600
41.600
41.631
41.639
41.631
41.563
41.617
NA
41.463
41.467
Longitude
87.898
87.645
NA
87.676
87.568
88.077
88.073
88.273
87.865
87.847
88.103
87.982
88.241
88.220
88.049
87.481
87.335
87.383
87.521
87.501
87.087
87.076
87.199
NA
87.044
87.061
MSA
1600
1600
1600
1600
1600
1600
1600
620
3965
3965
3965
3965
1600
3690
3690
2960
2960
2960
2960
2960
2960
2960
2960
2960
2960
2960
State
EL
EL
DL
IL
IL
EL
DL
EL
IL
DL
IL
IL
IL
IL
IL
IN
DM
IN
IN
IN
IN
IN
IN
IN
IN
IN
First and Last Dates
850321
810101
830701
810101
910401
810226
840208
810101
810101
810101
900606
810101
810101
810227
810101
810131
820901
810101
811002
810101
830930
810512
831115
810101
830430
810430
911031
820201
841016
911231
911031
831231
911231
911231
911231
911231
911031
911231
911231
891108
911231
820818
910930
820815
820831
910930
911031
851031
910930
810131
851018
821031
Code
RS
—
CS
RS
RS
RS
AS
RS
RS
RS
AR
RS
RS
AR
RS
CU
RU
IU
RU
CS
R
AR
RS
IS
RS
CU

-------
Preliminary Report
9 June, 1993
     3
 2"  °
 s  ^*
               88.4   88.2   88.0    87.8    87.6   87.4   872   87.0   86.8

                J	I	I	I	I	I	|	I	I
                88.4    88.2   88.0   87.8    87.6    87.4    872   87.0   86.8
                                                                              CM
                                                                              
-------
Preliminary Report
9 June, 1993
2.2   Meteorological data

Surface and upper air meteorological variables were collected at the two stations
shown in Figure 2.  The meteorological variables are described in Table 2. The
surface observations were made each hour, while the upper air soundings were
made largely at OOZ and 12Z (00:00 and 12:00 UTC, respectively, occasional
soundings at 23Z and 11Z were taken as being at OOZ and 12Z, respectively; other
soundings were ignored). The upper air measurements were made at many levels;
these always included 950mb, 850mb, 700mb, and 500mb, which were the only
levels at which the data were used.
           Table 2: Meteorological variables and station locations.
       Variable
Units     Surface   Upper  Symbol
Total cloud cover
Opaque cloud cover
Ceiling height
Barometric pressure
Temperature
Dewpoint temperature
Relative humidity
Specific humidity
Wind direction
Wind speed
Visibility
Height of pressure layer
Latitude
Longitude
%
%
m
mb
oF
°F
%
g/kg
°fromN
m/s
km
m


Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes

41.98°
87.90°




Yes
Yes
Yes
Yes
Yes
Yes

Yes
40.67°
89.68°
totcov
opcov
cht
pr
t
td
rh
q
wdir
wspd
vis
ht



-------
Preliminary Report
      9 June, 1993
10
     p
     
-------
Preliminary Report            9 June, 1993                            11

3    Preliminary Analyses

3.1    Diurnal cycles
The recorded ozone concentrations for a given station may be written as a two-way
array:
                  y.                    (1)
The decomposition was made using median polish (Tukey, 1977), as implemented
in S (Becker, Chambers andWilks, 1988, see twoway). The advantage of median
polish, which is closely related to least absolute deviations fitting of equation (1)
(Bloomfield and Steiger, 1983), is that  it is influenced more by the body of a
distribution and less by its tails than methods based on averages. Thus the "typical"
values for an hour or for a day are less affected by deviations from normal levels
that last for less than half a day.  Median polish also works well in the presence of
missing data (Tukey, 1977, Chapter 10).
    The decomposition was made on the logarithmic scale, to correspond to a mul-
tiplicative decomposition of the actual ozone concentrations. This is appropriate
when effects are expected to be proportional; for instance, when the typical diurnal
profile is expected to be scaled by the daily effect, rather than offset by it The
diagnostic plot (Tukey, 1977, Section 10F) indicated that the decomposition was
more satisfactory on the logarithmic scale.
    Figure 3 shows the fitted daily typical values, exp(/x + ay), for a station with
one of the longer records. The seasonal cycle is visible, as are the unusually high
values in  1988.  Figure 4 shows the corresponding diurnal cycle, exp(/z + fa).
This shows the characteristic behavior of urban ozone: a mid-afternoon maximum,
followed by a decay to an early morning minimum, which is sharpened by NO
scavenging during the morning rush hour, and is followed by a rise through the
late morning and the middle part of the day. This rise is caused initially by mixing
of the surface layer with the relatively ozone-rich residual layer above it, and later
by photochemical production of ozone.  For this station, the root mean square
residual in the median polish decomposition (on the logarithmic scale) was 0.62,
indicating quite large proportional variations of the observed data around the fitted
values.

-------
Preliminary Report
      9 June, 1993
                                    12
     8

     S
     o
     1*1  ••?  -
                                  •£  >
                              1  &.  &
                  I  fe
                  •3  ^•q
             1982
1984
1986
1988
1990
1992
             Figure 3: Daily typical values for station 170317002.
                                     12  ,

                                 Hour of the day
                           18
                               24
                Figure 4: Diurnal cycle for station 170317002.

-------
Preliminary Report
9 June, 1993
13
3.2   Imputation of missing values

The decomposition (1) was used to impute values for missing hourly ozone con*
centrations. If y^ is missing, but the first three terms on the right hand side of (1)
are available, then their sum provides a "fitted value" for the logarithm of the
ozone concentration at that hour. These three terms are available if there are any
data for the relevant station on the same day but at different hours, and for that
hour but on different days. At every station there was enough data to construct a
reliable estimate of the diurnal cycle terms &,&,..., £24, so this procedure gave
imputed values for all missing data other than where entire days were missing.
However, it was found that the imputed values were unreliable when there were
fewer than 2  valid hourly averages between 9:00 a.ra. and 6:00 p.m.,  and they
were not used in this situation.
   The root mean square residual in equation (1) of 0.62 means that the imputed
value typically differed from the missed value by a factor of 0.5 to 2. However,
their distribution was quite similar to the valid data. Figure 5 shows the histograms
of the valid and imputed 2 p.m. observations for the station shown in Figures 3
and 4.  Figure 6 is the corresponding quantile-quantile plot (Becker et al., 1988,
see qqplot). Here and later, the order statistics of the smaller sample are graphed
against estimated quantiles from the larger sample, obtained by interpolation and
          0     50    100    150

        2230 valid 2 pjn. observations (ppb)
                0     50    100   150

              71 imputed 2 pjn. observations (ppb)
       Figure 5: Imputation of 2 p.m. observations for station 170317002.

-------
Preliminary Report
9 June, 1993
14
                             0    40   80  120

                             Valid observation:; (ppb)
Figure 6:  Quantile-quantile plot of imputed 2 p.m. observations for station
170317002 against valid 2 pjn. observations.

by assuming that the ith order statistic in a sample of size n estimates the (i — \}jn
quantile (Becker et al., 1988, see quant lie). The distribution of the imputed
data agrees closely with  that of the valid data up to about 60 ppb, but the largest
imputed values appear to be too small. However, this affects only a handful of
days, and should lead to negligible bias in further analyses.  For instance, 26 of
the 2230 valid 2 p.m. values exceed 120 ppb, suggesting that 1 or so of the missed
values might have been in this range, while the largest imputed value was below
90  ppb.  However, the percentages of values above 120 ppb are 1.166% in the
valid data and 1.130% in the data set including the imputed values.
    A further level of imputation was used for the analyses described in follow-
ing sections, including principal components analysis (Section 4.1) and nonlinear
modeling (Section 5). Where an entire day of data was missing, but at least some
data were measured on both the preceding and succeeding days, the logarithmic-
scale daily effect a* was imputed as the average of the two neighboring values.
This is equivalent to using the geometric mean on the original scale of measure-
ment  Days with fewer than 2 valid observations between 9:00 a.m. and 6:00 p.m.
were treated in the same way: although a* was available, it was flagged as missing,
and replaced by the average of the neighboring; values, if available.  For most sta-
tions there were 10 or fewer days meeting these conditions, so this had relatively

-------
Preliminary Report            9 June, 1993                             15

little effect.


4    Summarizing the Network

4.1   Principal components analysis

The joint behavior of ozone concentrations at the various stations was explored
using principal components analysis of the daily maximum concentration.  This
was carried out computationally through the singular value decomposition of a
data matrix whose rows corresponded to days of data, and whose columns corre-
sponded to the stations. Since no missing values were allowable, an appropriate
subset of stations and days had to be found. To minimize missing data, both levels
of imputation described in Section 3.1 were used.  The matrix was constructed
sequentially, beginning with the station with the most complete data, and succes-
sively adding stations in descending order of amount of data. The first 16 stations
yielded 662 days of data, while adding the 17th reduced this to 274. The locations
of the 16 stations chosen for the analysis are shown in Figure 7. The singular
values and some related quantities are shown in Table 3. The station loadings for
the first 5 components are shown in Table 4. Figures-8 to 11 show the results of
spatial interpolation (Becker et al., 1988, see int erp) of the station loadings for
the first four components.
   The first component, accounting for 79% of the variance of  the 16 station
sub-network, is a weighted average of the station values, with weights ranging
from 0.20 to 0.32.   Stations close to Lake Michigan carry the higher weights,
while those  further  from the lake have  lower weights.   The second and  third
components account for a further 5% and 4% of variance, respectively. They
represent gradients across the network, from East to West and from North to South,
respectively. The remaining singular values are not well  separated, and most are
presumably noise. However, the fourth component, while not well distinguished
from the rest, nevertheless has some spatial interpretation as a contrast between
the interior and boundary stations. Components 2 to 4 may reflect meteorological
conditions; this will be explored elsewhere.

-------
Preliminary Report
9 June, 1993
16
               88.4   88.2   88.0   87.8    87.6   87.4   87.2    87.0   86.8
                 I	I	I	|	I	I	I	|	I
                88.4   88.2    88.0   87.8   87.6    87.4   872    87.0   86.8

                                 Longitude (degrees W)
 Figure 7: Locations of the 16 stations used in principal components analysis.
 Small dots indicate stations not used.

-------
Preliminary Report
9 June, 1993
17
       Table 3: Results of principal components analysis for 16 stations.
Singular value
1828.81
451.36
415.50
284.81
268.73
240.70
230.02
221.47
210.90
197.13
16928
161.87
156.91
144.68
118.49
97.98
Percent of variance
78.8523
4.8031
4.0703
1.9124
1.7026
1.3659
1.2474
1.1564
1.0486
0.9162
0.6756
0.6178
0.5805
0.4935
0.3310
0.2263
Cumulative percent
78.85
83.66
87.73
89.64
91.34
92.71
93.95
95.11
96.16
97.08
97.75
98.37
98.95
99.44
99.77
100.00

-------
Preliminary Report
9 June, 1993
18
        Table 4: Station loadings for the first five principal components.
Station
181270024
170436001
170311601
170311003
170314002
180891016
170311002
170310050
180892008
171971008
170970001
170973001
170971002
171110001
170890005
170317002
1
0.2786
0.2065
0.2359
0.2525
0.2512
0.2399
0.2334
0.2384
0.2574
0.2306
0.2593
0.2492
0.2959
0.2043
0.2185
0.3202
Component
234
0.3956
-0.3170
-0.3976
-0.0622
-0.0159
0.3188
0.1039
0.1691
0.2667
-0.3262
-0.0300
-0.1252
0.0772
-0.2771
-0.3230
0.2423
-0.3512
-0.1700
-02663
0.1928
0.0759
-02732
-0.1675
-0.0237
-0.2379
-0.2942
0.3581
0.3090
0.4390
0.0052
-0.0716
0.2567
-0.1620
-0.0380
0.0826
-0.5339
-0.4037
02106
-0.3765
-0.0391
0.2017
0.1316
-0.1165
0.0005
0.3147
0.1937
0.1242
0.3261
5
0.5381
0.0692
-0.0573
0.1447
-0.4021
-0.1318
-0.0154
-02313
-0.5233
0.0619
0.0287
-0.0682
-0.0893
0.0834
0.1124
0.3715

-------
Preliminary Report
9 June, 1993
19
                 88.2
 3   3
 o
                  88.2      88.0     87.8      87.6



                                 Longitude (degrees W)
                                                                           r-i


                                                                           5
                                           O

                                           
-------
Preliminary Report
    9 June, 1993
                             20
                 88.2
         87.3
87.6
87.4      87.2
      00
                  88.2
88.0      87.8       87.6




     Longitude (degrees W)
          87.4
                   Figure 9: Station loadings for component 2.
                                                                           00

-------
Preliminary Report
9 June, 1993
                   21
                                                                87.2
      (N
  g>   <=>.
 •3   <^"
      -
                  88.2
     87.8      87.6


 Longitude (degrees W)
87.4
87.2
                  Rgure 10: Station loadings for component 3.

-------
Preliminary Report
9 June, 1993
         22
                 8S.2
                                               \
                                    87.8      87.6

                                Longitude (degrees W)
                        87.4
87.2
                                                                         f-l
                                                                          00
                  Figure 11: Station loadings for component 4.

-------
Preliminary Report            9 June, 1993                            23

4.2   Network average

The principal components analysis of Section 4.1 showed that most of the variance
was associated with a single component It was based on a subset of 662 days
of data, because of the data  requirements of the method, and consequently the
principal component time series could be calculated only for the same subset of
days.  However, it was essentially a weighted average of the 16 stations, which
suggested that a similar network average, calculated for a more extensive set of
days, would provide a bener basis for the subsequent modeling.
   Median polish was used a second time to provide an outlier-resistant summary.
Both levels of imputation (Section  3.1) were used to construct daily maximum
ozone concentrations by station.  The daily maxima were written as a two-way
array:
             y
-------
Preliminary Report            9 June, 1993                             24


Other variables mentioned were wind direction, dew point temperature, sea level
pressure, and precipitation. All of these except precipitation were included in the
suite of meteorological data available for the present study (surface barometric
pressure was used in place of sea level pressure).
   Figure 12 shows scatter plots of the network average daily maximum ozone
concentration, described in Section 4.2, against four surface meteorological vari-
ables. Temperature is measured by the maximum from 9:00 aan. to 6:00 p.m.,
while the other variables are for noon. The curve overlayed on each graph is i
cubic least squares smoothing spline with 6 degrees of freedom. Rgure 13 shows
corresponding plots for surface wind direction and barometric pressure; dew point
temperature was not considered, as it is a function, albeit nonlinear, of temperature
and relative humidity, and highly correlated with temperature.
   It is evident that temperature has the strongest effect, that the effect would
be well approximated by a low order polynomial, and that relative humidity
and wind speed appear to be the next most important variables.  To explore the
dependence of ozone level on these variables two at a time, the lowess method
of nonparametric regression (Qeveland, 1979; Chambers and Hastie,  1992, see
loess) was used to find regression surfaces.  Rgure 14 shows the surface for
ozone against temperature and relative humidity. The surface indicates that the
polynomial effect of temperature is maintained at all levels of relative humidity, and
that the effect of relative humidity is reasonably linear at all levels of temperature,
but with a larger slope where the temperature effect is larger. This suggests that a
model of the form

  ozone = (polynomial in temperature) x (linear function of relative humidity)
                                                                     (2)
might express the joint effect of these two variables. Polynomial regression of
ozone against temperature suggests that the polynomial needs to be of order at
least 3;  raising the order from 3 to 4 increases R* from 0.6052 only to 0.6065,
suggesting that order 3 is adequate. Fitting equation (2) by nonlinear least squares
leads to/? = 0.6509. The dimensionless function of relative humidity is estimated
to be 1 - 0.00498%-' x (relative humidity - 50*), which decreases from 1.249 to
0.751 as relative humidity increases from 0% to 100%. The B? for this fit may be
compared with that for the lowess fit of Figure 14, which was 0.66; the equivalent
number of parameters was 10.2. This indicates that the parametric  model, with
only 5 parameters, performs very nearly as well as the nonparametric one.
   The corresponding regression surface for ozone against temperature and wind

-------
Preliminary Report
                               9 June, 1993
25
s   s   \
1   s
                                           s
            40     60    80   100


      Maximum temperature, 09:00 to 18:00 (F)
                                               0       5       10      15


                                                  Wind speed, 1100 (mfc)

1   2

           20   40   60    80    100


           Relative humidity, 12,-QO (%)
                                           *
                                                                  41
                                               0   20   40  60  80  100


                                               Opaque cloud cover, 1100 (%)
Figxire 12: Scatter plots of ozone against temperature, wind speed, relative hu-

midity, and cloud cover.

-------
Preliminary Report
9 June, 1993
                                                            26
 s  §<

 I  8
     a
:  ,:•: •:.. "•:,;:!:; .•: •
vi:,:'.:i,:!iilll
         0    100   200   300




          Wind direction, 1100 (deg)
       I  i


       I  S

           s
                990    1010    1030



                  Pressure, 11-00 (mbar)
Figure 13: Scatter plots of ozone against wind direction and barometric pressure.
 Figure 14: Nonpararoetric regression surface for ozone against temperature (max-

 imum from 09:00 to 18:00) and noon relative humidity.

-------
Preliminary Report
9 June, 1993
27
speed is shown in Figure 15.  This surface shows similarly that the polynomial
effect of temperature is maintained at all levels of wind speed, but that the effect
of wind speed is neither linear nor the same (nor even proportional) at different
wind speeds.  Rather, increasing wind speed is associated with a drop in ozone at
high temperatures, leveling off above around 6 m/s, but with a slight rise in ozone
at low temperatures. This behavior could be captured in a model of the form

 ozone = constant + (polynomial in temperature) x (function of wind speed)  (3)

and the surface suggests that the function of wind speed might be of the form

                                    1
                             1 +
                                 wind speed"'
where v is a critical speed at which the effect of this dimensionless factor drops
from 1 to 0.5.  The nonlinear least squares fitted value of v is 7.72 m/s, and
fl2 = 0.6280. That the effect of wind speed differs from that of relative humidity
is shown in the fitted value of the constant term in equation (3), 40.8 ppb. This
terra would have to be set to 0 ppb for the form of the equation to reduce to
the simpler multiplicative form. Forcing this change reduces /? to 0.6143 (and
 Figure 15: Nonparametric regression surface for ozone against temperature (max-
 imum from 09:00 to 18:00) and noon wind speed.

-------
Preliminary Report            9 June, 1993                            28

increases c to 51 m/s). By contrast, if such a constant is included in equation (2),
it is estimated to be -1.2 ppb, and /? is unchanged to 5 decimal places.
   Again, the /? for the model (3),  0.6280, may be compared with that for the
lowess fit, which was 0.63 with the equivalent of 10.2 parameters. The parametric
model performs well by comparison.
   Equations (2) and (3)  may be combined as

    ozone  =  (constant
               + (polynomial in temperature) x (function of wind speed)}
               x (linear function of relative humidity),                  (4)

with wind speed entering in the same form as before. The .ft2 for the combined
model is 0.6752, and the coefficients  change only slightly:  the  multiplier of
(relative humidity - 50%) becomes -0.00505%"*, and u becomes 6.85 m/s. The
corresponding lowess fit  has /?  = 0.68 for the equivalent of 16.7 parameters,
again indicating good performance of  the parametric model, which here has 7
parameters.
   Study of the residuals from combined models such as equation (4) showed
that ozone also depends on visibility  and opaque cloud cover. On the other hand,
these residuals showed essentially no dependence on barometric pressure, despite
the association shown in Figure 13. High temperatures tend to be associated with
pressures close  to 1015  mbar, and when  the effect of temperature is removed
from the ozone concentrations, the association with pressure is removed as welL
Much but not all of the association of ozone concentration with wind direction is
similarly removed by taking out the effect of temperature.
   Noon values of visibility and  opaque cloud cover were used, and these were
included in the model by  multiplying the right hand side of equation (4) by

     (linear function of visibility)  x (linear function of opaque cloud cover).
                                  i
Including these terms in order raised & to 0.7016 and 0.7035, respectively. The
dependence of ozone levels on wind direction suggested including a similar term,
of the form
                           ,.   .     F   fir x wind direction \
           1  -|-  linear combination of cos I	—	J
                          2jr x wind direction\
                          	360	)'                      (5)

-------
Preliminary Report             9 June, 1993                             29

Including this term raised A2 to 0.7064. Multiplying the cosine and sine by wind
speed gave a better fit and has greater physical meaning, since the terra can then
be interpreted as the inner product of the wind vector with a vector of coefficients.
This change increased R? to 0.7074. Finally, it was found that averaging the wind
vector over the hours 06:00 to 18:00 gave a still better fit, with If = 0.7108.
    Equation 5 was adequate for these data because the (remaining) dependence on
wind direction was simple. In many cases the dependence of ozone concentration
on wind direction is multimodal, in which case a longer Fourier series would be
needed. This requires including the sines and cosines of small multiples of wind
direction.

5.2   Upper  air meteorology

The residuals from the model described in the preceding section were graphed
against and correlated with upper air meteorological variables from the 12Z sound-
ing (06:00 CST). The only substantial association was with wind speed, and this
was stronger at  700 mbar than at 950 mbar, 850 mbar or 500 mbar.
    First the effect of including upper air wind speeds in the wind speed interaction
with temperature was explored. The factor
                                    1
                              1 a. wind speed*
                              1 "*"     v
was extended by adding the four upper air wind speeds, with arbitrary weights, in
the denominator. The weights were then estimated by nonlinear least squares. The
coefficient of 950 mbar wind speed was very small and negative. The coefficients
of 850 mbar and 500 mbar wind speed were both smaller than that of 700 mbar
wind speed, consistently with the ordering of the correlation coefficients. The
700 mbar term  was therefore retained, and the model was refined with the wind
factor extended to
               	1	
               t  i  surface wind speed   700 mbar wind speed'
               1  "*"         t>4         T         V700
This gave an R2 of 0.7204.
    Next the effect of including the 700 mbar wind vector in the wind vector part
of the model was studied. The term
     1,1-         u-   •    ei • j     j\      /2r x wind direction's
     1   +  linear combination of (wind speed) x cos I	^r	I

-------
Preliminary Report            9 June, 1993                            30

                               '2r x wind direction\
   ....    ,.   .  /2r x wind direction \
and (wind speed) x sin (	—	)
was extended by adding a corresponding linear combination of the components of
the 700 mbar wind vector. However, this increased If by only 0.0003, indicating
that there is negligible further information in the upper air wind vector beyond that
in the surface wind vector. A similar experiment with the 950 mbar wind vector
also gave negative results. This extension was not used further.

53   Lagged meteorological variables

The residuals from the preceding analysis were examined for association with
lagged surface meteorological variables by using multiple linear regression meth-
ods. The lagged variables were all 24 hour averages. This examination showed
that there were moderately strong effects of temperature at lags 1 and 2, and rel-
ative humidity and wind speed at lag  1. The model was therefore extended by
including these  lagged temperature variables linearly in the temperature factor,
and by adding lagged relative humidity and wind speed into the corresponding
factors.
    These additions raised & to 0.7499. The coefficients of lagged temperatures
were both negative, with that of the lag 2 temperature three times the size of that
of kg  1 temperature, though still only one thirtieth of the coefficient of the same
day's temperature. The coefficient of lagged relative humidity was the same sign
and size as that of the  same day's relative humidity.  Similarly, the critical wind
speed for lagged wind speed was similar to that for the same day's wind speed.
The multiple regression had indicated that neither variable needed to be lagged
more than one day.
    The residuals were also fitted linearly to lagged upper air variables, in the form
of the average of the OOZ sounding for the given day (which is made at 6:00 p.m.
on the previous day) and the 12Z sounding for the previous day (made at 6:00 a.m.
on the previous day). The t-statistics for these variables, while larger in magnitude
than their nominal 95% points, were all smaller than those for the lagged surface
variables. The lagged upper air variables were not included in the model

5.4   Seasonal structure and trend

The residuals from all  models were also found to have seasonal dependence. For
instance, the residuals  from the model described above are plotted against day of

-------
Preliminary Report
     9 June, 1993
                                                31
year in Figure 16. The graph shows an essentially linear decline over the ozone
season, and to model this effect a seasonal term was included in the model. To give
a reasonable fit both in the present context and in a model with no meteorological
factors, the term was taken to be a short Fourier series, with the annual and semi-
annual frequencies represented, and with the mean removed When this term was
included as another multiplicative factor, it raised R2 considerably, to 0.7982. The
alternative of an additive seasonal term gave a somewhat higher R2 of 0.8037.
   To explore further the choice between an additive seasonal term and a multi-
plicative one, lowess was again used to give a nonparametric view.  Here ozone
was fitted as a function of season and of the fit from the previous (nonseasonal)
model. Figure 17 shows the lowess surface, which is not easy to interpret. The
seasonal  change is larger at fitted values of 90-100 ppb than at low fitted values
such as 20-30 ppb, which is consistent with a multiplicative combination of ef-
fects. However, there are relatively few data points with high fitted values near
the ends of the season. Up to 60-80 ppb, the surface shows a more nearly constant
seasonal  change, suggestive of an additive combination. The R2 for the lowess
fit was 0.80, essentially the same as that obtained with both the multiplicative
seasonal  term and the additive version.  The residual root mean square for lowess,
8.225 ppb, falls between those for the additive fit (8.203 ppb) and the multiplica-
              Apr
May    Jun
           Jul
Aug     Sep
         Oct
 ft
             100
150
                200

            Day of year
250
                  300
      Figure 16: Residuals from meteorological model against day of year.

-------
Preliminary Report
9 June, 1993
32
Figure 17: Nonparametric regression surface for ozone against season and the fit
from the nonseasonal model

live fit (8.317 ppb). Although the choice is not clear cut, the additive form was
chosen for the later analyses.
    The model is easily extended to estimate a trend in ozone concentrations, of
any chosen form. The simplest extension is the incorporation of a factor that is
linear in time into the model, as an extra multiplicative factor in the main part
of the model. Since possible trends in ozone concentration that are traceable to
trends in causative factors such as temperature are accounted for in the model, the
fitted coefficient in the trend term represents an estimate of that part of the trend
that is not explained by meteorology, or in other words an adjusted trend.  An
unadjusted trend may also be calculated by omitting all meteorological variables.
The adjusted trend was found to be -2.7%/decade, while the unadjusted trend was
-4-5.3%/decade. Adding the trend term to the model with meteorological variables
resulted in a relatively small rise in /? to 0.8(342. The & for the model with no
meteorology, only trend and seasonality, was 0.3084, showing that meteorological
effects as formulated in the model account for around 50% of the variance in ozone
concentrations.
    The 8%/decade difference between the adjusted and unadjusted trend estimates
is large relative to all of the trend standard eirors calculated below. It is caused
by bias in the misspecified model that omits meteorology, the removal of which is

-------
Preliminary Report            9 June, 1993                            33

one of the motives for constructing these models.

5.5   Fitted coefficients and standard errors

The trend coefficients discussed in the previous section are of quantitative interest,
as are others of the fitted coefficients in the model. It is therefore desirable to asso-
ciate a standard error with each of them.  This is possible in nonlinear least squares
fits (Gallant, 1987, for instance), but typically requires the usual assumptions of
constancy of variance and lack of correlation in the residuals. These assumptions
are easily seen to be false for the residuals from the present model (see Section 6. 1 ).
If these requirements are ignored, the standard errors reported by nonlinear least
squares programs (Chambers and Hastie, 1992, see nls, for instance) are invalid.
However, Gallant (1987, Sections 2.1, 2.2) describes methods for correcting the
variance estimates for heteroscedasticity and serial correlation in the errors. When
combined in a way that allows for serial correlation with a 2-3  day  span, these
corrections gave the standard errors shown in Table 5. The final form of the model
was

  ozone  ~  {constant
                 + (polynomial in temperature) x (function of wind speed)}
             x (linear function of same day and lag 1 relative humidity)
             x (linear function of visibility)
             x (linear function of opaque cloud cover)
             x (linear function of mean surface wind vector
             x (linear function of time in years)
             -f (seasonal model)

where the polynomial in temperature is cubic in same-day temperature plus linear
in lags 1 and 2 temperature, the function of wind speed is
     _ _1 _
      i .  surface wind speed   700 mbar wind speed   lag 1 wind speed *
       "*"                 "*"                     +
and the seasonal model is a linear combination of the cosines and sines of the
annual and semiannual frequencies. It may be written specifically as

    o3  ~  (muO + (tO-f tl*(maxt-60)

-------
Preliminary Report
       9 June, 1993
34
Table 5: Coefficients in the fitted model, with standard errors computed conven-
tionally and adjusted for heteroscedasticity arid serial correlation.
  Coefft.   Fitted Value
      Conventional!               Adjusted
Standard error    i Value   Standard error   t Value
muO
to
tl
t2
t3
til
t!2
vh
vh700
vhl
r
rl
op
V
m.u
m.v
y
al
bl
a2
b2
3.977e-K)l
-l.Olle+01
3.015e+00
9.480e-02
-1.541e-03
-2.531e-02
-8.198e-01
6.055e+00
1.358e+01
4.686e400
-3.384e-03
•2.0446-03
-6.976e-04
-7.658e-03
5.245e-03
9.877e-04
-2.685e-03
-8.028e+00
4.085e+00
-2.6786+00
-1.1246+00
1.6149160
6.8563271
0.4637083
0.0141952
0.0002857
0.1758384
0.1851373
1.2663745
2.8454335
1.1571911
0.0003513
0.0003430
0.0001289
0.0006391
0.0014633
0.0015238
0.0011339
1.4207495
0.4924658
0.6501565
0.4750715
24.6262
-1.4747
6.5029
6.6780
-5.3931
-0.1439
-4.4280
4.7815
4.7728
4.0498
-9.6329
-5.9603
-5.4120
-11.9820
3.5841
0.6482
-2.3680
-5.6504
8.2946
-4.1195
-2.3656
1.5570528
6.6209256
0.5347362
0.0165989
0.0003189
0.1729924
0.2053805
1.5009998
3.7519675
1.4042068
0.0003664
0.0003769
0.0001417
0.0008258
0.0015789
0.0016596
0.0014404
1.3814916
0.4682407
0.7043474
0.5213316
25.5414
-1.5271
5.6391
5.7109
-4.8325
-0.1463
-3.9916
4.0341
3.6196
3.3374
-92342
-5.4244
-4.9215
-9.2733
3.3218
0.5952
-1.8640
-5.8110
8.7237
-3.8026
-2.1557

-------
Preliminary Report            9 June, 1993                           35


                    +t2 * (maxt - 60)2 +13 * (maxt - 60)3
                    +tll * (tlagl - 60) +1!2 * (tlag2) - 60)
               *!/(! -I- wspd/vh + wspd700/vh700 -|- wlag/vhl))
           *(1 + r * (rh - 50) + rl * (rhlag - 50))
           *(1 -f op * (opcov - 50))
           *(l+v*(vis-12))
           *(1 + m. u * mean. u -f m. v * mean. v)
           *(l+y* (year- 1985))
           + al * cos(2 * pi * year) + bl * sin(2 * pi * year)
                + a2 * cos(4 * pi * year) + b2 * sin(4 * pi * year). (6)

The seasonal cosine and sine terms were in fact deviations from their respective
means.
   Most of the /-ratios for variables in the model are at least 2 in absolute value.
Aside from the trend coefficient y, the only parameters that have ^-ratios lower than
3 are associated with other parameters with higher ^-ratios. Thus all meaningful
groupings of parameters have a high level of statistical significance.


6   Discussion of the Model

6.1   Quality of the fitted model

The root mean square residual from the fitted model is 8.195 ppb. The lower and
upper 2.5% points of the residuals are -14.6 ppb and 17.6 ppb, respectively, while
the quartiles are —5.0 ppb and 4.3 ppb. Thus model predictions differ from the
actual values by up to ±5 ppb about half the time, and by up to  ±16 ppb about
95% or the time. Figure 18 is a quantile-quantile plot of the residuals against the
Gaussian distribution. The points would fall on a straight line if the distribution of
the residuals were exactly Gaussian in shape. In the figure, the behavior is roughly
linear up to a Gaussian quantile of around 2, or in other words for the lower 97.5%
of the distribution. Above this level, the points fall above the extrapolation of the
straight line, meaning that the quantiles of the residuals are progressively higher
than the corresponding quantiles of the Gaussian distribution that would fit the
lower part of their distribution.  The highest residual, 65 ppb, is at around the So-
level.

-------
Preliminary Report
9 June, 1993
36
                          8
                      1
                      •3

                                 •20   2

                           Quantises of Standard Normal
Figure 18: Quantile-quantile plot of the residuals from the fitted model against the
Gaussian distribution.

   Table 6 shows the root mean square residual and the first lagged correlation
coefficient, by month. The standard error of a month's root mean square is around
6% of the observed value, and therefore ranges from 0.4 to 0.6.  The seasonal
variation in the root mean square is therefore highly significant.  The standard
error of each correlation coefficient is around 0.06; thus the correlations in June
and July are not significantly different from zero, while the higher correlations are
clearly different from these mid-summer values.
   The  seasonal rise in root mean square residual parallels the rise in mean levels
associated with summer temperatures. If the extra variability is solely caused by
the higher mean value, it might be eliminated by reexpressing ozone concentrations
on a "variance-stabilizing" scale. However, it may be caused partly by differing
meteorological variability in the different .seasons, and not simply by the change
in mean levels.  This  issue may be addressed by exploring the dependence of
the magnitude of the residuals on the fitted values and the season.  Figure 19
shows the regression surface that results from using lowess to fit such a model
nonparametrically.  It appears that there is a strong relationship between the
magnitude of the residuals and the fined values early in the season,  but that it
becomes weaker as the season progresses. The decrease in residual magnitude at
high fitted values late in the season is presumably spurious, there being relatively

-------
Preliminary Report
9 June, 1993
37
Table 6: Root mean square residual and one day lagged correlation coefficient, by
month

    Month   Rms    Number in rms  Correlation   Number in correlation
4
5
6
7
8
9
10
6.492
7.203
9.205
10.404
8.975
7.467
6.515
317
336
318
331
331
318
332
0.28475
0.25152
0.06903
0.06727
0.22361
0.42167
0.28397
295
320
298
312
315
298
313
 Figure 19: Nonparametric regression surface for magnitude of residuals against
 season and fitted value.

-------
Preliminary Report             9 June, 1993                            38

few high fitted values at these dates.
   It should be noted that the adjusted standard errors tabulated in Section 5.5
(page 34) are valid in the presence of heteroscedasticity and serial correlation of
the form displayed in this section. In particular, they remain valid when the serial
correlations are not stationary, provided they are short term in nature. Standard er-
rors that are valid for longer term correlation are discussed in Section 6.5 (page 46).
However, ordinary least  squares parameter estimates may be inefficient in these
cases. Efficiency can be partially restored by reexpressing the ozone concentra-
tions on a variance-stabilizing scale. In the present case, Figure 19 suggests that
the magnitude of the residuals is roughly proportional to the fitted values, mean-
ing that  the logarithms of ozone concentrations would have more nearly constant
variance.
   It must be recognized, however, that the model is only an empirical approxima-
tion to the actual physical/chemical mechanism whereby meteorological variables
influence ozone concentrations.  When a fitting procedure such as least squares is
used to  fit the model, it produces estimates of the values of the parameters that
make the empirical model as close as possible to the actual mechanism, in the least
squares  sense. If the model were fitted differently, for instance by least squares
on the logarithmic scale, the fitted parameters would be estimates of different
parameter values, namely those that make the model best approximate the actual
mechanism on the new scale. In the case of a logarithmic reexpression, the effect is
to estimate a model that fits  the data better at low concentrations, but worse at high
concentrations. In other words, rexpression may produce more efficient parameter
estimates, but they may be estimates of less appropriate parameter values.  It is
for this reason that all the fitting described  in this report has been carried out on
the original scale, with standard errors computed in a way that makes them valid
in the face of heteroscedasticity and serial correlation, rather than on reexpressed
data.

6.2   Predicted and  adjusted percentiles

One aspect of model performance is how  well it predicts the highest levels of
ozone. To address this question the model predictions of 95% points by season
were calculated.  The model prediction for a given day consists of a  probability
distribution, whose mean is the predicted value for that day. The distribution was
taken to be the empirical  distribution of the residuals from the model (6), centered
at the predicted value. These prediction distributions were averaged within years

-------
Preliminary Report
         9 June, 1993
                                       39
(actually within ozone seasons, April-October of each year). Figure 20 shows the
actual 95th percentiles and those of the yearly averaged prediction distributions.
The model percentiles track the actual percentiles well, with the largest deviation
occurring in  1987.  Cox and  Chu (1992) carried out a similar exercise for the
network maximum values rather the network typical value, as here. Their Figure 3
shows similar model ability to track the observed 95th percentile, but with some-
what poorer agreement, presumably reflecting the noisier character of the network
maximum versus the network typical value.
   To explore the effect of the residual distribution on the prediction of  per-
centiles, the calculation of predicted percentiles was repeated using a Gaussian
shape for the prediction distribution with a standard deviation of 8.195 ppb. This
gave essentially  the same predicted quantiles, presumably because of the close
agreement between the percent points of the distribution of the residuals and those
of a Gaussian distribution, up to the 97.5% level.
   One of the major uses of a model such as that discussed above is to allow for
the effect of year-to-year variations in the explanatory variables.  Adjustment of
overall trends in ozone levels was described in Section 5.4.  An individual day's
ozone value may also  be adjusted, by adding the predicted value for an adjusted
set of meteorological variables to  the residual for the actual day.   Figure 21
                                  Actual
                                  Predicted
     8


 *  8  I
          1981
1983
1985
1987
1989
1991
       Figure 20: Actual and predicted 95th percentiles of ozone by year.

-------
Preliminary Report
         9 June, 1993
                                       40
shows the 95th percentiles by year of such adjusted ozone values, as well as the
actual percentiles. As in Cox and Chu (1992), the meteorological variables in the
model (6) were adjusted linearly season by season, so that the mean and variance
from June to September matched those for all 11 years combined.
   The adjusted percentiles show somewhat less year-to-year variation than the
actual percentiles, indicating that much of the variation was associated with me-
teorological variability.  Again, one year is exceptional:  in 1987 the adjusted
percentile is very close to the actual value. However, since the 95th percentile is
determined by the values of the largest 10 or so residuals, the exceptional behavior
in 1987 may be due simply to sampling variability.
   Cox and Chu (1992) also constructed adjusted ozone percentiles, for the net-
work maximum  value.  Their Figure 6 shows much less variability around a
downward linear trend than  does our Figure 21, despite the fact that Cox and
Chu's calculation is for the 99th percentile of the network maximum, rather than
the 95th percentile of the network typical value. However, their construction does
not involve the observed residuals, as was done here, but is closer to the calculation
of predicted percentiles for the adjusted meteorology.
                                  Actual
                                  Adjusted
     o
     CO
     p
          1981
1983
1985
1987
1989
1991
       Figure 21: Actual and adjusted 95th percentiles of ozone by year.

-------
Preliminary Report
9 June, 1993
41
6.3   Interpretation of model components

The parameters of the model (6) have various interpretations.  As always, these
interpretations are only suggestive of actual physical or chemical causes of ozone
variations.

rauO : Predicted ozone level when temperature x wind speed interaction term is
     zero (that is, high wind speed or temperature = zero of polynomial, around
     60°F), and all other variables are at their centering values. 40 ppb.

tO,  tl,  t2,   t3,  til,  t!2 : Coefficients of a cubic polynomial in max-
     imum temperature and lagged temperature, which added to muO gives the
     predicted ozone level for a given temperature at zero wind speed, and all
     other variables at their centering values. The polynomial plus muO is shown
     in Figure 22 (horizontal line indicates muO). In constructing the figure,
     lagged temperatures were taken as equal to same day temperature.

vh,  vh700,  vhl : Critical speeds for surf ace wind, 700mb wind, and lagged
     surface wind, respectively, at which wind speed factor drops to one half,
     with the other wind speeds at zero. 6 m/s, 14 m/s, 5 m/s.


     o
     10
          40        50       60       70       80       90       100

                                   degF
     Figure 22: Fitted polynomial effect of temperature at zero wind speed.

-------
Preliminary Report
        9 June, 1993
                                           42
r,  rl  : Effects of relative humidity and lagged relative humidity. -0.003%~l,
      -Q.002%-1. The factor drops from 1.27 to 0.73 as both humidities rise from
      0% to 100%.

op  :  Effect of opaque cloud cover. -.0007%~I. The factor drops from 1.03 to
      0.97 as opaque cloud cover rises from 0% to 100%.

v :  Effect of visibility.  -.008km"1.  The factor drops from 1.09 to 0.91 as
      visibility rises from Okm to 24km.

m. u ,  m. v : The relative effect of the 24hr mean wind vector is its vector (inner)
      product with the vector (m.u,m.v) = (0.0052,0.0010)(m/s)"!. Maximum
      when wind is from the south. The effect of a 5 m/s wind varies from 1.03 to
      0.97 as the wind direction changes from south to north.

y :  Trend parameter, -O.OC^yr"1 = -2.7%/decade.

al,  bl, a2,   b2 : Coefficients of the annual and semiannual cosines and
      sines.  The seasonal term is shown in Figure 23.  The location of the
      maximum at May 1 rather than at April 1 may reflect the small number
      of terms in the Fourier series rather than a true increase from April 1 to
 i
             Apr
            100
May    Jun
           Jul
Aug    Sep
150
                200

            day of year
250
         Oct
                 300
                     Figure 23: Fitted seasonal effect

-------
Preliminary Report             9 June, 1993                            43

     April 30. The fitted values at the start of the ozone season are influenced
     by the observed values at the end of the season, because of the periodicity
     of this component  However, a similar rise through the month of April is
     perceptible in Figure 16 (page 31).

   The need for a seasonal term in the model indicates that there are other de-
terminants of surface ozone concentrations than those incorporated in the model,
with systematic seasonal variations.  The observation that the seasonal term fits
better in an additive form rather than as a further multiplicative factor suggests
that it reflects variations in a source of ozone rather than in precursors of ozone.
However, the difference in quality of fit was not clear cut (R* of 0.8037 rather
than 0.7982), and does not amount to more than a suggestion.
   Altshuller (1987) reviewed information on background tropospheric ozone
concentrations, and stated that during the warmer months it appears to be in the
range 10-20 ppb. Although Altshuller did not quantify the seasonal variability
in background tropospheric ozone, springtime maxima in oxidants and ozone in
particular have been observed by Wakamatsu, Uno, Ueda  and Uehara (1989),
for example. Thus the effect shown in Figure 23 is qualitatively consistent with
seasonal variations in background tropospheric ozone concentrations.
   The model (6) contains no term that may be interpreted as an estimate of
the average background ozone level.  The coefficient muO is independent of
temperature and wind speed, but still multiplies the factors associated with the
remaining meteorological variables. An intercept may be added to the model by
extending the "seasonal" part to

          mul -f al * cos(2 * pi * year) + bl * sin(2 * pi * year)
          + a2 * cos(4 * pi * year) + b2 * sin(4 * pi * year)

(recall that in (6) the cosine and sine terms had means of zero, by construction).
The fined value of mul is 10.5 ppb, and the seasonal coefficients  are essentially
unchanged.  Thus the background level of ozone is estimated as the curve in
Figure 23 (page 42), offset by+10.5 ppb, in reasonable agreement with Altshuller's
statement. However, the standard error of this estimate is 5.6 ppb, and it is therefore
poorly quantified.  The estimates of the coefficients muO and mul are highly
(and negatively) correlated; their sum is well estimated, but neither coefficient
is individually. In particular the estimated value of mul is barely significantly
different from zero, and therefore this extended seasonal model has not been used
elsewhere.

-------
Preliminary Report            9 June, 1993                            44
6.4    Cross validation

As with any model containing many parameters, the possibility exists that (6) is
over-parameterized. This may be explored by refitting the model to subsets of the
data, and studying

   •  the quality of predictions from the refit model to the remainder of the data
      (cross validation), and

   •  the variability of the parameter estimates in the refit models (jackknifing).

In the present case,  where the focus is  on the possible effects of interannual
variations in conditions, it is appropriate to construct the subsets based on years
(or "ozone seasons"). This was done by leaving out one year at a time.
   Table 7 gives some results.  The line for a given year gives the number of
days of data used in the model for that year, and the mean and root mean square
error  when the model refit to the remainder of the data is used to predict that
year. This statistic is a grouped version of the PRESS statistic discussed by Cook
        Table 7: Predicted residual mean and root mean square by year.
            Year  Number of days   Mean   Root mean square
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
204
208
210
208
209
206
209
212
208
204
205
-0.1860
-1.2221
-0.6555
-0.1020
0.7356
-0.4128
1.4206
-0.7896
2.3823
0.9323
-2.9085
8.878
8.371
9.776
8.490
7.629
8.371
8.628
7.162
8.240
7.372
8.088

-------
Preliminary Report            9 June, 1993                            45

and Weisberg (1982, Section 2.2.3). The overall root mean square of the cross-
validated prediction errors is 8.302 ppb, which compares very favorably with the
root mean square residual of the fit to all the data, 8.195 ppb.
   There  is some variability from year to year in Table 7. However, it is not
clear how it relates to the characteristics of either ozone levels or meteorology.
For instance, both 1983 and 1988 experienced high ozone levels (see Figure 20,
page 39, for example), and yet 1983 shows the highest root mean square error and
1988 the lowest Note also that 1987, while showing the poorest fit in the predicted
95th percentile (see the same figure), has a mean  squared error that is exceeded
in two other years. The possibility that these variations were caused solely by
sampling error was explored by comparing the values in the second column of
Table 7 with quantiles of a reference distribution. Some care has to be taken in
constructing the reference distribution, because of the following features of the
context

   •  There is variability in sample size from year to year.

   •  The residuals from the fitted model are noticeably nonGaussian (see Fig-
      ure 18, page 36).

   •  Each line in Table 7 summarizes an entire season, and the residuals do not
      have the same distribution across the season (see Figure 19, page 37).

   To allow for these features, the reference distribution was constructed by
bootstrap methods (Efron, 1979; Wu, 1986), in the following steps.

   1.  Select a season length randomly from the first column of Table 7.

   2.  Construct a random season as a sample of the dates from April 1 to Octo-
      ber 31, without replacement, with size equal to the season length selected in
      step 1.

   3.  For each date in the random season, choose a residual at random from the
      population of residuals labeled by that date. Note that there are at most 11
      such residuals.

   4. Calculate the standard deviation of the resulting random season of residuals.

This procedure was carried out 250 times to produce simulation estimates of
the quantiles of the reference distribution.  It should be noted that the reference

-------
Preliminary Report
9 June, 1993
46
distribution was constructed from the residues from the model when fitted to all
of the data, rather than from "leave one out" predicted residuals. However, these
were very similar in overall magnitude, and the substitution should not cause a
noticeable bias.
   Figure 24 shows  the ordered values from the second column of Table 7,
graphed against the bootstrap quantiles.  The graph suggests that at most the
highest observed value, that for 1983, is more extreme than could be explained by
sampling variability, and by only around 0.5 ppb.

6.5   Jackknifing

The "leave out one year" parameter estimates discussed in Section 6.4 may be
used to construct "pseudo-values" (Tukey,  1958) from

 pseudo-value = Y(estimate from all data) - (Y — l)(leave one out estimate),

where Y is the  number of years, 11.  The pseudo-values may then  be treated
as a sample of values estimating the given parameter. Their mean  is used as
a less biased "jackknifed" estimate of the parameter, and the standard error of
                               7.5   ' 8 J   9.5

                               Bootstrap quantile
Figure 24: Quantile-quantile plot of predicted root mean square residual by year,
against a bootstrap reference distribution.

-------
Preliminary Report            9 June, 1993                             47

the mean provides a standard error for either the original parameter estimate or
the jackknifed estimate. This standard error is valid in the presence of arbitrary
variances and covariances of the model residuals, provided only that they are
uncorrelated across years.  Miller (1974) discussed the use of the jackknife for
linear regression models, and proposed its use also for nonlinear models. Hinkley
(1977) showed that in general a weighted jackknife performs better for unbalanced
situations such as regression models, and other modifications were suggested by
Wu (1986) for both linear and nonlinear models.  However, in the present case
each year should be reasonably similar to the others, which suggests that the extra
complication of the weighted jackknife outweighs any performance advantage.
    The jacknifed estimates and standard errors are shown in Table 8. The jack-
knifed parameter estimates differ very little from the overall values shown in
Table 5  (page 34).  The jackknifed standard errors are also generally similar to
those in Table 5, the most notable difference being the increase in the  standard
error of the trend coefficient from 1.4 %/decade to 3.4 %/decade. The associated
change in the \t(-statistic from nearly 2 (allowing for heteroscedasticity and short-
term correlation) to less than 1 suggests that the information about trend in the
data is overstated in the standard errors in Table 5. However, the standard errors
in Table 8 are based on only  10 degrees of freedom, and the effects of sampling
variations must not be neglected
    Figure 25 shows the ratios of the two sets of standard errors, ordered and
graphed against the quantiles  of the x distribution with 10 degrees of freedom. If
there were no long-term or interannual effects, and if the dispersion matrix of the
parameter estimates were diagonal, the points would be expected  to lie close to
the solid diagonal line.  Correlations among the parameter estimates would tend
to reduce the slope of  the points, but this effect is expected to be small.  The
highest point in the figure, corresponding to the trend coefficient, falls far enough
above the line to preclude the possibility that it is the result of sampling variability.
The jackknife standard  error is therefore preferred for inferences about the trend
coefficient
    The dotted line in Figure 25 has a slope of 1.2, and provides a good compromise
match to the remaining points. This suggests that the jackknife standard errors are
estimating values up to 20% larger than the adjusted standard errors of Table 5.  The
adjusted standard errors multiplied by 1.2 should therefore give rise to conservative
inferences about all parameters other than the trend, at the same time being less
sensitive to sampling variability than the jackknife standard errors.

-------
Preliminary Report
9 June, 1993
48
              Table 8: Jackknifed estimates and standard errors.
Coefft.
muO
to
tl
t2
t3
til
t!2
vh
vh700
vhl
r
rl
op
V
m.u
m.v
y
al
bl
a2
b2
Fitted Value Standard error
3.980e-K)l
-1.042e+01
3.050e+00
9.606e-02
-1.557e-03
-2.382e-02
-8.322e-01
6.018e-KX)
1.353e-f01
4.679e+00
-3.385e-03
-2.046e-03
-6.951e-04
-7.640e-03
5.284e-03
9.919e-04
-2.646e-03
-8.024e+00
4.084e+00
-2.684e-fOO
-1.129e-KX)
1.6932387
8.0915148
0.7471219
0.0267512
0.0003707
0.2129552
0'.3544919
. 2.0013063
4.0252847
1.7690150
0.0002265
0.0003206
0.0001320
0.0009879
0.0011824
0.0015540
0.0034115
1.2199726
0.5528235
0.4108871
0.3694495
t Value
23.5035
-1.2881
4.0830
3.5908
-4.2002
-0.1119
-2.3477
3.0071
3.3612
2.6447
-14.9437
-6.3810
-5.2658
-7.7332
4.4688
0.6383
-0.7757
-6.5774
7.3876
-6.5331
-3.0560

-------
Preliminary Report
9 June, 1993
49
                     00
                     •8  2
                         V)
                         d
                            0.5       1.5

                                 Chiquamile
Figure 25: Quantile-quantile plot of ratio of jackknifed standard errors to adjusted
standard errors, against the x distribution with 10 degrees of freedom, with lines
of slope 1 and 1.2.

6.6   Network maximum

The modeling effort described above dealt with only the daily network typical
value. The median polish analysis that gave rise to these network typical values
was also used to construct a network maximum. Fitting the same form of model
to the network maximum gave R2 = 0.6731 and a root mean square residual of
15.611 ppb. The fitted coefficients are shown in Table 9. The major changes from
the model for the network typical value are as follows.

muO : Increased from 40 ppb to 59 ppb.

tOf  tl, t2,  t3,  til,  t!2 : The same day temperature coefficients are
     generally larger. This is to be expected, as they are absolute effects.  How-
     ever, the lagged temperature effects are smaller, and both are now non-
     significant. The polynomial plus muO is shown in Figure 26 (horizontal line
     indicates muO).  As in constructing Figure 22, lagged temperatures were
      taken as equal to same day temperature.

vh,  vh700, vhl : All critical wind speeds are lower, meaning that the effects
      of high temperature are moderated more rapidly by increasing wind speed

-------
Preliminary Report
9 June, 1993
50
Table 9: Coefficients in the fitted model for the network maximum, with standard
errors computed conventionally and adjusted for heteroscedastitity and serial
correlation.
  Coefft.   Fitted Value       Conventional              Adjusted
                        Standard error   t Value  Standard error   t Value
muO
to
tl
t2
t3
til
t!2
vh
vh700
vhl
r
rl
op
V
m.u
m.v
y
al
bl
a2
b2
5.8516401
-8.222e+00
5.540e+00
1.626e-01
-1.615e-03
-1.019e-01
-7.087e-01
2.908e400
1.167e-K)l
2.965e+00
-1.796e-03
-2.048e-03
-6.176e-04
-7.002e-03
2.574e-03
-1.328e-02
-9.492e-03
-4.2356400
6.380e+00
-1.441e-fOO
2,531e-01
2.7046400
1.526e401
1.370e400
3.900e-02
6.316e-04
4.806e-01
4.170e-01
8.012e-01
3.791C-KX)
1. 013e-KX)
4.579e-04
4.456e-04
1.674e-04
8.283e-04
1.863e-03
1.916e-03
1.441e-03
2.6886400
9.344e-01
1.240e-K)0
9.072e-01
21.6400
-0.5389
4.0449
4.1699
-2.5578
-0.2120
-1.6994
3.6292
3.0774
2.9261
-3.9214
-4.5967
•3.6893
-8.4542
1.3814
••6.9274
-6.5882
••1.5758
6.8283
•1.1623
0.2790
2.817e400
1.510e401
1.499e400
4.318e-02
6.969e-04
4.854e-01
4.568e-01
9.122e-01
4.343e400
1.249e400
5.258e-04
5.315e-04
1.783e-04
9.134e-04
1.915e-03
2.089e-03
1.828e-03
2.912e400
1.0916400
1.4016400
1.0756400
20.7682
-0.5444
3.6964
3.7664
-2.3179
-0.2099
-1.5514
3.1873
2.6863
2.3746
-3.4150
-3.8537
-3.4647
-7.6662
1.3438
-6.3554
-5.1939
-1.4543
5.8466
-1.0289
0.2354

-------
Preliminary Report
           9 June, 1993
                                         51
 f
     8
     en
     8
          40
50
60
 70

degF
SO
90
100
Figure 26: Fitted polynomial effect of temperature at zero wind speed for network
maximum. Dotted line shows values for network typical value, shown in Figure 22.

      for the network maximum than for the network typical value.

T,  rl,  op,  v : The proportional effects of relative humidity, opaque cloud
      cover, and visibility are all lower for the network maximum than for the
      network typical value.

m. u,  m. v : The proportional impact  of the wind vector is now greatest when
      winds are from the north, and is nearly three times as large as for the network
      typical value.

y : The estimated trend parameter for the network maximum is -O.OOPSyr"1 =
      —9.5%/decade, notably larger than for the network typical value. It is ap-
      parently statistically significant, with respect to the standard errors adjusted
      for heteroscedasticity and serial correlation. Jackknife standard errors have
      not been computed, but if the same ratio were found as for the trend in the
      network typical value, the jackknife standard error for the trend would be
      4.8%/decade, and the trend would be barely significant
 al,  bl,  a2,  b2 : The seasonal term is shown in Figure 27, together with the
      corresponding curve for the network typical value. The curves are essentially

-------
Preliminary Report
    9 June, 1993
                                   52
              Apr      May   Jun     Jul
                    Ang    Sep     Oct
     v»
 f
            100
150
    200

day of year
250
300
Figure 27: Fitted seasonal effect for network maximum. Dotted line shows values
for network typical value, shown in Rgure 23.
                                     I
      identical, the largest difference being the somewhat earlier maximum. The
      close agreement between the seasonal curves contrasts sharply with the
      difference between the temperature curves (Figure 26), and tends to support
      the interpretation of this part of the model as the effect of seasonal variations
      in background ozone concentrations.

   Actual, predicted, and adjusted 95th percentiles for the network maximum,
corresponding to those in Figures 20 and  21  (pages 39 and 40), are shown in
Figures 28 and 29.  These  figures show similar performance to those for the
network typical value. The relatively good performance in 1987 suggests that the
problems in the prediction and adjustment of the 95th percentile for the network
typical value were due to chance effects.
   The results of this  section may be compared with those of Cox and Chu
(1992), who fitted a Weibull probability model to the same network maximum
time  series.  In  the model  the logarithm  of  the Weibull scale parameter was
represented as a linear  function of selected meteorological variables and linear
trend. Parameters were fitted by maximum likelihood. The trend was estimated as
—0.0054/yr = -5.4%/decade with a standard error (computed by bootstrapping
3 day sequences) of 3.7%/decade. The difference between Cox and Chu's estimate

-------
Preliminary Report             9 June, 1993
53
                                   Actual
                                   Predicted
     s
          1981        1983       1985        1987        1989        1991
Figure 28:  Actual and predicted 95th percentiles of network maximum ozone by
year.
                                   Actual
                                   Adjusted
          1981        1983        198S        1987        1989        1991
Figure 29: Actual and adjusted 95th percentiles of network maximum ozone by
year.

-------
Preliminary Report            9 June, 1993                            54


and the one obtained from the present model is around one standard error. They
are therefore reasonably consistent with each other, given the differences between
the models and the fitting criteria.


7   Conclusions

The daily maximum one-hour average surface ozone concentrations from the 45
stations were highly correlated. A 16-station subnetwork with essentially the same
spatial coverage as the whole network showed a dominant principal component that
accounted for 78% of the variance, and the corresponding time series was nearly
perfectly correlated (p = 0.99) with a simple "typical" value for the network,
obtained by median polish.
   The network typical value time series shows nonlinear and nonadditive depen-
dence on various meteorological quantities, including individual measurements
and constructs (averages).  There is also sarong seasonal dependence, even after
allowing for the effects of the meteorological variables. The dependence can
be approximated by a nonlinear parametric  model, and when the parameters are
fitted by least squares, the  model accounts for 80% of the variance of the ozone
concentration data.  The root mean square residual is 8.2 ppb. The model may be
extended to include a trend parameter, which is estimated to be -2.7%/decade,
with a (jackknife) standard error of 3.7%/decade. This represents an estimate of
trend adjusted for meteorological variability. If the meteorological variables are
omitted, the model  contains just seasonaliry and trend, and the trend estimate is
found to be +5.3%/decade, representing the  unadjusted trend in the surface ozone
concentrations.
References

Altshuller, A. P. (1987). 'Estimation of the natural background of ozone present
     at sruface rural locations', Journal of the Air Pollution Control Association
     37,1409-1417.

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language,
     Advanced Books and Software. Pacific Grove, California: Wadsworth.

-------
Preliminary Report            9 June, 1993                            55

Bloomfield, P. and Steiger, W. L. (1983).  Least Absolute Deviations:  Theory,
     Applications, and Algorithms, Progress in Probability and Statistics. Boston:
     BirkhSuser.

Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Pacific Grove,
     CA: Wadsworth.

Cleveland,  W. S. (1979). 'Robust locally weighted regression and smoothing
     scatterplots', Journal of the American Statistical Association 74(368), 829-
     836.

Comrie, A. C.  and Yarnal, B. (1992).  'Relationships between synoptic-scale
     atmospheric circulation and ozone concentrations in metropolitan Pittsburgh,
     Pennsylvania', Atmospheric Environment 26B, 301-312.

Cook, R. D. and Weisberg, S. (1982).  Residuals and Influence in Regression,
     Monographs on Statistics and Applied Probability. New York:  Chapman
     and Hall.

Cox, W. M. (1992). Personal communication.

Cox, W. M. and Chu, S.-H. (1992). Meteorologically adjusted ozone trends in
     urban areas: A probability approach, U.S. Environmental Protection Agency,
     Technical Support Division MD-14, Research Triangle Park, NC 27711.

Efron, B. (1979). 'Bootstrap methods: Another look at the jackknife', Annals of
     Statistics 7,1-26.

Feister, U. and Balzer, K. (1991).  'Surface ozone and meteorological predictors
     on a subregional scale', Atmospheric Environment 25A, 1781-1790.

Gallant, A.  R. (1987). Nonlinear Statistical Models. New York: Wiley.

Hinkley, D. V.  (1977).   'Jackknifing in unbalanced situations', Technometrics
     19,285-292.

Kelly, N. A.,  Ferman, M. A. and Wolff, G. T. (1986). 'The  chemical and me-
     teorological conditions associated with high and low ozone concentrations
     in  southeastern michigan and nearby areas of Ontario', Journal of the Air
     Pollution Control Association 36,150-158.

-------
Preliminary Report            9 June, 1993                             56
Miller, R. G. (1974). 'The jackknife—a review', Biometrika 61,1-15.

National Research Council (1991). Rethinking the Ozone Problem in Urban and
     Regional Air Pollution. Washington, DC: National Academy Press.

TWcey, J. W. (1958). 'Bias and confidence in not quite large samples (abstract)',
     Annals of Mathematical Statistics 29,614.

Tukey, J. W. (1977).  Exploratory Data  Analysis.   Reading, Massachusetts:
     Addison-Wesley.

Wakamatsu, S., Uno, I., Ueda, H. and Uehara, K. (1989).  'Observational study
     of stratospheric ozone intrusions into the lower troposphere', Atmospheric
     Environment 23,1815-1826.

Wu, C.  F. J. (1986).  'Jackknife, bootstrap and other resampling methods in
     regression analysis'', Annals of Statistics 14,1261-1295.

-------