Adjustment of Ozone Trends for Meteorological Variation: Final Report


              Final Report

 ADJUSTMENT OF OZONE TRENDS FOR
     METEOROLOGICAL VARIATION

            SYSAPP-90/008

            30 January 1990
              Prepared for

               Neil Frank

      Data Analysis Section, MD-14
Office of Air Quality Planning and Standards
   U.S. Environmental Protection Agency
    Research Triangle Park,  NC 27711

   EPA Contract No. 68-02-4391 WA 2-1
              Prepared by

            T. E. Stoeckenius
         A. Belle Hudischewskyj

        Systems Applications, Inc.
          101 Lucas Valley Road
          San Rafael, CA  94903

-------
                                Contents


List of Figures	     ii
List of Tables	     iv
Executive Summary	     vi
1    INTRODUCTION	      1
2    DATA	      5
          Selection of Study Period	      5
          Ozone Data 	      5
          Calculation of Summary Statistics	      6
          Meteorological Data	     10
3    METHODOLOGY	     16
          Overview of CART	     17
          Types of Trees That Can Be Constructed with CART	     21
          Using Regression Trees in Trends Adjustment	     23
          Trends Adjustment for Monitoring Networks	     31
4    CONCLUSIONS AND RECOMMENDATIONS	     54
References	     56
89053r2

-------
                                   Figures
1-1   Estimated number of daily exceedances of the ozone NAAQS
      in the ozone season (April 1 to October 31) at sites in
      southwestern Connecticut ...... .................................     2

3-1   Schematic diagram of a CART tree showing root node, intermediate
      nodes, terminal nodes, and branches  ..............................    19

3-2   True (R*) and resubstitution (R) error as a function of
      tree size [[[    22
3-3   Regression tree for Bridgeport for 1979-1987
3-4   Observed and met-adjusted annual ozone exceedances derived from
      the regression tree shown in Figure 3-3 ............................    29

3-5   Regression tree for Philadelphia monitoring network
      (1979-1988) [[[    36

3-6   Regression tree for southwestern CT ozone monitoring network
      (1979-1988) [[[    37

3-7   Annual number of days on which the average daily maximum ozone
      concentration in the Philadelphia Ozone Network exceeds
      0.105 pphm (based on three years of data)  .........................    42

3-8   Annual number of days on which the average daily maximum ozone
      concentration in the Philadelphia Ozone Network exceeds
      0.085 pphm (based on three years of data)  .........................    43

3-9   Annual number of days on which the average daily maximum ozone
      concentration in the Philadelphia Ozone Network exceeds
      0.105 pphm (based on yearly data) ................................    45

-------
3-11  Annual mean daily maximum ozone concentrations for the
      Philadelphia Ozone Network
3-12  Annual mean daily maximum ozone concentrations for the
      Philadelphia Ozone Network
3-13  Annual mean daily maximum ozone concentrations for the
      southwestern Connecticut Ozone Network (based on individual
      yearly data run down the CART tree grown on 1979-1988 data) .......    50
3-14  Annual number of days on which the average daily maximum ozone
      concentration in the Philadelphia Ozone Network exceeds
      0.105 pphm (based on the number of days with temperatures
      above 90° F in each year)	    55
89053r2 1                               ill

-------
                                    Tables
2-1   Ozone monitoring sites included in the Philadelphia and
      southwestern Connecticut networks	      7

2-2   Number of days with missing valid daily maximum ozone
      concentrations for monitors in the Philadelphia monitoring
      network	      8

2-3   Number of days with missing valid daily maximum ozone
      concentrations for monitors in the southwestern Connecticut
      monitoring network	      9

2-4   Meteorological variables available for northeastern monitoring
      sites	     11

2-5   Key meteorological variables selected for inclusion in the
      preliminary calculations	     15

3-1   Distribution of number of high ozone days by terminal node
      for regression tree depicted in Figure 3-3	     25

3-2   Observed and met-adjusted exceedances for Bridgeport, CT	     28

3-3   Additional meteorological variables	     32

3-4   Results of regression tree analysis for ozone monitoring
      networks	     35

3-5   Distribution of Philadelphia network average daily maximum
      ozone concentrations for the period 1979-1988 across terminal
      nodes of the regression tree depicted in Figure 3-5	     39

3-6   Distribution of days across terminal nodes of regression tree
      depicted in Figure 3-5 by year  	     40

3-7   Distribution of days across terminal nodes of regression tree
      depicted in Figure 3-6 by year  	     41
89053r2

-------
3-8   Average Dmax (Philadelphia network) by year for selected
      groups of days	    52
3-9   Number of days on which daily maximum temperature in Philadelphia
      (TMAX2) exceeds 90°F and number of days on which the Philadelphia
      network average daily maximum ozone concentration (Dmax exceeds
      10.5 pphm
89053r2

-------
                            EXECUTIVE SUMMARY
Year-to-year changes in annual ozone summary statistics such as the number of
exceedances per year of the ozone NAAQS or the annual average daily maximum
concentration result both from changes in precursor pollutant emissions and from
fluctuations in meteorological conditions. Ozone concentrations in recent years have
undergone large variations associated with a series of unusual weather patterns. This
variability makes it difficult to estimate the extent to which emission control
strategies and economic growth have affected ozone air quality.

Statistical models developed from field data that quantify the relationship between
meteorological  conditions and ozone can be used to calculate the values of annual
ozone summary statistics that would  be expected to occur under a fixed (reference)
set of meteorological conditions.  If the models are sufficiently accurate and precise
and are designed and applied so that the model parameters properly reflect year-to-
year changes in precursor emissions, these expected annual ozone values will vary
from one year to the next in response only to changes in emissions; fluctuations due
to meteorological variations will have been effectively removed. Such meteorologi-
cally adjusted concentration trends are useful for identifying "underlying" air quality
trends that may otherwise be difficult to  distinguish due to the interfering effects of
unusual weather patterns.

This report describes several test applications of a trends adjustment procedure that
is based on a statistical model known as a regression tree. Regression trees are con-
structed using the Classification and  Regression  Tree (CART) methodology developed
by Breiman et al. (1984). Such trees are designed to classify days into groups based
on a set of key  meteorological criteria such that days falling within any one group
have similar meteorological conditions that in turn produce daily maximum ozone
concentrations  of nearly equal magnitudes.  For example, CART can be used to
develop a regression tree that identifies a series of meteorological characteristics
that effectively sort days into low, medium, and high ozone groups. Thus, the
regression tree  is a statistical model  that relates meteorological conditions to ozone
concentrations. This model can be used to estimate ozone concentrations that would
have been observed in a given year if meteorological conditions in that year had been
equivalent to those observed in a typical or average year. By doing this for each of a
series of years, ozone concentration trends  that are not affected by year-to-year
fluctuations in meteorological conditions  can be calculated.
89053r2 1                                 vi

-------
Preliminary calculations were performed based on the number of days during the
ozone season (April through October) during which concentrations at a monitor in
Bridgeport, CT exceeded 10 or 12 pphm. Five key meteorological parameters known
to influence ozone formation were used to develop a regression tree model.  The
model was designed to classify days on the basis of these parameters into groups con-
sisting of days with similar daily maximum ozone concentrations. This model was
then used to calculate the number of exceedances estimated to occur in each of a
series of overlapping three-year base periods (1979-1981, 1980-1982,..., 1985-1987)
assuming weather conditions during each base period corresponded to climatological
norms.  These  meteorologically adjusted exceedances were then compared to the
number  of exceedances actually observed during each base period.

Results  obtained via the above procedure  (see Figure 3-4) reveal that the met-
adjusted exceedances are only slightly different from the unadjusted values  in most
of the base periods. This is partly due to the fact that the  use of three-year running
average base periods averages out most of the meteorological fluctuations the pro-
cedure is trying to account for. In addition, the regression tree model used in the
procedure provides for only a rough approximation of the relationship between daily
weather conditions and ozone concentrations. Apparently, fluctuations in the daily
maximum ozone concentration at a single monitoring site are  difficult to describe
solely on the basis of available meteorological data.

On the basis of the above findings, we attempted to define a set of ozone summary
statistics and meteorological variables that would result in a better fit of the
regression tree model to the data. We also modified the trend adjustment calculation
to allow for adjustment of concentrations in individual years rather than just over-
lapping  three-year periods, thus providing the model with a better opportunity to
account for the large changes in weather  patterns observed in some years.

Because daily  maximum concentrations at individual monitors are subject to a large
number  of microscale influences that are difficult to account for, we developed a
more robust daily ozone summary statistic by averaging the daily maximum concen-
trations over a network of several monitoring sites. We also performed some
analyses using the largest of the daily maximum concentrations over the network.
Two networks  were identified—one in the Philadelphia area and one in southwestern
Connecticut.  Several additional meteorological variables were also included in the
analysis. These were primarily designed to account for the fact that ozone  concen-
trations are often observed to build gradually from one day to the next, thus suggest-
ing that concentrations on a given day may be partially influenced by meteorological
conditions on previous days. Data for 1988 (during which record high ozone exceed-
ance rates were observed at a number of locations in the northeastern United States)
were also included in the analysis at this point.

Regression trees for the network average daily maximum concentration (Dmax)
developed by using the extended set of meteorological variables produced slightly
lower prediction errors than those for the network maximum concentration

-------
[Max(Dmax)], which in turn were lower than those for the Bridgeport monitor daily
maximum concentration (see Table_3-4). Meteorologically adjusted values of the
number of days per year on which Dmax exceeded 8 and 10 pphm and of the annual
average Dmax were calculated on the basis of these trees. As was the case in the
preliminary calculations, some residual correlation of the met-adjusted values with
the unadjusted values was found, suggesting that not all of the meteorological influ-
ences had been successfully removed. In spite of this, the results clearly show the
influence of meteorological conditions on ozone concentrations in some years, and
these influences were found to be generally  the same for both the Philadelphia and
Connecticut networks.  In particular, the years 1980,  1983 and 1988  were shown to
have had an unusually large number of days on which weather conditions were of the
type identified by the regression trees for both networks to be associated with high
ozone concentrations. This increased frequency of ozone-conducive days resulted in
unusually high observed ozone concentrations in these years, which were adjusted
downward by the adjustment procedure.

A time series plot of the annual average Dmax for the Philadelphia network (Figure
3-12) revealed no apparent upward or downward trend in either the met-adjusted or
unadjusted values.  Similar plots for the number of days per year on  which frmax
exceeded 8 and 10 pphm (Figures 3-9_and 3-10, respectively) also fail to show any
trend.  A plot of  the annual average T5max for the southwestern Connecticut network
(Figure 3-13) seems to show a slight decrease in both the met-adjusted and unadjus-
ted values between  1980 and 1988.  However, this decrease was found not to be  sta-
tistically significant.

To put our results in perspective, we compared them to results obtained through a
much simpler alternative trends adjustment procedure suggested by  Jones and Dash
(1985). In this procedure, the ratio of the number of days during an  ozone season on
which temperatures exceeded 90°F to the climatological average number of such
days is used as a meteorological index of the ozone-formation potential of each
year. Met-adjusted values of the number of days per year on which  the Philadelphia
network D"max exceeded 10 pphm were obtained by multiplying the inverse of this
meteorological index by the number of observed exceedances. This  adjustment  pro-
cedure is based on the implicit assumption that the ratio of exceedances to hot  days
is a function only of precursor emissions:  Years with twice the average number of
hot days are assumed to have twice the average number of exceedances.  Our results
(Figure 3-14) suggest that, at least for the situation we examined, this assumption is
not valid: Met-adjusted exceedance rates obtained by this method exhibited large
fluctuations from one year to the next that  are unlikely to be related to fluctuations
in precursor emissions.

From the results described above, we conclude that regression trees developed by the
CART methodology appear to be capable of describing the relationship between daily
meteorological conditions and ozone concentrations to a sufficient degree of
accuracy and precision to produce useful estimates of meteorologically adjusted
ozone trends.  Our comparison of met-adjusted and unadjusted annual ozone summary
                                        viii

-------
statistics reveals that weather conditions have lead to unusually high ozone values
during at least three of the past ten years (1980, 1983 and 1988).  Confirmation of
the statistical significance of differences in individual years between met-adjusted
and unadjusted annual ozone summary statistics must await the development of
methods for quantifying the uncertainty in the met-adjusted values.
 89053r2 1

-------
                              1   INTRODUCTION
Ozone concentrations in metropolitan regions typically follow an episodic pattern in
which comparatively lengthy stretches of low to moderate concentrations are
occasionally interrupted by multiday episodes during which concentrations increase
markedly over a wide area. -In most cities, exceedances of the National Ambient Air
Quality Standard (NAAQS) for ozone (12 pphm) occur only during such episodes. In
many areas, exceedances of 10 pphm also occur only during such episodes.  Studies of
the annual number of ozone NAAQS exceedances in major U.S. cities have shown
that the number of exceedances varies widely  from one year to the next (EPA,
1985). A typical example is shown in Figure 1-1.

A substantial portion of the variability evident in Figure 1-1 may be attributable to
fluctuations in meteorological conditions. Ozone formation in the troposphere is the
result of a complex sequence of photochemical reactions involving precursor pollu-
tants (nitrous oxides and volatile organic compounds).  The concentrations of these
pollutants and rates of  the chemical reactions leading to ozone formation depend to
a large extent on the weather. Given similar precursor emissions, the basic differ-
ences between days when ozone concentrations are average or below average (non-
episode days) and days when exceedances occur (episode days) are observed in the
prevailing meteorological conditions. In general, warm, clear days with light winds
that bring, keep, or return high precursor and ozone concentrations into an area will
coincide with the formation of the highest ozone concentrations.  Since such
meteorological conditions occur more frequently in some years than in others, ozone
episodes occur more frequently in some years than in others.

Unfortunately, the meteorologically induced "noise" evident in Figure 1-1 makes it
difficult to identify any underlying  long-term trends in exceedances that may come
about as a result of gradual changes in the pattern of precursor emissions. To  better
identify such underlying trends, we need to develop a method of filtering out the
meteorological noise in the ozone exceedance signal. In other words, we are seeking
a method by which the  number of exceedances actually observed in each year can be
adjusted to a common meteorological basis.

Numerous investigators have developed and applied methods for obtaining ozone
trend estimates that account  for year-to-year variations in meteorological condi-
tions. Most of these methods rely on the development of an empirical model that
relates ozone concentrations to meteorological conditions. The model is then used to
89053r2  2

-------
                                         KEY:
                                           -e-
                                           -s-
                                           -B-
070060123F01 BRIDGEPORT
070175123F01 DANBURY
070330017F03 GREENWICH
070700123F01 NEW HAVEN
071110007F01 STRATFORD
                                 1961
 1983
1985
1987
                   FIGURE 1-1. Estimated number of daily exceedances of the ozone NAAQS
                   in the ozone season (April 1 to October 31) at sites in southwestern
                   Connecticut.
88109
89053

-------
predict met-adjusted ozone concentrations for a reference set of conditions. The
reference set of conditions may correspond to the meteorology of a particular year
or to climatological norms. In either case, if the empirical model is sufficiently
accurate and precise, the resulting meteorologically adjusted ozone values give a
better picture of the actual long-term ozone trends without the interfering effects
of meteorological fluctuations.

Zeldin and Meisel (1978) describe a number of the variations on the basic procedure
for trends adjustment discussed above. They also summarize the results of previous
analyses of the correlation between various meteorological variables and ozone con-
centrations. This summary shows that the particular meteorological conditions most
closely associated with high ozone events vary somewhat from one location to the
next.  In general, however, high ozone has been found to be associated with high
temperatures, clear skies, and light winds. Surprisingly, most investigations of the
effect of mixing height on ozone concentrations have been inconclusive, in spite of
the fact that mixing height is generally believed to play a major role in the degree of
atmospheric mixing occurring during the day. Chock and co-workers (1982) point out
that this may be a result of the fact that accurate measurements of mixing height
are difficult to make and that ozone may be carried over from one day to the next in
layers of stable air above the inversion top, thus confounding the relationship
between mixing height and surface ozone concentrations.

Most previous trends adjustment studies have made use of  linear regression tech-
niques for relating ozone concentrations to meteorological conditions. Recent
examples include Wackter and Bayly (1987) and Pollack et  al. (1988) in the Northeast;
and Chock et al. (1982), Kumar and Chock (1984), and Davidson et al. (1985) in Los
Angeles. Sweitzer and Swinford (1986) developed a meteorological index based on
the number of days  in each year on which meteorological conditions were found to be
conducive to ozone  formation. To sort through the many possible meteorological
variables that may be affecting ozone concentrations, some of these investigators
used step-wise regressions while others relied on the results of previous studies or
theoretical considerations to arrive at a set of key meteorological variables. Some
investigators used the annual number of exceedances as the dependent variable in the
regression (Davidson et al., 1985; Wackter and Bayly, 1987) instead of the daily
maximum ozone concentration. Wackter and Bayly also developed a regression equa-
tion for the annual mean daily maximum ozone concentration.

As pointed out by Pollack et al. (1988), the use of a linear  regression model to pre-
dict daily maximum ozone values has a serious drawback:  High ozone concentrations
are consistently underpredicted by such a model because the least squares fitting
procedure used in linear regression is designed to  limit the overall mean square
error. Since high ozone values (say, 10 pphm or more) occur only rarely, they do not
play a major role in determining the regression coefficients. Pollack and co-workers
found this to be the case even though the data on which the regression equations
were  developed were first passed through a screening procedure designed to remove
 89053r2  2

-------
the majority of the low-ozone days (defined as days on which daily maximum ozone
concentrations are less then 8 pphm) on the basis of unfavorable meteorological con-
ditions.

Langstaff and Pollack (1985) examined the use of a different regression model,
known as a regression tree, for quantifying the relationship between meteorological
conditions and daily maximum ozone concentrations in St. Louis. Regression trees
have the potential to overcome some of the limitations of using linear regression to
calculate met-adjusted ozone trends pointed out by Pollack and co-workers (1988). A
methodology for developing regression trees has been developed by Breiman et al.
(1984) and is encoded in the CART (Classification and Regression Trees) computer
program.

Systems Applications, Inc. has been asked by the U.S. Environmental Protection
Agency to develop and test methods for calculating meteorologically  adjusted ozone
trends.  In particular,  we have been asked to investigate the potential for applying
the CART methodology to the trends adjustment problem. This report summarizes
the results of our  investigation. A description of the  data used in our analysis is pre-
sented in Section  2. An overview of CART and our methodology is presented in the
first part of Section 3, followed by a description of the results we  obtained. Con-
clusions and recommendations for additional analyses are presented in
Section 4.
 89053r2 2

-------
2 DATA
In this section we describe the methods used to select the ozone and meteorological
data used in our analysis.
SELECTION OF STUDY PERIOD

Most ozone measurements collected prior to 1979 were made by instruments cali-
brated by the neutral buffered potassium iodide (NBKI) wet chemical method. Dur-
ing 1979, all monitors were switched over to ultraviolet photometry (UV method), a
physical method that does not involve the use of liquid chemical reagents. Compari-
sons of UV and NBKI measurements have shown the latter to be less reliable and
have failed to establish a uniform correction factor that could be applied to all NBKI
results (43 FR 21, 22 June 1978). Because of this, we decided to use only UV data in
our analysis, thus effectively limiting our study period to the years 1979-1988.

Ozone concentrations during the warm summer months when the sun is more directly
overhead are generally higher than in other seasons. However, southern U.S. cities
with mild climates may experience high ozone concentrations year round. EPA has
designated ozone seasons for each region of the country during which certain moni-
toring requirements apply. For the northeastern United States, this season runs
1 April through 31 October. Because high ozone concentrations are very unlikely to
occur during other times of the year in this region, we have restricted our attention
in this study to data collected during the northeastern ozone season.
OZONE DATA

Monitoring sites examined in our study were selected from among those used in a
previous analysis of the effects of meteorological conditions on ozone concentrations
conducted in the northeastern United States (Pollack et al., 1988). In this way,
results from the earlier study could be used to guide our work, and resources other-
wise spent on data collection and processing could be redirected towards exploratory
analyses. Two monitoring sites examined in detail by Pollack and co-workers were
chosen for our study: Bridgeport, Connecticut (SAROAD Site ID 070060123F01) and
a site near the Philadelphia, Pennsylvania airport (SAROAD Site ID 397140024H01).
Data from the Philadelphia monitor were used only for preliminary analyses designed
to determine the framework of our study. Because of limited available resources,
these analyses are not reported here.
89053r2 7

-------
   In addition to examining data from individual sites, we also pooled data from two
   monitoring networks incorporating these sites—one in the Philadelphia area and one
   in southwestern Connecticut. Monitors included in the networks were selected from
   among those examined by Pollack and co-workers (1988).  Only monitors with reason-
   ably complete records of valid daily maximum values were selected.*  Monitors
   meeting these criteria were chosen for inclusion in the networks on the basis of their
   locations and the extent to which daily maximum ozone values at each monitor are
   correlated with those observed at the Philadelphia airport and Bridgeport monitors.
   Monitors located close to or within the Philadelphia metropolitan or southwestern
   Connecticut areas with correlation coefficients above about 0.75 were selected.

   The above selection process resulted in identification of a group of monitors capable
   of representing the air quality in a region over which concentrations respond to a
   single set of local meteorological influences.  By pooling data from all of the moni-
   tors in each network, concentration data indicative of the local influences can be
   obtained without interference from the less interesting microscale factors that can
   affect concentration levels at individual monitors.

   Monitors selected for inclusion in the Philadelphia and southwestern Connecticut
   networks are shown in Table 2-1.  Also shown in this table are correlations between
   daily maximum concentrations at each monitor and the corresponding concentration
   at the Philadelphia airport and Bridgeport, Connecticut, monitors.  Frequencies of
   missing valid daily maximum concentrations for each monitor by year are shown in
   Tables 2-2 and 2-3 for monitors in the Philadelphia monitoring network and the
   southwestern Connecticut monitoring network, respectively. When analyzing net-
   work summary statistics for 1979 and 1980, it must be kept in mind that only two or
   three monitors in the Philadelphia network were reporting during these years.
   Generally complete data are  available for the Philadelphia network beginning in 1982
   and for all years except 1979 in the southwestern Connecticut network.
    CALCULATION OF SUMMARY STATISTICS

    Daily and annual summary statistics were calculated for the Bridgeport monitor as
    well as for the two monitoring networks described above. For the Bridgeport moni-
    tor, the daily maximum hourly average concentration was used as the daily summary
    statistic. Two daily summary statistics were used for the monitoring networks: (1)
    the network average of the daily maximum concentrations,  D~max> and (2) the largest
    of the daily maximum concentrations, Max(Dmax).
* In accordance with EPA guidelines (EPA, 1979), a daily maximum ozone concentration
  was considered valid only if fewer than four missing hourly values occurred between
  0900 and 2100 LST or if any single hour exceeded the NAAQS (12 pphm)].


89053r2 7

-------
TABLE  2-1.  Ozone monitoring sites included in the Philadelphia and
southwestern Connecticut networks.
State
City
SAROAD ID
AIRS ID
Correlation*
Philadelphia, Pennsylvania

NJ       Camden
NJ       Camden County
NJ       Gloucester
NJ       McGuire AFB
NJ       Mercer County
PA       Bristol
PA       Chester
PA       Norristown
PA       Philadelphia
PA       Philadelphia
DE       New Castle County
DE       New Castle County
DE       New Castle County

Southwestern Connecticut
              310720003F01
              310740001F01
              311760002F01
              312750001F01
              312980005F01
              391080012F01
              391620012F01
              396540013F01
              397140023H01
              397140024H01
              080180001F01
              080180003F01
              080180018F01
               34-007-
               34-007-
               34-015-
               34-005-
               34-021-
               42-017-
               42-045-
               42-091-
               42-101-
               42-101-
               10-003-
               10-003-
               10-003-
     0003
     •1001
     •0002
     •3001
     •0005
     •0012
     •0002
     •0013
     •0023
     •0024
     •3001
     •1003
     •0018
    0.
    0.
    0.
0.90
0.77
 .85
 .84
 .91
0.93
0.85
0.85
0.84
1.00
0.85
0.79
0.78
CT
CT
CT
CT
CT
CT
CT
Bridgeport
Danbury
Greenwich
New Haven
Stratford
East Hartford
Middletown
070060 123F01
070175123F01
0703300 17F03
070700 123F01
071110007F01
070220003F01
070570007F01
09-001-0123
09-001-1123
09-001-0017
09-009-1123
09-001-3007
09-003-1003
09-007-0007
1.00
0.81
0.91
0.92
0.93
0.81
0.89
* Correlation of valid daily maximum ozone  concentrations  (1979-1988)
  during April-October ozone season with  Philadelphia airport monitor
  (SAROAD ID 397140024H01 for Philadelphia  network)  and Bridgeport monitor
  (SAROAD ID 070060123F01 for southwestern  Connecticut network).
 89053r2

-------
00
    TABLE 2-2.   Number  of days  with missing valid daily maximum ozone concentrations for monitors in the
    Philadelphia monitoring  network (there  are 214 days in the monitoring season).
Monitor Name
New Castle County
New Castle County
New Castle County
McGuire AFB
Camden
Camden County
Gloucester
Mercer County
Bristol
Chester
Norristown
Philadelphia
Philadelphia
AIRS
ID Code
10-003-0018
10-003-1003
10-003-3001
34-005-3001
3i|_007-0003
3^-007-1001
34-015-0002
34-021-0005
42-017-0012
42-045-0002
42-091-0013
42-101-0023
42-101-0024
1979
214
214
214
214
19
19
214
214
214
214
214
214
214
1980
214
214
214
28
34
28
214
214
214
214
214
214
214
1981
164
162
159
33
8
5
214
79
28
15
26
214
22
1982
98
33
23
19
9
15
11
8
27
25
24
36
18
1983
20
13
17
12
7
10
13
18
15
11
16
22
16
1984
32
20
11
15
4
5
3
11
17
25
17
45
26
1985
17
14
15
17
13
18
6
4
14
23
6
85
13
1986
44
10
14
8
10
23
5
11
10
14
14
42
23
1987
38
14
22
32
4
19
9
8
14
14
13
63
4
1988
77
44
35
5
8
2
9
5
8
3
13
136
132
       89053r2

-------
TABLE 2-3.   Number of days with missing valid daily maximum ozone concentrations for monitors in the
southwestern Connecticut monitoring network (there are 214 days in the monitoring season).
Monitor Name
Greenwich
Bridgeport
Danbury
Stratford
East Hartford
Middletown
New Haven
AIRS
ID Code
09-001-0017
09-001-0123
09-001-1123
09-001-3007
09-003-1003
09-007-0007
09-009-1 123
1979
214
214
214
214
214
214
214
1980
214
2
44
43
214
48
39
1981
214
32
45
39
18
29
29
1982
45
15
58
12
11
23
32
1983
39
3
22
0
7
10
2
1984
31
31
41
38
39
31
48
1985
43
5
14
12
14
6
10
1986
14
5
9
2
4
4
4
1987
13
7
18
19
3
5
17
1988
10
214
13
214
1
11
0
    89053r2

-------
   Annual summary statistics were computed as follows: For the Bridgeport monitor,
   the number of times each year that the daily maximum hourly average concentration
   exceeded 10 and 12 pphm (i.e., was at or above 10.5 or 12.5 pphm, respectively) was
   calculated. For the monitoring networks, two annual summary statistics were calcu-
   lated:

        The annual average of the daily summary statistics

        The number of times during the year that the daily summary statistics
        exceeded 8 or 10 pphm.
   METEOROLOGICAL DATA

   Meteorological data were obtained for our study from the National Climatic Data
   Center surface and upper-air data archives. These data were originally used by
   Pollack and co-workers (1988). We supplemented the data base from that study by
   obtaining additional surface data for the latter part of 1987 and all of 1988.
   Meteorological observations from Bridgeport, Connecticut were used to analyze
   ozone concentrations at the southwestern Connecticut sites, and data from Phila-
   delphia were used for the Philadelphia ozone monitors.

   Pollack and co-workers considered a large number of variables that are potentially
   related to ozone concentrations (Table 2-4). These variables are grouped according
   to the primary processes by which they are believed to affect ozone concentrations.
   Since there is  a strong interrelationship between many of the variables (e.g., daily
   maximum temperature and average wet bulb temperature), the variables could have
   been grouped in many ways; the classifications in Table 2-4 are merely one
   example. A description  of each group adapted  from the report by Pollack and co-
   workers follows:
   Insolation

   Certain variables are associated with the amount of solar radiation that is available
   for N©2 photolysis and subseo,uent ozone formation. N©2 photolysis rates were
   obtained from Demerjian et al. (1980) for each daylight hour. These values were
   added (integrated) over the daily surface weather observations. FAC1 represents the
   photolysis rate assuming clear skies, while  FAC2 reflects an adjustment for cloud
   cover according to the method of Maul (1980) as adapted by Scire and co-workers
   (1983).  Ceiling height (CAVE) is loosely related to the amount and type of clouds and
   therefore the amount of solar radiation reaching the lowest layers of the atmo-
   sphere. Total opaque sky cover (OPQAVE) serves a similar purpose. The number of
   daylight hours (DAYHR) provides a measure of the extent of solar insolation in the
   day and also serves as an indicator of time of year.
89053r2 7

                                           10

-------
TABLE 2-4.  Meteorological variables available for northeastern monitoring sites.
(Source:  Pollack et al.,  1988)
     Variable
                  Description
  Unit
Insolation

OPQAVE
FAC1
FAC2
CAVE
DAYHR

Ventilation
Average total opaque sky cover
Integrated clear sky photolysis rate
Integrated cloudy sky photolysis rate
Average ceiling height
Number of daylight hours
tenths
pphm   min
pphm   min"'
ft
VWSAVE
SWAVE
WRATIO

MIXHT1
MIXHT2
MIXHT3

MIXHT4
MIXHT5
MIXHT6
Transport

VWDAVE
Vector average wind speed
Scalar average wind speed
Wind fluctuation (ratio of scalar to vector
  average wind speed)
Morning (0700 LST) mixing height
Afternoon (1900 LST) mixing height
Average mixing height (average of 0700 and
  1900 LST observations)
Afternoon minus morning mixing height
Ratio of afternoon to morning mixing height
Ventilation coefficient (average mixing height
  times average wind speed)
Resultant wind direction (from vector average)
kt
kt
m
m
m

m

m*kt
degrees
from
north
      89053r2 if
                                                                         continued
                                           11

-------
TABLE 2-4.   (concluded)
     Variable
                  Description
Unit
Indirect Measures
OLDMAX

VISAVE
TMAX
TDIF
TDAVE
TWAVE
RHAVE
PAVE

PDIF

IWEA

TEMP1.TEMP2,TEMPS
TEMP4,TEMPS,TEMP6

TEMP7, TEMPS, TEMP9
TEMP10,TEMPI1.TEMP12

TEMP13,TEMP14,TEMP15
HEIGHT1,HEIGHT2,
  HEIGHTS
HEIGHT4,HEIGHTS,
  HEIGHTS
DEW1.DEW2,DEWS
Previous day's daily maximum ozone
  concentration
Average visibility
Daily maximum temperature
Daily temperature range (max. minus min.)
Average dew point
Average wet bulb temperature
Average relative humidity
Average surface pressure (corrected
  to sea level)
Pressure range (max. minus min. surface
  pressure)
Occurrence of thunderstorm (ITH=1),
  rain (IRA=1), or drizzle (IDR=1)
850 mb temperature (0700, 1900 LST and average)
1000 mb - 850 mb temperature difference
  (0700, 1900 LST and average)
700 mb temperature (0700, 1900 LST and average)
1000 mb - 700 mb temperature difference
  (0700, 1900 and average)
500 mb temperature (0700, 1900 LST and average)
850 mb height (0700, 1900 LST and average)

500 mb height (0700, 1900 LST and average)

850 mb dew point (0700,  1900 LST and average)
pphm

miles
mb

mb
°C
°C

°C
°C

°C
°C

°C

°C
      89053r2
                                            12

-------
Ventilation

Certain variables are associated with the advection and dispersion of ozone and pre-
cursors. Previous studies have shown an inverse relationship between wind speed and
ozone concentrations. To account for differences in wind direction, a vector average
wind speed (VWSAVE) defined as the magnitude of the resultant vector was calcula-
ted in addition to the scalar average wind speed (SWAVE). A measure of the persis-
tence of wind direction during the day is provided by the wind fluctuation ratio
(WRATIO), which is defined as the ratio of the scalar average wind speed to the vec-
tor average wind speed. A ratio near one indicates a highly persistent wind direction
while a larger ratio indicates variability in wind direction. Mixing height describes
the vertical dimension of the volume through which dispersion of ozone and precur-
sors can occur. The degree of difference between morning and afternoon mixing
heights may provide an indication of the type of diurnal weather pattern to which a
particular day belongs. A ventilation coefficient (MIXHT6) was calculated by multi-
plying the average of the 12Z and OZ mixing heights by the average wind speed. A
small coefficient is indicative of conditions under which little dispersion of pollu-
tants takes place.
Transport

Trajectories of polluted air masses are a function of wind speed (already included in
the ventilation category discussed above) and direction (VWDAVE). Ozone and pre-
cursors may be transported both into and out of a particular region. High ozone con-
centrations at sites not located within an urban area can occur only when trajec-
tories are such that a polluted air mass is advected toward the site.
Indirect Measures

Many meteorological variables are associated with combinations of factors that
either promote or hinder the formation of high ozone concentrations, although they
are not a direct measure of any of the factors themselves. For example, the daily
range of surface pressure (PDIF) does not in itself affect ozone concentrations.
However, large pressure ranges are most likely indicative of windy days with possible
frontal passage and associated precipitation. Such days will have generally low
ozone concentrations due to the dispersing action of the winds, lack of solar insola-
tion, scavenging of pollutants by precipitation, and low temperatures. Conversely,
high ozone days may be characterized by small pressure ranges.

One of the most significant meteorological variables in this group is the daily maxi-
mum temperature (TMAX). While some photochemical reactions do proceed more
rapidly as temperatures increase, resulting in a net increase in ozone formation
89053r2 7

-------
(Whitten and Gery, 1986), temperature is also (and perhaps most significantly) rela-
ted to a number of other factors that influence ozone formation. For example, very
warm days will occur only in conjunction with clear skies and usually light winds. As
noted above, both of these conditions are conducive to high ozone concentrations. In
addition, high temperature may promote changes in the amount and chemical nature
of precursor emissions (e.g., an increase in evaporative VOC emissions), which may
also promote ozone formation.

A number of other variables have been included in the "indirect measures" group.
The inclusion of the previous day's ozone concentration (OLDMAX) is based on the
fact that weather patterns tend to persist from one day to the next and the possi-
bility that ozone is carried over from one day to the next. Humidity indicators (dew
point, wet bulb temperature, and relative humidity) may be related to observed
ozone concentrations because they are indicative of overall weather patterns that
either promote or hinder ozone formation. Humidity also plays a role in the photo-
chemical reactions that result in ozone formation (Whitten and Gery, 1986). Tem-
perature lapse rates (1000 - 850 mb, 1000 - 700 mb temperature differences) are
indicative of overall weather patterns and atmospheric stability. The presence of
precipitation (occurrence of thunderstorm, rain or drizzle) drastically reduces ozone
concentrations due to increased cloudiness, decreased solar insolation, washout by
rain, higher ventilation by increased winds, and lower temperatures. Atmospheric
thickness (the depth between the 850 and 500 mb pressure surfaces) is a direct func-
tion of the mean temperature in the layer and is thus related to ozone concentrations
for the reasons discussed above. Finally, visibility is an indicator of aerosol loading
and may thus be correlated with ozone concentrations.

Pollack and co-workers (1988) found that not all of the variables described above
have a significant relationship with ozone. In addition, many of the variables were
found to be highly correlated with one another and therefore redundant. Eliminating
most of the obviously redundant variables along with those found to be unrelated to
ozone concentration leads to a reduced list of variables (Table 2-5). Our preliminary
analyses (described in the next section) made use of just the variables listed in Table
2-5. Later analyses included some additional variables, which are described in the
discussion of those analyses in Section 3 (see Table 3-3).

Since we examined daily maximum ozone concentrations, daily averages of the selec-
ted meteorological variables were used in our analysis. A previous EPA study (EPA,
1985) suggested that conditions during the morning and early afternoon hours are
particularly important in determining a day's ozone formation potential. Therefore,
as indicated in Tables 2-5 and 3-3, two averaging periods were used in defining the
daily meteorological variables: an average over the daylight hours (0700 - 1900,
Local Standard Time) and an average over the "ozone formation period" (0700 - 1300
LST). In most cases, variables averaged over all daylight hours are noted by the
appearance of a "2" at the end of the variable name. Averaging times used for each
variable are indicated in the tables.
89053r2 7

-------
TABLE 2-5.  Key meteorological variables selected for inclusion in the
preliminary calculations.
Variable
Unit
                                           Description
TMAX2      °F

RHAVE      %

OLDTMAX2   °F
FAC2


SWAVE

WD
           pphm
           kt
               "^
             Daily maximum temperature (0700 - 1900 LSI)

             Average relative humidity (0700 - 1300 LSI)

             Previous day's daily maximum temperature
             (TMAX2; deg.F)

             Integrated N02 photolysis rate adjusted for
             cloud cover (0700 - 1300 LST)

             Scalar average wind speed (0700 - 1300 LST)

             Vector average wind direction (1 = N,
             2 = ME, ..., 8 = NW) (0700 - 1300 LST)
89053r2
                                       15

-------
                              3   METHODOLOGY
As described in the introduction, most methods for developing meteorologically
adjusted ozone trends involve the development of a statistical model that relates
ozone concentrations to meteorological conditions. Such a model can be built by
applying standard statistical techniques such as linear regression to a "learning" data
base consisting of monitored ozone concentrations and concomitant meteorological
observations. For purposes of illustration, assume that a linear regression model is
used to relate the daily maximum ozone concentration, Dmax> to a set of two
meteorological variables, ml and m2 (e.g., daily maximum temperature and average
wind speed). Mathematically, such a model  looks like this:
                           Dmax = b° + blml + b2m2

where Dmax is the ozone concentration estimated or "predicted" by the model, and
bO, bl and b2 represent the regression coefficients (i.e., model parameters). Assume
for the moment that separate sets of values of bO, bl and b2 are determined through
a least-squares regression analysis for each year over a 10-year period.  In this case,
different values of the parameters will be calculated for different years. At least
part of this difference will be due to the fact that precursor emissions vary from one
year to the next, thus altering the relationship between ozone concentrations and
meteorological conditions.  In other words, the same set of meteorological condi-
tions will lead to different ozone concentrations in different years due to changes in
emissions.*

By applying the above model with the ten sets of parameters developed  for a series
of ten years to a standard or reference set of meteorological conditions (i.e., fixed
sets of values of ml and m2), it is possible to calculate the effects of emissions
changes on ozone concentrations without the interfering effects of meteorological
* In practice, the parameter values will differ from year to year even if there is no
  change in precursor emissions because the simple model described above cannot
  completely describe all of the different ways that meteorology can affect ozone
  concentrations. We temporarily ignore this complication here for the sake of
  simplicity.


89053r2  3

                                         16

-------
variability. The standard sets of meteorological conditions may be those observed
during a particular year arbitrarily chosen to serve as the reference year. Alterna-
tively, they may be chosen to correspond to conditions during an average or typical
year.  In either case, the resulting met-adjusted ozone concentrations calculated for
each year will be those that would have been observed had the meteorological condi-
tions during the year been identical to the reference conditions. Any remaining
year-to-year variations in the met-adjusted ozone concentrations are due to varia-
tions in the model parameters arising from changes in precursor emission levels.

As noted in the Introduction, Pollack and co-workers (1988) discovered that the use
of a linear regression model similar to the one described above to adjust  ozone trends
suffered from a major drawback; the model consistently underpredicted concentra-
tions on high ozone days.  This defect makes the regression model unsuitable for cal-
culating met-adjusted values of certain annual ozone summary statistics such as the
number of days exceeding the ozone NAAQS that are strongly dependent on the fre-
quency and magnitude of  high daily concentrations.  For this purpose, a statistical
model capable of providing unbiased predictions of high ozone concentrations is
needed.

In the following subsections we describe several ways in which CART, an alternative
statistical procedure that has the potential for reducing the problem with regression
models noted by Pollack and co-workers, can be used to calculate met-adjusted
exceedance trends. We then present some results produced by this method and com-
pare them to results obtained by a much simpler adjustment method suggested by
Jones and Dash (1985).
OVERVIEW OF CART

The Classification and Regression Tree (CART) methodology developed by Breiman
and co-workers (1984) is an exploratory data analysis tool that can be used to
identify groups of days on the basis of daily meteorological characteristics such that
ozone concentrations on days falling within a particular meteorological group are
nearly identical (according to certain statistical criteria described below). Thus,
CART might be used to identify a series of meteorological characteristics that can
be used to sort days into low, medium, and high ozone groups. Once the appropriate
characteristics are identified, days can be classified as belonging to the low,
medium, or high group without knowing the actual ozone concentrations on those
days. In other words, CART allows one to predict the approximate ozone concentra-
tion based on meteorological observations.  The advantage of the CART approach
over simple  linear regression is that if there are specific meteorological conditions
associated with high ozone days, such days will be placed in a separate group, thus
eliminating the compromise involved in fitting a single linear equation to  all of the
data. It  is this compromise that led to the underprediction of high ozone  days noted
by Pollack and co-workers (1988).
89053r2  3


                                         17

-------
CART consists of a set of algorithms for growing binary decision trees.  Such a tree
consists of a series of decision points (represented by nodes) and data paths (repre-
sented by branches). At the beginning of the tree, data are split off into one of two
branches based on certain criteria.  For the application described here, these criteria
are based on the observed meteorological conditions. Data in each of these branches
are then split again on the basis of different criteria. This process is repeated until
the data are sufficiently subdivided (i.e., until the ozone concentrations on days
within each final group are sufficiently uniform).*

An example of a tree grown by CART is shown in Figure 3-1.  The beginning of the
tree is known as the root node, and the endpoints of the last branches on the tree are
known as terminal nodes. At each node, data are sent down one branch or the other
on the basis of a decision of the type: Is m < c?  Here, m is one of the meteorologi-
cal predictor variables, and c is a specific cutoff value.  At each node, CART deter-
mines the particular meteorological variable, m  (known as the primary splitting
variable) and the cutoff, c, that produces the "best" possible split of the data at that
point. A good split is one that results in a division of days such that the variance
(variability) of ozone concentrations within each newly formed group is  greatly
reduced from that of the original group. The best split is the split that results in the
largest possible difference between the sum over the newly formed groups of the
within-group variances and the variance in the data before  the split is made.  CART
also identifies a list of "competitor" splits for each node. This is a rank-ordered list
of variables and associated cutoff values that result in the next best possible splits
(i.e., the splits that produce the next best possible reduction in variance) at the
node. In some cases, the top-ranked competitor split uses a variable that  produces
almost as good a reduction in variance as the primary splitting variable.

A tree is grown initially on a "learning" sample.  This data set is used to determine
the choices of X and c at each node. In addition, the predicted value of the daily
maximum ozone concentration (Dmax) associated with each terminal node is calcu-
lated by averaging the Dmax values over all days in the learning sample that  fall into
the node.

A CART decision tree can be used to "predict" the daily maximum ozone concentra-
tion for a particular day not included in the learning sample on the basis of the
meteorological characteristics of that day just as the linear regression model
described in the previous section can be used to  make such a prediction. This is done
as follows: A day is "dropped" down the tree, starting at the  root node, with the
direction of travel at each node determined on the basis of the value of one of the
* Criteria for determining when this point has been reached are included in the
  CART algorithm.


89053r2 3                                1Q

-------
All days with
M
-------
meteorological variables on that day. At the bottom of the tree, the day ends up in a
terminal node consisting of similar days with similar daily maximum ozone concen-
trations. The average daily maximum concentration calculated for that terminal
node from the learning sample can then be used as the predicted concentration for
the day in question. The prediction error for that day is then the difference between
this predicted Dmax and the value actually observed. This error is used to determine
how well the decision tree model fits the data and to determine the optimum-sized
tree, as explained below.
True and Apparent Error: Growing the Right-Sized Tree

If the learning sample used to grow a decision tree is dropped back down (resubstitu-
ted into) the tree and the squared values of the prediction errors are averaged over
ail days in the sample, one obtains the apparent or resubstitution error, R. An
important characteristic of R is that it decreases as the size of the tree (i.e., the
number of terminal nodes in the tree) increases, reaching a value of zero when the
tree has reached a size where at most one day remains in each terminal node. Such
an oversized tree is not very useful, however, since it is too closely tailored to the
specific characteristics of the learning sample and will produce large errors when
applied to an independent data set.

Another way to state the problem associated with oversized trees is that the "true"
error, R*, of the oversized tree will be quite high. R* is the average prediction
error over the set of all days to which the tree is to be applied (not just the days
included in the learning sample). Thus R* represents the average actual error that
would be encountered if the decision tree were used to make predictions on days not
included in the particular set of days on which tree development was based. CART
provides two procedures for estimating R*~one uses an independent test sample, and
the other uses a statistical technique known as cross-validation. In the test sample
method, an independently derived test data set distinct from the learning set is run
down a previously constructed tree, and the value of Dmax on each day in the test
data set is compared to the predicted value assigned to the terminal node into which
the day falls, thus generating a test sample estimate, Rts, of R*. In the cross-vali-
dation method, cross-validation is performed by using subsets of the learning sample
to calculate an estimate, Rcv, of R*. Details of the cross-validation method can be
found in Breiman et al. (1984, p. 234).
%
Because a large tree too closely reflects the peculiarities of th_e learning sample, R
will be undesirably high for such a tree. On the other hand, R will also be too high
for a tree that is not big enough. An important aspect of CART is the procedure
used to choose the right-sized tree, i.e., the tree that minimizes R*. First, the
learning sample is used to grow a tree large enough so that only a few days at most
(typically no more than five) fall in each terminal node. The tree is then sequentially
pruned upward, with a single branch removed at each step. Pruning is carried out by
using an algorithm that identifies the cut that will produce a pruned tree with a
89053r2 3
-------
minimum value of R at each step. In this way, a sequence of optimally pruned trees
is developed, starting from the largest tree and ending with the tree consisting of
just the root node. The true error rate, R*, of each tree in this sequence is then
estimated through either an independent test sample or cross-validation. Confidence
intervals for the estimates (Rts or Rcv) of R* are also calculated for each tree. The
smallest tree that has an estimated true error rate falling within the confidence
interval of the tree with the minimum estimated true error rate is then selected as
the optimum tree (see Figure 3-2).
TYPES OF TREES THAT CAN BE CONSTRUCTED WITH CART

Three types of trees can be constructed with CART: regression, classification, and
class probability. We have used regression trees to illustrate our points in the discus-
sion above. The key features of each of the other types of trees are briefly
described below. Additional details may be found in the book by Breiman and co-
workers (1984).
Classification Trees: In these trees the dependent variable is categorical
(e.g., a day is either an exceedance or not). The goal of growing a classifi-
cation tree is to develop a decision rule that can be used to accurately pre-
dict the category in which a case (e.g., day) belongs. Thus, the prediction
associated with each terminal node of a classification tree is the class
designation (either exceedance or no exceedance) determined by the class
to which the majority of days from the learning sample assigned to the
node belong.

Class Probability Trees: These trees are a variation of the classification
tree concept in which the goal is to accurately predict the probability of a
case belonging to a particular category (e.g., the probability of a day being
in exceedance).

To calculate a meteorologically adjusted trend in the number of exceed-
ances per year of the ozone standard using a decision tree, we must esti-
mate the probability that a day falling into a particular terminal node will
be an exceedance day. For class probability trees, this probability is
calculated by CART. For regression trees, the probability of exceedance
can be estimated by noting the fraction of days from the learning sample
that fall in the node that are in exceedance. A similar procedure can be
used for classification trees, or, alternatively, the class designation of each
node can be used to determine if the probability of exceedance should be
set equal to one or zero.
89053r2 3

21
-------
t:
Tree with
^» minimum R*
Confidence
Interval (±1 S
Tree Size'

(Number of terminal nodes)
FIGURE 3-2. True (R*) and resubstitution (R) error as a function
of tree size.
EEE
89053
22
-------
We conducted some preliminary analyses in which we compared results
obtained by applying regression, classification and class probability trees to
the Bridgeport site (see Stoeckenius and Hudischewskyj, 1989). Our results
indicate that classification trees are not well suited to trends adjustment
because they are not able to identify the particular meteorological condi-
tions that are associated with high ozone events and which contribute most
significantly to the annual exceedance total. We encountered a similar dif-
ficulty with the class probability tree we grew using data from the Bridge-
port monitor: None of the terminal nodes in the tree corresponded to a set
of meteorological conditions that are nearly always associated with
exceedances. This appears to be a result of the criteria used in CART to
grow class probability trees. These criteria are designed to result in a tree
that most accurately predicts the probability of exceedance for a given
day, rather than to identify the meteorological conditions that are
extremely likely (or unlikely) to result in exceedances. Further details of
these analyses can be found in Stoeckenius and Hudischewskyj (1989).

On the basis of the results summarized above, we decided to focus our
attention on the development of met-adjusted exceedance trends using
regression trees rather than classification or class probability trees.
Details of the manner in which regression trees can be used to calculate
met-adjusted trends are presented in the following section.
USING REGRESSION TREES IN TRENDS ADJUSTMENT

Regression trees developed with CART represent potentially useful tools for calcula-
ting meteorologically adjusted ozone trends. To illustrate how such trees can be
used in this way, we describe a set of adjustment calculations we carried out by
means of a regression tree developed for the Bridgeport, Connecticut, ozone moni-
tor. For these preliminary calculations, we used the set of key meteorological
variables identified by Pollack and co-workers (1988) and listed in Table 2-5.

We first grew a regression tree (see Figure 3-3) using data from all available years of
data (1979-1987 in this case since 1988 data were unavailable at the time these pre-
liminary runs were made). This tree is identified by the CART computer program as
the smallest tree consistent with minimizing the true error as estimated by tenfold
cross-validation. Table 3-1 shows the number of days in each terminal node (iden-
tified by boldface numbers in Figure 3-3) with Dmax greater than 8, 10, and 12
pphm. Most of the days with Dmax less than 8 pphm are contained in nodes 1-3. As
indicated in Figure 3-3, these nodes are characterized by low maximum temperatures
and wind directions from NW to NE and contain virtually no high ozone days. Thus,
the CART methodology appears to have done a good job of automatically developing
a set of criteria that can be used to identify conditions under which ozone values
above 8 pphm are very unlikely to occur. Again turning to Table 3-1, we see that
node 9 consists exclusively of days on which ozone values are above 10 pphm (and Ik
890S3r2 3
-------
TMAX2
>80.5°F
TMAX2
>88.5°F
150
120
AX2
S88.5°F
TMAX2
<80.5»F
WD=(6)
WD=(1,2,3.4,5.7.8)
TMAX2 >76.5°F
WD=(3.4.5,7.8)
125
83
WD=(1,2,6)
142
57
TMAX2 S76.58F
WD=(3,4,5.6)
279
60
WD=(1,2,7.8)
TMAX2 >71.5°F
103
48
23
150
127
110
976
46
TMAX2
-------
TABLE 3-1. Distribution of number of high ozone days by
terminal node for regression tree depicted in Figure 3-3
(Bridgeport, CT, 1979-1987).
Terminal
Node
1
2
3
4
5
6
7
8
9
Total
No. of
Days*
757
186
125
214
161
150
96
151
25
RSD ($)*
37
23
28
35
31
35
34
31
24
Number*
8 pphm
3
3
1
29
11
52
44
104
21
of Days
10 pphm
1
0
0
14
6
26
21
67
21
Above
12 pphm
0
0
0
4
1
9
12
46
17
* Does not include days with missing meteorological data.

' Relative standard deviation of daily maximum ozone
concentrations in each terminal node (equal to standard
deviation divided by the mean).

* Does not include corrections to account for days with
missing daily maximum ozone concentrations and meteoro-
logical data.
89os3r2
-------
percent of these are above 12 pphm). These days are characterized by maximum
temperatures in excess of 88.5°F and winds from the SW, conditions similar to those
identified by Wackter and Bayly (1987) and Pollack and co-workers (1988) as being
associated with high ozone concentrations in southern Connecticut. Nodes 4 to 8
seem to represent days for which conditions are marginally sufficient for the forma-
tion of high ozone concentrations, but which, for one reason or another, do not
always exhibit such high concentrations.

If the mean Dmax in each terminal node is taken as the predicted value for all days
falling in that node, the tenfold cross-validation estimate of the mean square error
of these predictions is 2.2 pphm (the variance of Dmax in the 10-year data set is 11
pphm). The resubstitution estimate of the mean square error is 2.1 pphm. Resubsti-
tution estimates of the relative standard deviations (RSDs) for each terminal node
are shown in Table 3-1. Values for the high ozone, low population nodes (e.g., node 9)
are not significantly larger than those for any other nodes, thus suggesting that the
prediction error on high ozone days is not significantly different from that on the
more abundant low ozone days. Caution must be exercised in interpreting this result,
however, since within-node resubstitution error estimates are likely to be quite dif-
ferent from (and in general smaller than) the true within-node errors.

One way to use the regression tree in Figure 3-3 to develop a meteorologically adjus-
ted ozone exceedance trend is to drop the days in each of a sequence of overlapping
three-year "base periods" (i.e., 1979-1981, 1980-1982, etc.) down the tree.* For each
base period, the fraction of days falling in each terminal node that are in exceedance
of the concentration level of interest (either 10 or 12 pphm) is calculated, making
suitable corrections to account for missing data.* These fractions represent esti-
mates of the probability of exceedance for days with the meteorological characteris-
tics described by the sequence of splits that leads to each terminal node. They can
be viewed as probabilities conditioned on the particular set of precursor emissions
prevailing during each base period. The probability of exceedance for a given node
varies from one base period to the next as a result of changes in precursor emission
levels.

To obtain an estimate of the number of days that will fall into each terminal node in
an "average" or typical year, meteorological data for all available years are dropped
* A length of three years was chosen for the base periods to maximize the amount
of data available for estimating the probabilities of exceedance while minimizing
the chance that precursor emissions will have changed significantly over the
period.

t These corrections are based on the assumption that days with missing Dmax
values (but valid meteorological data) have a probability of being in exceedance
equal to the probability calculated using days with valid Dmax values for the
terminal node into which the day falls.

26
89053r2 3
-------
down the tree, and the average number of days per year in each node is calculated by
dividing the total number of days in the node by the number of years (in this case, 9),
making suitable corrections to account for missing data.§ The long-term average
number of days in each node is then multiplied by the corresponding probability of
exceedance for a particular three-year base period and the results summed over all
nodes to obtain a meteorologically adjusted value of the exceedance rate for that
base period. This value represents an estimate of the number of exceedances one
would expect to have observed during the base period if meteorological conditions
during that period had been typical or "average."

Results of the calculations described above are shown in Table 3-2 and Figure 3-4.
Figure 3-4a shows the observed number of exceedances of 12 pphm in each year (cor-
rected to account for missing data by standard EPA procedures; EPA, 1979) together
with the met-adjusted values plotted at the middle year of each three-year base
period. Also shown are three-year running averages of the observed exceedances.
The series of met-adjusted exceedances is clearly less variable than the series of
observed exceedances, and a gradual decrease in exceedances over time is evident.
Since the met-adjusted exceedances in a sense represent running three-year
averages, it is perhaps more appropriate to compare the met-adjusted exceedances
to three-year running averages of the observed exceedances. Figure 3-4a shows that
the met-adjusted and running averaged observed exceedances are nearly identical.
Similar results were found for exceedances of a 10 pphm threshold concentration
(Figure 3-4b). Thus, at least in this example, the advantage to be gained by calcu-
lating meteorologically adjusted exceedances is not clear.

From the results just described, we can see several difficulties involved with our
CART-based trend adjustment procedure. For one thing, the calculation of model
parameters based on running three-year averages makes it impossible to perform a
meaningful comparison of the observed and met-adjusted exceedance rates in any
individual year. However, it is the varying effect of meteorological conditions on
ozone concentrations in individual years that is of primary interest; taking three year
averages smooths over most of the variability we are trying to explain. In the next
section we describe some results obtained by using a modified adjustment procedure
in which the model parameters are estimated for individual years, thus allowing a
year-by-year comparison of actual and met-adjusted exceedances.

Another difficulty evident from the above results is that the regression tree model
cannot account for a sizable fraction of the variation in ozone concentrations obser-
ved at an individual monitor (only about 56 percent of the variance in Dmax is
accounted for by the regression tree). It seems reasonable to suppose that a myriad
§ These corrections are based on the assumption that days with missing
meteorological data are distributed across terminal nodes in the same proportion
as are days with non-missing data.

890S3r2 3 27
-------
TABLE 3-2. Observed and met-adjusted exceedances for Bridge-
port, CT. Met-adjusted values were obtained by the method
described in the text using the regression tree depicted in
Figure 3-3.
Number of Exceedances per Year
Met-Ad justed
Base Period
1979-1981
1980-1982
1981-1983
1982-1984
1983-1985
1984-1986
1985-1987
10 pphm
Threshold
22.6
25.4
22.2
25.6
22.1
16.2
13.7
12 pphm
Threshold
15.6
15.1
13.8
14.2
12.5
8.6
6.0
Observed*
10 pphm
Threshold
23.9
22. u
24.8
28.7
25.6
16.9
16.0
12 pphm
Threshold
16.4
14.1
14.8
16.3
14.5
8.4
4.1
* Three-year running averages of observed exceedances
corrected to account for missing data.
89053r2
28
-------
o
en
Bridgeport Ozone Exceedances. Observed and Adjusted by CART
Regression Tree. (Exceedance: 0, >= 12.5 pphm.)
Observed
Observed 3-Yr Avg
Adjusted 3-Yr Avg.
83
Oeise Year

FIGURE 3-4a.Observed and adjusted annual ozone exceedances derived from the regression tree shown in Figure 3-3.
Observed exceedances are for year indicated (no data available in 1979). Adjusted exceedances and running
3-year averages of the observed exceedances are shown for the middle year of each base period.
Exceedance = 12 pphm.
-------
o
en
to
Bridgeport Ozone Exceedances, Observed and Adjusted by CART
Regression Tree. (Exceedciiice. Oa ^— 10.0 ppliiu.)
OJ
O
Observed
Observed 3-Yr Avg
Adjusted 3-Yr Avg.
83
Base Year
FIGURE 3-4b. Exceedance = 10 pphm.
-------
of factors can affect the daily maximum concentration at a single site. To the
extent that the regression tree (or any other statistical model) is unable to account
for these factors, the met-adjusted exceedance trend will vary from year to year in
response to them. One way of avoiding this problem is to calculate meteorological
adjustments for a more robust ozone summary statistic obtained from a network of
several monitoring sites. In this way, the dif ficult-to-model microscale factors
affecting any single site become less important, leaving only the larger-scale factors
that are (presumably) easier to model. In the following section, we test this idea by
using data collected from two ozone monitoring networks.
TRENDS ADJUSTMENT FOR MONITORING NETWORKS

As noted in the previous section, a large number of local factors potentially affect
the daily maximum ozone concentration at an individual monitoring site. It is diffi-
cult to capture a sufficient number of these factors in the regression tree model to
produce an accurate prediction. One solution to this problem is to focus on daily
maximum concentrations observed over a network of monitors rather than at just a
single location. Daily network monitoring statistics such as the average or the
maximum over the network of the daily maximum concentrations observed at each
site may be easier to relate to meteorological conditions than is the daily maximum
concentration at a single location. To test this idea, we identified two networks of
ozone monitors as described in Section 2: One network is located in southwestern
Connecticut and includes the Bridgeport site, and the other one is in the Philadelphia
area and includes the Philadelphia airport monitor. Additional details of these moni-
tor networks are provided in Section 2. For each day at both networks, we calcula-
ted both the average across sites of the daily maximum concentrations (Dmax) and
the maximum of the daily maximum concentrations Max(Dmax). These daily sta-
tistics were then used as dependent variables in separate CART regression tree
analyses. These analyses were carried out in the same manner as those described
above with the exception that the £est sample procedure was used to obtain an
estimate, Rts, of the true error, R . The test sample was constructed by randomly
selecting 30 percent of days from the complete data set. The remaining 70 percent
of days were then used as the learning sample on the basis of which the regression
tree was constructed.

In an attempt to improve the regression tree model further, we included a number of
meteorological parameters as independent variables in addition to those listed in
Table 2-5. The additional parameters are listed in Table 3-3. Most of the new
parameters were designed to provide some information about the meteorological
conditions during the day preceding the day for which concentration estimates are to
be determined. Ozone concentrations in the northeastern United States are often
observed to increase gradually over a period of two or three days, suggesting that
concentrations are carried over from one day to the next. Thus, meteorological con-
ditions conducive to ozone formation could be associated with higher-than-expected
concentrations if conditions on the preceding day were also conducive to ozone
formation. Average temperature and wind speed over the current and preceding day
89053r2 3
-------
TABLE 3-3. Additional meteorological variables.
Variable
Description
TMAX Daily maximum temperature (deg. F) (0700 - 1300 LST)

TEMPAVG Average of previous and current day's daily maximum
temperature (TMAX2) (deg. F) (0700 - 1900 LST)

DAYHR Number of daylight hours

SWAVG Average of previous and current day's scalar average wind
speed (SWAVE) (kt)

TDIF Daily temperature range (max. minus min. 0700-1900 LST)
(deg. F)

CAVE Average ceiling height (ft) (0700 - 1300 LST)

PAVE Average surface pressure (corrected to sea level) (mb) (0700 -
1300 LST)

IWEA Occurrence of thunderstorm, rain, or dri2zle (sum of all such
observations, 0700-1900 LST)

DELTAP Difference between previous and current day's average surface
pressure (PAVE) (mb)

PDIF Pressure range (max. minus min. surface pressure) (mb)

DELTAWD Difference between previous and current day's WD (0-4)
89053r2
32
-------
(TEMPAVG and SWAVG) were included to indicate the ozone-formation potential of
the two-day period. By a similar line of reasoning, we included two variables that
measure the degree of change in conditions from one day to the next—the change in
surface pressure (DELTAP) and the change in wind direction (DELTAWD). Large
changes in these variables indicate unsettled and changeable weather, which is not
conducive to the buildup of high ozone concentrations. Other meteorological
variables added to the analysis include

TMAX: The maximum temperature during the morning and early afternoon
ozone-formation period (0700-1300 LSI). Although closely related to the daily
maximum temperature (TMAX 2), this parameter may be a better measure of
the influence of temperature during the critical ozone-formation period.

TDIF: The difference between the minimum and maximum temperatures dur-
ing the day (0700-1900 LST). A large value of TDIF indicates a combination of
strong heating during the daytime and rapid cooling overnight. This requires
clear skies and light winds throughout the day, conditions that are favorable to
ozone formation.

DAYHR: The number of daylight hours. This is a seasonal (time of year) indi-
cator. High ozone values normally occur most frequently during the time of
year when days are longest.

PAVE: Average surface pressure. High values indicate fair weather generally
conducive to ozone formation.

IWEA: Occurrence of precipitation (total number of observations during the
ozone-formation period, 0700-1300 LST). Such occurrences generally result in
low ozone concentrations.

PDIF: Daily pressure range. Large pressure changes during the day may signal
unsettled conditions not conducive to ozone formation.

CAVE: The average ceiling height. Ceiling height is loosely related to the
amount and type of cloud cover (if any) and therefore to the amount of solar
insolation available near the surface for NO2 photolysis. High ceiling heights
(above about 20,000 feet) are usually associated with thin cirrus clouds that do
little to attenuate UV radiation. Low ceiling heights are typically associated
with more opaque clouds that retard NO2 photolysis and thus ozone forma-
tion. To simplify the relationship between CAVE and ozone, clear sky (unlimi-
ted ceiling) observations were coded as CAVE = 20,000 ft.
Regression Tree Results

Results of the regression tree development described above are summarized in Table
3-4. Relative errors were lower for the trees in which the network average Dmax
89053r2 3
-------
was used as the dependent variable.* This variable appears to be less sensitive to
various extraneous factors not included in the regression tree model, thus making it
easier to predict. The cross-validation estimate of the true relative error for the
Bridgeport monitor Dmax regression tree described previously (see Figure 3-3) was
0.44. This is larger than the test-sample estimates for the two southwestern Con-
necticut network regression trees shown in Table 3-4. Thus, we appear to have
succeeded in our goal of developing regression trees that produce more precise pre-
dictions of daily ozone concentrations.

The regression trees for the network average Dmax are depicted in Figures 3-5 and
3-6. As indicated in these figures, temperature is the most important meteorological
variable in explaining day-to-day ozone variations. In both the Philadelphia and
southwestern Connecticut trees, the three temperature variables (TMAX2, TMAX
and TEMPAVG) are listed in the CART computer program output as the three most
important variables overall. A variable is considered important by CART if it is used
as the basis for splits at several nodes (especially nodes close to the root node) or if
it is one of the top-ranked competitor splitting variables at a large number of
nodes. The next most important variables identified by CART are the solar insola-
tion indicators FAC2 and DAYHR and the wind speed variables SWAVE and SWAVG.
In Connecticut, however, these variables were not as important as the wind direction
(WD). This confirms the relatively large influence of upwind emissions sources on
southwestern Connecticut ozone levels noted by Pollack and co-workers (1988).

A brief "tour1* of the regression tree in Figure 3-5 illustrates the advantages of
CART analysis over simple linear regression in the ways in which the results can be
interpreted. An important feature of this tree is that it shows that all groups of days
averaging over 8 pphm of ozone have daily maximum temperatures above 80°F.
Most days with temperatures below 80°F have low ozone concentrations, with the
exception of days with temperatures in the 70s with generally clear or cirrus-cloud-
covered skies (CAVE greater than 12,200 ft) and light winds from a direction other
than north or northwest (terminal node 6). Such days have moderately high ozone
concentrations averaging 7.9 pphm. The lowest ozone days are those at the begin-
ning or end of the ozone season (DAYHR less than or equal to 12.4 hours) during
which temperatures do not exceed 70°F during the morning or early afternoon
(TMAX less than or equal to 69°F). These days are represented by terminal node 1
and average only 31 pphm of ozone. Of the days above 80°F, not all are high ozone
days. In particular, days between 80 and 88°F with moderate or stronger winds
(SWAVE greater than 7.15 kts) and weak insolation (FAC2 less than or equal to
1.03 pphm" min ) average only 6.2 pphm (terminal node 11). Days with the highest
ozone concentrations have maximum temperatures exceeding 88°F, strong insolation
(FAC2 greater than 1.24 pphm" min ), and large diurnal temperature ranges (TDIF
* The relative error is equal to the mean square prediction error divided by the
variance of the observations in the root node.
89053r2 3
34
-------
TABLE 3-4. Results of regression tree analysis for ozone monitoring networks.

No. of terminal nodes
Total No. of cases
Variance in root node (pphm )
Resubstitution relative error
Test sample relative error

Philadelphia
Network
Average
D
max
17 Met. Vars. 17
15
2112
795
0.29
0.36 ± 0.026 0

Network
Network
Maximum
D
max
Met. Vars.
6
2112
1370
0.39
.43 ± 0.029
Southwestern
Connecticut
Network
Network
Average
D
max
17 Met. Vars.
7
1877
1100
0.40
0.37 ± 0.026
Network
Maximum
D
max
17 Met. Vars.
8
1877
1960
0.38
0.41 ± 0.032
89053r2
-------
15
TDIF S 13.5
13
FAC2<1.03
12
11

5|~
34
90
1
DAYHR<14.4

|
83
66

1
134
84
I

I
186
71
1
TDIF<13.5

SWAVE<6.15

100
99

78
90

56
74

113
67

73
79
WD=1,8
DAYHR<13.1
FIGURE 3-5. Regression tree for Philadelphia monitoring network (1979-1988). Dependent variable
is network average daily maximum ozone concentration (Dmax); top number in each node (boxes) is
number of days in learning sample with valid Dmax values, bottom number is average Dmax over all
such days (ppb). Meteorological variables used to define each split are listed in Tables 2-5 and 3-3.
EEE
89053
36
-------
TMAX < 82.5
WD = 1,2,7,8
TMAX2 < 78.5
WD = 1,2,7.8
TMAX2 S 71.5
DAYHR S 12.5
FIGURE 3-6. Regression tree for southwestern CT ozone monitoring network (1979-1988).
Dependent variable is network average daily maximum ozone concentration (Dmax). Top
number in each node (box)_is the number of days in learning sample with valid Dmax values.
Bottom number is average Dmax over all days in node (ppb). Meteorological variables used
to split data are defined in Tables 2-5 and 3-3.
EEE 89053
37
-------
greater than 13.5°F), indicating calm, clear conditions (terminal node 15). Such days
average 11.6 pphm of ozone.

Table 3-5 provides an indication of the distribution of high ozone days across termi-
nal nodes of the Philadelphia network regression tree shown in Figure 3-5. Node 15
has by far the largest fraction of days above 10 pphm, followed by nodes 9, 10, 13
and 14.

Of particular interest from a trends adjustment viewpoint are the variations from
one year to the next in the distribution of days across terminal nodes in the regres-
sion trees described above. These variations indicate the particular meteorological
features that differentiate one year from the next and provide the quantitative
information that is the basis for the meteorological adjustment calculations. Tables
3-6a and 3-7a list the number of days falling in each terminal node in each year for
the Philadelphia and southwestern CT trees, respectively. Tables 3-6b and 3-7b show
the deviations in these day counts from their 10-year averages. The deviations have
been normalized by their standard deviations to make the results comparable from
one node to the next. To make the results for Philadelphia easier to interpret, nodes
1-7 (which contain generally low ozone concentrations) have been combined in Table
3-6 as have nodes 11 and 12.

Recalling from the above discussion of the Philadelphia regression tree that node 15
contains the highest ozone concentrations, it is of interest to note that an unusually
large number of days fall into this node in the warm weather years 1983 and 1988 and
to a lesser extent in 1980. These deviations from the norm will result in a downward
adjustment of ozone concentrations during these years when meteorology is taken
into account as described in the next section. It is also interesting to note that the
year with the fewest days in node 15 is 1979, suggesting that this cooler than normal
year will require an upward adjustment of ozone concentrations. Similarly, unusually
high numbers of days fall into the high ozone nodes of the southwestern CT
regression tree (nodes 6 and 7) during 1980, 1983, and 1988, and this will result in a
downward adjustment of ozone concentrations in these warm-weather years.
Met-Adjusted Trends for Network Ozone Exceedance Statistics

Met-adjusted trends at the Philadelphia monitoring network were calculated for the
number of days per year in which the network average daily maximum ozone concen-
tration exceeded 8 and 10 pphm. Procedures used are identical to those previously
described. Results are presented in Figure 3-7 (for exceedances of 10 pphm) and 3-8
(for exceedances of 8 pphm). Exceedances of 12 pphm did not occur frequently
enough to allow a meaningful calculation of a met-adjusted trend. Although the
met-adjusted trends shown in Figures 3-7 and 3-8 seem to show slightly less varia-
bility than the unadjusted trends, the difference is not dramatic. In this sense, the
results are similar to those shown in Figure 3-4 for the Bridgeport monitor. No
significant downward trend is evident in either the met-adjusted or unadjusted
exceedances.
89053r2 3
-------
TABLE 3-5. Distribution of Philadelphia network average daily
maximum ozone concentrations for the period 1979-1988 across
terminal nodes of the regression tree depicted in Figure 3-5.
Terminal
Node
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total No.
of Days*
274
380
100
157
150
106
150
83
112
142
83
185
32
30
128
Percent
of Days
> 8 pphm
0.0
0.3
0.0
3.8
2.7
28.3
11.3
36.1
60.7
72.5
9.6
36.8
59.4
63.3
89.1
Percent
of Days
> 10 pphm
0.0
0.0
0.0
0.0
0.7
6.6
0.0
8.4
19.6
36.6
2.4
7.6
21.9
30.0
64.1
Average
Dmax
(pphm)
31.5
45.8
45.3
57.0
56.6
78.7
66.8
74.1
90.3
99.2
61.6
79.3
87.7
96.0
116.0
Standard
Deviation
(pphm)
10
11
12
15
13
16
15
19
17
19
17
16
20
22
23
* Includes only days with valid network average daily maximum
ozone values.
89053r2
39
-------
TABLE 3-6. Distribution of days across terminal nodes of regression tree
depicted in Figure 3-5 by year.

(a) Number of Days
NODE(S)
Year
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
Mean
SD*
COV
1-7
145
117
139
140
126
148
135
126
130
128
133.4
9.69
0.07
8
20
6
9
4
6
12
10
8
5
4
8.4
4.86
0.58
(b) Standard Deviates
9
3
14
16
19
16
8
15
10
6
8
11.5
5.21
0.45
(difference
10
11
21
20
17
13
13
14
13
15
6
14.3
4.35
0.30
11-12
26
29
18
25
20
26
32
36
35
26
27.3
5.87
0.22
from 10-year mean
13
3
6
2
3
3
0
1
6
3
5
3.2
1.99
0.62
divided by
14
3
1
4
2
2
0
1
3
5
10
3.1
2.85
0.92
standard
15
3
20
6
4
28
7
6
12
15
27
12.8
9.37
0.73
deviation)
NODE(S)
Year
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1-7
1.20
-1.69
0.58
0.68
-0.76
1.51
0.17
-0.76
-0.35
-0.56
8
2.39
-0.49
0.12
-0.91
-0.49
0.74
0.33
-0.08
-0.70
-0.91
9
-1.63
0.48
0.86
1.44
0.86
-0.67
0.67
-0.29
-1.06
-0.67
10
-0.76
1.54
1.31
0.62
-0.30
-0.30
-0.07
-0.30
0.16
-1.91
11-12
-0.22
0.29
-1.58
-0.39
-1.24
-0.22
0.80
1.48
1.31
-0.22
13
-0.10
1.41
-0.60
-0.10
-0.10
-1.61
-1.11
1.41
-0.10
0.91
14
-0.04
-0.74
0.32
-0.39
-0.39
-1.09
-0.74
-0.04
0.67
2.42
15
-1.05
0.77
-0.73
-0.94
1.62
-0.62
-0.73
-0.09
0.23
1.52
* SD = standard deviation.
40
89053r2 t
-------
TABLE 3-7. Distribution of days across terminal nodes of regression
tree depicted in Figure 3-6 by year.

(a) Number of Days
NODE(S)
Year
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
Mean
SD*
COV
1
42
38
46
42
39
41
37
39
43
39
40.6
2.72
0.07
2
73
61 .
68
75
71
73
69
68
66
67
69.1
4.09
0.06
3
19
20
22
20
13
14
21
20
22
19
19.0
3.09
0.16
(b) Standard Deviates (difference
standard
4
37
29
34
48
31
31
39
40
28
27
34.4
6.60
0.19
5
6
16
15
8
16
13
13
15
21
17
14.0
4.35
0.31
from 10-year mean
6
25
27
24
12
25
27
24
24
16
19
22.3
4.99
0.22
divided by
7
12
23
5
9
19
15
11
8
18
26
14.6
6.82
0.47

deviation)
NODE(S)
Year
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1
0.52
-0.96
1.99
0.52
-0.59
0.15
-1.33
-0.59
0.88
-0.59
2
0.95
-1.98
-0.27
1.44
0.46
0.95
-0.02
-0.27
-0.76
-0.51
3
0.00
0.32
0.97
0.32
-1.94
-1.62
0.65
0.32
0.97
0.00
4
0.39
-0.82
-0.06
2.06
-0.51
-0.51
0.70
0.85
-0.97
-1.12
5
-1.84
0.46
0.23
-1.38
0.46
-0.23
-0.23
0.23
1.61
0.69
6
0.54
0.94
0.34
-2.06
0.54
0.94
0.34
0.34
-1.26
-0.66
7
-0.38
1.23
-1.41
-0.82
0.65
0.06
-0.53
-0.97
0.50
1.67
* SD = standard deviation.
890S3r2 "+ 41
-------
40
35,
in
8 30
a
(0
-d
(U
g 25
x
,0
a
20.
15
10
o
O
3—Year Annual Average
3—Year Annual Average Adjusted for Meteorology
_L
79
80
81
82
85
86
87
88
83 84
Base Year
FIGURE 3-7. Annual number of days on which the average daily maximum ozone concentration in the
Philadelphia Ozone Network exceeds 0.105 ppm (based on three years of data run down the CART tree
grown on 1979-1988 data).
-------
CD

u>
OJ
o
a
3

5
90.
80
70.
"S 60-
o>
o

W 50.
40.
30.
20.
10.
0,
79
3—Year Annual Average

3—Year Annual Average Adjusted for Meteorology
I
I
I
I
I
J_
80
81
82
85
86
87
88
83 84

Base Year

FIGURE 3-8. Annual number of days on which the average daily nexiraum ozone concentration in the

Philadelphia Ozone Network exceeds 0.085 ppm (based on three years of data run down the CART tree

grown on 1979-1988 data).
-------
A principal drawback of the procedure used to develop the met-adjusted trends
shown in Figures 3-7 and 3-8 is that the use of overlapping three-year averages to
estimate the model parameters (i.e., the fraction of days in exceedance in each
terminal node) results in an averaging out of many of the year-to-year variations
that we are trying to explain in the first place. To circumvent this difficulty, we
reran the trends adjustment, this time estimating the fraction of days in exceedance
in each terminal node using data from individual years instead of three-year
periods. Results are shown in Figures 3-9 and 3-10. Since, for a single year, only a
relatively few days fall into each terminal node, the estimated model parameters are
quite variable from one year to the next. This results in an met-adjusted trend that
retains a fair degree of year-to-year variability. Nevertheless, Figures 3-9 and 3-10
show that the met-adjustment process does damp out some of the year-to-year varia-
tions in the unadjusted exceedance counts. In particular, individual years with
unusual meteorological conditions are clearly evident. For example, 1980, 1983, and
1988 are shown to have had weather conditions unusually favorable to ozone forma-
tion. The met-adjusted exceedance rates in these years are significantly lower (at
least in a nonstatistical sense) than the unadjusted values. 1979, on the other hand,
appears to have been a year in which weather conditions were unusually unfavorable
to ozone formation. Met-adjusted exceedances in this year were much higher than
the number of exceedances actually observed. These results are consistent with our
expectations based on the previously described analysis of the number of days falling
in the high ozone node (node 15) in each year (Table 3-6). By the same token, 1986
(and to a lesser extent 1981, 1982, and 1985) appears to have had less extreme
weather conditions, resulting in average ozone formation (little difference between
the met-adjusted and unadjusted exceedances). As pointed out in Section 2, results
for 1979 and 1980 must be viewed with some caution due to the large amount of
missing ozone data during this period.

On the basis of the above results, it appears that, although model parameters (in this
case probabilities of exceedance by node) estimated for individual years are more
variable than those estimated for overlapping three-year periods, the met-adjust-
ment procedure is capable of providing at least a qualitative indication of the effects
of meteorological conditions on ozone formation in individual years. Confirmation of
the statistical significance of the difference between observed and met-adjusted
exceedances must await the development of methods for estimating uncertainties in
the met-adjusted values.
Met-Adjusted Trends for Network Average Daily Maximum Concentrations

As the results described above show, individual years differ greatly in the frequency
of extreme ozone events (e.g., exceedances of 10 pphm) even when daily maximum
concentrations are averaged over a network of monitors. Our method for calculating
meteorologically adjusted trends is capable of accounting for some but not all of this
variability. Other annual summary statistics, such as the annual average of the daily
network average concentrations, are less variable. To test the ability of our trends
44

89053r2 3
-------
CD
cn
40.
-e-
Ui
Annual Average
Annual Average Adjusted for Meteorology
83 84
Base Year

FIGURE 3-9. Annual number of days on which the average daily maximum ozone concentration in the
Philadelphia Ozone Network exceeds 0.105 ppm (based on yearly data run down the CART tree grown
on 1979-1988 data).
-------
oo
ID
O
cn
CTi
90
80
10.
0,
o
O
Annual Average
Annual Average Adjusted for Meteorology
I
I
79
80
81
82
83 84
Base Year
85
86
87
88
FIGURE 3-10. Annual number of days on which the average daily maximum ozone concentration in the
Philadelphia Ozone Network exceeds 0.085 ppn (based on yearly data run down the CART tree grown
on 1979-1988 data).
-------
adjustment procedure to deal with this type of statistic, we calculated meteorologi-
cally adjusted values of the annual average concentration at the Philadelphia net-
work, again using the 10-year regression tree depicted in Figure 3-5. The met-
adjustment procedure for annual average concentrations is nearly identical to that
previously described for exceedance statistics. The only significant difference is
that the model parameters estimated for each year are the average ozone concentra-
tion in each terminal node instead of the fraction of days in each node that are in
exceedance. Results based on running three-year averages are shown in Figure
3-11. As expected, the variations in the unadjusted annual averages from one three-
year period to the next are very small and barely distinguishable from the met-
adjusted averages. As with the exceedance statistics described above, we circum-
vented this difficulty by calculating met-adjusted values for individual years.
Results (presented in Figure 3-12) are similar to those displayed in Figure 3-9 (for
exceedances of 10 pphm) with the exception that year-to-year variations in the
annual averages are less pronounced. Comparison of the met-adjusted and unadjus-
ted values clearly shows that meteorological conditions in 1980, 1983, and 1988 were
unusually favorable to ozone formation, while 1979 appears to have been unusually
unfavorable. Conditions in 1982 and 1986 appear to result in near average ozone
formation.

In addition to the met-adjusted trend calculations described above for the Phila-
delphia monitoring network, we also calculated met-adjusted trends in T5max for the
southwest Connecticut network. Met-adjusted values were obtained using model
parameters estimated for individual years. Results are shown in Figure 3-13. No
values are shown for 1979 since ozone data were unavailable for that year. These
results are similar to those for the Philadelphia network (Figure 3-12) in that concen-
trations for 1980, 1983, and 1988 are adjusted downward to account for the greater
prevalence of ozone-conducive meteorological conditions during these years. Almost
no adjustment is calculated for 1986 and, in contrast to the Philadelphia results, for
1987.

We applied a Kendall's tau test to the met-adjusted and unadjusted trends in Figure
3-13. This test is designed to determine whether or not there is any statistically
significant trend over the 9-year period. It is a test for a monotonic but not neces-
sarily linear trend. The test results reveal that no significant trend exists in either
the met-adjusted or unadjusted values at the 95 percent confidence level.
Within-Node Trends

As noted previously, our trends ajustment procedure relies on the assumption that
model parameters (in this_case the average across days in each node of the Phila-
delphia network average Dmax) vary slowly from one year to the next in response
only to gradual changes in precursor emissions. Meteorological effects are assumed
to be accounted for entirely by year-to-year changes in the number of days falling
into each node. To examine the extent to which this assumption is valid for the
47
89053r2 3
-------
oo
in
o
en
00

^,
O,
a
o
cO
(H
. i
Conceni

70.
65.
60.
55.
50.
45.
40.
35.
30.
25.
20.
15.
10.
5.
r\
- 1 1 1 1 1 1 1 1
~- &== -gb^a^^.^^ ft^— — g-^^— ..^ Q
|
~-
j
^
^
-
—
-
- O 3— Year Annual Average
3- O 3— Year Annual Average Adjusted for Meteorology
~
- 1 1 1 1 1 1 1 1
—

-^
—
-^
-^
—
—
—
-^
-^
=
80
81
88
Base Year
FIGURE 3-11. Annual mean daily naximum ozone concentrations for the Philadelphia Ozone Network
(based on three years of data run down the CART tree grown on 1979-1988 data).
-------
o
en
(JO
VO
o 40.

«J OC
$-, 35.
fl ,
8 30. r
S I
2 25. r

20. r

15.r

10. r

5. r
0,
O Annual Average
O Annual Average Adjusted for Meteorology
I
I
^79
FIGURE
(based
80
81
82
83 84
Base Year
85
86
87
88
3-12. Annual mean daily maximum ozone concentrations for the Philadelphia Ozone Network
on individual yearly data run down the CART tree grown on 1979-1988 data) .
-------
CD
<£>
O
Ul
Ut
O
75.

70.

65.

60.

55.

— 50.
a
-S 45.

.2 40'
+3
2J 35.
fl
S 30.
rt
o 25.

20.

15.

10.

5.
0,
79
O Annual Average
O Annual Average Adjusted for Meteorology
_L
80
81
82
83 84
Base Year
85
86
87
88
FIGURE 3-13. Annual mean daily maximum ozone concentrations for the southwestern Connecticut
Ozone Network (based on individual yearly data run down the CART tree grown on 1979-1988 data) .
-------
regression tree model, we calculated the average Dmax across days in each node by
year for the Philadelphia network regression tree depicted in Figure 3-5. Table 3-8
lists these values for the combined nodes 1-7, which contain most of the low ozone
days, and for node 15 (the high ozone node). Also shown for comparison are the
annual average Dmax values actually observed in each year. Of particular interest
are the coefficients of variations (COVs) shown at the bottom of each column.* The
COV of the average Dmax for days with meteorological conditions characteristic of
node 15 is larger than the COV of the observed annual average TJmax. Thus, no
decrease in variability is gained by averaging only over days with the specific
meteorological conditions known to produce high ozone concentrations. This lack of
variance reduction is no doubt at least partly due to the small number of days that
fall into node 15 in any given year (especially 1979, 1981-1982, and 1984-1985).
Averages obtained over such small samples tend to be quite variable.

Another noteworthy feature of Table 3-8 is that the average D~max in node 15 is
highest in 1979 and 1980. Although this might indicate that emissions were highest
in these years, such a conclusion must be viewed with caution due to the large
amount of missing data in the Philadelphia network at that time (see Table 2-2).

Also shown in Table 3-8 are annual averages of Dmax for days falling in just the high
ozone nodes (nodes 9-10 and 13-15) and all remaining nodes (nodes 1-8 and 11-12).
This aggregation of nodes produces a larger number of days over which to take
averages in each year. Year-to-year variability in average concentrations in the high
ozone nodes (as_measured by the COV) is on par with the variability in the observed
annual average Dmax. Variability in the low ozone nodes is also on par with that of
the averages over all days. Again, the averages for 1979 and 1980 have to be viewed
with caution due to the large amount of missing data during these years. It is
interesting to note that averages over the high ozone nodes are above the 10-year
average in the warm weather years 1983 and 1988. This may be the result of
meteorological factors that are not accounted for by the regression tree model or it
may be due to an actual increase in precursor emissions during- these years.
An Alternative Trend Adjustment Technique

To put the results described above into perspective, we applied an alternative (and
much simpler) trend adjustment procedure developed by Jones and Dash (1985) to the
Philadelphia monitoring network data. This procedure uses the number of days in
each year for which the maximum temperature exceeds 90°F as a measure of the
tendency of the meteorological conditions in the year to produce high ozone concen-
trations. For each year, the ratio of the number of days on which the network
average daily maximum ozone concentration exceeds 10 pphm to the number of days
* The COV is simply the standard deviation divided by the mean.

89053r2 3 51
-------
TABLE 3-8. Average D_ov (Philadelphia network) by year for
UlclA
selected groups of days. Node numbers refer to regression
tree depicted in Figure 3-5).
Mean over Days
in Nodes (pphm)
Year
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
Mean
SD*
covt
1-7
52.40
54.26
46.62
50.60
50.13
49.31
54.70
47.48
47.12
45.45
49.81
3.22
0.06
15
134.50
131.64
108.78
99.48
104.61
115.17
93.70
101.61
113.20
125.36
112.81
13.90
0.12
Mean over Days in Nodes
(pphm)
1-8, 11-12
53.41
48.55
40.63
44.95
45.91
49.25
50.98
47.01
49.41
46.59
47.67
3.52
0.07
9-10, 13-15
102.75
108.36
94.43
89.35
100.94
95.57
99.76
88.60
99.87
107.48
98.71
6.77
0.07
Mean over
All Days
62.21
73.82
59.53
61.85
68.08
60.15
66.32
60.61
64.63
65.57
64.28
4.41
0.07
* SD = standard deviation.
t COV = SD/Mean.
89053r2 i* 52
-------
on which temperatures exceed 90°F is calculated. This ratio is assumed to be pri-
marily a function of precursor emission levels. When multiplied by the long-term
average number of days per year exceeding 90°F (we calculated this average for the
10-year period 1979-1988), a meteorologically adjusted exceedance trend is
obtained. Results are listed in Table 3-9 and displayed in Figure 3-U. Some key
aspects of these results are similar to those obtained by using the CART-based
trends adjustment procedure (see Figure 3-9). The years 1980, 1983, and 1988 are
shown to have been abnormally conducive to ozone formation; 1979 was unusually
unfavorable to ozone formation, and 1986 was an average year. In contrast to the
CART-based results, however, the met-adjusted exceedance trend in Figure 3-14 is
even more variable than the unadjusted trend. This variability results from large
year-to-year changes in the ratio of high ozone days to high temperature days used
to calculate the met-adjusted trend. In some years, the number of days above 90°F
is quite small. With such small numbers in the denominator of the ratio, a difference
of just a day or two can cause large swings in the met-adjusted exceedance rate. In
addition, it is quite likely that the relationship between high temperature days and
high ozone days is more complicated than the simple proportionality assumed here.
53
89053r2 3
-------
TABLE 3-9. Number of days on which daily
maximum temperature in Philadelphia
(TMAX2) exceeds 90°F and number of days
on which the Philadelphia network average
daily maximum ozone concentration (D"max)
exceeds 0.105 pphm.

No. Days No. Days
Year TMAX2 > 90°F D~max > 0.105 pphm*
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
9
27
11
6
34
7
6
20
24
42
20.0
35.4
16.0
10.0
28.0
11.0
20.0
9.0
24.0
32.0
* These values have been corrected to
account for missing data by assuming
that exceedances on missing days are
distributed identically to exceedances
on non-missing days (affects 1979 and
1980 values only).
89053r2
-------
o
on
oo
Ln
Ul
65.

60.

55.
10
0
79
O
O
Annual Average
Annual Average Based On Days With T >= 90 deg. F
I
80
81
82
83 84
Base Year
85
86
87
88
FIGURE 3-14. Annual number of days on which the average daily maximum ozone concentration in the
Philadelphia Ozone Network exceeds 0.105 ppm (based on the number of days with temperatures above
90 °F in each year).
-------
SUMMARY AND CONCLUSIONS
We have described several test applications of a trends adjustment procedure that
makes use of regression tree models developed by the CART methodology of Breiman
et al. (1984). The methodology was applied to two ozone monitoring networks—one
in southwestern Connecticut and one in the Philadelphia area. Regression trees were
developed for Dmax, the network average daily maximum ozone concentration, and
for Max(Dmax), the network maximum of the daily maximum concentrations.
Slightly lower mean square prediction errors were obtained for "Dmax than for
Max(Dmax). Both network summary statistics produced better results than were
obtained when the daily maximum concentration at a single monitoring site was used
as the dependent variable.

Our calculations of meteorologically adjusted annual ozone summary statistics,
which use model parameters estimated for individual years, demonstrate that the
CART methodology is well suited to such applications. The met-adjusted trends
clearly show the influence of meteorological conditions on ozone concentrations in
certain key years, and these influences were found to be generally the same year in
corresponding years for both the Philadelphia and Connecticut monitoring networks.
In particular, 1980, 1983, and 1988 were shown to have had an unusually large number
of days on which weather conditions were of the type determined by the regression
trees for both networks to be associated with high ozone concentrations. This
increased frequency of ozone-conducive days resulted in unusually high observed
ozone concentrations for which downward adjustments were calculated.

Time series plots of met-adjusted and unadjusted values of the annual average Dmax
for the Philadelphia network show no apparent upward or downward trend. Similar
plots for the number of days per year on which Dmax exceeds 10 and 8 pphm also fail
to show any trend. A plot of the annual average Dmax for the southwestern
Connecticut network seems to show a slight decrease in both the met-adjusted and
unadjusted values between 1980 and 1988. However, this decrease was found not to
be statistically significant (using a Kendall's tau test at the 95 percent level).

The time series plots described above showed some residual correlation of met-
adjusted annual ozone summary statistics with the unadjusted statistics, suggesting
that not all of the meteorological influences had been successfully removed by the
trends adjustment procedure. To examine this issue further, we calculated the year-
to-year trend in Dmax averaged over the days in each year meeting a specific set of
meteorological conditions identified by the regression tree analysis as being
89053r2 8 56
-------
associated with high ozone concentrations (i.e., within-node trends). These trends
exhibited relatively large fluctuations unlikely to be associated with actual changes
in precursor emissions. These fluctuations are no doubt partially due to the fact that
only a few days in each year meet the prescribed meteorological criteria, and the
resulting averages are therefore heavily influenced by small random fluctuations in
the concentrations observed on these days. Somewhat more robust averages were
obtained by broadening the meteorological criteria to include more days in each
year. Unfortunately, this results in averages that are more heavily affected by year-
to-year fluctuations within the broader set of meteorological conditions allowed by
the relaxed criteria.

To put our met-adjusted trends results in perspective, we compared them to results
obtained using a much simpler trends adjustment procedure suggested by Jones and
Dash (1985) in which the ratio of the number of days during an ozone season on which
temperatures exceeded 90° F to the long-term average number of such days is used as
a meteorological index of ozone-formation potential. Our results suggest that, at
least for the particular calculation we made, this meteorological index produced
met-adjusted annual summary statistics with large year-to-year variations that are
unlikely to be the result of changes in precursor emissions. On the basis of our
results, we conclude that regression trees developed by the CART methodology
appear to be capable of describing the relationship between daily meteorological
conditions and ozone concentrations to a sufficient degree of accuracy and precision
to produce useful estimates of meteorologically adjusted ozone trends. Our
comparison of met-adjusted and unadjusted annual ozone summary statistics reveals
that weather conditions have led to unusually high ozone values during at least three
of the past ten years (1980, 1983, and 1988).

Confirmation of the statistical significance of differences in individual years
between met-adjusted and unadjusted annual ozone summary statistics must await
the development of methods for quantifying the uncertainty in the met-adjusted
values. Calculation of met-adjusted trends that are less influenced by
meteorological variability and more accurately reflect changes in precursor
emissions will require the development of more precise and accurate regression tree
models.
89053r2 8

57
-------
References
Breiman, L., J. H. Friedman, R. A. Olshen, and C. 3. Stone. 1984. Classification and
Regression Trees. Wadsworth, Belmont, California.

Chock, D. P., S. Kumar, and R. W. Herrmann. 1982. An analysis of trends in oxidant
air quality in the South Coast Air Basin of California. Atmos. Environ.,
16(10:2615-2624.

Davidson, A., M. Hoggan, and P. Wong. 1985. "Air Quality Trends in the South Coast
Air Basin 1975-1984." South Coast Air Quality Management District, El Monte,
California.

Demerjian, K. L., K. L. Schere, and J. T. Peterson. 1980. Theoretical estimates of
actinic (spherically integrated) flux and photolytic rate constants of atmo-
spheric species in the lower troposphere. Adv. Environ. Sci. Technol., 10:369-
460.

EPA. 1987. National Air Quality and Emissions Trends Report, 1985. U.S. Environ-
mental Protection Agency (EPA-450/4-87-001).

EPA. 1979. Guideline for the Interpretation of Ozone Air Quality Standards. U.S.
Environmental Protection Agency, Research Triangle Park, North Carolina
(EPA-450/4-79-003).

Jones, K. H., and A. Dash. 1985. "Correcting Cs Data for Annual Variations in
Temperature." Roy F. Weston, Inc., West Chester, Pennsylvania.

Kumar, S., and D. P. Chock. 1984. An update on oxidant trends in the South Coast
Air Basin of California. Atmos. Environ., 18(10):2131-2134.

Langstaff, J. E., and A. K. Pollack. 1985. "Meteorological Characterization of High
Ozone Levels: A Pilot Study of St. Louis, Missouri." Systems Applications,
Inc., San Rafael, California.

Maul, P. R. 1980. "Atmospheric Transport of Sulfur Compound Pollutants." Central
Electricity Generating Bureau MID/SSD/80/0026/R. Nottingham, England.
89053rl 5
58
-------
Pollack, A. K., T. E. Stoeckenius, J. L. Haney, T. S. Stocking, J. L. Fieber, and
M. Moezzi. 1988. "Analysis of Historical Ozone Concentrations in the North-
east." Systems Applications, Inc., San Rafael, California (SYSAPP-88/192).

Scire, J. S. 1983. "User's Guide to the MESOPUFF-II Model and Related Processor
Programs." Environmental Research
-------