EPA-600/4-76-029a
July 1976 Environmental Monitoring Series
EMPIRICAL TECHNIQUES FOR ANALYZING
AIR QUALITY AND METEOROLOGICAL DATA
Part I. The Role of Empirical Methods in
Air Quality and Meteorological Analyses
Environmental Sciences Research Laboratory
Office of Research and Development
U.S. Environmental Protection Agency
Research Triangle Park, North Carolina 27711
-------
RESEARCH REPORTING SERIES
Research reports of the Office of Research and Development, U.S. Environmental
Protection Agency, have been grouped into five series. These five broad
categories were established to facilitate further development and application of
environmental technology. Elimination of traditional grouping was consciously
planned to foster technology transfer and a maximum interface in related fields.
The five series are:
1. Environmental Health Effects Research
2. Environmental Protection Technology
3. Ecological Research
4. Environmental Monitoring
5. Socioeconomic Environmental Studies
This report has been assigned to the ENVIRONMENTAL MONITORING series.
This series describes research conducted to develop new or improved methods
and instrumentation for the identification and quantification of environmental
pollutants at the lowest conceivably significant concentrations. It also includes
studies to determine the ambient concentrations of pollutants in the environment
and/or the variance of pollutants as a function of time or meteorological factors.
This document is available to the public through the National Technical Informa-
tion Service. Springfield, Virginia 22161.
-------
EPA-600/4-76-029a
July 1976
EMPIRICAL TECHNIQUES FOR ANALYZING AIR QUALITY AND METEOROLOGICAL DATA
Part I. The Role of Empirical Methods in
Air Quality and Meteorological Analyses
by
W. S. Meisel
Technology Service Corporation
2811 Milshire Boulevard
Santa Monica, California 90403
Contract No. 68-02-1704
Project Officer
Kenneth L. Calder
Meteorology and Assessment Division
Environmental Sciences Research Laboratory
Research Triangle Park, North Carolina 27711
U.S. ENVIRONMENTAL PROTECTION AGENCY
OFFICE OF RESEARCH AND DEVELOPMENT
ENVIRONMENTAL SCIENCES RESEARCH LABORATORY
RESEARCH TRIANGLE PARK, NORTH CAROLINA 27711
-------
DISCLAIMER
This report has been reviewed by the Environmental Sciences
Research Laboratory, U.S. Environmental Protection Agency, and
approved for publication. Approval does not signify that the
contents necessarily reflect the views and policies of the
U.S. Environmental Protection Agency, nor does mention of trade
names or commercial products constitute endorsement or
recommendation for use.
-------
ABSTRACT
Empirical methods have found limited application in air quality and
meteorological analyses, largely because of a lack of good data and the
large number of variables in most applications. More and better data,
along with advances in methodology, have broadened the applicability of
empirical approaches. This report illustrates the types of problems for
which creative empirical approaches have the potential for significant
contributions. The results of two pilot projects are reported in some
detail.
-------
PREFACE
This is the first of three reports of work performed under EPA
Contract No. 68-02-1704, examining the potential role of state-of-the-
art empirical techniques in analyzing air quality and meteorological
data. The reports are entitled as follows:
I. The Role of Empirical Methods in Air Quality and Meteorological
Analyses
II. Feasibility Study of a Source-Oriented Empirical Air Quality
Simulation Model
III. Short-Term Changes in Ground-Level Ozone Concentrations:
An Empirical Analysis
The present report is a revision of an interim report prepared in
December 1974.
-------
CONTENTS
ABSTRACT iii
PREFACE iv
ACKNOWLEDGMENTS vii
1. INTRODUCTION 1
1.1 EXAMPLE APPLICATIONS 2
1.2 METHODOLOGY 3
2. A SOURCE-ORIENTED EMPIRICAL AIR QUALITY MODEL 5
2.1 MOTIVATION 5
2.2 MATHEMATICAL FORMULATION 7
2.3 TEST DATA 12
2.4 OPTIMIZING PARAMETERS FOR THE GAUSSIAN FORM OF
THE SOURCE-RECEPTOR FUNCTION 13
2.5 MORE GENERAL SOURCE-RECEPTOR FUNCTIONS 16
3. EMPIRICAL MODELING OF THE OXIDANT FORMATION PROCESS .... 17
3.1 MOTIVATION 17
3.2 FORM OF MODEL 18
3.3 THE DATA 20
3.4 THE ANALYSIS 22
3.4.1 Variable Selection 23
3.4.2 Specific Functional Relationship 24
3.5 INTERPRETATION OF MODEL IMPLICATIONS 25
3.6 CONCLUSIONS 32
4. EXTRACTION OF EMISSION TRENDS FROM AIR QUALITY TRENDS ... 33
4.1 MOTIVATION 33
4.2 REPORT OF A COMPARISON OF EMISSION LEVELS OVER
TWO TIME PERIODS 34
4.3 GENERALIZATION AND MATHEMATICAL FORMULATION 37
5. DETECTION OF INCONSISTENCIES IN AIR QUALITY/
METEOROLOGICAL DATA BASES 45
5.1 MOTIVATION 45
-------
CONTENTS (Cont.)
5.2 FORMULATION OF CONSISTENCY MODELS 47
5.3 TYPES OF INCONSISTENCIES 49
5.4 DIFFICULTIES 54
6. REPRO-MODELING: EMPIRICAL APPROACHES TO THE UNDERSTANDING
AND EFFICIENT USE OF COMPLEX AIR QUALITY MODELS 57
7. OTHER APPLICATION AREAS 61
7.1 HEALTH EFFECTS OF AIR POLLUTION 61
7.2 SHORT-TERM FORECASTING OF POLLUTANT LEVELS 62
REFERENCES 64
VI
-------
ACKNOWLEDGMENTS
Discussions with EPA personnel led to inclusion of many of the
subjects dealt with in this report. The project monitor, Ken Calder,
took a particularly active and constructive role in formulating and
aiding in the reporting of the studies outlined in Sections 2 and 3.
Advice from Leo Breiman and others at Technology Service Corporation
further improved the report.
VII
-------
1. INTRODUCTION
The increased availability of appropriate data bases and improvements
in methodology have led to increasing use of empirical and statistical
approaches for the analysis of air quality and meteorological data [15].
The full application of empirical approaches has, however, been hampered
by misconceptions about the nature and intent of such techniques.
One common misconception is that empirical techniques are "black box"
techniques, that is, that data is simply processed with little need for
adapting the methods to the specific application. The employment of
empirical techniques in such a manner simply illustrates that any tool
can be abused. As in any other sort of technical work, a great deal of
thought and creativity may be required to obtain maximum benefit from an
empirical analysis.
A second misconception regarding empirical approaches is a strong
distinction between empirical modeling and modeling based upon fundamental
physical principles. A good empirical model may involve incorporating
a great deal of physical insight. Some empirical analyses do indeed
suffer from a lack of physical insight; similarly, some analyses based
upon physical insight ignore the presence of measurement data contradicting
the major assumptions of those models. In reality, the two approaches are
strongly complementary. Physical insight can provide a guide to the appro-
priate forms of empirical models and to interpretations of the implications
of those models; empirical analyses can provide hints as to the key physical
mechanisms involved in a process for which measurements are available.
-------
1.1 EXAMPLE APPLICATIONS
It is the intent of this report to illustrate how empirical models
might be employed in a number of applications of interest to the Environmental
Protection Agency, particularly those with a meteorological aspect. Two of
the applications discussed are examined in some detail through limited
feasibility studies.
The first feasibility study is on the derivation of a source-oriented
empirical air quality model and is discussed in detail in a companion report [7]
The study is outlined in Section 2.0. It is often assumed that it is imprac-
tical to derive an empirical model which relates the emission source
distribution and meteorology to the resulting pollutant concentration
distribution. An approach is discussed whereby it is suggested that an
empirical meteorological dispersion function can be derived by indirect
means.
The second project was a study of the feasibility of empirically
deriving change equations relating to the formation of ozone. While only
the difference equation for changes in one-hour average oxidant, ignoring
emissions, is derived and interpreted, the technique can be extended to
include emissions and a full set of difference equations for the major
species involved in oxidant formation.
A number of potential areas for the application of empirical methods
are outlined in more abstract and less complete terms. Some of the sub-
jects touched upon in more or less detail include the following:
Extraction of emission trends from air quality trends: The
estimation of air quality trends from air quality measurements is
-------
complicated by the effect of meteorology. We discuss the determination
of a "meteorologically adjusted" trend, i.e., a trend more nearly related
to the emissions trend.
Detection of inconsistencies in air quality/meteorological data bases:
In any data collection or data analysis effort, a major concern is the
integrity of the data. It is important to detect problems with monitoring
equipment or monitoring methods and to note any important changes in the
system monitored so that such measurement errors or system changes do not
distort the analysis of the data or invalidate a portion of the data col-
lected. We discuss automatic procedures for detecting inconsistencies.
Empirical approaches to the understanding and efficient use of complex
air quality models: Computer-based models derived from physical principles
are tools which often should be analyzed themselves for the sake of extrac-
ting their implications, for modeling aspects of their behavior to reduce
input data requirements and running time, for validation, to compare them
to other models, or to suggest further areas for model improvement. Model -
generated input/output data can be so analyzed by empirical techniques.
Health effects of air pollution: We comment on this area in which
empirical approaches are at present heavily employed.
Short-term pollutant level forecasting: Short-term forecasting for
health warning systems or to invoke temporary controls can be approached
empirically. Several pitfalls are highlighted.
1.2 METHODOLOGY
It would be impossible and inappropriate to undertake a comprehensive
discussion of the broad spectrum of data-analytic techniques in this report.
-------
In practice the methodology utilized is often directed by the problem
itself and by the characteristics uncovered in the data as the analysis
proceeds. In fact, it is typically detrimental to the quality of the
results of the data-analytic study if a preconception as to the best
methodology to be utilized is formed and stubbornly adhered to. A
common difficulty in empirical analyses is the tendency to fit the problem
to the method rather than the method to the problem.
It is intended that the examples in the body of the report yield a
feeling for the variety of approaches in typical applications.
-------
2. A SOURCE-ORIENTED EMPIRICAL AIR QUALITY MODEL
2.1 MOTIVATION
In commenting on the lack of acceptance of empirical/statistical models
in air quality modeling in 1973, Kenneth L. Calder called attention to "the
historical belief that air quality models based on statistical regression
type of analysis are not source-oriented and, therefore, are largely useless
for control strategy in terms of the contribution of individual sources to
the degradation of air quality" [6]. He went on to ask "whether, with an
appropriate analysis, a source-oriented statistical-type of air quality model
could be developed which did not involve prior specification of meteorological
dispersion functions per se and incorporation of these as in present air
quality models. My thought here is that for given 'meteorological conditions'
these dispersion functions play the role of transfer functions between the
air quality distribution and the distribution of pollutant emissions, and
if one were smart enough might, therefore, conceivably be obtained empiri-
cally by a mathematical inversion technique (as, for example, by numerical
solution of sets of integral equations) utilizing accumulated data on the dis-
tributions of air quality and emissions. If this could be accomplished then
maybe a major shortcoming of the current statistical models could be removed
and we should then in effect have an alternative to the customary meteoro-
logical-dispersion type of modeling." These comments suggest the motivation
for the study outlined here and reported more fully elsewhere [7].
The difficulties in developing a source-oriented empirical model can
be stated from a statistical point of view. The spatial distribution
-------
of pollutant concentrations over a region is determined by emissions and
meteorological conditions. The number of variables determining the con-
centration at a given point is very large, particularly since emissions
arise from a large number of point sources and area sources. Consequently
the number of emission variables alone can easily be in the hundreds. If
an empirical model were to be developed in the most obvious manner, there
should be an attempt to relate the pollutant concentration at a given point
to all the possible emission variables and meteorological variables affecting
the concentration at that point. Since the determination of the relation-
ship between emission/meteorological variables and concentration requires
examples of that relationshp over a very wide range of emission and meteor-
ological variables, a tremendous amount of data would be required to ade-
quately determine this relationship.
If we could, however, isolate a given emissions source and we had a
number of receptor locations scattered about the source, the variation in
wind speed and direction would cause a wide variation in measured concen-
tration at the receptor locations. With enough examples of the source-
receptor relationship, the dispersion function could be determined
empirically.
In the urban environment, of course, individual sources cannot be
isolated. Measurements are the result of contributions from a number of
sources. However, because of the wide diversity of meteorological con-
ditions, the concentration will vary widely at a given point, and the
sources which contribute to the concentration at that point will similarly
-------
vary. One may then ask for a consistent source-receptor relationship which,
when summed (or integrated) over all the sources, would explain best on the
average the observed concentrations. More specifically, one could choose
the source-receptor function which minimized the average squared error in
prediction of the measured values. This concept is the core of the ideas
tested. When the parameters optimized are those of a Gaussian-form source-
receptor function, this methodology can be regarded as a means of calibrating
some commonly used models to particular urban environments.
The data used to test these ideas was model-created data. Model data
was chosen for three major reasons:
1. With model data, the source-receptor function is known and can be
compared with the function extracted from the data. With measurement data,
"truth" is unknown.
2. Area sources and point sources can be isolated and studied separa-
tely as well as jointly.
3. The cost of verifying and organizing measurement data would have
been beyond the scope of the present study.
2.2 MATHEMATICAL FORMULATION
We worked with a rectangular coordinate system with x-axis along the
mean horizontal wind direction, with y-axis crosswind, and with the z-axis
vertical. Then in urban air quality models it is customary to consider the
pollutant emissions in terms of a limited number (say J) of elevated point-
sources together with horizontal area-sources, the latter being possibly
located at a few distinct heights c. (say, for example, for s=l,2,3). The
-------
total concentration x(x,y,0) at ground level at the receptor location (x,y,0)
will be the sum of the concentration contribution from the point-source dis-
tribution, say xD(x»y,0) and that from the area-source distribution, say
xA(x,y,0), i.e.,
,y,0) = Xp(x,y,0) +^ (x,y,o) (2-1)
where
J
xD(x,y,o) = £ QnU)K(x-e.,y-v o,c.) (2-2)
r TTT1 H * * X,
3 f f
XA(*,y,0) = 2 J /Q«(5,n,5.)K£) = source-receptor function; it gives the ground level
concentration at the receptor location (x,y,o) result-
ing from a point-source of unit strength at U.n.c).
-------
Note that this formulation includes the assumption of horizontal
homogeneity, namely, that the impact of a given source upon a given
receptor depends only upon their relative and not absolute coordinates.
This assumption is true for an urban environment only in an average sense.
A single wind direction is similarly valid only in an average sense. Fi-
nally, it should be noted that the above formulation assumes steady-state
conditions and is thus only applicable for relatively short time periods
(of the order of one hour), when this may be an adequate approximation pro-
viding the emissions and meteorological conditions are not rapidly changing.
In Equations (2-2) and (2-3) above it is convenient to use "source-
oriented" position coordinates, and to consider a typical ground-level
receptor location as (x^ ,y..), i=l,2
Let
x' = xre , dx- = -de , x'u = xrg£ (2-4)
r - yrn , dy- = -dn , r. = y.-
Then
J
Xp(x1.y1,o) = £ Qp(A)K(xr£,yrji; O.ct) (2-5)
3 f f
J J QA(x.-x',y.-y',r )K(x',y'; O.cJdx'dy' (2-6)
s=l A b s
-------
10
In the following several different source-receptor functions [K(x',y';
0,?)] will be considered, including the classical Gaussian form that is the
basis for the RAM model [14]. For the latter, and with the meteorological
condition of infinite mixing depth
y2
- exp -
Y
TrU a (x')oz(x')
where U denotes the mean wind speed, and we assume simple power-law depend-
encies for the standard deviation functions, say
b
oy(x') = a (x') y (2-7b)
Also, as in the RAM-model we will assume that the narrow-plume hypothesis may
be employed in order to reduce the double integral of equation (2-6) to a
one-dimensional integral. Thus, under this hypothesis, if
oo
/
K(x',y", 0,es)dy = G(x',c) (2-8)
then in place of equation (2-6) we have
-------
11
3 f
£ / Vvx''yi'?s)G()r'cs)dx' (2"9)
s-i J
which only involves values of the area-source emission rates in the vertical
plane through the wind direction and the receptor location.
For the special case of a Gaussian plume
expV ~~2
I o <-
<2-10'
The basic equations (2-5) and (2-6) (or (2-5) and (2-9)), with the Gaussian
forms for K(x',y';0,s) and G(x',^) involve four unspecified parameters through
the equations (2-7b) and (2-7c), namely a ,b ,az and b . More generally, any
«7 «/
functional form chosen for K (and therefore G) may have unspecified parameters,
denoted by the vector o_. Thus for the special Gaussian form
a = (ay,by,az,bz) . (2-11)
The explicit dependence of the calculated concentration values on these
parameters could be indicated by the notation x(x^.y^,0;oJ.
The basic method employed in this study is that of choosing o_ to minimize
*
the error between calculated and observed values of concentrations. In order
"Observed" in the present case is model-created test data; the technique
is, of course, intended for practical use on measured data.
-------
12
to express this statement formally, we must elaborate our notation to indicate
explicitly the dependence on wind direction; thus x(*.,y.,0;e;a). For each
I I ™~
wind direction e .(j=l,2...R) there is a concentration observation for each
J
receptor location (monitoring station). The receptor locations are denoted
(x.,y.) for i=l,2...N, and are assumed to be at ground level so that we may
omit the symbol 0 in the x notation. Then the mean square error over all
observations is
2^ i v v
el n ) *- "~ • / /
DM JJ ^t / ^t
KH i=l J=l
1 N R
= RN" .
where x and x/\ are given by Eqs. (2-5) and (2-6) (or (2-5) and (2-9)).
P H
2
The problem of minimizing e with respect to o_ is a standard optimization
problem. We will not discuss the particular method used here.
2.3 TEST DATA
For a realistic distribution of point-sources, area-sources and receptor
locations, use was made of unpublished information from a 1968 air pollution
study conducted in St. Louis, Missouri. The area sources were gridded into
over 600 square regions; there were 60 point sources and errors were calculated
at 40 receptors for the 16 wind directions. (See Reference [7] for more details.)
The corresponding concentration data were generated by the EPA-developed RAM
algorithm [14], which is a specific implementation of the classical Gaussian
-------
13
plume formulation, that considers both point- and area-sources, with three
possible heights for the latter, and which uses the "narrow-plume" hypothesis
(i.e., Eq. (2-9)) to calculate the area-source concentration contribution XA-
A constant wind speed U of 5 meters per second was employed, and sixteen wind
directions at the points of the compass were simulated. Infinite mixing depth
and a neutral atmospheric stability category were assumed. For the latter, in
Eqs. (2-7b) and (2-7c), we have
a = 0.072 , b = 0.90
J •*
az = 0.038 , bz = 0.76 .
(2-13)
For this data, these values and the indicated equations are optimal and
would produce zero mean-square error. It is this result we hope to be able
to recover from the data by the optimization procedure.
2.4 OPTIMIZING PARAMETERS FOR THE GAUSSIAN FORM OF THE SOURCE-RECEPTOR FUNCTION
The data base described earlier contains concentration values at forty
receptors and sixteen wind direcitons, a total of 640 values (referred to
as "actual" values). The contribution to the concentration from point and
area sources was available separately, as well as in toto.
Equations (2-5) and (2-7a) provide a prediction of the point-source pol-
lutant concentration at any given receptor location once the four parameters
are specified. A comparison of values predicted by these equations versus
actual values allows calculation of the root-mean-square value of the error
with a given choice of parameter values. (See Eq. (2-12), with area sources
at zero.)
-------
14
With initial guesses of a =az=0.1 and b =bz=1.0, the search routine
described arrived at values of
a = 0.74, b = 0.92, az = 0.039, bz = 0.77
when the "true" values (those used to create the data) were
a - 0.72, b = 0.90, az = 0.38, bz - 0.76.
The root-mean-square (RMS) error initially was 157 yg/m and the maximum
o
error over the 640 values was 1205 yg/m , the parameter values after 100
3 3
iterations yielded an RMS error of 14 yg/m and a maximum error of 175 yg/m ,
To place the size of the final error in perspective, we note that the actual
3
values (due to point sources alone) were as high as 1545 yg/m .
Employing Eq. (2-9) for area sources and using only the area-source
contribution in the "actual" data, we get the values (.037, 0.79) for az and
b versus the true values (.038, 0.76).
The results of treating point and area sources simultaneously, represen-
tative of the case which would be encountered with measurement data, are listed
in Table 2-1; the algorithm once again closely approaches the optimum values
in 100 iterations. Actual values of the total concentrations from both point
and area sources go to above 1600 yg/m .
While the initial parameter values we chose in these cases converged
toward the values used in creating the data, experimentation indicated that
this was not always the case. Small RMS errors could be achieved with com-
binations of parameters significantly different in value from those used in
-------
15
Table 2-1: Point and Area Sources Together; Parameter Values
It Initial, Mid, and Final Iteration During Search
Both Point and Area Sources
Iteration
0 (initial)
50 (mid)
100 (final)
ACTUAL
VALUES:
0.100
0.055
0.074
1.00
0.79
0.89
z
0.100
0.044
0.036
z
1.00
0.67
0.74
bz
1.00
0.67
0.74
RMS Error
(yg/m3)
157
79
24
Max. Error
(yg/m3)
1205
583
194
(0.072) (0.90) (0.038) (0.76)
(0)
(0)
-------
creating the data. It is easy to find rather different combinations of a
and b which yield very similar values of ax over the range of x in which
we are interested. It is clear that an essentially equivalent combination
of values should not be deemed erroneous, since they yield an accurate
empirical model. We regard this as a characteristic of the formulation
chosen for calculating a and do not regard it a difficulty of the meth-
dology proposed. Further, in practice, initial values for the parameters
would be chosen from the literature, and the solution obtained would be a
set of values similar to the initial values, but which minimized the pre-
diction error.
This aspect of implementation also suggests that a good initial guess
would be employed and, thus, that convergence to an "optimum" solution would
be rapid.
2.5 MORE GENERAL SOURCE-RECEPTOR FUNCTIONS
More complex source-receptor functions (such as multivariate polynomials
and piecewise quadratic functions) were tested with success [7], but broad
conclusions about alternative forms will not be forthcoming through the
analysis of the present test data. Analysis of measurement data may allow
meaningful comparison of the Gaussian and more general parameterized forms.
-------
17
3. EMPIRICAL MODELING OF THE OXIDANT FORMATION PROCESS
3.1 MOTIVATION
Typical objectives of a modeling effort are (1) qualitative understanding
and (2) quantitative impacts. In air quality modeling, these objectives are
aimed at the ultimate objectives of determining the effects of alternative
control policies and understanding which policies will be most productive.
Ozone air quality modeling efforts have been largely concentrated at extremes
of the spectrum of approaches to modeling: (1) simple statistical models
with limited applications, or (2) complex models based on the underlying
physics and chemistry of the process. The former class of models provides
easy-to-use, but rough, guidelines; the latter class of model is capable of
detailed temporal and spatial impact analysis, but is costly and difficult to use.
The study outlined in this section illustrates the feasibility of an
intermediate class of model which is relatively inexpensive and easy-to-use,
but which is capable of providing reasonably detailed temporal and spatial
estimates of oxidant concentration. Further, the form of the model makes it
possible to understand (with careful inspection) the qualitative implications
of the model as a guide to the design of control strategies. This study is
reported more fully elsewhere [3].
We hasten to emphasize, however, that a full model in this class was not
a result of this study; rather, we present an analysis which we believe in-
dicates the feasibility of the development of such a model. In particular,
we develop am empirical difference equation for the production of oxidant
from chemical precursors, as effected by meteorological variables. A full
-------
18
model would involve difference equations for the precursor pollutants as
well. Further, data easily available did not include all meteorological
variables of possible interest or emission data. (Since ozone is a secondary
pollutant, emissions of primary pollutants over a brief interval, e.g., one
hour, will not effect the change in ozone levels over that interval to the
degree they effect the change in primary pollutant levels. Since we did
not derive difference equations for the primary pollutants in this study,
not including emissions did not prove serious.) The context in which the
reader sould then interpret the results is as the degree to which the change
in ozone can be explained despite these limitations. Whatever degree of ex-
planation of the variance in one- or two-hour changes in ozone we can achieve
within these limitations can be improved when more of the omitted factors are
taken into account. This analysis will thus provide a pessimistic estimate
of the degree of success that can be expected in a full-scale implementation
of the approach.
3.2 FORM OF MODEL
We consider a "parcel" or air that moves along a trajectory to be deter-
mined from the wind field, and we define 03(t) as the oxidant concentration
averaged over the hour preceding time t. We further define AQ-(t) as the
O
change in the hourly average oxidant concentration (in pphm) in the time in-
terval At following t; explicitly,
A03(t) = 03(t + At) - 03(t) . (3_-,)
(We will consider At = 1 hourandAt = 2 hours.) We seek an equation predicting
the change in hourly average concentration after time t from measurements of
-------
19
of pollutants and meteorology available at time t. Pollutant measurements
other than ozone we will consider as possible precursors include the fol-
lowing, all of them in terms of concentration averaged over the hour pre-
ceding time t:
N0(t) = NO concentration (pphm)
N02(t) = N02 concentration (pphm)
HC(t) = Non-methane hydrocarbon concentration (ppm)
CH^(t) = Methane (ppm).
Meteorological variables considered explicitly include the following,
again averaged over the hour preceding time t:
o
SR(t) = solar radiation (gm-cal/cm /hr)
T(t) = temperature (°F).
Mixing height was not used in the present study.
We thus seek a relationship of the form
A03(t) = F[03(t),NO(t),N02(t),HC(t),CH4(t),SR(t),T(t)] , (3-2)
which accurately reflects observed data. Referring to (3-1), equation (3-2)
can be alternatively written as
03(t+At) = 03(t) + F[03(t),NO(t),N02(t),HC(t),CH4(T),SR(t),T(t)]. (3-3)
This form indicates explicitly how such a relationship, if derived, can
be used to compute a sequence of hourly or bi-hourly oxidant concentrations.
-------
20
(Similar equations would be derived for the primary pollutants to provide a
complete model.)
We must incorporate transport effects. We have adopted a rather simple
model. The model estimates the trajectory of a "parcel" of air from ground-
level measurements of the wind field. A parcel arriving at a given location
at a given time (e.g., Pasadena at 1600 hours) is estimated, from the current
wind direction, to have been at another location upwind one hour earlier. The
distance of that location upwind is given by the current wind speed. The
trajectory is tracked backwards to give a sequence of hourly locations. The
"actual" values of pollutant levels at these points at the given times are
obtained by an interpolation procedure from measured data at fixed monitoring
stations. The motivation for tracking parcels backwards rather than forwards
is to allow choice of parcels which end up at monitoring stations in part, so
that the last (and often highest) pollutant concentration need not be inter-
polated. The air parcel trajectory approach is obviously a simplification of
the true physics of the system; this approach is similar to assumptions employed
in some physically based air quality models [26], In the present empirical
modeling context, the trajectory approach is a statistical approximation rather
than an assumption; that is, the inaccuracy of the approximation will be
reflected in the overall error of the final empirical model.
3.3 THE DATA
Data collected by the Los Angeles Air Pollution Control District was
employed. Air quality data from the seven monitoring stations indicated
in Figure 3-1 was utilized.
-------
BURBANK
PASADENA
AZUSA
PI ay a Del Rey
• LOS ANGELES
POMONA
Long Beach
WHITTIER
N
0 4 8
SCALE IN MILES
Huntington
Beach
Figure 3-1. The study region (monitoring stations
are capitalized).
-------
22
We interpolated the wind field in a region of the Los Angeles basin so
that we were able to track parcels of air as they moved through the basin.
The pollutant readings at seven APCD stations were also interpolated so
that we could keep hourly records of the pollutants discussed. We also had
the hourly solar radiation readings at the Los Angeles Civic Center location
of the APCD, and hourly temperature readings at three representative locations
in the basin.
Our study was carried out using data from five summer months, June
through October 1973. About 7000 trajectories were formed and placed in the
primary data base.
From the bank of 7000 trajectories, we extracted a sample of about 1900
data vectors of the form
(A03, 03, NO, N02, HC, CH4, SR, T) ,
where AO^ was a one-hour change and about 1800 vectors where A03 was a two-
hour change.
3.4 THE ANALYSIS
Since time is only implicit in (3-2), we search for a fixed relationship
A03 = F(03, NO, N02, HC, CH4, SR, T) . (3-4)
Equally important, we want to determine which of the variables were most
significantly related to the change in ozone. Therefore, we really had two
objectives in this study:
-------
23
(1) To find those subgroups of the variables most significantly
related to the ozone change.
(2) To find the form of the function F providing the best fit
to the data.
3.4.1 Variable Selection
For the variable selection and exploratory phase, we used INVAR, a general
nonparametric method for estimating efficiently how much of the variability
in the dependent variable can be explained by a subgroup of the independent
variables [2]. This technique estimates the limiting value of percent of
variance explained (PVE) by a "smooth" nonlinear model.* We first tested all
independent variables as individual predictors, then pairs of variables, and
then added variables to find the best three, etc. The most significant in-
dividual variables (in approximate order of importance) are SR, N0o» T, and
o3.
Exploring pairs of variables, we found the best pair was (0^, SR) for
both one- and two-hour AOg.
Triplets of variables were then explored with one really significant
improvement showing. The best triple by a good margin was (O-j, SR, NC^),
explaining 60% of the variance in one-hour AOj and 71% of the two-hour AO^.
The final significant increase occurred when we added temperature to 63,
N02> SR. But, somewhat strangely, the increase was significant only for the
data base of one hour AO. Here we obtained
Percent of variance explained equals
100 x
1 _ variance of error in prediction
variance of dependent variable
-------
24
Variables One-Hour PVE
03, N02, SR, T 66%
In all of the INVAR runs using HC and CH^, neither of them significantly
increased the PVE. For instance, when HC and CH. were individually added to
the variables NC^, NO, Oo, and SR, the maximum increase in PVE was 2.1%.
These results are encouraging; the three variables 0^, NC^, SR predict
about 71% of the variance in two-hour ozone changes, that is, with a cor-
relation between predicted and actual values of 0.84 over 1800 samples.
3.4.2. Specific Functional Relationship
The exploratory analysis provided nonparametric estimates of the degree
of predictability of two-hour AO-, as a function of Oo, NOn, SR. In this
subsection we discuss the derivation of a specific simple functional form to
make explicit this relationship.
To qet a continuous functional form for the relationship of A03 to Oo,,
N02, and SR, we used continuous piecewise linear regression [16,12]. Since the
function generated bv this method is smoother and less general than that used
in INVAR estimates, we did not achieve the level of PVE obtained by INVAR.
The continuous piecewise linear function which minimizes the mean-square error
in the fit to the 1800 sample points is given by*
AO-, = - 5.125 • max {A,B,C}
(3-5)
- 1.167 • max {D,E,F} + 10.48 ,
*
The notation max {A,B,C} means the largest of the three values computed
by equations A, B, and C.
-------
25
where
A = - 0.2146 ' 03 - .0701 • N02 - .002268 • SR + .9376
B = .02114 • 03 - .1013 • N02 - .01075 • SR + 2.275
C - .1638 • 03 - .09855 • N02 - .005938 • SR - .2263
D = .02709 • 03 - .3015 • N02 + .001298 - SR + 2.304
E = - .009565 • 03 + .0005252 • N02 - .001079 • SR + .2306
F = - .0144 • 03 + .2066 • N02 - .003171 • SR - 2.943
(The unusual form of the equation has no physical interpretation, but is
simply a consequence of the particular methodology employed.) This equation
explained 60.7% of the variance, a correlation between predicted and actual
values of 0.78. Figure 3-2 illustrates the form of the function.
This equation can be used to calculate a sequence of oxidant concentrations
in a parcel of air by using known values of the other pollutant concentrations
(since difference equations for these pollutants have not been derived). Fig-
ures 3-3, 3-4, and 3-5 illustrate the result for three air parcel trajectories.
Numerical values are listed in Table 3-1.
3.5 INTERPRETATION OF MODEL IMPLICATIONS
Let us attempt to interpret the functional form in (3-5). The final
fitted surface is fairly simple, consisting of a continuous patching together
of eight hyperplane segments. Of the eight regions, there are three small
regions that together contain only 1.0% of the total number of points. We
will ignore these and restrict our analysis to the information contained in
the functional fit to A03 in the five other regions.
-------
ro
CTl
45
Figure 3-2. Graph of regression surfaces with SR = 1
00
-------
27
20 r
O Actual value
^ Predicted value
15
0. (pphm)
10
7 AM
1 PM 3PM
Hour of the Day
5 PM
Figure 3-3. Air parcel arriving at the Pomona
station at 3 PM.
-------
28
20
15
(pphm)
10
Q Actual value
A Predicted value
7 AM 9 AM 11 AM 1 PM 3 PM 5 PM
Hour of the Day
Figure 3-4. Air parcel arriving at the Pomona
station at 4 PM.
-------
29
20
Q Actual value
A Predicted value
15
(pphm)
10
T
7 AM 9 AM
11 AM 1 PM
Hour of the Day
Figure 3-5. Air parcel arriving at the Pomona
station at 5 PM.
-------
30
Table 3-1. Comparison of Model Predictions
to Actual Values of 00
(a) Air parcel arriving at the Pomona
station at 3 PM
Hour of the Day Actual Value Predicted Value
(pphm) (pphm)
7 AM 1.0
9 AM 1.0 0.0
11 AM 1.1 0.6
1 PM 1.6 2.5
3 PM 10.2 6.1
5 PM 15.1 20.2
(b) Air parcel arriving at the Pomona
station at 4 PM
Hour of the Day Actual Value Predicted Value
(pphm) (pphm)
8 AM 1.0
10 AM 1.4 0.0
12 N 1.2 1.0
2 PM 7.8 4.0
4 PM 12.7 15.8
(c) Air parcel arriving at the Pomona
station at 5 PM
Hour of the Day Actual Value Predicted
(pphm) (pphm)
7 AM 1.0
9 AM 1.0 0.0
11 AM 1.6 0.4
1 PM 3.9 2.0
3 PM 9.0 8.7
-------
31
Region 1, containing almost half of the sample points, is representative of
low pollution levels, low 03 production, and low solar radiation. Region 2,
with 33% of the points, contains data with above average mean N02 and solar
radiation levels, below average 03 levels, and high positive changes in 03.
The other three regions, with a total of 20% of the sample points, represent
more extreme conditions.
Since the size of the coefficients depends on the scaling of the variables,
we introduce normalized variables by dividing the original variables by their
overall standard deviations; i.e., denoting normalized variables by primes:
0'3 = 03/6.2, N0'2 = N02/5.2, SR = SR/52.8 . (3-6)
The equations are given in terms of the normalized variables, in Table 3-2.
These are derived from equation (3-5).
Table 3-2. Normalized Equations for A03
(A03 not normalized)
Region
1
2
3
4
5
-0.9 (03)
-0.6 (0'3)
-5.4 (0'3)
-0.57 (0'3)
-5.1 (O1,)
Equations
+4.5 (N02)
+2.6 (N0'2)
+4.4 (N0'2)
+1.4 (N0'2)
+1.4 (NO?)
+2.8 (SR1)
+2.9 (SR1)
+1.5 (SR1)
+3.1 (SR1)
+1.8 (SR1)
- 3.9
- 1.4
+ 9.0
+ 2.3
+15.1
-------
32
The major qualitative conclusions that can be inferred from this table
(see [3] for fuller discussion) are the following:
(1) At below average 03 levels, the 03 change is determined largely by
the SR and NC^ levels, with larger values of these latter two
related to larger values of the 0^ change. The largest positive
changes in 0^ occur in this regime.
(2) At above average 0, levels, the (K has a strong negative association
with 0^ change, and moderate to high levels of NC^ and SR are asso-
ciated with low to only moderately above-average changes in 0^.
The consistent negative sign on OA suggests a possible self-
limiting effect.
3.6 CONCLUSIONS
It is possible to derive rather accurate empirical equations predicting
the short-term change in oxidant concentration (considering the limitations
of the data and the difficulty of the problem). These results are encouraging
in terms of the practicality of a full model involving emission variables and
all the major reactive pollutants.
-------
33
4. EXTRACTION OF EMISSION TRENDS FROM AIR QUALITY TRENDS
4.1 MOTIVATION
While measured pollutant concentration is the final impact of a
given level of emissions, trends in pollutant concentration measure-
ments can be misleading if it is assumed that those trends represent
progress (or the lack thereof) in emission control. Since meteorology
need not be uniform from time period to time period, the measure of
progress should be more directly related to emissions. Emissions come
from a large number of diverse sources, however, and are difficult to
measure directly. Since air quality has been measured directly for a
number of years, it is of significant interest to understand if the
effect of meteorology can be removed from air quality trends to more
nearly elicit trends in emissions. Such an analysis of trends is the
subject of periodic reports both by the Council on Environmental Quality
and by the Environmental Protection Agency.
Such a study must implicitly extract information about the influence
of meteorological factors on pollution levels for a given level of emis-
sions. This information can be an important subsidiary benefit of an
analysis of the sort suggested.
We will discuss this concept by referring to a specific example
of a study of the improvement in emissions between the early and late
sixties in Oslo, Norway [11]. We will then relate this example to a
general formulation to highlight the assumptions involved in such a
study, to make the method more specific, and to provide a context for
broader application of this approach.
-------
34
4.2 REPORT OF A COMPARISON OF EMISSION LEVELS OVER TWO TIME PERIODS
A study of the changes in emission levels of S02 in Oslo, Norway, as
deduced from changes in measured S02 concentrations, was undertaken to
compare the S02 emissions of the periods 1959-1963 and 1969-1973. The
meteorological conditions during the former period were considerably
different from those during the latter period; hence, one could not ex-
pect a change in air quality to be directly related to a change in emis-
sions.
Data from the earlier period (1959-1963) was used to do a linear
regression analysis. It was discovered that two variables dominated
the estimate of SO^ concentration, a temperature difference between a
low altitude and high altitude measuring station and the temperature
at the lower station. For example, a typical regression equation for
one station was
qso = 61.5 (lyi^) - 11.6T, + 472 , (4-1)
where
o
= daily mean value of SOp concentration in yg/m at the parti-
cular station
2 = temperature at higher station at 7 P.M.
= temperature at lower station at 7 P.M.
-------
35
This equation explained the observed values of S02 concentration with a
multiple correlation coefficient of .80; that is, the correlation between
values predicted by this equation and observed values for the period indi-
cated was 0.80. Adding other variables did not result in a significantly
better predictor equation. It was suggested that the temperature difference
term expressed the ventilation in the Oslo area while the temperature term
measured the variation in the emission of SOp due to space heating. Since
the temperature data for the later time period is known, the level of S02
expected for the meteorological conditions during that time period can be es-
timated by equation (4-1). This was done for the days on which data was
available in the later time period; the results are indicated and compared
with data from the earlier time period in Figure 4-1. The data from the
1959-1963 time period is scattered relatively uniformly about this line
of slope 1—as expected, since the regression was performed on that data.
However, the data from the later years evinces a much lower observed value
of S0« concentration than would be expected from the meteorological condi-
tions. The referenced study attributed this to a reduction in emissions.
Figure 4-1 indicates qualitatively the emission reduction (or, if
the reader prefers, the "meteorologically normalized" reduction in pollu-
tant levels). A quantitative statement was made in the report that the
S02 pollution was reduced 50 to 60%. According to a conversation with
one of the authors of the report, this latter statement was derived by
looking at the ratio of the coefficient on the temperature difference
term in the early time period to the coefficient of the temperature
difference term in a similarly derived equation for the later
-------
36
SOa
1200
800
400
• 1953/63
* 1969/70
*JAN-MA« 1971
800
Figure 4-1: Values of daily mean SC^ concentration computed
from temperature measurements at 7 P.M. versus
daily mean S02 concentration observed. The fact
that the values in the later period are much less
than would be expected from the meteorology suggests
that emissions are less. Ref. [11]
-------
37
time period. The intuitive justification for such a statement is that
the coefficient measures the degree to which a given temperature inversion
will be translated Into SCL concentrations. Thus a 50 or 60% reduction
in that coefficient might be thought of as a meteorologically adjusted
measure of the trend in air quality. The intent was to obtain a value
which can be interpreted as being proportional to the reduction in emis-
sions.
4.3 GENERALIZATION AND MATHEMATICAL FORMULATION
The purpose of the Oslo study was to compare air quality for two
different periods rather than to obtain a continuous estimate of a
meteorologically normalized air quality trend. We will formulate the
problem in the former terms in order to relate it explicitly to that
study; however, this does not at all imply that the approach cannot be
modified to yield a continuous estimate of air quality trends. Assume
we are given two sets of observations, one set for the first period of
time:
-------
38
where
q; ' = an air quality measurement during the first period (e.g., a
daily mean value of pollutant concentration)
and
mf) = (m.^m.^..., m.n)
= a vector of meteorological measurements corresponding to the
1 air quality measurement qP' (e.g., m^ might be a temper-
ature measurement at a particular station).
There are a similar set of measurements for a later period:
(4-3)
It is from this information (and without an estimate of emissions
during the two periods) that we wish to determine a meteorologically ad-
justed estimate of the improvement or deterioration of air quality (i.e.,
to estimate the change in emissions from air quality and meteorological
measurements). Suppose there is some "true," but unknown, equation (or
model) which relates emissions and meteorological measurements to air
quality:
-------
39
q = F(£,m) . (4.4 )
where e_ is a vector of emission measurements.
This equation plus measurements error produced the measurement data of
(4-2) and (4-3). We are assuming that the equation does not differ
between the two periods.
For the sake of the present discussion, let us again assume that emis-
sions remain essentially constant over the first time period and over the
second time period:
e_ = e, in first period, (4-5a)
e_ = §2 in second period . (4-5b)
Now let us suppose that we do a linear or nonlinear regression with the
data from the first period, equation (4-2), and obtain a best fit equation
to the data:
(4-6)
Equation (4-1) is such an equation.
Then obviously
(1) = f^m) «F(e.1,m) , (4-7)
-------
40
Now suppose we use the data of the second period, equation (4-3), to obtain
a similar empirical model:
q(2) = f2(m) . (4-8)
Then, as before,
q(2) = f2(m) wFfeg.m) . (4-9)
Let us further assume that F is decomposible:
q = F(e,m) = G(e)H(m) . (4-10)
Equation (4-10) implies that the effect of emissions on air quality is
essentially independent of the effect of meteorology. The appropriateness
of this assumption clearly depends upon the particular definitions of the
emission, meteorological, and pollutant variables, as well as on the area in
question. If the pollutant concentration is location-specific (rather
than a spatial average or spatial maximum), then either emissions must be
spatially uniform or the direction of the wind field relatively constant
for (4-10) to be reasonable. (The latter seems to be the assumption
of the Norwegian study.) If the variables are aggregates (such as spatially
averaged S02 concentrations, total emissions, and average wind speed), then
less severe assumptions need be made for (4-10) to be reasonable.
Given (4-10), the ratio of the empirical equations for the two
time periods is
-------
41
using (4-6), (4-7), and (4-8). Thus, the ratio should be very nearly
constant if (4-10) is valid, and that constant will be a measure of
the change in emissions between the two periods. (The function G(eJ
can be, for example, total emissions in tons.)
If (4-11) is not nearly constant, it can be interpreted as im-
plying that the improvement is a function of the meteorology. This might
easily be the case. For example, if there is substantial reduction in
industrial emissions but no improvement in emissions from space heating,
the improvement in emissions will be less when the temperature is lower.
If the improvement is a function of wind direction, the location of major
emission sources may be the cause. In the Oslo study, the ratio of the
temperature difference terms alone was taken and is exactly constant.
Since the full Oslo model, (4-1), contains other terms, however, the
ratio suggested by this discussion is not constant. Since the equation
for the later time period was not explicitly reported, we cannot calculate
the ratio. Let us examine, however, an analysis which is consistent with
Figure 4-1 and which provides an alternative approach.
Suppose we create a model f, for the first time period only and apply
it to the meteorological conditions for the second time period:
-------
42
~42) -
"f 21
We obtain estimates for the air quality qj ' to be expected if the emis-
sions have not changed; these calculated values can be compared with ob-
served values. These are the values plotted in Figure 4-1. If we now
perform a linear regression of observed versus estimated values, i.e.,
qj ' versus q: ' for i=l ,2,... ,N2, we obtain a regression equation:
q = a q + b , (4-13)
with specific values of a and b. Suppose we then assume that the "true"
equation is of the form
q = F(e_,m) = G(e)H(m) + qQ , (4.14)
where q is a "background" air quality level not related to local
emissions.
Then (4-13) is consistent with (4-14) if
-------
43
and
b '
Then "a" can be interpreted as the increase in emissions and, more contro-
versially, "b" can be related to the change in "backqround" level (where
the background level may contain contributions from sources outside the emis-
sions inventory included in e_— for example, long-range transport from
other cities).
Estimating the best-linear-fit equations graphically, from Figure 4-1,
we find that the equation for the 1969/70 data is approximately
q = 0.25 q + 120 (4-16a)
and for the 1971 data
q = 0.25 q . (4-16b)
Thus, the reduction in emissions is about 75% by this analysis for both
periods. The 1969/70 period had higher "background" than the 1959/63
period by 120 yg/m , but the 1971 period had about the same background
as 1959/63. Thus, the improvement between 1969/70 and 1971 could be
attributed to improvements in areas other than Oslo.
Note that this latter approach requires that only one model be created.
Since the approach is symmetrical, the model can be created for the period
in which the most data is available and applied to the other period.
-------
44
-------
45
5. DETECTION OF INCONSISTENCIES IN AIR QUALITY/METEOROLOGICAL DATA BASES
5.1 MOTIVATION
Air quality and meteorological data bases are collected for many pur-
poses (and often used for purposes not intended when collected). An im-
portant objective either during collection or after the fact is the de-
tection of inconsistencies in the data. In most data collection efforts,
an attempt is made to study the data for strange behavior or to employ in-
tuition and problem knowledge to uncover sources of system changes causing
data problems, such as changes or discrepancies in monitoring techniques.
A recent example 1s the detection of a significant discrepancy in cer-
tain calibration techniques used by the California Air Resources Board and
the Los Angeles Air Pollution Control District, making oxidant measurements
of the agencies inconsistent without a correction factor [9]. The fre-
quent occurrence of detected inconsistencies in data bases leads one to
expect the possibility of undetected inconsistencies. An automatic tech-
nique for flagging potential inconsistencies using the data itself would
be an important tool. Such a technique would take an existing data base
and detect potential problems for closer inspection or detect problems
occurring in an ongoing data collection effort before a substantial amount
of data was irretrievably lost.
In this section, we will indicate how data-analytic/statistical tech-
niques can be employed to achieve this objective, we will distinguish the
types of inconsistencies for which one might search, the appropriate ap-
proaches to detecting these various types of inconsistencies, and the
-------
46
potential difficulties in this formal approach to the detection of in-
consistencies.
The key concept will be that of usinq the data collected to form a
model of the relationship between selected sets of measurements and to
automatically detect the measurements or points in time when (1) the model
changes or (2) the data is least consistent with the model. Note that
the model need not be a prediction model or relate independent to depen-
dent variables. Any consistent relationships in the data can be employed
in detecting inconsistencies.
It is important to distinguish inconsistencies from extremes. An
extreme value of air pollution is not necessarily inconsistent—It may be
consistent with extreme meteorological conditions. If the model ade-
quately incorporates the extreme conditions, the extreme values would
be indicated as being consistent and not flagged. If, however, the ex-
treme conditions were not previously observed in the data base or not
otherwise represented by a similar condition in the data base, the ex-
treme conditions may not be incorporated in the model and may be flagged
as possible Inconsistencies. We bring up these points to emphasize two
key concepts: (1) the intent of a consistency analysis is not to flag
simple extreme values but to flag values which are inconsistent, i.e., ex-
treme and inconsistent values are not equivalent; (2) the intent of a con-
sistency analysis is to flag potential inconsistencies for inspection.
An inconsistency analysis will be successful if it does not miss key in-
consistencies that could seriously damage an empirical analysis or data
-------
47
collection effort. It will not have failed if it also flags potential
inconsistencies which upon further examination are more accurately cate-
gorized as extremes or unusual occurrences.
Let us structure these ideas more formally.
5.2 FORMULATION OF CONSISTENCY MODELS
We imagine the basic situation of the simultaneous collection of
air quality and meteorological data, as well as possible adjunct data
depending upon the application (e.g., health effects data, emissions
data, etc.). Suppose the basic data is a sequence of measurements over
time of a number of variables:
Measurement 1: x^t^), x-j(t2), XI^N^ »
Measurement 2: x2(t-|), x2(t2) » *2(tN) »
Measurement n: x n(t-j), xn(t2)
, .....
There are three basic formulations of consistency models available.
Time Sequence Inconsistencies
The consistency of individual time series can be examined. The model
constructed can be a model which predicts the value at a given point in
time from past and future values of itself. An inconsistency will then
be detected as a significant discrepancy between the forecast and observed
value. That is, the model could be of the form
x^t..) = FCxjtt^ ),..., x^t^.,), Xjdj+j) ..... W] , (5-2)
A
where x^t.) is the value of x.. (t.) predicted by the model. We emphasize
-------
48
that since we are testing consistency rather than predicting behavior,
values occurring after the particular value tested can be used in the
model when available. While many time series techniques employ recursively
expressed predictor models, they imply a general dependence of the form
indicated.
An inconsistency would be a sufficiently large deviation between pre-
dicted and measured values, i.e., a large value of
- x.(t..)| . (5-3)
Cross fteasurement Inconsistencies
This type of model is constructed by modeling the relationships be-
tween measurements at a given point in time. An example is a derived
relationship between a vertical temperature difference and average wind
speed at the same time. Formally, such a model is of the form
Xl(t.) = G[Xl(tj.) ..... x^t.J.x^tj),..., xn(tj)] . (5_4)
An inconsistency would be detected by large values of (5-3), as before.
Combined Model
In general, measurements will depend upon both past history and con
current measurements. A full model would then be a technique which used
data both at other times and from other variables:
.....
.....
(5-5)
-------
49
Note that in many cases 1t is neither easy nor important to categorize
the type of modeling being employed. It might be unclear for example in what
category one should place a model where the time slice was fairly broad,
for example, where monthly averages of daily values were compared to one
another. If the daily values are considered the basic data, then the
model is a combined model; if the monthly averages are considered the
basic data, then the model is a cross-measurement model. It is clearly
less important to categorize a model than to create and use it appropriately.
5.3 TYPES OF INCONSISTENCIES
There are several types of inconsistencies one might be interested
in detecting in the data:
1. Abrupt, but persistent, changes;
2. Slow nonstationarities; and-
3. Anomalous data (abrupt, nonpersistent changes).
Let us discuss these categories of problem and formulation of models for
their solution.
Abrupt, Persistent Changes
The change in the data may occur suddenly in time, i.e., at an identi-
fiable point in time.
There are generally two types of abrupt, persistent changes of interest:
1. Malfunctioning measurement or recording devices - If a measuring
device suddenly begins to malfunction, it will generally continue
to malfunction until repaired or replaced. The motivation for
detecting such a problem is obvious. In the present categoriza-
tion, we intend to mean by an abrupt, persistent change a change
-------
50
in the underlying model which occurs over a relatively short
period of time. This is as distinguished from slow changes
or short-term changes.
2. Changes in the system - We refer to major changes in the system
which occur over a short period of time such as the opening of
a new freeway or the opening of a major Indirect source. As
well as permanent changes, there may be temporary but signifi-
cant changes, such as if a city were to host the Olympic Games.
Without specific attention to such events, the conclusions of
an analysis could be misleading. The analysis of this type
of abrupt change has been called "intervention analysis" by
Box and Tiao [1].
There is also clearly a matter of degree. An event can have a rela-
tively mild effect, as might the closing of several on-ramps to a freeway.
One output of a consistency analysis should be a measurement of the de-
gree of inconsistency.
This category of inconsistency has the basic character of having
a significantly different relationship between variables in the time
periods before the event and after the event. The point in time sepa-
rating the two periods is assumed unknown (since the purpose of a con-
sistency analysis is to discover such points).
The first of two basic technical approaches to this problem consists
creating a series of models and searching for a statistically significant
change in model structure or parameters. One may create a model over
-------
51
is
the Interval [t] tfc] and predict Xj(tk+1). If the prediction 1«
consistent with observation, then a model over [t, t^] is created
to predict Xj(tk+2)t and so on, until a discrepancy occurs. A simple
modeling technique or recursive procedure is probably a requirement if
a high computing cost is to be avoided.
The second approach does not require as abrupt a change as the first
but may be more computational. Here, one can create two models, one for
the period [t1§tk] and one for the period [tk,tN]. One can calculate an
appropriate measure of the difference in the models, say Dk. Repeating
this for varying breakpoints tk§ one can determine the value at which
the difference Dk is maximized, presumably the point when the change
occurred.
Slow Nonstationarities
Many types of change will occur gradually over a period of time.
For example, the retrofitting of emission control devices in automobiles
in California was mandated by law to occur in a month-by-month fashion
depending upon the digit of the car owner's license plate. The slow in-
troduction of the retrofitting might affect the time sequence of air
quality measurements. Another example is a slow but significant drift
in a measuring instrument. Such an inconsistency would be detected as
a systematic change in the appropriate model over time as opposed to an
abrupt inconsistency.
As with abrupt changes, categorizations of slow nonstationarities
are possible. They may be related both to measurement device drift or
-------
52
to changes in the system, and they may be both temporary and permanent.
(An example of a temporary but slow nonstationarity 1s a slow but defi-
nite degradation in the degree of compliance with the 55-miles-per-hour
speed limit.)
The most straightforward approach to this problem is to postulate
the form of the nonstationarity and test for it. For example, two air-
quality monitoring stations near each other might measure the same pol-
lutant, recording x-j (t) and x2(t), respectively. One could then do a
linear regression of day-to-day changes of the stations against one
another, i.e., find the best-fit linear relationship between
and
v2(tk) = x2(tk) - x2(tk_-,)
for k=2,3,...,N. The result will be of the form
One can then test statistically whether b is significantly different than
zero. If it is, the values measured by one station are drifting relative
to the other. Unless this can be explained by a constantly increasing
(or decreasing) emission source affecting one of the stations selectively,
1t is an inconsistency.
-------
53
Another approach is to compare a model created on [t^.tj with a
model created on [tk+fiftNL where the time gap 6 between periods modeled
is sufficient to detect a slow drift. This approach requires fewer as-
sumptions regarding the form of a possible nonstationarity.
Anomalous Data
This type of inconsistency might be categorized as a "noisy" measure-
ment. It could be caused by erroneous recording or digitization of the
data by a temporarily malfunctioning instrument or by an anomalous occur-
rence such as might be caused by sidewalk repairs raising dust near a
site monitoring suspended particulate levels. Such an occurrence is a
short-term abrupt inconsistency in either a time sequence or cross-
measurement model. It is a relatively conventional type of problem en-
countered in data analysis and is often referred to as "outlier analysis."
This problem can be approached in the single variable case by
studying extreme values detected by creating a histogram (the empirical
distribution) of measured values. The more variables measured, the
greater the potential for outliers which are not obvious by looking at
individual variables. (The classical example is the existence of a
"pregnant male" in a medical data base; neither "pregnant" nor "male"
is illegal, only the combination.) In the multivariate case, the most
general class of techniques for detecting outliers is "cluster analysis" [10]
Very small clusters of points or single-point clusters in multivariate
space are inconsistencies which should be examined.
-------
54
5.4 DIFFICULTIES
The major technical difficulties in consistency analysis are, first,
nonlinearities and secondly, lack of data relative to the number of varia-
bles the relation of which is to be modeled. Most air quality and meteor-
ological parameters are nonlinearly related. Further, it often takes a
large number of variables to determine with accuracy other meteorological
or air-quality variables. This means that the diversity of joint obser-
vations of values of a large number of variables that one can expect in
a given data base or at the start of a measurement program is limited.
Compounding the problem, nonlinear models will, in general, require more
parameters than linear models and, hence, require more data for accurate
model determination.
These problems can be alleviated by both technical and operational
solutions. A technical consideration is that an efficient (low-parameter)
nonlinear form will require less data for the determination of the model
than an inefficient (overparameterized) nonlinear form; hence, efficient
functional forms, such as continuous piecewise linear functions, can help
alleviate this problem. A second technical point 1s that a set of models
of relatively simple form can be created with subsets of the relevant
variables.
The operational consideration 1s the fact that one may operationally
be able to tolerate a high level of "false alarms" in detecting Inconsis-
tencies at the beginning of a data collection project or in analyzing a
data base in the initial stages. It is at this early point 1n most data
-------
55
collection or data analysis efforts that most of the problems are en-
countered. As more data is collected, the model will become more re-
fined and flag fewer potential inconsistencies.
Another possible problem is the inclusion of inconsistencies into
the model. Without care, the data can be modeled including inconsis-
tencies in such a way that the inconsistencies are fitted and do not be-
come apparent as a discrepancy in the model. This pitfall can be avoided
by simply employing good data-analytic practices to avoid overfitting.
For many projects in data collection and analysis, the use of con-
ventional tools in a careful manner can provide a systematic analysis of
consistency which may avoid erroneous analyses and a great deal of wasted
effort.
-------
56
-------
57
6. REPRO-MODELING: EMPIRICAL APPROACHES TO THE
UNDERSTANDING AND EFFICIENT USE OF COMPLEX AIR QUALITY MODELS
Several computer-based mathematical models derived from basic
physical principles have been constructed to model air pollution and
meteorological phenomena. The diversity of inputs to such models and the
typically long running times often make it difficult to understand the
full implications of the models or to use the models in certain planning
applications where large numbers of alternatives must be rapidly evalu-
ated. The concept of "repro-modeling" is to treat a model as a source
of data for an empirical analysis [16]. Such an analysis will, in general,
have two major objectives:
1. To understand the implication of the model by discovering
which variables most affect the outputs of interest and
in what way they affect the outputs of interest; and
2. To construct as a simple functional form a model of the
relationship between key independent variables and key
model outputs.
Since this approach has been a subject of a previous EPA contract,
in which the technique of repro-modeling was applied to a reactive dis-
persive model of photochemical pollutant behavior in the Los Angeles
basin [12], we will not discuss it in further depth in this report. We
do wish to emphasize the role of such an analysis in evaluating, vali-
dating and comparing models, as well as in suggesting to modelers the
characteristics which a current version of the model implies which might
bear further investigation.
-------
58
One point in earlier discussions of repro-modeling which has not
been emphasized is its use in model validation and sensitivity analysis.
Often sensitivity analysis is performed on models in order to determine
which parameters of the model are most critical in determining the model
output [23]. The change in model output with a small change in a given
parameter or input value is the sensitivity of the model to that param-
eter. Since the sensitivity of a model to a particular parameter will,
in general, depend upon the values of the other parameters, classical
sensitivity analysis is usually performed in one of two ways:
1. One set of typical values for the parameters and inputs is
chosen and the effect of small changes in the parameters
about that nominal condition are made in order to examine
sensitivity. This obviously indicates only the sensitivity
at the particular nominal condition chosen.
2. A "factorial" analysis is performed, where a number of diverse
nominal values are chosen and the above analysis repeated for
this large number of diverse conditions. This exercises the
full range of potential operation of the model, but creates
the problem of commensuratinq the implications of what are
often thousands of model runs. It also has the obvious dis-
advantage of requiring a large number of model runs.
If one is willing to perform a given number of model runs to get
a number of nominal points for a sensitivity analysis, it is more ef-
ficient, rather than to do a sensitivity analysis at each point, to fit
the points with an appropriate functional form such as a continuous
-------
59
piecewlse linear form [16]. As demonstrated in the referenced report,
this results in regimes in which the model output is a linear function
of the model inputs and/or parameters and the sensitivity to those pa-
rameters and inputs is quite clearly displayed. This approach automati-
cally determines those regimes in which the sensitivity is relatively
constant over a large area of parameter/input variations. This "global"
sensitivity analysis approach can be more easily interpreted and more
efficient than a "local" sensitivity analysis approach.
-------
60
-------
61
7. OTHER APPLICATION AREAS
Two additional topics are treated briefly here. The brevity is not
related to a judgment of importance, but simply to the limited nature of
the remarks.
7.1 HEALTH EFFECTS OF AIR POLLUTION
Empirical approaches (in particular, linear and nonlinear regression
techniques) have been employed in estimating the effects of air pollution
levels on health. The main difficulty encountered in this type of anal-
ysis is that of determining an incremental effect on respiratory health
measurements which are often dominated by vagaries of general health prob-
lems such as flu epidemics or of individual differences such as the habit
of smoking or occupational environment. Yet, very strong relations must
be derived if causal effects are implied. In such conditions, the best
hope for improvement is in more highly controlled data collection efforts
(which are, however, very expensive).
This situation highlights an important aspect of data analysis pro-
jects: A legitimate result of the analysis is a negative conclusion, a
conclusion that the data does not admit of reliable results. A negative
result is constructive to the degree that it makes the strong statement
that the information desired is not present in the data; this settles the
matter unless the data base is augmented. A less conclusive culmination
of a data analysis effort is a limited negative statement, for example,
a conclusion that no linear function of the independent variables predicts
the desired variable with statistically significant accuracy.
-------
62
We note, however, that a negative conclusion does not necessarily im-
ply a faulty data collection effort; it may instead imply that the rela-
tionship of interest is less pronounced than initially expected relative
to the effect of uncontrolled (or unmeasured) variables. Unfortunately,
a well-conceived data analysis or collection effort is often labeled a
failure when only negative results are produced—a charge which implies
that the knowledge which the study was designed to elicit should have
been obvious before the data was collected.
7.2 SHORT-TERM FORECASTING OF POLLUTANT LEVELS
The forecasting of pollutant levels the next day is of importance
for health warning systems and/or to initiate short-term control procedures.
Forecasting pollution levels and forecasting the weather are closely re-
lated problems; it is not clear which is the most difficult, but certainly
neither is easy. The empirical approach attempts to model directly the
relation implicit in measured meteorological and air-quality data.
Persistence (i.e., assuming tomorrow's peak pollutant concentration
equals today's peak concentration) usually proves a reliable forecast at
lower pollution levels, but not necessarily at high levels when accuracy
is most critical [13]. Certainly persistence will not predict a high
pollutant level on a day following one with a low-pollutant level. Regression
or time-series approaches tend to exploit persistence and may not be best
suited to a situation where the determinants of the future pollution level
-------
63
can be considerably different depending on the level. Further, the
performance estimate evaluating the results can be misleadingly promising
due to the number of low or intermediate pollution days usually included
in the analysis.
Classification analysis is probably a more natural approach to the
problem. The joint distribution of attributes (i.e., descriptive variables)
of high-pollution days can be derived by looking at high-pollution days
alone and can be compared to the joint distribution of attributes of
intermediate-pollution days and to the joint distribution of attributes
of low-pollution days. The variables of importance in distinguishing the
3 classes can be determined, and an algorithm to predict the classes can be
derived. Such an approach will build into the methodology the natural dis-
continuities inherent to the onset and conclusion of a high air pollution
episode.
-------
64
REFERENCES
1. Box, G.E.P., and G.C. Tiao, "Intervention Analysis with Applications
to Economic and Environmental Problems," Technical Report No. 335,
Department of Statistics, University of Wisconsin, Madison, Oct. 1973.
2. Breiman, Leo, and W.S. Meisel, "General Estimates of the Intrinsic
Variability of Data in Nonlinear Regression Modesi," TSC Report,
Technology Service Corp., Santa Monica, Calif., October 1974, to be
published in J. Amer. Stat. Asso.
3. Breiman, Leo, and U.S. Meisel, "Short-term Changes in Ground-Level
Ozone Concentrations: An Empirical Analysis," Part III of "The
Role of Empirical Methods in Air Quality and Meteorological Analyses,"
Final Report for EPA Contract No. 68-02-1704, October 1975.
4. Bruntz, S.M., W.S. Cleveland, B. Kleiner and J.L. Warner, "The Depen-
dence of Ambient Ozone on Solar Radiation, Wind, Temperature, and
Mixing Height," Proc. Symp. on Atmospheric Diffusion and Air Pollution,
Santa Barbara, Calif., Sept. 9-13, 1974, American Meteorological
Society, Boston, Mass.
5. Calder, K.L., "Some Miscellaneous Aspects of Current Urban Pollution
Models," Proc. Symp. on Multiple Source Urban Diffusion Models, EPA,
Research Triangle Park, No. Carolina, 1970.
6. Calder, K.L., Quoted by Niels Busch in the proceedings of the fourth
meeting of the NATO/CCMS Panel on Air Pollution Modeling, from a
letter written in March 1973.
7. Calder, K.L., W.S. Meisel, and M.D. Teener, "Feasibility Study of a
Source-Oriented Empirical Air Quality Model," (Part II of "Empirical
Techniques for Analyzing Air Quality and Meteorological Data"), Final
Report on EPA Contract No. 68-02-1704, Dec. 1975.
8. Calder, K.L., "A Narrow Plume Simplification for Multiple Source Urban
Pollution Models," (informal unpublished note), Dec. 1969.
9. "Calibration Report: LAAPCD Method More Accurate; ARB More Precise,"
Calif. Air Resources Board Bulletin. Vol. 5, No. 11 (Dec. 1974), pp 1-2.
10. "Cluster Analysis," Chapter VIII of W.S. Meisel, Computer-Oriented
Approaches to Pattern Recognition. Academic Press, 1972.
11. Gronskei, K.E., E. Joranqer and F. Gram, "Assessment of Air Quality in
Oslo, Norway," Published as Appendix D to the NATO/CCMS Air Pollution
Document "Guidelines to Assessment of Air Quality (Revised) SO , TSP,
CO, HC, NOX Oxidants," Norwegian Institute for Air Research, Kjeller,
Norway, Feb. 1973. (This document may be obtained from the Air Pollu-
tion Technical Information Center, Office of Air and Water Programs,
Environmental Protection Agency, Research Triangle Park, No. Carolina.)
-------
65
REFERENCES (Cont.)
12. Horowitz, A. and M.S. Meisel, "The Application of Repro-Modeling to the
Analysis of a Photochemical Air Pollution Model," EPA Report No. EPA-
6504-74-001, NERC, Research Triangle Park, No. Carolina, Dec. 1973.
13. Horowitz, A. and W.S. Meisel, "On-time Series Models in the Short-term
Forecasting of Air Pollution Concentrations," Technology Service Corpo-
ration Report No. TSC-74-DS-101, Santa Monica, CA, Aug. 22, 1974.
14. Hrenko, J.M. and D.B. Turner, "An Efficient Gaussian-Plume Multiple
Source Air Quality Algorithm," Paper 75-04.3, 68th Annual APC Meeting,
Boston, June 1975.
15. Meisel, W.S., "Empirical Approaches to Air Quality and Meteorological
Modeling," Proc. of Expert Panel on Air Pollution Modeling, NATO Com-
mittee on Crises in Modern Society, Riso, Denmark, June 6, 1974. (This
document may be obtained from the Air Pollution Technical Information
Center, Office of Air and Water Programs, Environmental Protection
Agency, Research Triangle Park, No. Carolina 27711.)
16. Meisel, W.S. and D.C. Collins, "Repro-Modeling: An Approach to Efficient
Model Utilization and Interpretation," IEEE Transactions on Systems. Man,
and Cybernetics. Vol. SMC-3, No. 4, July 1973, pp 349-358.
17. Meisel, W.S., Computer-Oriented Approaches to Pattern Recognition,
Academic Press, New York, 1972.
18. Nadaraya, E.A., "On Estimating Regression," Theor. Probability Appl.,
Vol. 4, pp 141-142, 1965.
19. Nadaraya, E.A., "On Non-parametric Estimates of Density Functions and
Regression Curves," Theor. Probability Appl., Vol. 5, pp 186-190, 1965.
20. Nadaraya, E.A., "Remarks on Non-parametric Estimates for Density Func-
tions and Regression Curves," Theor. Probability Appl.. Vol. 15,
pp 134-137, 1970.
21. Rosenblatt, M., "Conditional Probability Density and Regression Esti-
mators," Multivariate Analysis, Vol. II, pp 25-31, Academic Press,
New York, 1969.
22. Smith, F.B. and G.H. Jeffrey, "The Prediction of High Concentrations
of Sulfur Dioxide in London and Manchester Air," Proc. 3rd Meeting of
NATO/CCMS Expert Panel on Air Pollution Modeling, Paris, Oct. 2-3. 1972.
23. Thayer, S.D. and R.C. Koch, "Sensitivity Analysis of the Multiple-Source
Gaussian Plume Urban Diffusion Model," Preprint volume, Conference on
Urban Environment, Oct. 31-Nov. 2, 1972, Philadelphia, Pennsylvania
(published by American Meteorological Society, Boston, Mass.).
-------
66
REFERENCES (Cont.)
24. Tiao, G.C., G.E.P. Box, and W.J. Hamming, "Analysis of Los Angeles
Photochemical Smog Data: A Statistical Overview," Technical Rept.
No. 331, Dept. of Statistics, U. of Wisconsin, April 1973.
25. Tiao, G.C., et al., "Los Angeles Aerometric Ozone Data 1955-1972,"
Technical Rept. No. 346, Dept. of Statistics, U. of Wisconsin, Oct. 1973.
26. Wayne, Kokin, and Weisburd, "Controlled Evaluation of Reactive Environ-
mental Simulation Model (REM)," Vols. I & II, NTIS PB 220 456/8 and
PB 220 457/6, Feb. 1973.
-------
TECHNICAL REPORT DATA
(Please read Instructions on the reverse before completing)
REPORT NO.
lPA-600/4-76-029a
TITLE AND SUBTITLE
2.
;ya
— — "" EMPIRICAL TECHNIQUES FOR ANALYZING AIR
QUALITY AND METEOROLOGICAL DATA. Part I. The Role of
Empirical Methods in Air Quality and Meteorological
Analyses
3. RECIPIENT'S ACCESSION NO.
5. REPORT DATE
July 1976
6. PERFORMING ORGANIZATION CODE
AUTHOR(S)
M.S. Meisel
8. PERFORMING ORGANIZATION REPORT NO.
TSC-PD-132-2
PERFORMING ORGANIZATION NAME AND ADDRESS
Technology Service Corporation
2811 Mil shire Boulevard
Santa Monica, California 90403
10. PROGRAM ELEMENT NO.
1AA009
11. CONTRACT/GRANT NO.
EPA 68-02-1704
2. SPONSORING AGENCY NAME AND ADDRESS
Environmental Sciences Research Laboratory
Office of Research and Development
U.S. Environmental Protection Agency
Research Triangle Park, North Carolina 27711
13. TYPE OF REPORT AND PERIOD COVERED
Final May 74-Dec 75
14. SPONSORING AGENCY CODE
EPA-ORD
5. SUPPLEMENTARY NOTES
This is the first of three reports examining the potential role of state-of-the-art
empirical techniques in analyzing air quality and meteorological data.
16. ABSTRACT
Empirical methods have found limited application in air quality and
meteorological analyses, largely because of a lack of good data and the large
number of variables in most applications. More and better data, along with
advances in methodology, have broadened the applicability of empirical approaches.
This report illustrates the types of problems for which creative empirical
approaches have the potential for significant contributions. The results of two
pilot projects are reported in some detail.
7.
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
b.lDENTIFIERS/OPEN ENDED TERMS c. COSATI F;ield/Group
*Air pollution
*Meteorological data
*Atmospheric diffusion
*Mathematical models
*Environment simulation
13B
04B
04A
12A
14B
8. DISTRIBUTION STATEMEN1
RELEASE TO PUBLIC
19. SECURITY CLASS (This Report)
UNCLASSIFIED
20. SECURITY CLASS (This page)
UNCLASSIFIED
21. NO. OF PAGES
73
22. PRICE
EPA Form 2220-1 (9-73)
------- |