Empirical Methods in the Analysis of Air Quality and Meteorological Problems: Draft


EMPIRICAL METHODS IN THE ANALYSIS OF
AIR QUALITY AND METEOROLOGICAL
PROBLEMS
William S. Meisel
Interim Report
December 1974


Technology Service Corporation

-------
Technology Service Corporation
2811 Wi1 shire Boulevard	DRAFT: For Internal
Santa Monica, California 90403	EPA Review
(213)829-7411
EMPIRICAL METHODS IN THE ANALYSIS OF
AIR QUALITY AMD METEOROLOGICAL
PROBLEMS
William S. Meisel
Interim Report
December 1974
Contract No. 68-02-1704
EPA Project Officer: Ken Calder
Meteorology Laboratory
National Environmental Research Center
Research Triangle Park, North Carolina 27711
Prepared for
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
WASHINGTON, D.C. 20460

-------
PREFACE
This interim report serves two functions: it is (1) an outline
of the proposed phase II projects, for comment,and (2) a draft of
the first volume of the three-volume final report:
I. Empirical Methods in the Analysis of Air Quality
and Meteorological Problems
II. A Source-Oriented Empirical hLdel of the Dispersion
of Air Pollutants.
III. The Oxidant Formation Process in the Los Angeles Basin:
An Empirical Analysis.
This volume will be revised to serve as the introductory volume indicated;
the revised version will be delivered with the final report.
Discussions with EPA personnel led to inclusion of many of the sub-
jects dealt with in this report. The project monitor, Ken Calder, took
a particularly active and constructive role. Advice from Leo Breiman
and Alan Horowitz at Technology Service Corporation further improved the
report.
1

-------
TABLE OF CONTENTS
SECTION	PAGE
1.0 INTRODUCTION 		1
2.0 A SOURCE-ORIENTED EMPIRICAL MODEL 		6
2.1	Motivation		6
2.2	Formulation 		6
2.3	Feasibility		14
2.4	The Inverse Problem	15
2.5	Testing the Approach		16
2.6	Research Plan		16
3.0 EMPIRICAL ANALYSIS OF THE OXIDANT FORMATION PROCESSES
IN THE LOS ANGELES BASIN		 .	18
3.1	Motivation		18
3.2	The Data		19
3.3	The Problem Formulation		20
3.4	Research Plan					27
4.0 EXTRACTION OF EMISSION TRENDS FROM AIR QUALITY TRENDS ...	28
4.1	Motivation 			28
4.2	Report of a Comparison of Emission Levels	^9
over Two Time Periods			29
4.3	Generalization and Mathematical Formulation	32
5.0 DETECTION OF INCONSISTENCIES IN AIR QUALITY/
METEOROLOGICAL 		39
5.1	Motivation		39
5.2	Formulation of Consistency Models ..........	41
5.3	Types of Inconsistencies		43
5.4	Difficulties				48
6.0 REPRO-MODELING: EMPIRICAL APPROACHES TO THE
UNDERSTANDING AND EFFICIENT USE OF COMPLEX
AIR QUALITY MODELS		50
11

-------
TABLE OF CONTENTS (CONT'D)
SECTION	PAGE
7.0 OTHER APPLICATION AREAS 		53
7.1	Spatial Interpolation of Meteorological and
Air Quality Measurements 		53
7.2	Health Effects of Air Pollution		54
7.3	Short-term Forecasting of Pollutant Levels ....	55
8.0 REFERENCES		57
111

-------
1.0 INTRODUCTION
The increased availability of appropriate data bases arid improve-
ments in methodology have led to the increasing use of empirical and
statistical approaches to the analysis of air quality and meteorological
problems [I]. This volume suggests how these approaches might be applied
to a number of problems of interest to the Environmental Protection Agency,
particularly those with a meteorological aspect.
The objectives of this report are limited. The applications dis-
cussed at most length;.are those where either empirical approaches have
not been fully exploited and/or the problem can be formulated in an in-
novative manner. The discussions are intended to highlight opportunities
rather than to provide detailed plans for problem solution. As part of
the present project, two problem areas will be explored as pilot studies
to more fully demonstrate modern empirical techniques; these projects will
be reported in separate volumes.
The subjects discussed in this volume are the following:
A source-oriented empirical model: It has often been assumed that
1t is Impractical to derive an empirical model relating emission
source distribution and meteorology to the resulting pollutant
concentration distribution. The basic argument against an empirical
approach has been the dearth of detailed emission inventories in
comparison to the relative abundance of air quality data. An ap-
proach 1s postulated whereby it 1s suggested that an empirical
meteorological dispersion function can be derived by indirect
means.

-------
2
Empirical analysis of the oxidant formation processess: Empirical
approaches to the problem of determining the relationship between
oxidant precursor (HC and NO ) concentrations and resulting am-
bient oxidant levels are discussed.
Extraction of emission trends from air quality trends; The esti-
mation of air quality trends from air quality measurements is
complicated by the effect of meteorology. We discuss the deter-
mination of a "meteorologically adjusted" trend, i.e., a trend
more nearly related to the emissions trend.
Detection of inconsistencies in air quality/meteorological data
bases: In any data collection or data analysis effort, a major
concern is the integrity of the data. It is important to de-
tect problems with monitoring equipment or monitoring methods and
to note any important changes in the system monitored so that
such errors or changes do not distort the analysis of the data
or invalidate a portion of the data collected. We discuss auto-
matic procedures for detecting inconsistencies.
Empirical approaches to the understanding and efficient use of
complex air quality models: Computer-based models derived from
physical principles are tools which often should be analyzed
themselves for the sake of extracting their implications, for
modeling aspects of their behavior to reduce input data require-
ments and running time, for validation, or to suggest further
areas for model improvement. Model-generated input/output data
can be so analyzed by empirical techniques.

-------
3
Spatial interpolation of meteorological and air quality measure-
ments; Interpolation of variables such as wind field or pollutant
concentration is of interest in several applications. We discuss
some general aspects of this problem.
Health effects of air pollution: We comment on this area in
which empirical approaches are at present heavily employed.
Short-term pollutant level forecasting: Short-term forecasting
for health warning systems or to invoke temporary controls can
be approached empirically. Several pitfalls are highlighted.
In discussions of the above subjects, the attempt is to formulate
a data-analytic approach which reduces the problem to a straightforward
data-analytic technique. The-techniques of empirical analysis which are
referenced include the following:
1• Hypothesis testing, statistical modeling, and other "classical"
statistical approaches: These "classical" approaches are by no
means without their subtleties or potential for misapplication,
but are the subject of many textbooks and conventional statis-
tics courses.
2. Linear and nonlinear regression: These techniques fit a function
to data to model the relationship between independent variables
and an ordered, many-valued independent variable, linear regres-
sion, 1n general, and nonlinear regression in a single Independent
variable are well-understood and often used. Nonlinear regression
with multiple independent variables, particularly for the small-
sample case, is more difficult, but significant technical progress
has been made in the last few years.

-------
4
it Time-series analysis; Time-series analysis takes advantage of
the serial nature of the data and presumably of the underlying
model. The subject has been studied for many years (sometimes
It "signal processing"), but In recent years new developments
hiVi arisen and the subject has been treated more systematically.
The linear case 1s much more highly developed than the nonlinear
east; however, not all problems involving time series are best
treated by techniques designed specifically for time series.
4i Classification analysis ("pattern recognition"): These tech-
niques use data to relate Independent variables to a class
label (1,eM a possibly unordered, few-valued dependent variable).
Because earlier work in this field was oriented toward developing
hardware devices rather than analyzing data, its power as a data-
analyttc tool has only been fully realized in the last few years,
(i, Cluster analysis: Cluster analysis does not require a dependent
variable but analyzes the distribution of multivariate data,
1*e», the joint distribution of the independent variables, to
determine distinct groupings of data points in multivariate space.
Much work on the subject has been done recently, and 1t will be-
come better known when several textbooks 1n press are published.
Discussions of cluster analysis tend at present to appear as chap-
ters 1n books on pattern recognition.
Each of these data-analytic subjects is difficult and tends to
have Its own language and proponents. Further, few universities
currently encourage students to become broadly based experts in data

-------
5
analysis. Hence, tradeoffs among techniques are not always made with
obtaining the best problem solution as the only criterion. In the
present report, a sincere attempt has been made to formulate the
problem in the most general terms, pointing out the class of techniques
applicable but seldom specific algorithms.

-------
6
2.0	A SOURCE-ORIENTED EMPIRICAL MODEL
2.1	Motivation
Multiple-source simulation models for urban air quality based on
meteorological dispersion functions are in broad use [2]; for example, the
Gaussian plume formulation is used in many models including the RAM model
presently in development at the Environmental Protection Agency [3],
The particular form of relationship between source and receptor used
in this formulation was originally developed to describe dispersion from
isolated sources and has been adapted to the urban environment. Because
a source-oriented model is extremely useful in examining the impact of
proposed emission controls, it 1s of interest to determine if a source-
receptor relationship which provides an alternative to the Gaussian plume
formulation can be determined empirically.
Since one cannot, in general, isolate the effects of single sources in
an urban area to determine the source-receptor relationship, we propose
that the relationship be extracted indirectly by determining a formulation
which will best predict the pollutant concentration distribution, given
the emission distribution and meteorological conditions.
2.2	Formulation
The basic Gaussian plume equation predicts the concentration at a
point (x,y,z) from a source of unit strength at (S,n,e) as
/
+ exp -
(2-1)

-------
7
where
IT = mean wind speed,
i; = effective source height, and
a (d),a (d) = horizontal and vertical diffusion functions a distance d
y 2 downwind from the source.
The first term within the brackets of Eq. (2-1) denotes the dispersion of
the pollutant in the lateral direction; the second term, in the vertical
direction; anJ the third term represents the perfect reflection of the pol-
lutant bearing diffusive eddies from the surface of the earth, i.e., there
is neither deposition nor reaction at the surface. The coordinates are
aligned such that x is along-wind and y is crosswind. In fact, Eq. (2-1)
holds only when x-s is positive; the concentration is assumed to be zero
if the source is downwind. The diffusion functions depend on meteorological
parameters, usually mixing layer depth and stability condition.
The concentration from multiple sources in a region V with a source
strength distribution Q(€,n,d is given by superposition:
x(*,y,z) = fR(x,y,z;s,n»s)Q(5,n,s)dedndc	(2-2)
V
where the integral over the volume V can be abstractly considered to include
both point and area sources.
Following Calder [4], we can assume horizontal homogeneity, as in
the Gaussian formulation:
R(x,y,2;c,n,c) e K(x-c,y-n,z, c)
(2-3)

-------
8
yielding an equivalent to 2-2):
x(x,y,z) = y*K(x'}y',zti)Q(x-x',y-y',?)dV' ,	(2-4)
V
where we have made the change of variable
x1 = x-5
y' = y-n
and dV ¦ dx'dy'dc.
Again following Calder [4], we note that (2-4) represents an integral
equation for the function K if the concentration distribution x and emission
distribution Q are known. In other words, we might conceive of determining
the source-receptor function K empirically by examining observed concentra-
tion distributions resulting from observed (or estimated) emission distri-
butions. Calder notes that in the case where (1) we are predicting
ground-level concentrations from area sources at ground-level, (2) the
Integral is approximated by a summation over an M-by-N grid of
values, and (3) we have measured concentrations and emissions at each grid
point, K can be determined in tabular form by solving a set of linear
equations. The table specifying the function K(x',y') in this case yields
values for any pair of grid points (x,?) and (y,n) and is valid for the
meteorological conditions which yielded the particular concentration
distribution used. (The table would appear as in Figure 2-1). A
tabular formulation of the source-receptor function K

-------
9
x* = x - K
K
.1
.2
.3
.4
.1
5.0
5.5
4.5
3.0
.2
4.0
4.5
3.0
2.5
.3
3.0
2.0
1.5
1.4
.4
2.0
1.6
1.2
1.0
Figure 2-1. A tabular representation of a hypo-
thetical source-reception function
for a fixed meteorology. Another
table would be required for another
set of meteorological conditions.
has the advantage of not restricting the class of functions being investi-
gated, but has the disadvantage of requiring values of the concentration
distribution at each grid point and making the dependence upon meteorological
factors difficult to extract explicitly.
An alternative approach to solving equation (2-.4) is to restrict K
to membership in a family of functions K(x',y',z,s;a), where choosing the
parameter vector o specifies a particular member of the family. A familiar
example is the family of multivariate polynomials where a member 1s speci-
fied by a particular choice of values for the coefficients. In this case,
K is specified as a specific functional form rather than as a table.

-------
10
One approach to determining the "best-fitting" function of the
*
chosen family (or, equivalently, of finding the parameter vector a
which gave the best fit) is to fit,by a classical least-squares method,
the values yielded by the set of linear equations suggested by Calder.
In the hypothetical example of Figure 2-1, the 16 values of the table
could be fit by a function of two variables. Since this is a two-step
approach, it does not overcome the problems of the first approach, but
amounts largely to smoothing the values yielded by that approach.
The more direct approach is to substitute K(x',y',z,5;^) directly
in equation (2-4). If there were an a. such that the equation could
be satisfied exactly, the concentration distribution could be predicted
exactly by the function given by that a. If a perfect fit is not possible,
then one could find the parameters £ which minimized the mean-square error
over a'number of measurement points (x^y^z^), i=l,2,...,M, perhaps
corresponding to monitoring stations:
e2(a) 88 ft £ L - f MxVV^jaWxj-x'.yj-y'.cJdV']2	(2-5)
i **1 L V	J
where ^ * x(x1,y1,z1).
Equation (2-5) can be minimized with respect to a by any number of
optimization techniques if the integral can be calculated; many numerical
integration techniques are suitable for that purpose. A key problem is
the choice of an appropriate family of parameterized functional forms for
K such that the error e will be small, but such that the number of

-------
u
parameters a is small. The number of parameters is related to the number
of measurements required to make the problem well-determined and to its
computational feasibility. We will discuss this point after further
refinement of the problem formulation.
Explicit Dependence on Meteorology
Since the concentration distribution in (2-5) is determined by
meteorological parameters as well as the source distribution, K determined
by that formulation would be valid only for that particular set of meteor-
ological conditions. If we denote the vector of meteorological parameters
as m (e.g., wind speed, inversion height, and stability class) and express
the dependence of K and x on meteorology as K(x' ,y' ,z,cfm;o.) and x(*>y.z,m),
2
then the error e is a function of the choice of parameters a_ and meteor-
ological conditions m:
= e2(a,m) .	(2-6)
If a set of N meteorological conditions we wish to explore is given by
rn^,m2>...,m^, then we may find a to minimize
E2{«) " I S e2(«i]!lj) »	(2-7)
where e is defined by (2-5). This last equation gives the mean-square
error over all measurement points (x.»yj»Zj) and over all the chosen
meteorological conditions. The function resulting from optimization,
K(x\y\z ,c,ny«*)t should predict these MN points accurately. If m is

-------
12
three-dimensional, then K is a function of six variables and explicitly
contains dependence on the meteorology. Equation (2-1), the Gaussian
formulation, is an example of such a functional form (although not obtained
by optimization).
Area and Point Sources
We can further refine our formulation by explicit consideration of
area and point sources.
Area sources at ground level yield concentrations at ground level
given by
xA(x,y,m) = x(x,y,0,m) = J Kft(x' ,y' »m;6.)QA(x-x' ,y-y' )dx'dy' . (2-8)
V
Elevated point sources measured at ground level are given by
xP(x,y,m) = x(x,y,0,ni) = Kp(x^,y^q,m;^)Qp(x-x^,y-y^,sjl) , (2-9)
where the point sources are at
and
x; = x - andy;- y - .
The total concentration at (x,y) with meteorological conditions m 1s
given by

-------
13
x(x,y,m) = xA(x,y,m) + x (x,y,m)	(2-10)
P
A§ before, the optimal source-receptor function K is determined by finding
the parameters e. and x which minimize the mean-square error in predicting
concentrations over a varying set of meteorological conditions:
N ( M r r
E2(3,X) * f ^ xi ~ J* ka(x' »Jg-)QACXi-x* ,yi~y' )dx'dy'
j=l " 1=1
- E Kp(x;»y;.cil,mj;x)Qp(xrx;,yry;,c£)
A*8 I
2
(2-11)
(Note that the effective stack height X, may also be a function of the
meteorology: 5= c(m).)
Equation (2-11) summarizes the basic method proposed. Presuming the
Integral is approximated using a numerical integration technique, the
2
error E can be calculated for any choice of parameters. A number of
optimization techniques fan be employed to find the parameters which give
the best fit. Given those optimal parameter values, the optimal source-
receptor function and Kp are fully specified. It remains to consider
the feasibility of this approach.

-------
14
2.3 Feasibility
Feasibility of the proposed method depends upon two closely related
problems:
(1)	Does the problem as posed have a well-defined solution?
(2)	Even if the solution is theoretically well-defined, is
it computationally feasible to obtain?
Both questions are heavily dependent on the number of parameters
defining and Kp, the dimensionality of 8. and y_. The number of values
to be fit (M-N) should be greater than the number of free parameters; if
so, the solution will in most cases be well-defined (perhaps within some
limited region of parameter space).
Computational feasibility also depends on the number of parameters.
The cost of most optimization algorithms will tend to go up as a power of
the number of parameters whose values must be determined. A major objective
of this approach must hence be to find a parameterized functional form which
is sufficiently general to be able to model the source-receptor function,
but which does not require a large number of free parameters to achieve
this generality. Continuous piecewise linear functions, as used in a
recent EPA study [5], provide such a class of functions and are a promising
candidate for achieving a feasible solution.
Whatever form of approximating function is used, however, one can
simplify the problem by making certain assumptions regarding the source-
receptor function; for example:

-------
15
(1)	One might assume specific dependencies on some meteorological
parameters, e.g., by assuming that the concentration is
inversely proportional to the average wind speed rather than
extracting that dependence empirically.
(2)	One might assume the Gaussian form and determine the dispersion
functions empirically.
(3)	One might make the narrow-plume assumption for area sources [6].
The more assumptions made, the less general, but also less difficult, the
analysis will be.
2.4 The Inverse Problem
Suppose a good source-receptor function has been determined. Then
one might pose the problem: Given meteorological conditions, measurements
of the pollutant concentration distribution, and source locations, determine
the distribution of source strengths.
Equation (2-11) provides a formulation of this problem if §_ and £ are
assumed known and the area and/or point source strengths assumed unknown.
For example, suppose the area sources are assumed known, and the point
source emission rates to be determined. There are then J unknown values
for the J point sources, and E , the mean-square error in predicting the
measured concentrations, is a function of those J values. The values
o
which minimize E are good estimates of the source emission rates. Since
the unknowns appear linearly within the brackets, the minimum of (2-11) can
be found by solving a set of J linear equations in J unknowns. The solution
will be well-determined (except in degenerate cases) if the number of con-
centration measurements times the number of meteorological conditions
exceeds the number of point sources.

-------
16
2.5	Testing the Approach
A common problem in testing meteorological and air quality models
is that the data base required for the models is subject to errors which
may be of the same size as errors introduced by the models. Emission
inventories and estimates of diurnal variations in emissions, for example,
may tend to be correct on the average, yet be considerably in error in
any given hour. A least-squares formulation such as that proposed tends
to average out errors and will tend to produce good models. In order to
gain confidence in the approach and to explore alternative levels of
assumptions, however, it is desirable to use a case where the source of
errors will arise from the model formulation rather than measurement error.
One means to this end is to use data generated by a model such as the;
Gaussian-plume RAM model referenced earlier. If the source-receptor
function derived by the proposed approach closely approximates the Gaussian
form used in generating the data, one would have increased confidence in
the applicability of the proposed technique to measured data. Further,
alternative versions of the methodology could be analyzed in a controlled
environment. The proposed Phase II study thus suggests this approach and
follows a work plan suggested by Calder [4],
2.6	Research Plan
Task 1
Select a real urban location (e.g., St. Louis, New York, or Chicago)
for which ground-level area- and point-source, short-term emissions distri-
butions are available for SOg. For a typical one-hour emissions distribution^
use a muHipie-source Gaussian dispersion model (probably the RAM model of

-------
17
the EPA Meteorology Laboratory)—for one wind speed (5 m/sec), one stability
class (neutral) and infinite mixing depth—to calculate total one-hour con-
centrations x(Pj»ei) at ground level at a number of receptor locations
I J
P,j (e.g., as for St. Louis RAPS network) and for various wind directions
ej. Apply the methodology proposed to attempt to recover the meteorological
dispersion function K. Determine
(a)	the degree of error in predicting concentrations, for the re-
ceptor locations and wind directions actually used to derive the empirical
dispersion function,
(b)	the degree of error in predicting concentrations at receptor
locations and for wind directions not used in the derivation (a measure
of interpolation accuracy),
(c)	the degree of error in predicting results for a somewhat dif-
ferent emissions distribution (a test of extrapolation accuracy), and
(d)	a comparison of the empirical dispersion function with the
Gaussian form used to compute input concentrations for the analysis.
Task 2
Test the sensitivity of the method to the number of "observed"
concentrations used and to random errors in the emissions inventory.
Task 3
Extend the preceding to a range of wind speeds, atmospheric
stability classes, and to several different emissions distributions.

-------
18
3.0	EMPIRICAL ANALYSIS OF THE OXIDANT FORMATION PROCESSES IN THE
LOS ANGELES BASIN
3.1	Motivation
Oxidant is a difficult pollutant to deal with 1n terms of understanding
the effect of particular controls on the resultant level of its concentra-
tion. The principal reason for this difficulty is that oxidant is largely
an end product of a chemical process rather than being directly emitted
from pollutant sources. Oxidant is related to emissions not only through
transport and diffusion, but by a complex chemical reaction in which meteor-
ology can play a significant part. The principal pollutants leading to
the formation of oxidant are reactive hydrocarbons (HC) and oxides of ni-
trogen (N0X). '(We shall refer-to these "raw" pollutants as "oxidant pre-
cursors.") Since emission control policies can affect not only the overall
level of emissions but the ratio of emissions of NO to hydrocarbons, it
becomes Important to understand the effect of this ratio as well as of the
absolute level of emissions upon the end concentrations of oxidant. These
effects are by no means fully agreed upon.
One approach to understanding this problem is a very detailed inspec-
tion of all the physical processes Involved, including meteorological ef-
fects and chemistry. One then obtains a model which, if successful, related
a detailed emissions Inventory and detailed meteorological conditions to a
resulting time and spatial distribution of oxidant values over the area
modeled.
An alternative approach is an empirical analysis of the relationship
between observed concentrations of oxidant precursors and meteorological

-------
19
variables and the resulting observed distribution of oxidant concentration.
The objectives of such an analysis would generally be more limited than
in the development of detailed models arising from fundamental physical
and chemical principles; however, an insight Into the relationships which
are observed, even for a limited range of conditions, can provide both
guidance for setting control policy and guidance as to the dominant effects
which should be considered in a chemical/physical model.
3.2 The Pat-
Any empirical analysis must proceed from a data base. For the analysis
proposed, a great deal of data is available on the Los Angeles basiri,
particularly from the Los Angeles Air Pollution Control District (LAAPCD)
and the California Air Resources Board (ARB).
There are at present approximately 30 stations monitoring air quality
in the South Coast (Los Angeles srea) Basin. Almost all of the stations
monitor oxidant and NO . Several monitor HC.
The earliest station records date back to 1955, but few stations have
such long histories. The early records are of somewhat doubtful value in
some cases, due to changes in monitoring technology and standards. There
are, however, more than 20 stations with histories of several years.
There are continuing questions about the comparability of data taken
by different agencies. While such comparability (and indeed absolute
accuracy) is critical for use with deterministic models, it is less
important for statistical models as long as a consistent basis is used
for each station reporting. The more recent data is generally charac-
terized by such consistency, although data from the ARB may have to be

-------
20
adjusted downward 20 to 25 percent to be consistent with LAAPCD data due
to differing calibration techniques [7,8].
Mesometerological data, such as wind speed and direction, surface
temperature, the vertical temperature profile, pressure, humidity, pre-
cipitation, visibility, and insolation, is collected at airports and
other Weather Service stations and at meteorological stations run by
other organizations, for example, Air Pollution Control Districts and the
Armed Forces. Data from the Weather Service and Armed Forces stations are
available from NOAA. Data from a sizable number of other meteorological
stations in Los Angeles County is available from the LAAPCD.
An initial statistical analysis of LAAPCD data has been performed by
Tiao, Box, et al. [9,10]. At present,only preliminary results have been
reported.
3.3 The Problem Formulation
The general empirical approach is to postulate the possible independent
variables which affect the dependent variable to be predicted. In the
present case, the independent variables are measures of the precursor
pollutant concentrations and of meteorological variables and the dependent
variable is the oxidant concentration at a given location (or an aggregate
measure such as peak oxidant concentration throughout the basin). (We re-
fer for the sake of conciseness to variables such as averages or peaks
which remove either a spatial or time variation from a given Independent
or dependent variable as "aggregate" variables.) This analysis has two
major steps:

-------
21
(1)	Find the independent variables which best explain the
behavior of the dependent variable; and
(2)	Model that relationship mathematically.
While the second step generally receives the most attention in empirical
analyses, the first step is the more difficult and often the more revealing.
One can apply both linear and nonlinear models in both steps. If linear
models are applied in the first step, the question answered will be whether
the variables predict the dependent variable linearly. If general nonlinear
forms are allowed, the question answered will be whether the independent
variables predict dependent variables in either a linear or nonlinear manner.
An example of a good analysis of the linear dependence of ambient ozone
on meteorological parameters was performed 1n research at Bell Labora-
tories [11]. There the logarithms of the meteorological variables were used
to predict the logarithm of the oxidant concentration by a linear equation;
this equation produced a correlation between predicted and actual concentra-
tion of 0.84. A single location 1n New York was analyzed, and the only
variables used in attempting to predict the oxidant concentration were solar
radiation, wind speed, and temperature. Mixing height was determined to
offer no additional Information beyond that of the three Independent variables
Indicated. This analysis did not contain any dependence upon precursor pol-
lutants since the oxidant concentration was specific to a certain location
and the data was collected'over a relatively short period of time; hence,
the emissions might be expected to be relatively constant. The results,
although highly encouraging 1n terms of the potential for empirical anal-
ysis, should be qualified:

-------
22
(1)	The model correlation was based upon a limited time period
and one location and, while a very careful and credible
statistical analysis of the prediction errors was made,
no test was performed upon independent data not used in
creating the linear equation.
(2)	Since the errors in predicting the logarithm of the
ozone value were found to be normally distributed, the
error in predicting the ozone concentration Itself tended
to be largest at the higher values of ozone concentra-
tion. It 1s at the higher values where difficulty 1n
forecasting is generally encountered but where accuracy
of the model 1s most critical.
We note briefly another study as an example of the possibility
of creativity in the definition of potential independent variables.
Smith and Jeffrey attempted to predict, by a simple formula, the high
concentration of sulfer dioxide 1n London and Manchester a1r[l2]. One
variable they found quite useful was the number of hours when the wind
speed was less than three knots. This variable apparently summarized the
key aspect of the temporal variation of wind speed as a single number. It
is the Intent of the proposed project to attempt to exercise limited crea-
tivity 1n potential predictors of oxidant concentration.
A key characteristic of the problem 1s that the level of ozone at
a given time may be the result of precursor concentrations at a different
point at an earlier time. (There 1s even some tentative empirical evidence
that one day's hydrocarbons may be Important 1n producing high levels of

-------
23
oxidant on the following day [13]. The particular location and relevant
time delay will be a function of meteorological parameters such as the
Wind field, temperature, solar radiation, and mixing height. There are
several possible ways of handling this key problem:
(1)	Stratify the data by general meteorological or wind-field
classes, e.g., "a light wind from the ocean." Data for each
class of wind field could then be analyzed to discover the
location and time delay which best explain oxidant concentra-
tion at a given location. One could thus determine empiri-
cally the precursor location and time delays for the specific
meteorological class.
(2)	Aggregate precursor and oxidant values spatially and/or
temporally. If the average hydrocarbon and N0X concen-
trations across the basin for a given hour are considered
Independent variables and the peak oxidant reading through-
out the basin for the day 1s considered a dependent variable,
then one may seek a direct relationship between those variables.
This is much the sort of aggregation attempted success-
fully in an earlier analysis of data produced by a photo-
chemical smog model [5],
(3)	Perform a trajectory analysis. Let the precursors of oxidant
at a given location be the hydrocarbon and N0X concentration
1n the parcel of air at Its location at an earlier time
as obtained through analysis of the trajectory of the

-------
24
parcel. The independent variables in this case would be
the precursor concentrations in the parcel three hours
earlier, four hours earlier, etc. This approach has the
advantage of making it unnecessary to specifically include
the wind field as an independent variable but, instead,
to use it in defining a more complex independent variable.
The third approach involves the interpolation of the wind field and the
tracing of the trajectories. In a later section we will discuss objective
interpolation of meteorological parameters and specifically interpolation
of the wind field. Given a methodology for interpolating the wind field
from a limited number of measurements, trajectories such as those in
Figures 3-1 and 3-2 can be estimated. The precursor pollutant concen-
trations in the parcel of air at an earlier point can be estimated by
interpolation of the concentrations of the precursor pollutants between
measurement stations. The independent variable can then be the pollutant
concentration in the parcel of air at an earlier point in time (or perhaps
a weighted average of the earlier pollutant concentrations throughout the
trajectory).
The end objective is to obtain functional relationship between a
variable measuring the concentration of ozone and a limited number of
meteorological and chemical precursor variables. The tool for the
ultimate determination of the functional relationship will be nonlinear:
continuous piecewise-linear regression, as used in an earlier study [5],
(A comparison between a nonlinear fit and a linear fit will be made to
determine the degree of improvement obtained by allowing nonlinearity.)

-------
25
£StmTED TRAJECTORY OF AIR ARRIVING AT PASADENA, EL MONTE,
L-ONG BEACH, AND SANTA ANA AT ,0400 SEPTEMBER 29, 1969.
100
200
O
WEST
L,A.
BURBANK
o
o
LOS
ANGELES
.300
400
PASADENA
200_
L
100
300
""^400
O
EL MONTE
o
A2USA
o
POMONA
redondo
3EACH
>/
100 v-» 200
LONG
BEACH
^00
'400

300
400A 100
SANTA ^'200
ANA
Figure 3-1: Trajectories
The figure shows that an irregular early morning meandering
pattern exists at Long Beach, Santa Ana, and El Monte. Pasadena,
on the other hand, shows a northerly flow pattern due to nocturnal
air drainage down the mountains combined with an offshore wind
MILES flow. The lengths of the arrows give an Indication of how much
¦ , , . . . the air has moved during an hour interval. None of the stations
0 2 4 6 8. 10 show more than 4 m.p.h, air movement for the early morning hours.

-------
GST i M AT F.D T R A J t C TO R Y
LONG DiEACH, SANTA
26
OF AIR ARRIVING AT PASADENA, EL MONTE,
ANA, AfJD POMONA AT 1600 SEPTEMBER 29, IS
BURBANK
PASADENA
1600
AZUSA
1300
1400
1500
isoc^j
1200
L.A,
EL MONTE
1500
WEST
L.A.
1500
1400
l6°0 LONG
beach
1500
1600
REDONDO
V. BEACH
1500 SANTA
ANA
1400
Figure 3-2: Trajectories
The figure shows the estimated air trajectories for the af_
ternoon hours. All of the stations show the dominance of onshore
sea breezes with a tendency of higher velocities later in the
afternoon. The more regular air trajectories of afternoon also
show a greater air r,overrent than early in the morning as shown
by the greater lengths of the arrows. Thedeflecting influence
of the Santa Monica mountains causes the air trajectory to curve
northward as it approaches Pasadena.
MILES
» ¦ ¦ ¦ '
0 2 4 6 8 10

-------
27
The technique used to determine which variables will be used in the equation
will be a combination of linear and nonlinear variable selection techniques
[14].
3.4 Research Plan
Task 1
Collect data from the California ARB and LAAPCD (and limited meteoro-
logical data from other sources) into a common format. The data will be
limited to the South Coast Basin and the yeats 1968-1974.
Task 2
Analyze data as stratified by classes of wind field (and, perhaps,
other meteorological variables). Data from days with meteorological con-
ditions conducive to high oxidant concentration will be emphasized. Per-
form a linear and nonlinear analysis to determine appropriate independent
variables. Examine the utility of using aggregate variables within the
stratified classes.
Task 3
Examine the consistency of results obtained in Task 2 with those ob-
tained from a trajectory analysis. It is understood that practical limita-
tions on time and funds available will restrict the extensiveness of this
analysis.
Task 4
Summarize the implications of the analysis and of the generality and
validity of the models obtained.
The key difference between similar projects and the proposed project
1s the use of a class of powerful nonlinear techniques and the resulting
generality of the conclusions.

-------
28
4.0	EXTRACTION OF EMISSION TRENDS FROM AIR QUALITY TRENDS
4.1	Motivation
While measured pollutant concentration is the final impact of a
given level of emissions, trends in pollutant concentration measure-
ments can be misleading if it is assumed that those trends represent
progress (or the lack thereof) in emission control. Since meteorology
need not be uniform from time period to time period, the measure of
progress should be more directly related to emissions. Emissions come
from a large number of diverse sources, however, and are difficult to
measure directly. Since air quality has been measured directly for a
number of years, it is of significant interest to understand if the
effect of meteorology can be removed from air quality trends to more
nearly elicit trends in emissions. Such an analysis of trends is the
subject of periodic reports both by the Council on Environmental Quality
and by the Environmental Protection Agency.
Such a study ruist implicitly extract information about the influence
of meteorological factors on pollution levels for a given level of emis-
sions. This information can be an important subsidiary benefit of an
analysis of the sort suggested.
We will discuss this concept by referring to a specific example
of a study of the improvement in emissions between the early and late
sixties in Oslo, Norway [15], We will then relate this example to a
general formulation to highlight the assumptions involved in such a
study, to make the method more specific, and to provide a context for
broader application of this approach.

-------
29
4,1 ftoport of a Comparison of t^mlsslDr> Levels over Two Time Periods
A study of the changes 1n emission levels of SC^ in Oslo, Norway, as
deduced from changes In measured SO-? concentrations, was undertaken to
compare the S02 emissions of the periods 1959-1963 and 1969-1973. The
tiieteorological conditions during the former period were considerably
different from those during the latter period; hence, one could not ex-
pect a change in air quality to be directly related to a change in emis-
sions .
Data from the earlier period (1959-1963) was used to do a linear
regression analysis. It was discovered that two variables dominated
the estimate of SO^ concentration, a temperature difference between a
low altitude and high altitude measuring station and the temperature
at the lower station. For example* a typical regression equation for
erne station was
qso * 6'1,5 (T.J-T,) -11.6^+472 ,	(4-1)
where
qso = daily mean value of S02 concentration in yg/m3 at the parti-
cular station
T2 = temperature at higher station at 7 P.M.
T-| = temperature at lower station at 7 P.M.

-------
30
This equation explained the observed values of SC^ concentration with a
multiple correlation coefficient of .80; that is, the correlation between
values predicted by this equation and observed values for the period indi-
cated was 0.80. Adding other variables did not result in a significantly
better predictor equation. It was suggested that the temperature difference
term expressed the ventilation in the Oslo area while the temperature term
measured the variation in the emission of S^2 due to space heating. Since
the temperature data for the later time period is known, the level of SO2
expected for the meteorological conditions during that time period can be es-
timated by equation (4-1). This was done for the days on which data was
available in the later time period; the results are indicated and compared
with data from the earlier time period in Figure 4-1. The data from the
1959-1963 time period is scattered relatively uniformly about this line
of slope 1—as expected, since the regression was performed on that data.
However, the data from the later years evinces a much lower observed value
of SO2 concentration than would be expected from the meteorological condi-
tions. The referenced study attributed this to a reduction in emissions.
Figure 4-1 indicates qualitatively the emission reduction (or, if
the reader prefers, the "meteorologically normalized" reduction in pollu-
tant levels). A quantitative statement was made in the report that the
SOg pollution was reduced 50 to 60%. According to a conversation with
one of the authors of the report, this latter statement was derived by
looking at the ratio of the coefficient on the temperature difference
term in the early time period to the ratio of the coefficient of the
temperature difference term in a similarly derived equation for the later

-------
31
SOi
pq/nP
*	1950/G3
*	JSSS/70
'?7t
a
600
400
#00
Figure 4-1: Values of daily mean S02 concentration computed
from temperature measurements at 7 P.M. versus
daily mean S0? concentration observed. The fact
that the values in the later period are much less
than would be expected from the meteorology suggests
that emissions are less. Ref.[l5]

-------
32
time period. The intuitive justification for such a statement is that
the coefficient measures the degree to which a given temperature inversion
will be translated into SC>2 concentrations. Thus a 50 or 60% reduction
In that coefficient might be thought of as a meteorologically adjusted
measure of the trend in air quality. The intent was to obtain a value
which can be interpreted as being proportional to the reduction in emis-
sions.
4.3 Generalization and Mathematical Formulation
The purpose of the Oslo study was to compare air quality for two
different periods rather than to obtain a continuous estimate of a
meteorologically normalized air quality trend. We will formulate the
problem in the former terms in order to relate it explicitly to that
study; however, this does not at all imply that the approach cannot be
modified to yield a continuous estimate of air quality trends. Assume
we are given two sets of observations, one set for the first period of
time:
.0)
SL]
(1)
-------
33
where
qj1^ = an air quality measurement during the first period (e.g., a
daily mean value of pollutant concentration)
and
=		min)
= a vector of meteorological measurements corresponding to the
ith air quality measurement qj1^ (e.g., m^ might be a tempe"
ature measurement at a particular station).
There are a similar set of measurements for a later period:
<{2) • e{2)
qi2)' , 42)
n (2)	m (2)
qN2 ' 2n2	.	(4-3)
It is from this information (and without an estimate of emissions
during the two periods) that we wish to determine a meteorologically ad-
justed estimate of the improvement or deterioration of air quality (i.e.,
to estimate the change in emissions from air quality and meteorological
measurements). Suppose there is some "true," but unknown, equation (or
model) which relates emissions and meteorological measurements to air
quality:
-------
34
cj = F(e,m) .	{4-4)
This equation plus measurement error produced the measurement data of
(4-2) and (4-3). We are assuming that the equation does not differ
between the two periods, that any changes in air quality are explained
either by a change in meteorology or a change in emissions.
For the sake of the present discussion, let us again assume that emis-
sions remain essentially constant over the first time period and over the
second time period:
e_ = ei in first period,	(4-5a)
e. = e£ in second period .	(4-5b)
Now let us suppose that we do a linear or nonlinear regression with the
data from the first period, equation (4-2), and obtain a best fit equation
to the data:
q = f-j (m) •	(4-6)
Equation (4-1) 1s such an equation.
Since (4-6) was derived with constant emissions e^, and since "truth"
is assumed to be given by equation (4-4), f-j represents the relation
between meteorological conditions and pollutant concentration for fixed
emissions e^:
f-j (m) F(ej ,m) ,
(4-7)
-------
35
Now suppose we use the data of the second period, equation (4-3), to obtain
a similar empirical model:
q =	*	{4-8)
Then, as before,
f2(m) F(eg,rn) ,	(4-9)
Let us further assume that F is decomposible:
q = F(je,m) = G(e.)H(m) .	(4-10)
Equation (4-10) implies that the effect of emissions on air quality is
essentially independent of the effect of meteorology, The appropriateness
of this assumption clearly depends upon the particular definitions of the
emission, meteorological, and pollutant variables, as well as the area in
question. If the pollutant concentration is location-specific (rather
than a spatial average or spatial maximum), then either emissions must be
spatially uniform or the direction of the wind field relatively consistent
for (4-10) to be reasonable. (The latter seems to be the assumption
of the Norwegian study.) If the variables are aggregates (such as spatially
averaged S02 concentrations, total emissions, and average wind speed), then
less severe assumptions need be made for (4-10) to be reasonable.
Given (4-10), the ratio of the empirical equations for the two
time periods is
-------
36
f2(m) Fte^rn) Gf^)
f! (m) = F(e, ,m) " G(e,) '	(4-11)
usinq (4-6), (4-7), and (4-8). Thus, the ratio of the two equations
should be very nearly constant if (4-10) is valid, and that con-
stant will be a measure of the change in emissions between the two
periods. (The function G(eJ can be, for example, total emissions in tons.)
If (4-11) is not nearly constant, it can be interpreted as im-
plying that the improvement is a function of the meteorology. This might
easily be the case. For example, if there is substantial reduction 1n
industrial emissions but no improvement in emissions from space heating,
the improvement in emissions will be less when the temperature is lower.
If the improvement is a function of wind direction, the location of major
emission sources may be the cause. In the Oslo study, the ratio of the
temperature difference terms alone was taken and is exactly constant.
Since the full Oslo model, (4-1), contains other terms, however, the
ratio suggested by this discussion is not constant. Since the equation
for the later time period was not explicitly reported, we cannot calculate
the ratio. Let us examine, however, an analysis which is consistent with
Figure 4-1 and which provides an alternative approach.
Suppose we create a model f^ for the first time period only and apply
1t to the meteorological conditions for the second time period:
-------
37
^(2) = f,(m{2>)
q|2) = f,^2')
= fl^2)> •	<4-12>
We obtain estimates for the air quality qj2^ to be expected if the emis-
sions have not changed; these calculated values can be compared with ob-
served values. These are the values plotted in Figure 4-1. If we now
perform a linear regression of observed versus estimated values, I.e.,
r?)	~(2)
q} ' versus qj ' for 1*1,2	n2, we obtain a regression equation:
q ¦ a q + b ,	(4-13)
with specific values of a and b. Suppose we then assume that the "true"
equation is of the form
q = Ffe^m) = G(e)H(m) + qQ ,	(4-14)
where
qQ = a "background" air quality level not related to local emissions.
Then (4-13) is consistent with (4-14) if
fi(eo)
a = g^r-y	(4-15a)
-------
38
and
b = ^ • a qj1' .	(4-15b)
Then "a" can be interpreted as the increase in emissions and, more contro-
versially, "b" can be related to the change in "background" level (where
the background level may contain contributions from sources outside the emis-
sions inventory included in e_--for example, long-range transport from
other cities).
Estimating the best-linear-fit equations graphically,from Figure 4-1,
we find that the equation for the 1969/70 data is approximately
q = 0.25 q + 120	(4-16a)
and for the 1971 data
q = 0.25 q .	(4-16b)
Thus, the reduction in emissions is about 75% by this analysis for both
periods. The 1969/70 period had higher "background" than the 1959/63
period by 120 yg/m, but the 1971 period had about the same background
as 1959/63. Thus, the improvement between 1969/70 and 1971 could be
attributed to improvements in areas other than Oslo.
Note that this latter approach requires that only one model be created.
Since the approach is symmetrical, the model can be created for the period
in which the most data is available and applied to the other period.
-------
39
5.0	DETECTION OF INCONSISTENCIES IN AIR QUALITY/METEOROLOGICAL DATA BASES
5.1	Motivation
Air quality and meteorological data bases are collected for many pur-
poses (and often used for purposes not intended when collected). An im-
portant objective either during collection or after the fact is the de-
tection of inconsistencies in the data. In most data collection efforts,
an attempt is made to study the data for strange behavior or to employ in-
tuition and problem knowledge to uncover sources of system changes causing
data problems, such as changes or discrepancies in monitoring techniques.
A recent example is the detection of a significant discrepancy in cer-
tain calibration techniques used by the California Air Resources Board and
the Los Angeles Air Pollution Control District, making oxidant measurements
of the agencies inconsistent without a correction factor [8], The fre-
quent occurrence of detected inconsistencies in data bases leads one to
expect the possibility of undetected inconsistencies. An automatic tech-
nique for flagging potential Inconsistencies using the data itself would
be an important tool. Such a technique would take an existing data base
and detect potential problems for closer inspection or detect problems
occurring in an ongoing 4ata collection effort before a substantial amount
of data was irretrievably lost.
In this section, we will indicate how data-analytic/statistical tech-
niques can be employed to achieve this objective, we will distinguish the
types of inconsistencies for which one might search, the appropriate ap-
proaches to detecting these various types of inconsistencies, and the
-------
40
potential difficulties in this formal approach to the detection of in-
consistencies.
The key concept will be that of using the data collected to form a
model of the relationship between selected sets of measurements and to
automatically detect the measurements or points in time when (1) the model
changes or (2) the data is least consistent with the model. Note that
the model need not be a prediction model or relate independent to depen-
dent variables. Any consistent relationships in the data can be employed
in detecting inconsistencies.
It is important to distinguish inconsistencies from extremes. An
extreme value of air pollution is not necessarily inconsistent—it may be
consistent with extreme meteorological conditions. If the model ade-
quately incorporates the extreme conditions, the extreme values would
be indicated as being consistent and not flagged. If, however, the ex-
treme conditions were not previously observed in the data base or not
otherwise represented by a similar condition in the data base, the ex-
treme conditions may not be incorporated in the model and may be flagged
as possible inconsistencies. We bring up these points to emphasize two
key concepts: (1) the intent of a consistency analysis is not to flag
simple extreme values but to flag values which are inconsistent, i.e., ex-
treme and Inconsistent values are not equivalent; (2) the intent of a con-
sistency analysis is to flag potential inconsistencies for Inspection.
An inconsistency analysis will be successful if it does not miss key in-
consistencies that could seriously damage an empirical analysis or data
-------
41
collection effort. It will not have failed if it also flags potential
inconsistencies which upon further examination are more accurately cate-
gorized as extremes or unusual occurrences.
Let us structure these ideas more formally.
5.2 Formulation of Consistency Models
We imagine the basic situation of the simultaneous collection of
air quality and meteorological data, as well as possible adjunct data
depending upon the application (e.g., health effects data, emissions
data, etc.). Suppose the basic data is a sequence of measurements over
time of a number of variables:
Measurement 1: x-j(t^), x-j (t2). • •»	»
Measurement 2 • ^^^t^)i ^2^2^''''1' ^2^ *
Measurement n: xn(t^), x (tg)»....»	•	(5-1)
There are three basic formulations of consistency models available.
Time Sequence Inconsistencies
The consistency of individual time series can be examined. The model
constructed can be a model which predicts the value at a given point in
time from past and future values of itself. An inconsistency will then
be detected as a significant discrepancy between the forecast and observed
value. That 1s, the model could be of the form
x^tj) « F[xj(t-jx^tj^), xi(tj+1),...f xi(tN)] , (5-2)
A
where x^tj) is the value of x^(tj) predicted by the model. We emphasize
-------
42
that since we are testing consistency rather than predicting behavior,
values occurring after the particular value tested can be used 1n the
model when available. While many time series techniques employ recursively
expressed predictor models, they imply a general dependence of the form
indicated.
An Inconsistency would be a sufficiently large deviation between pre-
dicted ard measured values, i.e., a large value of
IXjUj) - xi (tj) | .	(5-3)
Cross Measurement Inconsistencies
This type of model 1s constructed by modeling the relationships be-
tween measurements at a given point in time. An example is a derived
relationship between a vertical temperature difference and average wind
speed at the same time. Formally, such a model is of the form
a
*i^j^ = ^x-j (tj)»• • •» (^j) »xf+i () * * • •» Xfl(tj)] • (5-4)
An inconsistency would be detected by large values of (5-3), as before.
Combined Model
In general, measurements will depend upon both past history and con-
current measurements. A full model would then be a technique which used
data both at other times and from other variables:
xj(tj) c (t>j ),..., Xj(t|^)j \ X^(t-j),...,
(tj+i)»• • •»	);*"*;
xn(t|jj)3 .	(5-5)
-------
43
Note that in many cases it is neither easy nor important to categorize
the type of modeling being employed. It might be unclear for example what
category one should place a model where the time slice was fairly broad,
for example, where monthly averages of daily values were compared to one
another. If the daily values are considered the basic data, then the
model is a combined model; if the monthly averages are considered the
basic data, then the model is a cross-measurement model. It is clearly
less important to categorize a model than to create and use it appropriately.
5.3 Types of Inconsistencies
There are several types of inconsistencies one might be interested
in detecting in the data:
1.	Abrupt, but persistent, changes;
2.	Slow nonstationarities; and
3.	Anomalous data (abrupt, nonpersistent changes).
Let us discuss these categories of problem and formulation of models for
their solution.
Abrupt, Persistent Changes
The change in the data may occur suddenly 1n time, i.e., at an identi-
fiable point in time.
There are generally two types of abrupt, persistent changes of interest:
1. Malfunctioning measurement or recording devices - If a measuring
device suddenly begins to malfunction, 1t will generally continue
to malfunction until repaired or replaced. The motivation for
detecting such a problem is obvious. In the present categoriza-
tion, we intend to mean by an abrupt, persistent change a change
-------
44
in the underlying model which occurs over a relatively short
period of time. This is as distinguished from slow changes
or short-term changes.
2. Changes in the system - We refer to major changes in the system
which occur over a short period of time such as the opening of
a new freeway or the opening of a major indirect source. As
well as permanent changes, there may be temporary but signifi-
cant changes, such as if a city were to host the Olympic Games.
Without specific attention to such events, the conclusions of
an analysis could be misleading. The analysis of this type
of abrupt change has been called "intervention analysis" by
Box and Tiao [16]. .
There is also clearly a matter of degree. An event can have a rela-
tively mild effect, as might the closing of several on-ramps to a freeway.
One output of a consistency analysis should be a measurement of the de-
gree of inconsistency.
This category of inconsistency has the basic character of having
a significantly different relationship between variables in the time
periods before the event and after the event. The point in time sepa-
rating the two periods is assumed unknown (since the purpose of a con-
sistency analysis is to discover such points).
The first of two basic technical approaches to this problem consists
creating a series of models and searching for a statistically significant
change in model structure or parameters. One may create a model over
-------
45
the interval	and predict	If the prediction is
consistent with observation, then a model over [t-j. ,tk+1 ] is created
to predict	anc* so on> until a discrepancy occurs. A simple
modeling technique or recursive procedure is probably a requirement if
a high computing cost is to be avoided.
The second approach does not require as abrupt a change as the first
but may be more computational. Here, one can create two models, one for
the period [t^,tk] and one for the period [t^t^l. One can calculate an
appropriate measure of the difference in the models, say D^. Repeating
this for varying breakpoints t^, one can determine the value at which
the difference is maximized, presumably the point when the change
occurred.
Slow Nonstationarities
Many types of change will occur gradually over a period of time.
For example, the retrofitting of emission control devices in automobiles
in California was mandated by law to occur in a month-by-month fashion
depending upon the digit of the car owner's license plate. The slow in-
troduction of the retrofitting might affect the time sequence of air
quality measurements. Another example is a slow but significant drift
in a measuring instrument. Such an inconsistency would be detected as
a systematic change in the appropriate model over time as opposed to an
abrupt inconsistency.
As with abrupt changes, categorizations of slow nonstationarities
are possible. They may be related both to measurement device drift or
-------
46
to changes in the system, and they may be both temporary and permanent.
(An example of a temporary but slow nonstationarity is a slow but defi-
nite degradation in the degree of compliance with the 55-miles-per-hour
speed limit.)
The most straightforward approach to this problem is to postulate
the form of the nonstationarity and test for 1t. For example, two air-
quality monitoring stations near each other might measure the same pol-
lutant, recording x^t) and x2(t), respectively. One could then do a
linear regression of day-to-day changes of the stations against one
another, i.e., find the best-fit linear relationship between
vy ¦ w • mw
and
W c x2(tk> ' x2(tk-1>
for k=2,3,,,.,N. The result will be of the form
= av2 + b
One can then test statistically whether b is significantly different than
zero. If it is, the values measured by one station are drifting relative
to the other. Unless this can be explained by a constantly increasing
(or decreasing) emission source affecting one of the stations selectively,
1t 1s an Inconsistency.
-------
47
Another approach is to compare a model created on [t-j.t^l with a
model created on [tj^»t^], where the tine gap 6 between periods modeled
is sufficient to detect a slow drift. This approach requires fewer as-
sumptions regarding the form of a possible nonstationarity.
Anomalous Data
This type of inconsistency might be categorized as a "noisy" measure-
ment. It cou"M be caused by erroneous recording or digitization of the
data by a temporarily malfunctioning instrument or by an anomalous occur-
rence such as might be caused by sidewalk repairs raising dust near a
site monitoring suspended particulate levels. Such an occurrence is a
short-term abrupt inconsistency in either a time sequence or cross-
measurement model. It is a relatively conventional type of problem en-
countered in data analysis and is often referred to as "outlier analysis.''
This problem can be approached in the single variable case by
studying extreme values detected by creatinq a histogram (the empirical
distribution) of measured values. The more variables measured, the
greater the potential for outliors which are not obvious by looking at
individual variables. (The classical example is the existence of a
"pregnant male" in a medical data base; neither "pregnant" nor "male"
is illegal, only the combination.) In the multivariate case, the most
general class of techniques for detecting outliers is "cluster analysis"[l7].
Very small clusters of points or single-point clusters in multivariate
space are inconsistencies which should be examined.
-------
48
5.4 Difficulties
The major technical difficulties in consistency analysis are, first,
nonlinearities and secondly, lack of data relative to the number of varia-
bles the relation of which is to be modeled. Most air quality and meteor-
ological parameters are nonlinearly related. Further, it often takes a
large number of variables to determine with accuracy other meteorological
or air-quality variables. This means that the diversity of joint obser-
vations of values of a large number of variables that one can expect in
a given data base or at the start of a measurement program is limited.
Compounding the problem, nonlinear models will, in general, require more
parameters than linear models and, hence, require more data for accurate
model determination.
These problems can be alleviated by both technical and operational
solutions. A technical consideration is that an efficient (low-parameter)
nonlinear form will require less data for the determination of the model
than an inefficient (overparameterized) nonlinear form; hence, efficient
functional forms, such as continuous piecewise linear functions, can help
alleviate this problem. A second technical point is that a set of models
of relatively simple form can be created with subsets of the relevant
variables.
The operational consideration is the fact that one may operationally
be able to tolerate a high level of "false alarms" in detecting Inconsis-
tencies at the beginning of a data collection project or in analyzing a
data base 1n the Initial stages. It is at this early point 1n most data
-------
49
collection or data analysis efforts that most of the problems are en-
countered. As more data is collected, the model will become more re-
fined and flag fewer potential inconsistencies.
Another possible problem is the inclusion of inconsistencies into
the model. Without care, the data can be modeled including inconsis-
tencies 1n such a way that the inconsistencies are fitted and do not be-
come apparent as a discrepancy in the model. This pitfall can be avoided
by simply employing good data-analytic Dractices to avoid overfitting.
For many projects in data collection and analysis, the use of con-
ventional tools in a careful manner can provide a systematic analysis of
consistency which may avoid erroneous analyses and a great deal of wasted
effort.
-------
50
6,0 RfPRP-f'ODELING: EMPIRICAL APPROACHES TO THE UNDERSTANDING AND
rr-.'rT»5f of complex air u.ALiTTTcrXS	'
Several computer-based mathematical models derived from basic
physical principles have been constructed to model air pollution and
meteorological phenomena. The diversity of inputs to such models and the
typically long running times often make it difficult to understand the
full implications of the models or to use the models in certain planning
applications where large numbers of alternatives must be rapidly evalu-
ated. The concept of "repro-modeling" is to treat a model as a source
of data for an empirical analysis [18]. Such an analysis will, in general,
have two major objectives:
1.	To understand the implication of the model by discovering
which variables most affect the outputs of interest and
in what way they affect the outputs of interest; and
2.	To construct as a simple functional form a model of the
relationship between key independent variables and key
model outputs.
Since this approach has been a subject of a previous EPA contract,
in which the technique of repro-modeling was applied to a reactive dis-
persive model of photochemical pollutant behavior in the Los Angeles
basin [5], we will not discuss it in further depth in this report. We
do wish to emphasize the role of such an analysis in evaluating and
validating models, as well as in suggesting to modelers the charac-
teristics which a current version of the model implies which might bear
further investigation.
-------
51
One point in earlier discussions of repro-modeling which has not
been emphasized is its use in model validation and sensitivity analysis.
Often sensitivity analysis is performed on models in order to determine
which parameters of the model are most critical in determining the model
output [19]. The chanqe in model output with a small chanqe in a given
parameter or input value is the sensitivity of the model to that param-
eter. Since the sensitivity of a model to a particular parameter will,
in general,depend upon the values of the other parameters, classical
sensitivity analysis is usually performed in one of two ways:
1.	One set of typical values for the parameters and inputs is
chosen and the effect of small changes in the parameters
about that nominal condition are made in order to examine
sensitivity. This obviously indicates only the sensitivity
at the particular nominal condition chosen.
2.	A "factorial" analysis is performed, where a number of diverse
nominal values are chosen and the above analysis repeated for
this large number of diverse conditions. This exercises the
full range of potential operation of the model, but creates
the problem of commensurating the implications of what are
often thousands of model runs. It also has the obvious dis-
advantage of requiring a large number of model runs.
If one is willing to perform a given number of model runs to get
a number of nominal points for a sensitivity analysis, it is more ef-
ficient, rather than to do a sensitivity analysis at each point, to fit
the points with an appropriate functional form such as a continuous
-------
52
piecewise linear form [5]. As demonstrated in the referenced report,
this results in regimes in which the model output is a linear function
of the model inputs and/or parameters and the sensitivity to those pa-
rameters and inputs is quite clearly displayed. This approach automati-
cally determines those regimes in which the sensitivity is relatively
constant over a large area of parameter/input variations. This "global"
sensitivity analysis approach can be more easily interpreted and more
efficient than a "local" sensitivity analysis approach.
-------
53
7.0	OTHER APPLICATION AREAS
Three additional topics are treated briefly here. The brevity is
not related to a judgment of importance, but simply to the limited
nature of the remarks.
7.1	Spatial Interpolation of Meteorological and Air Quality Measurements
Several recent studies have adopted a simple interpolation formula
to construct continuous wind fields. (The approach is applicable to
the interpolation of other quantities as well.) This formula includes
every monitoring station with the weight of each measurement inversely
proportional to the distance to the monitoring station location raised
to a power. More explicitly, the interpolation formula 1s
V4 -
n
I
i=l
hi
a
n
I
1=1 "1J
a
(7-1)
where there are n measurements within a prespecified distance of the
t h
point and where v^ is the measurement at the 1 location; R.j 1s the
distance between point i and j, and a is the exponent. This formula
is applied separately to each vector component of the wind vector and
the two resulting estimates are combined to recover an interpolated
wind speed and direction. The value of a has been chosen to be either
1 or 2 in previous studies.
-------
54
Th1§ approach 1s closely related to some recent Russian work
[20,£1,£2] end work by M, Rosenblatt [23], In these papers the concept
of nonlinear regression 1s explored by means of kernel functions and
density estimates, The use of these methods in the wind field problem
would involve estimates of the type
N
V(x) « 2 SLi K< (x-&|)	(7-2)
1-1
i L
where is the location of the i station and v^.the measured wind
f U
vector at the 1 station. The kernel functions (yj have generally
been taken, in the Statistical literature, to be the same for all i and
generally to be a smooth Gaussian type function with the sharpness of
its peak determined by a shape parameter o. Equation (7-1) fits the
formulation using Instead an inverse-distance kernel function with shape
parameter a. The referenced papers and on-going research in probability
density estimation are thus relevant to a deeper understanding of the
Implications of using (7-1) and to the development of alternative ap-
proaches.
7.2 Health Effects of Air Pollution
Empirical approaches (in particular, linear and nonlinear regression
techniques) have been employed in estimating the effects of air pollution
levels on health. The main difficulty encountered in this type of anal-
ysis 1s that of determining an incremental effect on respiratory health
-------
55
measurements which are often dominated by vagaries of general health prob-
lems such as flu epidemics or of individual differences such as the habit
of smoking or occupational environment. Yet, very strong relations must
be derived if causal effects are implied. In such conditions, the best
hope for improvement is in more highly controlled data collection efforts
(which are, however, very expensive).
This situation highlights an important aspect of data analysis pro-
jects: A legitimate result of the analysis is a negative conclusion, a
conclusion that the data does not admit of reliable results. A negative
result is constructive to the degree that it makes the strong statement
that the information desired is not present in the data; this settles the
matter unless the data base is augmented. A less conclusive culmination
of a data analysis effort is a limited negative statement, for example,
a conclusion that no linear function of the independent variables predicts
the desired variable with statistically significant accuracy.
We note, however, that a negative conclusion does not necessarily im-
ply a faulty data collection effort; it may instead imply that the rela-
tionship of interest is less pronounced than initially expected relative
to the effect of uncontrolled (or unmeasured) variables. Unfortunately,
a well-conceived data analysis or collection effort is often labeled a
failure when only negative results are produced--a charge which implies
that the knowledge which the study was designed to elicit should have
been obvious before the data was collected.
7.3 Short-term Forecasting of Pollutant Levels
The forecasting of pollutant levels the next day is of importance
for health warning systems and/or to initiate short-term control procedures.
-------
56
Forecasting pollution levels and forecasting the weather are closely re-
lated problems; It is not clear which is the most difficult, but certainly
neither is easy. The empirical approach attempts to model directly the
relation implicit in measured meteorological and air-quality data.
Persistence (i.e., assuming tomorrow's peak pollutant concentration
equals today's peak concentration) usually proves a reliable forecast at
lower pollution levels, but not necessarily at high levels when accuracy
1s most critical [13]. Certainly persistence will not predict a high
pollutant level on a day following a low-pollutant level. Regression
or time-series approaches tend to exploit persistence and may not be best
suited to a situation where the determinants of the future pollution level
can be considerably different depending on the level. Further, the per-
formance estimate can be mlsleadinqly high due to the number of low or
intermediate pollution days usually included in the analysis.
Classification analysis is probably a more natural approach to the
problem. The joint distribution of attributes (i.e., descriptive variables)
of high-pollution days can be derived by looking at high-pollution days
alone and can be compared to the joint distribution of attributes of in-
termediate days and to the joint distribution of attributes of low-pollution
days. The variables of importance in distinguishing the 3 classes can be
determined, and an algorithm to predict the classes can be derived.
-------
57
e.o	Rrn-RRias
1, Molsel, II. 5,, "p'lpl rica I Approaches to Air Quality and Meteoro-
lofllcal Model ing," fVor^of Export Prnol on A1r Pn 11 u 11 on Mod eli nn,
NATO Committee on C'r'TsT»s "fVn-'SiiiirlV'TdcTely, IRTeq'VTj^wIs;,' Tune 67
1974, (This document may bp obtained from the Air Pollution Tech-
nical Information Center, Office; of Air and Water Programs, Environ-
mental Protection Aooncy, Research Triangle Park, North Carolina 27711.)
2f Calder, K, E,, "Some Miscellaneous Aspects of Current Urban Pollution
Models," P roc, Syinp> on Multiple Source Urhan Pit fusion Mode Is, EPA,
Research Triangle Park, North Carpiiria, 1970,	' """
3,	Hrenko, J, M,, and P, B. Turner, "RAM: Real-Time Air-Quality Simulation
Model," EPA, Research Triangle park. North Carolina (Preliminary draft.
July 12, 197/1),
4,	Older, K. E., "The Feasibility of Formulation of a Source-Oriented Air
Quality Simulation Model that Uses Atmospheric Dispersion Functions
Empirically Derived from Joint Historical Data for Air Quality and
Pollutant Emissions," EPA, Research Triangle Park, North Carolina
(draft, August 1974),
5,	Horowitz, Alfjn, and W. S, Meisel, "The Application of Repro-Modeling
to the Analysis of a Photochemical Air Pollution Model," EPA Report
No, EPA-E504-74-001 , MERC, Research Triangle Park, North Carolina,
December 1073,
6,	Calder, K. [.,, "A Narrow Plume Simplification for Multiple Source
Urban Pollution Models" (informal unpublished note), December 31, 1969.
7,	"ARB Oxidant Readlnqs to Be Adjusted Downward," Calif. ARB Bulletin,
Vol, 5, No. 8 (September 1974), pp 1-2,
8,	"Calibration Report: LAAPCD Method More Accurate; ARB More Precise,"
Calif. Air Resources Board Bulletin,Vol. 5, No. 11 (December 1974),
Pp'1-2.
9,	Tiao, G. C., G. E. P. Box, and W. J. Hamminq, "Analysis of Los
Angeles Photochemical Smog Data: A Statistical Overview," Tech-
nical Rept. No. 331, Dept. of Statistics, U. of Wisconsin, April 1973.
10. Tiao, G. C., et al., "Los Angeles Aerometric Ozone Data 1955-1972,"
Technical Rept. No. 346, Dept. of Statistics, U, of Wisconsin,
October 1973,
-------
58
REFERENCES (CONT'D)
11.	Bruntz, 5. M., W. S. Cleveland, B. Kleiner and J. L. Warner, "The
Dependence of Ambient Ozone on Solar Radiation, Wind, Temperature,
and Mixing Height," Proc. Sy*ip. on Atmospheric Diffusion and Air
Pollution, Santa Barbara, Ca"l if"'," September 9-13, 1974/ American
WeteeroTogicsl Society, Boston, Mass,
12,	Smith, F, B., and G, H. Jeffrey, "The Prediction of High Concentra-
tions of Sulfphur Dioxide in London and Manchester Air," Proc. 3rd
Meeting of NAT0/CCM5 Expert Panel on Air Pollution Modeling, Paris,
13,	Horowitz, A. J., and W, S. Melsel, "0n-t1me Series Models in the
Short-term Forecasting of Air Pollution Concentrations," Technology
Service Corporation Report No. TSC-74-DS-101, Santa Monica, Calif.,
August 22, 1974.
14,	Breiman, Leo, and W. S. Melsel, "General Estimates of the Intrinsic
Variability of Data in Nonlinear Regression Models," TSC Report,
Technology Service Corp,, Santa Monica, Calif,, October 1974.
15,	Gronskel, K. E,, E. Jorariger and F. Gram, "Assessment of Air Quality
1n Oslo, Norway," Published as Appendix D to the NATO/CCMS Air Pol-
lution Document "Guidelines to Assessment of Air Quality (Revised)
S0X, TSP, CO, HC, N0X Oxidants," Norwegian Institute for Air Research,
Kjeller, Norway, February 1973. (This document may be obtained, from
the Air Pollution Technical Information Center, Office of Air and
Water Programs, Environmental Protection Agency, Research Triangle
Park, North Carolina.)
16,	Box, G.E.P., and G. C. Tiao, "Intervention Analysis with Applications
to Economic and Environmental Problems," Technical Report NO. 335,
Department of Statistics, University of Wisconsin, Madison, Oct. 1973.
17,	"Cluster Analysis," Chapter VIII of W. S. Melsel, Computer-Oriented
Approaches to Pattern Recognition, Academic Press, 1972.	'
18,	Melsel, William S., and D. C. Collins, "Repro-Modeling: An Approach
to Efficient Model Utilization and Interpretation," IEEE Transactions
on Systems, Man, and Cybernetics, Vol. SMC-3, No. 4, July 1973,
pp 349-358.
19,	Thayer, S.D., and R.C. Koch, "Sensitivity Analysis of the Multiple-
Source Gaussian Plume Urban Diffusion Model," Preprint volume, Con-
ference on Urban Environment, October 31-Nov. 2, 1972, Philadelphia,
Pennsylvania (published by American Meteorological Society, Boston,
Mass.).
-------
59
REFERENCES (CONT'D)
20.	Nadaraya, E.A., "On Estimating Regression," Theor. Probabilit.y Appl.,
Vol. 4, pp 141-142, 1965.
21.	Nadaraya, E.A., "On Non-parametric Estimates of Density Functions
and Regression Curves," Theor. Probability Appl., Vol. 5, pp 186-190,
1965.			
22.	Nadaraya, E.A., "Remarks on Non-parametric Estimates for Density
Functions and Regression Curves," Theor. Probability Appl., Vol. 15,
pp 134-137, 1970.		
23.	Rosenblatt, M., "Conditional Probability Density and Regression
Estimators," Multivariate Analysis, Vol. II, pp 25-31, Academic
Press, New York, 1969.
-------