EMPIRICAL METHODS IN THE ANALYSIS OF AIR QUALITY AND METEOROLOGICAL PROBLEMS William S. Meisel Interim Report December 1974 Technology Service Corporation ------- Technology Service Corporation 2811 Wi1 shire Boulevard DRAFT: For Internal Santa Monica, California 90403 EPA Review (213)829-7411 EMPIRICAL METHODS IN THE ANALYSIS OF AIR QUALITY AMD METEOROLOGICAL PROBLEMS William S. Meisel Interim Report December 1974 Contract No. 68-02-1704 EPA Project Officer: Ken Calder Meteorology Laboratory National Environmental Research Center Research Triangle Park, North Carolina 27711 Prepared for OFFICE OF RESEARCH AND DEVELOPMENT U.S. ENVIRONMENTAL PROTECTION AGENCY WASHINGTON, D.C. 20460 ------- PREFACE This interim report serves two functions: it is (1) an outline of the proposed phase II projects, for comment,and (2) a draft of the first volume of the three-volume final report: I. Empirical Methods in the Analysis of Air Quality and Meteorological Problems II. A Source-Oriented Empirical hLdel of the Dispersion of Air Pollutants. III. The Oxidant Formation Process in the Los Angeles Basin: An Empirical Analysis. This volume will be revised to serve as the introductory volume indicated; the revised version will be delivered with the final report. Discussions with EPA personnel led to inclusion of many of the sub- jects dealt with in this report. The project monitor, Ken Calder, took a particularly active and constructive role. Advice from Leo Breiman and Alan Horowitz at Technology Service Corporation further improved the report. 1 ------- TABLE OF CONTENTS SECTION PAGE 1.0 INTRODUCTION 1 2.0 A SOURCE-ORIENTED EMPIRICAL MODEL 6 2.1 Motivation 6 2.2 Formulation 6 2.3 Feasibility 14 2.4 The Inverse Problem 15 2.5 Testing the Approach 16 2.6 Research Plan 16 3.0 EMPIRICAL ANALYSIS OF THE OXIDANT FORMATION PROCESSES IN THE LOS ANGELES BASIN . 18 3.1 Motivation 18 3.2 The Data 19 3.3 The Problem Formulation 20 3.4 Research Plan 27 4.0 EXTRACTION OF EMISSION TRENDS FROM AIR QUALITY TRENDS ... 28 4.1 Motivation 28 4.2 Report of a Comparison of Emission Levels ^9 over Two Time Periods 29 4.3 Generalization and Mathematical Formulation 32 5.0 DETECTION OF INCONSISTENCIES IN AIR QUALITY/ METEOROLOGICAL 39 5.1 Motivation 39 5.2 Formulation of Consistency Models .......... 41 5.3 Types of Inconsistencies 43 5.4 Difficulties 48 6.0 REPRO-MODELING: EMPIRICAL APPROACHES TO THE UNDERSTANDING AND EFFICIENT USE OF COMPLEX AIR QUALITY MODELS 50 11 ------- TABLE OF CONTENTS (CONT'D) SECTION PAGE 7.0 OTHER APPLICATION AREAS 53 7.1 Spatial Interpolation of Meteorological and Air Quality Measurements 53 7.2 Health Effects of Air Pollution 54 7.3 Short-term Forecasting of Pollutant Levels .... 55 8.0 REFERENCES 57 111 ------- 1.0 INTRODUCTION The increased availability of appropriate data bases arid improve- ments in methodology have led to the increasing use of empirical and statistical approaches to the analysis of air quality and meteorological problems [I]. This volume suggests how these approaches might be applied to a number of problems of interest to the Environmental Protection Agency, particularly those with a meteorological aspect. The objectives of this report are limited. The applications dis- cussed at most length;.are those where either empirical approaches have not been fully exploited and/or the problem can be formulated in an in- novative manner. The discussions are intended to highlight opportunities rather than to provide detailed plans for problem solution. As part of the present project, two problem areas will be explored as pilot studies to more fully demonstrate modern empirical techniques; these projects will be reported in separate volumes. The subjects discussed in this volume are the following: A source-oriented empirical model: It has often been assumed that 1t is Impractical to derive an empirical model relating emission source distribution and meteorology to the resulting pollutant concentration distribution. The basic argument against an empirical approach has been the dearth of detailed emission inventories in comparison to the relative abundance of air quality data. An ap- proach 1s postulated whereby it 1s suggested that an empirical meteorological dispersion function can be derived by indirect means. ------- 2 Empirical analysis of the oxidant formation processess: Empirical approaches to the problem of determining the relationship between oxidant precursor (HC and NO ) concentrations and resulting am- bient oxidant levels are discussed. Extraction of emission trends from air quality trends; The esti- mation of air quality trends from air quality measurements is complicated by the effect of meteorology. We discuss the deter- mination of a "meteorologically adjusted" trend, i.e., a trend more nearly related to the emissions trend. Detection of inconsistencies in air quality/meteorological data bases: In any data collection or data analysis effort, a major concern is the integrity of the data. It is important to de- tect problems with monitoring equipment or monitoring methods and to note any important changes in the system monitored so that such errors or changes do not distort the analysis of the data or invalidate a portion of the data collected. We discuss auto- matic procedures for detecting inconsistencies. Empirical approaches to the understanding and efficient use of complex air quality models: Computer-based models derived from physical principles are tools which often should be analyzed themselves for the sake of extracting their implications, for modeling aspects of their behavior to reduce input data require- ments and running time, for validation, or to suggest further areas for model improvement. Model-generated input/output data can be so analyzed by empirical techniques. ------- 3 Spatial interpolation of meteorological and air quality measure- ments; Interpolation of variables such as wind field or pollutant concentration is of interest in several applications. We discuss some general aspects of this problem. Health effects of air pollution: We comment on this area in which empirical approaches are at present heavily employed. Short-term pollutant level forecasting: Short-term forecasting for health warning systems or to invoke temporary controls can be approached empirically. Several pitfalls are highlighted. In discussions of the above subjects, the attempt is to formulate a data-analytic approach which reduces the problem to a straightforward data-analytic technique. The-techniques of empirical analysis which are referenced include the following: 1• Hypothesis testing, statistical modeling, and other "classical" statistical approaches: These "classical" approaches are by no means without their subtleties or potential for misapplication, but are the subject of many textbooks and conventional statis- tics courses. 2. Linear and nonlinear regression: These techniques fit a function to data to model the relationship between independent variables and an ordered, many-valued independent variable, linear regres- sion, 1n general, and nonlinear regression in a single Independent variable are well-understood and often used. Nonlinear regression with multiple independent variables, particularly for the small- sample case, is more difficult, but significant technical progress has been made in the last few years. ------- 4 it Time-series analysis; Time-series analysis takes advantage of the serial nature of the data and presumably of the underlying model. The subject has been studied for many years (sometimes It "signal processing"), but In recent years new developments hiVi arisen and the subject has been treated more systematically. The linear case 1s much more highly developed than the nonlinear east; however, not all problems involving time series are best treated by techniques designed specifically for time series. 4i Classification analysis ("pattern recognition"): These tech- niques use data to relate Independent variables to a class label (1,eM a possibly unordered, few-valued dependent variable). Because earlier work in this field was oriented toward developing hardware devices rather than analyzing data, its power as a data- analyttc tool has only been fully realized in the last few years, (i, Cluster analysis: Cluster analysis does not require a dependent variable but analyzes the distribution of multivariate data, 1*e», the joint distribution of the independent variables, to determine distinct groupings of data points in multivariate space. Much work on the subject has been done recently, and 1t will be- come better known when several textbooks 1n press are published. Discussions of cluster analysis tend at present to appear as chap- ters 1n books on pattern recognition. Each of these data-analytic subjects is difficult and tends to have Its own language and proponents. Further, few universities currently encourage students to become broadly based experts in data ------- 5 analysis. Hence, tradeoffs among techniques are not always made with obtaining the best problem solution as the only criterion. In the present report, a sincere attempt has been made to formulate the problem in the most general terms, pointing out the class of techniques applicable but seldom specific algorithms. ------- 6 2.0 A SOURCE-ORIENTED EMPIRICAL MODEL 2.1 Motivation Multiple-source simulation models for urban air quality based on meteorological dispersion functions are in broad use [2]; for example, the Gaussian plume formulation is used in many models including the RAM model presently in development at the Environmental Protection Agency [3], The particular form of relationship between source and receptor used in this formulation was originally developed to describe dispersion from isolated sources and has been adapted to the urban environment. Because a source-oriented model is extremely useful in examining the impact of proposed emission controls, it 1s of interest to determine if a source- receptor relationship which provides an alternative to the Gaussian plume formulation can be determined empirically. Since one cannot, in general, isolate the effects of single sources in an urban area to determine the source-receptor relationship, we propose that the relationship be extracted indirectly by determining a formulation which will best predict the pollutant concentration distribution, given the emission distribution and meteorological conditions. 2.2 Formulation The basic Gaussian plume equation predicts the concentration at a point (x,y,z) from a source of unit strength at (S,n,e) as / + exp - (2-1) ------- 7 where IT = mean wind speed, i; = effective source height, and a (d),a (d) = horizontal and vertical diffusion functions a distance d y 2 downwind from the source. The first term within the brackets of Eq. (2-1) denotes the dispersion of the pollutant in the lateral direction; the second term, in the vertical direction; anJ the third term represents the perfect reflection of the pol- lutant bearing diffusive eddies from the surface of the earth, i.e., there is neither deposition nor reaction at the surface. The coordinates are aligned such that x is along-wind and y is crosswind. In fact, Eq. (2-1) holds only when x-s is positive; the concentration is assumed to be zero if the source is downwind. The diffusion functions depend on meteorological parameters, usually mixing layer depth and stability condition. The concentration from multiple sources in a region V with a source strength distribution Q(€,n,d is given by superposition: x(*,y,z) = fR(x,y,z;s,n»s)Q(5,n,s)dedndc (2-2) V where the integral over the volume V can be abstractly considered to include both point and area sources. Following Calder [4], we can assume horizontal homogeneity, as in the Gaussian formulation: R(x,y,2;c,n,c) e K(x-c,y-n,z, c) (2-3) ------- 8 yielding an equivalent to 2-2): x(x,y,z) = y*K(x'}y',zti)Q(x-x',y-y',?)dV' , (2-4) V where we have made the change of variable x1 = x-5 y' = y-n and dV ¦ dx'dy'dc. Again following Calder [4], we note that (2-4) represents an integral equation for the function K if the concentration distribution x and emission distribution Q are known. In other words, we might conceive of determining the source-receptor function K empirically by examining observed concentra- tion distributions resulting from observed (or estimated) emission distri- butions. Calder notes that in the case where (1) we are predicting ground-level concentrations from area sources at ground-level, (2) the Integral is approximated by a summation over an M-by-N grid of values, and (3) we have measured concentrations and emissions at each grid point, K can be determined in tabular form by solving a set of linear equations. The table specifying the function K(x',y') in this case yields values for any pair of grid points (x,?) and (y,n) and is valid for the meteorological conditions which yielded the particular concentration distribution used. (The table would appear as in Figure 2-1). A tabular formulation of the source-receptor function K ------- 9 x* = x - K K .1 .2 .3 .4 .1 5.0 5.5 4.5 3.0 .2 4.0 4.5 3.0 2.5 .3 3.0 2.0 1.5 1.4 .4 2.0 1.6 1.2 1.0 Figure 2-1. A tabular representation of a hypo- thetical source-reception function for a fixed meteorology. Another table would be required for another set of meteorological conditions. has the advantage of not restricting the class of functions being investi- gated, but has the disadvantage of requiring values of the concentration distribution at each grid point and making the dependence upon meteorological factors difficult to extract explicitly. An alternative approach to solving equation (2-.4) is to restrict K to membership in a family of functions K(x',y',z,s;a), where choosing the parameter vector o specifies a particular member of the family. A familiar example is the family of multivariate polynomials where a member 1s speci- fied by a particular choice of values for the coefficients. In this case, K is specified as a specific functional form rather than as a table. ------- 10 One approach to determining the "best-fitting" function of the * chosen family (or, equivalently, of finding the parameter vector a which gave the best fit) is to fit,by a classical least-squares method, the values yielded by the set of linear equations suggested by Calder. In the hypothetical example of Figure 2-1, the 16 values of the table could be fit by a function of two variables. Since this is a two-step approach, it does not overcome the problems of the first approach, but amounts largely to smoothing the values yielded by that approach. The more direct approach is to substitute K(x',y',z,5;^) directly in equation (2-4). If there were an a. such that the equation could be satisfied exactly, the concentration distribution could be predicted exactly by the function given by that a. If a perfect fit is not possible, then one could find the parameters £ which minimized the mean-square error over a'number of measurement points (x^y^z^), i=l,2,...,M, perhaps corresponding to monitoring stations: e2(a) 88 ft £ L - f MxVV^jaWxj-x'.yj-y'.cJdV']2 (2-5) i **1 L V J where ^ * x(x1,y1,z1). Equation (2-5) can be minimized with respect to a by any number of optimization techniques if the integral can be calculated; many numerical integration techniques are suitable for that purpose. A key problem is the choice of an appropriate family of parameterized functional forms for K such that the error e will be small, but such that the number of ------- u parameters a is small. The number of parameters is related to the number of measurements required to make the problem well-determined and to its computational feasibility. We will discuss this point after further refinement of the problem formulation. Explicit Dependence on Meteorology Since the concentration distribution in (2-5) is determined by meteorological parameters as well as the source distribution, K determined by that formulation would be valid only for that particular set of meteor- ological conditions. If we denote the vector of meteorological parameters as m (e.g., wind speed, inversion height, and stability class) and express the dependence of K and x on meteorology as K(x' ,y' ,z,cfm;o.) and x(*>y.z,m), 2 then the error e is a function of the choice of parameters a_ and meteor- ological conditions m: = e2(a,m) . (2-6) If a set of N meteorological conditions we wish to explore is given by rn^,m2>...,m^, then we may find a to minimize E2{«) " I S e2(«i]!lj) » (2-7) where e is defined by (2-5). This last equation gives the mean-square error over all measurement points (x.»yj»Zj) and over all the chosen meteorological conditions. The function resulting from optimization, K(x\y\z ,c,ny«*)t should predict these MN points accurately. If m is ------- 12 three-dimensional, then K is a function of six variables and explicitly contains dependence on the meteorology. Equation (2-1), the Gaussian formulation, is an example of such a functional form (although not obtained by optimization). Area and Point Sources We can further refine our formulation by explicit consideration of area and point sources. Area sources at ground level yield concentrations at ground level given by xA(x,y,m) = x(x,y,0,m) = J Kft(x' ,y' »m;6.)QA(x-x' ,y-y' )dx'dy' . (2-8) V Elevated point sources measured at ground level are given by xP(x,y,m) = x(x,y,0,ni) = Kp(x^,y^q,m;^)Qp(x-x^,y-y^,sjl) , (2-9) where the point sources are at and x; = x - andy;- y - . The total concentration at (x,y) with meteorological conditions m 1s given by ------- 13 x(x,y,m) = xA(x,y,m) + x (x,y,m) (2-10) P A§ before, the optimal source-receptor function K is determined by finding the parameters e. and x which minimize the mean-square error in predicting concentrations over a varying set of meteorological conditions: N ( M r r E2(3,X) * f ^ xi ~ J* ka(x' »Jg-)QACXi-x* ,yi~y' )dx'dy' j=l " 1=1 - E Kp(x;»y;.cil,mj;x)Qp(xrx;,yry;,c£) A*8 I 2 (2-11) (Note that the effective stack height X, may also be a function of the meteorology: 5= c(m).) Equation (2-11) summarizes the basic method proposed. Presuming the Integral is approximated using a numerical integration technique, the 2 error E can be calculated for any choice of parameters. A number of optimization techniques fan be employed to find the parameters which give the best fit. Given those optimal parameter values, the optimal source- receptor function and Kp are fully specified. It remains to consider the feasibility of this approach. ------- 14 2.3 Feasibility Feasibility of the proposed method depends upon two closely related problems: (1) Does the problem as posed have a well-defined solution? (2) Even if the solution is theoretically well-defined, is it computationally feasible to obtain? Both questions are heavily dependent on the number of parameters defining and Kp, the dimensionality of 8. and y_. The number of values to be fit (M-N) should be greater than the number of free parameters; if so, the solution will in most cases be well-defined (perhaps within some limited region of parameter space). Computational feasibility also depends on the number of parameters. The cost of most optimization algorithms will tend to go up as a power of the number of parameters whose values must be determined. A major objective of this approach must hence be to find a parameterized functional form which is sufficiently general to be able to model the source-receptor function, but which does not require a large number of free parameters to achieve this generality. Continuous piecewise linear functions, as used in a recent EPA study [5], provide such a class of functions and are a promising candidate for achieving a feasible solution. Whatever form of approximating function is used, however, one can simplify the problem by making certain assumptions regarding the source- receptor function; for example: ------- 15 (1) One might assume specific dependencies on some meteorological parameters, e.g., by assuming that the concentration is inversely proportional to the average wind speed rather than extracting that dependence empirically. (2) One might assume the Gaussian form and determine the dispersion functions empirically. (3) One might make the narrow-plume assumption for area sources [6]. The more assumptions made, the less general, but also less difficult, the analysis will be. 2.4 The Inverse Problem Suppose a good source-receptor function has been determined. Then one might pose the problem: Given meteorological conditions, measurements of the pollutant concentration distribution, and source locations, determine the distribution of source strengths. Equation (2-11) provides a formulation of this problem if §_ and £ are assumed known and the area and/or point source strengths assumed unknown. For example, suppose the area sources are assumed known, and the point source emission rates to be determined. There are then J unknown values for the J point sources, and E , the mean-square error in predicting the measured concentrations, is a function of those J values. The values o which minimize E are good estimates of the source emission rates. Since the unknowns appear linearly within the brackets, the minimum of (2-11) can be found by solving a set of J linear equations in J unknowns. The solution will be well-determined (except in degenerate cases) if the number of con- centration measurements times the number of meteorological conditions exceeds the number of point sources. ------- 16 2.5 Testing the Approach A common problem in testing meteorological and air quality models is that the data base required for the models is subject to errors which may be of the same size as errors introduced by the models. Emission inventories and estimates of diurnal variations in emissions, for example, may tend to be correct on the average, yet be considerably in error in any given hour. A least-squares formulation such as that proposed tends to average out errors and will tend to produce good models. In order to gain confidence in the approach and to explore alternative levels of assumptions, however, it is desirable to use a case where the source of errors will arise from the model formulation rather than measurement error. One means to this end is to use data generated by a model such as the; Gaussian-plume RAM model referenced earlier. If the source-receptor function derived by the proposed approach closely approximates the Gaussian form used in generating the data, one would have increased confidence in the applicability of the proposed technique to measured data. Further, alternative versions of the methodology could be analyzed in a controlled environment. The proposed Phase II study thus suggests this approach and follows a work plan suggested by Calder [4], 2.6 Research Plan Task 1 Select a real urban location (e.g., St. Louis, New York, or Chicago) for which ground-level area- and point-source, short-term emissions distri- butions are available for SOg. For a typical one-hour emissions distribution^ use a muHipie-source Gaussian dispersion model (probably the RAM model of ------- 17 the EPA Meteorology Laboratory)—for one wind speed (5 m/sec), one stability class (neutral) and infinite mixing depth—to calculate total one-hour con- centrations x(Pj»ei) at ground level at a number of receptor locations I J P,j (e.g., as for St. Louis RAPS network) and for various wind directions ej. Apply the methodology proposed to attempt to recover the meteorological dispersion function K. Determine (a) the degree of error in predicting concentrations, for the re- ceptor locations and wind directions actually used to derive the empirical dispersion function, (b) the degree of error in predicting concentrations at receptor locations and for wind directions not used in the derivation (a measure of interpolation accuracy), (c) the degree of error in predicting results for a somewhat dif- ferent emissions distribution (a test of extrapolation accuracy), and (d) a comparison of the empirical dispersion function with the Gaussian form used to compute input concentrations for the analysis. Task 2 Test the sensitivity of the method to the number of "observed" concentrations used and to random errors in the emissions inventory. Task 3 Extend the preceding to a range of wind speeds, atmospheric stability classes, and to several different emissions distributions. ------- 18 3.0 EMPIRICAL ANALYSIS OF THE OXIDANT FORMATION PROCESSES IN THE LOS ANGELES BASIN 3.1 Motivation Oxidant is a difficult pollutant to deal with 1n terms of understanding the effect of particular controls on the resultant level of its concentra- tion. The principal reason for this difficulty is that oxidant is largely an end product of a chemical process rather than being directly emitted from pollutant sources. Oxidant is related to emissions not only through transport and diffusion, but by a complex chemical reaction in which meteor- ology can play a significant part. The principal pollutants leading to the formation of oxidant are reactive hydrocarbons (HC) and oxides of ni- trogen (N0X). '(We shall refer-to these "raw" pollutants as "oxidant pre- cursors.") Since emission control policies can affect not only the overall level of emissions but the ratio of emissions of NO to hydrocarbons, it becomes Important to understand the effect of this ratio as well as of the absolute level of emissions upon the end concentrations of oxidant. These effects are by no means fully agreed upon. One approach to understanding this problem is a very detailed inspec- tion of all the physical processes Involved, including meteorological ef- fects and chemistry. One then obtains a model which, if successful, related a detailed emissions Inventory and detailed meteorological conditions to a resulting time and spatial distribution of oxidant values over the area modeled. An alternative approach is an empirical analysis of the relationship between observed concentrations of oxidant precursors and meteorological ------- 19 variables and the resulting observed distribution of oxidant concentration. The objectives of such an analysis would generally be more limited than in the development of detailed models arising from fundamental physical and chemical principles; however, an insight Into the relationships which are observed, even for a limited range of conditions, can provide both guidance for setting control policy and guidance as to the dominant effects which should be considered in a chemical/physical model. 3.2 The Pat- Any empirical analysis must proceed from a data base. For the analysis proposed, a great deal of data is available on the Los Angeles basiri, particularly from the Los Angeles Air Pollution Control District (LAAPCD) and the California Air Resources Board (ARB). There are at present approximately 30 stations monitoring air quality in the South Coast (Los Angeles srea) Basin. Almost all of the stations monitor oxidant and NO . Several monitor HC. The earliest station records date back to 1955, but few stations have such long histories. The early records are of somewhat doubtful value in some cases, due to changes in monitoring technology and standards. There are, however, more than 20 stations with histories of several years. There are continuing questions about the comparability of data taken by different agencies. While such comparability (and indeed absolute accuracy) is critical for use with deterministic models, it is less important for statistical models as long as a consistent basis is used for each station reporting. The more recent data is generally charac- terized by such consistency, although data from the ARB may have to be ------- 20 adjusted downward 20 to 25 percent to be consistent with LAAPCD data due to differing calibration techniques [7,8]. Mesometerological data, such as wind speed and direction, surface temperature, the vertical temperature profile, pressure, humidity, pre- cipitation, visibility, and insolation, is collected at airports and other Weather Service stations and at meteorological stations run by other organizations, for example, Air Pollution Control Districts and the Armed Forces. Data from the Weather Service and Armed Forces stations are available from NOAA. Data from a sizable number of other meteorological stations in Los Angeles County is available from the LAAPCD. An initial statistical analysis of LAAPCD data has been performed by Tiao, Box, et al. [9,10]. At present,only preliminary results have been reported. 3.3 The Problem Formulation The general empirical approach is to postulate the possible independent variables which affect the dependent variable to be predicted. In the present case, the independent variables are measures of the precursor pollutant concentrations and of meteorological variables and the dependent variable is the oxidant concentration at a given location (or an aggregate measure such as peak oxidant concentration throughout the basin). (We re- fer for the sake of conciseness to variables such as averages or peaks which remove either a spatial or time variation from a given Independent or dependent variable as "aggregate" variables.) This analysis has two major steps: ------- 21 (1) Find the independent variables which best explain the behavior of the dependent variable; and (2) Model that relationship mathematically. While the second step generally receives the most attention in empirical analyses, the first step is the more difficult and often the more revealing. One can apply both linear and nonlinear models in both steps. If linear models are applied in the first step, the question answered will be whether the variables predict the dependent variable linearly. If general nonlinear forms are allowed, the question answered will be whether the independent variables predict dependent variables in either a linear or nonlinear manner. An example of a good analysis of the linear dependence of ambient ozone on meteorological parameters was performed 1n research at Bell Labora- tories [11]. There the logarithms of the meteorological variables were used to predict the logarithm of the oxidant concentration by a linear equation; this equation produced a correlation between predicted and actual concentra- tion of 0.84. A single location 1n New York was analyzed, and the only variables used in attempting to predict the oxidant concentration were solar radiation, wind speed, and temperature. Mixing height was determined to offer no additional Information beyond that of the three Independent variables Indicated. This analysis did not contain any dependence upon precursor pol- lutants since the oxidant concentration was specific to a certain location and the data was collected'over a relatively short period of time; hence, the emissions might be expected to be relatively constant. The results, although highly encouraging 1n terms of the potential for empirical anal- ysis, should be qualified: ------- 22 (1) The model correlation was based upon a limited time period and one location and, while a very careful and credible statistical analysis of the prediction errors was made, no test was performed upon independent data not used in creating the linear equation. (2) Since the errors in predicting the logarithm of the ozone value were found to be normally distributed, the error in predicting the ozone concentration Itself tended to be largest at the higher values of ozone concentra- tion. It 1s at the higher values where difficulty 1n forecasting is generally encountered but where accuracy of the model 1s most critical. We note briefly another study as an example of the possibility of creativity in the definition of potential independent variables. Smith and Jeffrey attempted to predict, by a simple formula, the high concentration of sulfer dioxide 1n London and Manchester a1r[l2]. One variable they found quite useful was the number of hours when the wind speed was less than three knots. This variable apparently summarized the key aspect of the temporal variation of wind speed as a single number. It is the Intent of the proposed project to attempt to exercise limited crea- tivity 1n potential predictors of oxidant concentration. A key characteristic of the problem 1s that the level of ozone at a given time may be the result of precursor concentrations at a different point at an earlier time. (There 1s even some tentative empirical evidence that one day's hydrocarbons may be Important 1n producing high levels of ------- 23 oxidant on the following day [13]. The particular location and relevant time delay will be a function of meteorological parameters such as the Wind field, temperature, solar radiation, and mixing height. There are several possible ways of handling this key problem: (1) Stratify the data by general meteorological or wind-field classes, e.g., "a light wind from the ocean." Data for each class of wind field could then be analyzed to discover the location and time delay which best explain oxidant concentra- tion at a given location. One could thus determine empiri- cally the precursor location and time delays for the specific meteorological class. (2) Aggregate precursor and oxidant values spatially and/or temporally. If the average hydrocarbon and N0X concen- trations across the basin for a given hour are considered Independent variables and the peak oxidant reading through- out the basin for the day 1s considered a dependent variable, then one may seek a direct relationship between those variables. This is much the sort of aggregation attempted success- fully in an earlier analysis of data produced by a photo- chemical smog model [5], (3) Perform a trajectory analysis. Let the precursors of oxidant at a given location be the hydrocarbon and N0X concentration 1n the parcel of air at Its location at an earlier time as obtained through analysis of the trajectory of the ------- 24 parcel. The independent variables in this case would be the precursor concentrations in the parcel three hours earlier, four hours earlier, etc. This approach has the advantage of making it unnecessary to specifically include the wind field as an independent variable but, instead, to use it in defining a more complex independent variable. The third approach involves the interpolation of the wind field and the tracing of the trajectories. In a later section we will discuss objective interpolation of meteorological parameters and specifically interpolation of the wind field. Given a methodology for interpolating the wind field from a limited number of measurements, trajectories such as those in Figures 3-1 and 3-2 can be estimated. The precursor pollutant concen- trations in the parcel of air at an earlier point can be estimated by interpolation of the concentrations of the precursor pollutants between measurement stations. The independent variable can then be the pollutant concentration in the parcel of air at an earlier point in time (or perhaps a weighted average of the earlier pollutant concentrations throughout the trajectory). The end objective is to obtain functional relationship between a variable measuring the concentration of ozone and a limited number of meteorological and chemical precursor variables. The tool for the ultimate determination of the functional relationship will be nonlinear: continuous piecewise-linear regression, as used in an earlier study [5], (A comparison between a nonlinear fit and a linear fit will be made to determine the degree of improvement obtained by allowing nonlinearity.) ------- 25 £StmTED TRAJECTORY OF AIR ARRIVING AT PASADENA, EL MONTE, L-ONG BEACH, AND SANTA ANA AT ,0400 SEPTEMBER 29, 1969. 100 200 O WEST L,A. BURBANK o o LOS ANGELES .300 400 PASADENA 200_ L 100 300 ""^400 O EL MONTE o A2USA o POMONA redondo 3EACH >/ 100 v-» 200 LONG BEACH ^00 '400 300 400A 100 SANTA ^'200 ANA Figure 3-1: Trajectories The figure shows that an irregular early morning meandering pattern exists at Long Beach, Santa Ana, and El Monte. Pasadena, on the other hand, shows a northerly flow pattern due to nocturnal air drainage down the mountains combined with an offshore wind MILES flow. The lengths of the arrows give an Indication of how much ¦ , , . . . the air has moved during an hour interval. None of the stations 0 2 4 6 8. 10 show more than 4 m.p.h, air movement for the early morning hours. ------- GST i M AT F.D T R A J t C TO R Y LONG DiEACH, SANTA 26 OF AIR ARRIVING AT PASADENA, EL MONTE, ANA, AfJD POMONA AT 1600 SEPTEMBER 29, IS BURBANK PASADENA 1600 AZUSA 1300 1400 1500 isoc^j 1200 L.A, EL MONTE 1500 WEST L.A. 1500 1400 l6°0 LONG beach 1500 1600 REDONDO V. BEACH 1500 SANTA ANA 1400 Figure 3-2: Trajectories The figure shows the estimated air trajectories for the af_ ternoon hours. All of the stations show the dominance of onshore sea breezes with a tendency of higher velocities later in the afternoon. The more regular air trajectories of afternoon also show a greater air r,overrent than early in the morning as shown by the greater lengths of the arrows. Thedeflecting influence of the Santa Monica mountains causes the air trajectory to curve northward as it approaches Pasadena. MILES » ¦ ¦ ¦ ' 0 2 4 6 8 10 ------- 27 The technique used to determine which variables will be used in the equation will be a combination of linear and nonlinear variable selection techniques [14]. 3.4 Research Plan Task 1 Collect data from the California ARB and LAAPCD (and limited meteoro- logical data from other sources) into a common format. The data will be limited to the South Coast Basin and the yeats 1968-1974. Task 2 Analyze data as stratified by classes of wind field (and, perhaps, other meteorological variables). Data from days with meteorological con- ditions conducive to high oxidant concentration will be emphasized. Per- form a linear and nonlinear analysis to determine appropriate independent variables. Examine the utility of using aggregate variables within the stratified classes. Task 3 Examine the consistency of results obtained in Task 2 with those ob- tained from a trajectory analysis. It is understood that practical limita- tions on time and funds available will restrict the extensiveness of this analysis. Task 4 Summarize the implications of the analysis and of the generality and validity of the models obtained. The key difference between similar projects and the proposed project 1s the use of a class of powerful nonlinear techniques and the resulting generality of the conclusions. ------- 28 4.0 EXTRACTION OF EMISSION TRENDS FROM AIR QUALITY TRENDS 4.1 Motivation While measured pollutant concentration is the final impact of a given level of emissions, trends in pollutant concentration measure- ments can be misleading if it is assumed that those trends represent progress (or the lack thereof) in emission control. Since meteorology need not be uniform from time period to time period, the measure of progress should be more directly related to emissions. Emissions come from a large number of diverse sources, however, and are difficult to measure directly. Since air quality has been measured directly for a number of years, it is of significant interest to understand if the effect of meteorology can be removed from air quality trends to more nearly elicit trends in emissions. Such an analysis of trends is the subject of periodic reports both by the Council on Environmental Quality and by the Environmental Protection Agency. Such a study ruist implicitly extract information about the influence of meteorological factors on pollution levels for a given level of emis- sions. This information can be an important subsidiary benefit of an analysis of the sort suggested. We will discuss this concept by referring to a specific example of a study of the improvement in emissions between the early and late sixties in Oslo, Norway [15], We will then relate this example to a general formulation to highlight the assumptions involved in such a study, to make the method more specific, and to provide a context for broader application of this approach. ------- 29 4,1 ftoport of a Comparison of t^mlsslDr> Levels over Two Time Periods A study of the changes 1n emission levels of SC^ in Oslo, Norway, as deduced from changes In measured SO-? concentrations, was undertaken to compare the S02 emissions of the periods 1959-1963 and 1969-1973. The tiieteorological conditions during the former period were considerably different from those during the latter period; hence, one could not ex- pect a change in air quality to be directly related to a change in emis- sions . Data from the earlier period (1959-1963) was used to do a linear regression analysis. It was discovered that two variables dominated the estimate of SO^ concentration, a temperature difference between a low altitude and high altitude measuring station and the temperature at the lower station. For example* a typical regression equation for erne station was qso * 6'1,5 (T.J-T,) -11.6^+472 , (4-1) where qso = daily mean value of S02 concentration in yg/m3 at the parti- cular station T2 = temperature at higher station at 7 P.M. T-| = temperature at lower station at 7 P.M. ------- 30 This equation explained the observed values of SC^ concentration with a multiple correlation coefficient of .80; that is, the correlation between values predicted by this equation and observed values for the period indi- cated was 0.80. Adding other variables did not result in a significantly better predictor equation. It was suggested that the temperature difference term expressed the ventilation in the Oslo area while the temperature term measured the variation in the emission of S^2 due to space heating. Since the temperature data for the later time period is known, the level of SO2 expected for the meteorological conditions during that time period can be es- timated by equation (4-1). This was done for the days on which data was available in the later time period; the results are indicated and compared with data from the earlier time period in Figure 4-1. The data from the 1959-1963 time period is scattered relatively uniformly about this line of slope 1—as expected, since the regression was performed on that data. However, the data from the later years evinces a much lower observed value of SO2 concentration than would be expected from the meteorological condi- tions. The referenced study attributed this to a reduction in emissions. Figure 4-1 indicates qualitatively the emission reduction (or, if the reader prefers, the "meteorologically normalized" reduction in pollu- tant levels). A quantitative statement was made in the report that the SOg pollution was reduced 50 to 60%. According to a conversation with one of the authors of the report, this latter statement was derived by looking at the ratio of the coefficient on the temperature difference term in the early time period to the ratio of the coefficient of the temperature difference term in a similarly derived equation for the later ------- 31 SOi pq/nP * 1950/G3 * JSSS/70 '?7t a 600 400 #00 Figure 4-1: Values of daily mean S02 concentration computed from temperature measurements at 7 P.M. versus daily mean S0? concentration observed. The fact that the values in the later period are much less than would be expected from the meteorology suggests that emissions are less. Ref.[l5] ------- 32 time period. The intuitive justification for such a statement is that the coefficient measures the degree to which a given temperature inversion will be translated into SC>2 concentrations. Thus a 50 or 60% reduction In that coefficient might be thought of as a meteorologically adjusted measure of the trend in air quality. The intent was to obtain a value which can be interpreted as being proportional to the reduction in emis- sions. 4.3 Generalization and Mathematical Formulation The purpose of the Oslo study was to compare air quality for two different periods rather than to obtain a continuous estimate of a meteorologically normalized air quality trend. We will formulate the problem in the former terms in order to relate it explicitly to that study; however, this does not at all imply that the approach cannot be modified to yield a continuous estimate of air quality trends. Assume we are given two sets of observations, one set for the first period of time: .0) SL] (1) 2 (1) mg 0) (4-2) ------- 33 where qj1^ = an air quality measurement during the first period (e.g., a daily mean value of pollutant concentration) and = min) = a vector of meteorological measurements corresponding to the ith air quality measurement qj1^ (e.g., m^ might be a tempe" ature measurement at a particular station). There are a similar set of measurements for a later period: <{2) • e{2) qi2)' , 42) n (2) m (2) qN2 ' 2n2 . (4-3) It is from this information (and without an estimate of emissions during the two periods) that we wish to determine a meteorologically ad- justed estimate of the improvement or deterioration of air quality (i.e., to estimate the change in emissions from air quality and meteorological measurements). Suppose there is some "true," but unknown, equation (or model) which relates emissions and meteorological measurements to air quality: ------- 34 cj = F(e,m) . {4-4) This equation plus measurement error produced the measurement data of (4-2) and (4-3). We are assuming that the equation does not differ between the two periods, that any changes in air quality are explained either by a change in meteorology or a change in emissions. For the sake of the present discussion, let us again assume that emis- sions remain essentially constant over the first time period and over the second time period: e_ = ei in first period, (4-5a) e. = e£ in second period . (4-5b) Now let us suppose that we do a linear or nonlinear regression with the data from the first period, equation (4-2), and obtain a best fit equation to the data: q = f-j (m) • (4-6) Equation (4-1) 1s such an equation. Since (4-6) was derived with constant emissions e^, and since "truth" is assumed to be given by equation (4-4), f-j represents the relation between meteorological conditions and pollutant concentration for fixed emissions e^: f-j (m) F(ej ,m) , (4-7) ------- 35 Now suppose we use the data of the second period, equation (4-3), to obtain a similar empirical model: q = * {4-8) Then, as before, f2(m) F(eg,rn) , (4-9) Let us further assume that F is decomposible: q = F(je,m) = G(e.)H(m) . (4-10) Equation (4-10) implies that the effect of emissions on air quality is essentially independent of the effect of meteorology, The appropriateness of this assumption clearly depends upon the particular definitions of the emission, meteorological, and pollutant variables, as well as the area in question. If the pollutant concentration is location-specific (rather than a spatial average or spatial maximum), then either emissions must be spatially uniform or the direction of the wind field relatively consistent for (4-10) to be reasonable. (The latter seems to be the assumption of the Norwegian study.) If the variables are aggregates (such as spatially averaged S02 concentrations, total emissions, and average wind speed), then less severe assumptions need be made for (4-10) to be reasonable. Given (4-10), the ratio of the empirical equations for the two time periods is ------- 36 f2(m) Fte^rn) Gf^) f! (m) = F(e, ,m) " G(e,) ' (4-11) usinq (4-6), (4-7), and (4-8). Thus, the ratio of the two equations should be very nearly constant if (4-10) is valid, and that con- stant will be a measure of the change in emissions between the two periods. (The function G(eJ can be, for example, total emissions in tons.) If (4-11) is not nearly constant, it can be interpreted as im- plying that the improvement is a function of the meteorology. This might easily be the case. For example, if there is substantial reduction 1n industrial emissions but no improvement in emissions from space heating, the improvement in emissions will be less when the temperature is lower. If the improvement is a function of wind direction, the location of major emission sources may be the cause. In the Oslo study, the ratio of the temperature difference terms alone was taken and is exactly constant. Since the full Oslo model, (4-1), contains other terms, however, the ratio suggested by this discussion is not constant. Since the equation for the later time period was not explicitly reported, we cannot calculate the ratio. Let us examine, however, an analysis which is consistent with Figure 4-1 and which provides an alternative approach. Suppose we create a model f^ for the first time period only and apply 1t to the meteorological conditions for the second time period: ------- 37 ^(2) = f,(m{2>) q|2) = f,^2') = fl^2)> • <4-12> We obtain estimates for the air quality qj2^ to be expected if the emis- sions have not changed; these calculated values can be compared with ob- served values. These are the values plotted in Figure 4-1. If we now perform a linear regression of observed versus estimated values, I.e., r?) ~(2) q} ' versus qj ' for 1*1,2 n2, we obtain a regression equation: q ¦ a q + b , (4-13) with specific values of a and b. Suppose we then assume that the "true" equation is of the form q = Ffe^m) = G(e)H(m) + qQ , (4-14) where qQ = a "background" air quality level not related to local emissions. Then (4-13) is consistent with (4-14) if fi(eo) a = g^r-y (4-15a) ------- 38 and b = ^ • a qj1' . (4-15b) Then "a" can be interpreted as the increase in emissions and, more contro- versially, "b" can be related to the change in "background" level (where the background level may contain contributions from sources outside the emis- sions inventory included in e_--for example, long-range transport from other cities). Estimating the best-linear-fit equations graphically,from Figure 4-1, we find that the equation for the 1969/70 data is approximately q = 0.25 q + 120 (4-16a) and for the 1971 data q = 0.25 q . (4-16b) Thus, the reduction in emissions is about 75% by this analysis for both periods. The 1969/70 period had higher "background" than the 1959/63 period by 120 yg/m, but the 1971 period had about the same background as 1959/63. Thus, the improvement between 1969/70 and 1971 could be attributed to improvements in areas other than Oslo. Note that this latter approach requires that only one model be created. Since the approach is symmetrical, the model can be created for the period in which the most data is available and applied to the other period. ------- 39 5.0 DETECTION OF INCONSISTENCIES IN AIR QUALITY/METEOROLOGICAL DATA BASES 5.1 Motivation Air quality and meteorological data bases are collected for many pur- poses (and often used for purposes not intended when collected). An im- portant objective either during collection or after the fact is the de- tection of inconsistencies in the data. In most data collection efforts, an attempt is made to study the data for strange behavior or to employ in- tuition and problem knowledge to uncover sources of system changes causing data problems, such as changes or discrepancies in monitoring techniques. A recent example is the detection of a significant discrepancy in cer- tain calibration techniques used by the California Air Resources Board and the Los Angeles Air Pollution Control District, making oxidant measurements of the agencies inconsistent without a correction factor [8], The fre- quent occurrence of detected inconsistencies in data bases leads one to expect the possibility of undetected inconsistencies. An automatic tech- nique for flagging potential Inconsistencies using the data itself would be an important tool. Such a technique would take an existing data base and detect potential problems for closer inspection or detect problems occurring in an ongoing 4ata collection effort before a substantial amount of data was irretrievably lost. In this section, we will indicate how data-analytic/statistical tech- niques can be employed to achieve this objective, we will distinguish the types of inconsistencies for which one might search, the appropriate ap- proaches to detecting these various types of inconsistencies, and the ------- 40 potential difficulties in this formal approach to the detection of in- consistencies. The key concept will be that of using the data collected to form a model of the relationship between selected sets of measurements and to automatically detect the measurements or points in time when (1) the model changes or (2) the data is least consistent with the model. Note that the model need not be a prediction model or relate independent to depen- dent variables. Any consistent relationships in the data can be employed in detecting inconsistencies. It is important to distinguish inconsistencies from extremes. An extreme value of air pollution is not necessarily inconsistent—it may be consistent with extreme meteorological conditions. If the model ade- quately incorporates the extreme conditions, the extreme values would be indicated as being consistent and not flagged. If, however, the ex- treme conditions were not previously observed in the data base or not otherwise represented by a similar condition in the data base, the ex- treme conditions may not be incorporated in the model and may be flagged as possible inconsistencies. We bring up these points to emphasize two key concepts: (1) the intent of a consistency analysis is not to flag simple extreme values but to flag values which are inconsistent, i.e., ex- treme and Inconsistent values are not equivalent; (2) the intent of a con- sistency analysis is to flag potential inconsistencies for Inspection. An inconsistency analysis will be successful if it does not miss key in- consistencies that could seriously damage an empirical analysis or data ------- 41 collection effort. It will not have failed if it also flags potential inconsistencies which upon further examination are more accurately cate- gorized as extremes or unusual occurrences. Let us structure these ideas more formally. 5.2 Formulation of Consistency Models We imagine the basic situation of the simultaneous collection of air quality and meteorological data, as well as possible adjunct data depending upon the application (e.g., health effects data, emissions data, etc.). Suppose the basic data is a sequence of measurements over time of a number of variables: Measurement 1: x-j(t^), x-j (t2). • •» » Measurement 2 • ^^^t^)i ^2^2^''''1' ^2^ * Measurement n: xn(t^), x (tg)»....» • (5-1) There are three basic formulations of consistency models available. Time Sequence Inconsistencies The consistency of individual time series can be examined. The model constructed can be a model which predicts the value at a given point in time from past and future values of itself. An inconsistency will then be detected as a significant discrepancy between the forecast and observed value. That 1s, the model could be of the form x^tj) « F[xj(t-jx^tj^), xi(tj+1),...f xi(tN)] , (5-2) A where x^tj) is the value of x^(tj) predicted by the model. We emphasize ------- 42 that since we are testing consistency rather than predicting behavior, values occurring after the particular value tested can be used 1n the model when available. While many time series techniques employ recursively expressed predictor models, they imply a general dependence of the form indicated. An Inconsistency would be a sufficiently large deviation between pre- dicted ard measured values, i.e., a large value of IXjUj) - xi (tj) | . (5-3) Cross Measurement Inconsistencies This type of model 1s constructed by modeling the relationships be- tween measurements at a given point in time. An example is a derived relationship between a vertical temperature difference and average wind speed at the same time. Formally, such a model is of the form a *i^j^ = ^x-j (tj)»• • •» (^j) »xf+i () * * • •» Xfl(tj)] • (5-4) An inconsistency would be detected by large values of (5-3), as before. Combined Model In general, measurements will depend upon both past history and con- current measurements. A full model would then be a technique which used data both at other times and from other variables: xj(tj) c (t>j ),..., Xj(t|^)j \ X^(t-j),..., (tj+i)»• • •» );*"*; xn(t|jj)3 . (5-5) ------- 43 Note that in many cases it is neither easy nor important to categorize the type of modeling being employed. It might be unclear for example what category one should place a model where the time slice was fairly broad, for example, where monthly averages of daily values were compared to one another. If the daily values are considered the basic data, then the model is a combined model; if the monthly averages are considered the basic data, then the model is a cross-measurement model. It is clearly less important to categorize a model than to create and use it appropriately. 5.3 Types of Inconsistencies There are several types of inconsistencies one might be interested in detecting in the data: 1. Abrupt, but persistent, changes; 2. Slow nonstationarities; and 3. Anomalous data (abrupt, nonpersistent changes). Let us discuss these categories of problem and formulation of models for their solution. Abrupt, Persistent Changes The change in the data may occur suddenly 1n time, i.e., at an identi- fiable point in time. There are generally two types of abrupt, persistent changes of interest: 1. Malfunctioning measurement or recording devices - If a measuring device suddenly begins to malfunction, 1t will generally continue to malfunction until repaired or replaced. The motivation for detecting such a problem is obvious. In the present categoriza- tion, we intend to mean by an abrupt, persistent change a change ------- 44 in the underlying model which occurs over a relatively short period of time. This is as distinguished from slow changes or short-term changes. 2. Changes in the system - We refer to major changes in the system which occur over a short period of time such as the opening of a new freeway or the opening of a major indirect source. As well as permanent changes, there may be temporary but signifi- cant changes, such as if a city were to host the Olympic Games. Without specific attention to such events, the conclusions of an analysis could be misleading. The analysis of this type of abrupt change has been called "intervention analysis" by Box and Tiao [16]. . There is also clearly a matter of degree. An event can have a rela- tively mild effect, as might the closing of several on-ramps to a freeway. One output of a consistency analysis should be a measurement of the de- gree of inconsistency. This category of inconsistency has the basic character of having a significantly different relationship between variables in the time periods before the event and after the event. The point in time sepa- rating the two periods is assumed unknown (since the purpose of a con- sistency analysis is to discover such points). The first of two basic technical approaches to this problem consists creating a series of models and searching for a statistically significant change in model structure or parameters. One may create a model over ------- 45 the interval and predict If the prediction is consistent with observation, then a model over [t-j. ,tk+1 ] is created to predict anc* so on> until a discrepancy occurs. A simple modeling technique or recursive procedure is probably a requirement if a high computing cost is to be avoided. The second approach does not require as abrupt a change as the first but may be more computational. Here, one can create two models, one for the period [t^,tk] and one for the period [t^t^l. One can calculate an appropriate measure of the difference in the models, say D^. Repeating this for varying breakpoints t^, one can determine the value at which the difference is maximized, presumably the point when the change occurred. Slow Nonstationarities Many types of change will occur gradually over a period of time. For example, the retrofitting of emission control devices in automobiles in California was mandated by law to occur in a month-by-month fashion depending upon the digit of the car owner's license plate. The slow in- troduction of the retrofitting might affect the time sequence of air quality measurements. Another example is a slow but significant drift in a measuring instrument. Such an inconsistency would be detected as a systematic change in the appropriate model over time as opposed to an abrupt inconsistency. As with abrupt changes, categorizations of slow nonstationarities are possible. They may be related both to measurement device drift or ------- 46 to changes in the system, and they may be both temporary and permanent. (An example of a temporary but slow nonstationarity is a slow but defi- nite degradation in the degree of compliance with the 55-miles-per-hour speed limit.) The most straightforward approach to this problem is to postulate the form of the nonstationarity and test for 1t. For example, two air- quality monitoring stations near each other might measure the same pol- lutant, recording x^t) and x2(t), respectively. One could then do a linear regression of day-to-day changes of the stations against one another, i.e., find the best-fit linear relationship between vy ¦ w • mw and W c x2(tk> ' x2(tk-1> for k=2,3,,,.,N. The result will be of the form = av2 + b One can then test statistically whether b is significantly different than zero. If it is, the values measured by one station are drifting relative to the other. Unless this can be explained by a constantly increasing (or decreasing) emission source affecting one of the stations selectively, 1t 1s an Inconsistency. ------- 47 Another approach is to compare a model created on [t-j.t^l with a model created on [tj^»t^], where the tine gap 6 between periods modeled is sufficient to detect a slow drift. This approach requires fewer as- sumptions regarding the form of a possible nonstationarity. Anomalous Data This type of inconsistency might be categorized as a "noisy" measure- ment. It cou"M be caused by erroneous recording or digitization of the data by a temporarily malfunctioning instrument or by an anomalous occur- rence such as might be caused by sidewalk repairs raising dust near a site monitoring suspended particulate levels. Such an occurrence is a short-term abrupt inconsistency in either a time sequence or cross- measurement model. It is a relatively conventional type of problem en- countered in data analysis and is often referred to as "outlier analysis.'' This problem can be approached in the single variable case by studying extreme values detected by creatinq a histogram (the empirical distribution) of measured values. The more variables measured, the greater the potential for outliors which are not obvious by looking at individual variables. (The classical example is the existence of a "pregnant male" in a medical data base; neither "pregnant" nor "male" is illegal, only the combination.) In the multivariate case, the most general class of techniques for detecting outliers is "cluster analysis"[l7]. Very small clusters of points or single-point clusters in multivariate space are inconsistencies which should be examined. ------- 48 5.4 Difficulties The major technical difficulties in consistency analysis are, first, nonlinearities and secondly, lack of data relative to the number of varia- bles the relation of which is to be modeled. Most air quality and meteor- ological parameters are nonlinearly related. Further, it often takes a large number of variables to determine with accuracy other meteorological or air-quality variables. This means that the diversity of joint obser- vations of values of a large number of variables that one can expect in a given data base or at the start of a measurement program is limited. Compounding the problem, nonlinear models will, in general, require more parameters than linear models and, hence, require more data for accurate model determination. These problems can be alleviated by both technical and operational solutions. A technical consideration is that an efficient (low-parameter) nonlinear form will require less data for the determination of the model than an inefficient (overparameterized) nonlinear form; hence, efficient functional forms, such as continuous piecewise linear functions, can help alleviate this problem. A second technical point is that a set of models of relatively simple form can be created with subsets of the relevant variables. The operational consideration is the fact that one may operationally be able to tolerate a high level of "false alarms" in detecting Inconsis- tencies at the beginning of a data collection project or in analyzing a data base 1n the Initial stages. It is at this early point 1n most data ------- 49 collection or data analysis efforts that most of the problems are en- countered. As more data is collected, the model will become more re- fined and flag fewer potential inconsistencies. Another possible problem is the inclusion of inconsistencies into the model. Without care, the data can be modeled including inconsis- tencies 1n such a way that the inconsistencies are fitted and do not be- come apparent as a discrepancy in the model. This pitfall can be avoided by simply employing good data-analytic Dractices to avoid overfitting. For many projects in data collection and analysis, the use of con- ventional tools in a careful manner can provide a systematic analysis of consistency which may avoid erroneous analyses and a great deal of wasted effort. ------- 50 6,0 RfPRP-f'ODELING: EMPIRICAL APPROACHES TO THE UNDERSTANDING AND rr-.'rT»5f of complex air u.ALiTTTcrXS ' Several computer-based mathematical models derived from basic physical principles have been constructed to model air pollution and meteorological phenomena. The diversity of inputs to such models and the typically long running times often make it difficult to understand the full implications of the models or to use the models in certain planning applications where large numbers of alternatives must be rapidly evalu- ated. The concept of "repro-modeling" is to treat a model as a source of data for an empirical analysis [18]. Such an analysis will, in general, have two major objectives: 1. To understand the implication of the model by discovering which variables most affect the outputs of interest and in what way they affect the outputs of interest; and 2. To construct as a simple functional form a model of the relationship between key independent variables and key model outputs. Since this approach has been a subject of a previous EPA contract, in which the technique of repro-modeling was applied to a reactive dis- persive model of photochemical pollutant behavior in the Los Angeles basin [5], we will not discuss it in further depth in this report. We do wish to emphasize the role of such an analysis in evaluating and validating models, as well as in suggesting to modelers the charac- teristics which a current version of the model implies which might bear further investigation. ------- 51 One point in earlier discussions of repro-modeling which has not been emphasized is its use in model validation and sensitivity analysis. Often sensitivity analysis is performed on models in order to determine which parameters of the model are most critical in determining the model output [19]. The chanqe in model output with a small chanqe in a given parameter or input value is the sensitivity of the model to that param- eter. Since the sensitivity of a model to a particular parameter will, in general,depend upon the values of the other parameters, classical sensitivity analysis is usually performed in one of two ways: 1. One set of typical values for the parameters and inputs is chosen and the effect of small changes in the parameters about that nominal condition are made in order to examine sensitivity. This obviously indicates only the sensitivity at the particular nominal condition chosen. 2. A "factorial" analysis is performed, where a number of diverse nominal values are chosen and the above analysis repeated for this large number of diverse conditions. This exercises the full range of potential operation of the model, but creates the problem of commensurating the implications of what are often thousands of model runs. It also has the obvious dis- advantage of requiring a large number of model runs. If one is willing to perform a given number of model runs to get a number of nominal points for a sensitivity analysis, it is more ef- ficient, rather than to do a sensitivity analysis at each point, to fit the points with an appropriate functional form such as a continuous ------- 52 piecewise linear form [5]. As demonstrated in the referenced report, this results in regimes in which the model output is a linear function of the model inputs and/or parameters and the sensitivity to those pa- rameters and inputs is quite clearly displayed. This approach automati- cally determines those regimes in which the sensitivity is relatively constant over a large area of parameter/input variations. This "global" sensitivity analysis approach can be more easily interpreted and more efficient than a "local" sensitivity analysis approach. ------- 53 7.0 OTHER APPLICATION AREAS Three additional topics are treated briefly here. The brevity is not related to a judgment of importance, but simply to the limited nature of the remarks. 7.1 Spatial Interpolation of Meteorological and Air Quality Measurements Several recent studies have adopted a simple interpolation formula to construct continuous wind fields. (The approach is applicable to the interpolation of other quantities as well.) This formula includes every monitoring station with the weight of each measurement inversely proportional to the distance to the monitoring station location raised to a power. More explicitly, the interpolation formula 1s V4 - n I i=l hi a n I 1=1 "1J a (7-1) where there are n measurements within a prespecified distance of the t h point and where v^ is the measurement at the 1 location; R.j 1s the distance between point i and j, and a is the exponent. This formula is applied separately to each vector component of the wind vector and the two resulting estimates are combined to recover an interpolated wind speed and direction. The value of a has been chosen to be either 1 or 2 in previous studies. ------- 54 Th1§ approach 1s closely related to some recent Russian work [20,£1,£2] end work by M, Rosenblatt [23], In these papers the concept of nonlinear regression 1s explored by means of kernel functions and density estimates, The use of these methods in the wind field problem would involve estimates of the type N V(x) « 2 SLi K< (x-&|) (7-2) 1-1 i L where is the location of the i station and v^.the measured wind f U vector at the 1 station. The kernel functions (yj have generally been taken, in the Statistical literature, to be the same for all i and generally to be a smooth Gaussian type function with the sharpness of its peak determined by a shape parameter o. Equation (7-1) fits the formulation using Instead an inverse-distance kernel function with shape parameter a. The referenced papers and on-going research in probability density estimation are thus relevant to a deeper understanding of the Implications of using (7-1) and to the development of alternative ap- proaches. 7.2 Health Effects of Air Pollution Empirical approaches (in particular, linear and nonlinear regression techniques) have been employed in estimating the effects of air pollution levels on health. The main difficulty encountered in this type of anal- ysis 1s that of determining an incremental effect on respiratory health ------- 55 measurements which are often dominated by vagaries of general health prob- lems such as flu epidemics or of individual differences such as the habit of smoking or occupational environment. Yet, very strong relations must be derived if causal effects are implied. In such conditions, the best hope for improvement is in more highly controlled data collection efforts (which are, however, very expensive). This situation highlights an important aspect of data analysis pro- jects: A legitimate result of the analysis is a negative conclusion, a conclusion that the data does not admit of reliable results. A negative result is constructive to the degree that it makes the strong statement that the information desired is not present in the data; this settles the matter unless the data base is augmented. A less conclusive culmination of a data analysis effort is a limited negative statement, for example, a conclusion that no linear function of the independent variables predicts the desired variable with statistically significant accuracy. We note, however, that a negative conclusion does not necessarily im- ply a faulty data collection effort; it may instead imply that the rela- tionship of interest is less pronounced than initially expected relative to the effect of uncontrolled (or unmeasured) variables. Unfortunately, a well-conceived data analysis or collection effort is often labeled a failure when only negative results are produced--a charge which implies that the knowledge which the study was designed to elicit should have been obvious before the data was collected. 7.3 Short-term Forecasting of Pollutant Levels The forecasting of pollutant levels the next day is of importance for health warning systems and/or to initiate short-term control procedures. ------- 56 Forecasting pollution levels and forecasting the weather are closely re- lated problems; It is not clear which is the most difficult, but certainly neither is easy. The empirical approach attempts to model directly the relation implicit in measured meteorological and air-quality data. Persistence (i.e., assuming tomorrow's peak pollutant concentration equals today's peak concentration) usually proves a reliable forecast at lower pollution levels, but not necessarily at high levels when accuracy 1s most critical [13]. Certainly persistence will not predict a high pollutant level on a day following a low-pollutant level. Regression or time-series approaches tend to exploit persistence and may not be best suited to a situation where the determinants of the future pollution level can be considerably different depending on the level. Further, the per- formance estimate can be mlsleadinqly high due to the number of low or intermediate pollution days usually included in the analysis. Classification analysis is probably a more natural approach to the problem. The joint distribution of attributes (i.e., descriptive variables) of high-pollution days can be derived by looking at high-pollution days alone and can be compared to the joint distribution of attributes of in- termediate days and to the joint distribution of attributes of low-pollution days. The variables of importance in distinguishing the 3 classes can be determined, and an algorithm to predict the classes can be derived. ------- 57 e.o Rrn-RRias 1, Molsel, II. 5,, "p'lpl rica I Approaches to Air Quality and Meteoro- lofllcal Model ing," fVor^of Export Prnol on A1r Pn 11 u 11 on Mod eli nn, NATO Committee on C'r'TsT»s "fVn-'SiiiirlV'TdcTely, IRTeq'VTj^wIs;,' Tune 67 1974, (This document may bp obtained from the Air Pollution Tech- nical Information Center, Office; of Air and Water Programs, Environ- mental Protection Aooncy, Research Triangle Park, North Carolina 27711.) 2f Calder, K, E,, "Some Miscellaneous Aspects of Current Urban Pollution Models," P roc, Syinp> on Multiple Source Urhan Pit fusion Mode Is, EPA, Research Triangle Park, North Carpiiria, 1970, ' """ 3, Hrenko, J, M,, and P, B. Turner, "RAM: Real-Time Air-Quality Simulation Model," EPA, Research Triangle park. North Carolina (Preliminary draft. July 12, 197/1), 4, Older, K. E., "The Feasibility of Formulation of a Source-Oriented Air Quality Simulation Model that Uses Atmospheric Dispersion Functions Empirically Derived from Joint Historical Data for Air Quality and Pollutant Emissions," EPA, Research Triangle Park, North Carolina (draft, August 1974), 5, Horowitz, Alfjn, and W. S, Meisel, "The Application of Repro-Modeling to the Analysis of a Photochemical Air Pollution Model," EPA Report No, EPA-E504-74-001 , MERC, Research Triangle Park, North Carolina, December 1073, 6, Calder, K. [.,, "A Narrow Plume Simplification for Multiple Source Urban Pollution Models" (informal unpublished note), December 31, 1969. 7, "ARB Oxidant Readlnqs to Be Adjusted Downward," Calif. ARB Bulletin, Vol, 5, No. 8 (September 1974), pp 1-2, 8, "Calibration Report: LAAPCD Method More Accurate; ARB More Precise," Calif. Air Resources Board Bulletin,Vol. 5, No. 11 (December 1974), Pp'1-2. 9, Tiao, G. C., G. E. P. Box, and W. J. Hamminq, "Analysis of Los Angeles Photochemical Smog Data: A Statistical Overview," Tech- nical Rept. No. 331, Dept. of Statistics, U. of Wisconsin, April 1973. 10. Tiao, G. C., et al., "Los Angeles Aerometric Ozone Data 1955-1972," Technical Rept. No. 346, Dept. of Statistics, U, of Wisconsin, October 1973, ------- 58 REFERENCES (CONT'D) 11. Bruntz, 5. M., W. S. Cleveland, B. Kleiner and J. L. Warner, "The Dependence of Ambient Ozone on Solar Radiation, Wind, Temperature, and Mixing Height," Proc. Sy*ip. on Atmospheric Diffusion and Air Pollution, Santa Barbara, Ca"l if"'," September 9-13, 1974/ American WeteeroTogicsl Society, Boston, Mass, 12, Smith, F, B., and G, H. Jeffrey, "The Prediction of High Concentra- tions of Sulfphur Dioxide in London and Manchester Air," Proc. 3rd Meeting of NAT0/CCM5 Expert Panel on Air Pollution Modeling, Paris, 13, Horowitz, A. J., and W, S. Melsel, "0n-t1me Series Models in the Short-term Forecasting of Air Pollution Concentrations," Technology Service Corporation Report No. TSC-74-DS-101, Santa Monica, Calif., August 22, 1974. 14, Breiman, Leo, and W. S. Melsel, "General Estimates of the Intrinsic Variability of Data in Nonlinear Regression Models," TSC Report, Technology Service Corp,, Santa Monica, Calif,, October 1974. 15, Gronskel, K. E,, E. Jorariger and F. Gram, "Assessment of Air Quality 1n Oslo, Norway," Published as Appendix D to the NATO/CCMS Air Pol- lution Document "Guidelines to Assessment of Air Quality (Revised) S0X, TSP, CO, HC, N0X Oxidants," Norwegian Institute for Air Research, Kjeller, Norway, February 1973. (This document may be obtained, from the Air Pollution Technical Information Center, Office of Air and Water Programs, Environmental Protection Agency, Research Triangle Park, North Carolina.) 16, Box, G.E.P., and G. C. Tiao, "Intervention Analysis with Applications to Economic and Environmental Problems," Technical Report NO. 335, Department of Statistics, University of Wisconsin, Madison, Oct. 1973. 17, "Cluster Analysis," Chapter VIII of W. S. Melsel, Computer-Oriented Approaches to Pattern Recognition, Academic Press, 1972. ' 18, Melsel, William S., and D. C. Collins, "Repro-Modeling: An Approach to Efficient Model Utilization and Interpretation," IEEE Transactions on Systems, Man, and Cybernetics, Vol. SMC-3, No. 4, July 1973, pp 349-358. 19, Thayer, S.D., and R.C. Koch, "Sensitivity Analysis of the Multiple- Source Gaussian Plume Urban Diffusion Model," Preprint volume, Con- ference on Urban Environment, October 31-Nov. 2, 1972, Philadelphia, Pennsylvania (published by American Meteorological Society, Boston, Mass.). ------- 59 REFERENCES (CONT'D) 20. Nadaraya, E.A., "On Estimating Regression," Theor. Probabilit.y Appl., Vol. 4, pp 141-142, 1965. 21. Nadaraya, E.A., "On Non-parametric Estimates of Density Functions and Regression Curves," Theor. Probability Appl., Vol. 5, pp 186-190, 1965. 22. Nadaraya, E.A., "Remarks on Non-parametric Estimates for Density Functions and Regression Curves," Theor. Probability Appl., Vol. 15, pp 134-137, 1970. 23. Rosenblatt, M., "Conditional Probability Density and Regression Estimators," Multivariate Analysis, Vol. II, pp 25-31, Academic Press, New York, 1969. ------- |