oEPA United States Environmental Protection Agency Environmental Sciences Laboratory Research Triangle Park NC ?7711 t PA 600 4 80 013b K-hi.i.rv 1980 .. ^, c.'i Research and Development Evaluation of the Real-Time Air-Quality Model Using the RAPS Data Base Volume 2. Statistical Procedures ------- RESEARCH REPORTING SERIES Research reports of the Office of Research and Development, U S Environmental Protection Agency, have been grouped into nine series These nine broad cate- gories were established to facilitate further development and application of en- vironmental technology Elimination of traditional grouping was consciously planned to foster technology transfer and a maximum interface in related fields The nine series are 1 Environmental Health Effects Research 2 Environmental Protection Technology 3 Ecological Research 4 Environmental Monitoring 5 Socioeconomic Environmental Studies 6 Scientific and Technical Assessment Reports (STAR) 7 Interagency Energy-Environment Research and Development 8 "Special" Reports 9 Miscellaneous Reports This report has been assigned to the ENVIRONMENTAL MONITORING series. This series describes research conducted to develop new or improved methods and instrumentation for the identification and quantification of environmental pollutants at the lowest conceivably significant concentrations It also includes studies to determine the ambient concentrations of pollutants in the environment and/or the variance of pollutants as a function of time or meteorological factors. This document is available to the public through the National Technical Informa- tion Service, Springfield, Virginia 22161 ------- EPA-600/4-80-013b February 1980 EVALUATION OF THE REAL-TIME AIR-QUALITY MODEL USING THE RAPS DATA BASE Volume 2. Statistical Procedures by Harold S. Javitz and Ronald E. Ruff Atmospheric Science Center SRI International Menlo Park, California 94025 Contract No. 68-02-2770 Project Officer John S. Irwin Meteorology and Assessment Division Environmental Sciences Research Laboratory Research Triangle Park, North Carolina 27711 ENVIRONMENTAL SCIENCES RESEARCH LABORATORY OFFICE OF RESEARCH AND DEVELOPMENT U.S. ENVIRONMENTAL PROTECTION AGENCY RESEARCH TRIANGLE PARK, NORTH CAROLINA 27711 ------- DISCLAIMER This report has been reviewed by the Environmental Sciences Research Laboratory, U.S. Environmental Protection Agency, and approved for publica- tion. Approval does not signify that the contents necessarily reflect the views and policies of the U.S. Environmental Protection Agency, nor does men- tion of trade names or commercial products constitute endorsement or recom- mendation for use. ii ------- ABSTRACT The theory and programming of statistical tests for evaluating the Real-Time Air-Quality Model (RAM) using the Regional Air Pollution Study (RAPS) data base are fully documented in four volumes. Moreover, the tests are generally applicable to other model evaluation problems. Vol- ume 2 presents the tests considered for evaluating air-quality simulation models in general and those that would be useful in evaluating the RAM. The capability of the RAM to predict sulfur dioxide (SC^) concentrations was of particular interest. Specific tests for both intermediate and final evaluations are recommended, with accompanying descriptions of formats, plots, and procedures to establish confidence limits. Discus- sion focuses on the mathematics, procedures, and interpretation of the individual tests; their relevance to the project objectives; and possible trade-offs among tests. This report was submitted in fulfillment of Contract No. 68-02-2770 by SRI International under the sponsorship of the U.S. Environmental Protection Agency. This report covers a period from 1 October 1977 to 1 April 1979, and work was completed as of April 1979. 111 ------- CONTENTS Abstract ........................... iii Figures ........................... vi Tables ............................ vi 1. Overview ........................ 1 Introduction .................... 1 Intermediate and final evaluation in model verification .............. 1 2. Evaluation Statistics .................. 3 General ...................... 3 The steps involved in selecting intermediate-evaluation statistics ........ 3 The steps involved in selecting final-evaluation statistics ........... 4 The degree of interchangeability of intermediate- and final-evaluation statistics .................... 6 Final-evaluation statistics ............ 7 Intermediate-evaluation statistics ........ 10 References .......................... 14 Appendices A. Common notation and conventions .... ......... 15 B. Final-evaluation statistics .............. 17 C. Confidence limits ................... 28 D. Intermediate-evaluation statistics .......... 30 ------- FIGURES Number Page B-l Geographical map 25 B-2 Translated vectors 26 B-3 Average of vectors 26 B-4 Residual vectors 26 B-5 Tolerance ellipse 26 D-l Scattergram, regression line, and bands tolerance 30 D-2 Visual assessment of correlation 33 D-3 Nonlinear scatterplot relationships 33 D-4 Sample residual versus parameter plot 40 D-5 Sample autocorrelation plot 44 D-6 Sample plot of cumulative periodogram 45 D-7 Sample plot of residuals versus time 46 D-8 Adjusted sample plot of residuals versus time 47 TABLES Number Page D-l Post-Agreement 48 D-2 Prefigurance 48 ------- SECTION 1 OVERVIEW INTRODUCTION Two types of model verification are distinguished: (1) intermediate evaluation for model modification and (2) final evaluation for model com- parison. These two types of evaluation are performed via evaluation sta- tistics, and the steps required in the selection of appropriate statistics are discussed. For intermediate evaluation, these steps are fairly straightforward. For final evaluation, the investigator must consider the relative seriousness of different types of prediction errors and must specify a "loss function." Because of the uncertainty concerning the impact of a misprediction, these are difficult tasks. The particular statistics we have chosen for intermediate and final evaluation are explained briefly in the text and more thoroughly in the appendices. In particular, considerable attention is devoted to the accuracy score--a general-purpose final-evaluation statistic. INTERMEDIATE AND FINAL EVALUATION IN MODEL VERIFICATION Verification of an air-pollution model is the process of comparing predicted and observed readings to evaluate model performance. The evaluation may be used for (1) modifying the air-pollution model or (2) quantifying its relative or absolute accuracy. In the first case (which we shall term "intermediate" evaluation), our test might show that the model tends to overpredict pollution concentrations at times when low wind speeds prevail. We may, therefore, choose to modify the model to compensate for the overprediction. In the second case (which we shall term "final" evaluation), we may demonstrate that the average error in prediction of a particular model is only 18 percent of the observed con- centration, and that this is a lower average error than any competing model. ------- Intermediate evaluation is often the basis for model modification and precedes the final evaluation of the "best" model. The terminology assumes there are several models (or versions of the same model). The "best" model is that one which most accurately represents the physical processes occurring in the atmosphere as determined by comparison of the final evaluation statistic for each model. Thus, final evaluation is the basis for model comparison and precedes the selection of the best model. During the intermediate evaluation phase, we are generally "fine tuning" a model after analyzing results from our statistical tests applied to a test data base. Hence, we run the risk of our results being unique to the test data base. To ensure that we are in fact improving a model, we must expand our evaluation using other data bases. (Alternatively, we may use a subset of a data base for intermediate evaluation and a separate, statistically independent, subset for final evaluation.) ------- SECTION 2 EVALUATION STATISTICS GENERAL Evaluation statistics are indices or scores computed from the data base of observed and predicted concentrations. These statistics consti- tute the method by which the evaluation is performed. For example, an intermediate evaluation of a model may involve the computation of many correlation coefficients between observed and predicted concentrations under different meteorological conditions. Certain types of statistics are better suited to intermediate evaluations and others are better suited to final evaluations. (In addition, graphical displays and accompanying subjective judgments may be appropriate for intermediate evaluations.) There are no hard and fast lines, however, and a statistic may be applied to both types of evaluations. THE STEPS INVOLVED IN SELECTING INTERMEDIATE-EVALUATION STATISTICS Different steps are involved in selecting the appropriate statistic to accomplish an intermediate or a final evaluation. For an intermediate evaluation, we first specify an aspect of the agreement (or disagreement) between the predicted and observed concen- trations we intend to investigate. For example, we may be interested in investigating whether high (low) predicted concentrations are associated with high (low) observed concentrations. Or we may be interested in investigating cyclical patterns of overprediction and underprediction. Secondly, we translate the aspect of the agreement we wish to investigate into a model for agreement. Such a model for agreement might be that observed and predicted concentrations are random variables following a bivariate normal distribution with an unknown correlation coefficient. Alternatively, we might postulate that the error in prediction is composed of a sinusoidal component and a white-noise component. Thirdly, we ------- specify a statistic that is optimal in some sense for estimating or test- ing the postulated model or some aspect thereof. For example, Pearson's sample correlation coefficient is often used to estimate the true corre- lation coefficient. Or the normalized cumulative periodogram may be used to test for the presence of cyclical trends in prediction error. (Both are described in more detail later in this Section.) THE STEPS INVOLVED IN SELECTING FINAL-EVALUATION STATISTICS The Relative Seriousness of Errors in Prediction For the final evaluation, the first step is determining the relative seriousness (or "loss") associated with different types of errors in air pollution prediction. Consideration should be given to the following types of questions: • Is an underprediction more or less serious than an overprediction? • Is the assessment of the seriousness of an underprediction or an overprediction also dependent upon the magnitude of the observed concentration? • Can misprediction errors be categorized according to quantifiable losses (such as the costs incurred in calling erroneous air pol- lution alerts)? • How serious is an error in predicting the location of the maxi- mum air pollution? • Is a string of overpredictions that is followed by a string of underpredictions more or less serious than a more interspersed pattern? Specifying a Loss Function The end result of these considerations is the specification of a loss function. In its most general form, the loss function quantitatively determines, for any set of observed and predicted concentrations, the loss associated with any errors in prediction. The data base to which the loss function is applied consists of the observed and predicted concentrations at one or more spatial locations and at one or more time periods. The data base may also include meteorological and other types of data that are relevant to the computation of loss. Consequently, the loss function may be quite complex. The evaluation statistic associated with an ------- air-pollution model and a data set is simply the loss associated with the data set. For example, a relatively simple evaluation statistic would be the average-squared error between observed and predicted concentrations over n time periods at a single location or group of locations. Mathe- matically, this statistic may be written n 2 . - PC.) where OC. is the observed concentration at the i^th time period and PC. is the corresponding predicted concentration.' Using this statistic implies that the investigator considers underprediction and overpredic- tion to be equally serious, and the assessment of the seriousness of an error in prediction to be independent of the magnitude of the observed concentration, the existence of any threshold, the location of the maxi- mum predicted or observed concentration, or any temporal arrangement of the underpredictions and overpredictions . The Difficulty in Specifying a Loss Function Specifying a final-evaluation statistic is equivalent to specifying the loss function that quantifies the losses associated with every type of error in prediction. While generally acceptable final-evaluation statistics (such as mean square error) have been used in the past, the associated loss function and its implications have not been uniformly considered. This omission appears to be linked to the difficulty in quantifying the impacts of misprediction. As an evaluation tool in the model-development phase, the loss function must be objective so that we can evaluate changes to a specific model or make comparisons among competing models. However, when a model is applied in an operational mode, many subjectively derived factors may * Refer to Appendix A for a complete description of the notation used throughout this text. ------- figure in the evaluation process. Difficulties are encountered both in the quantification of the utility or lack thereof to each user of a pre- diction and in the aggregations of these over many users. Consider the following examples: • A pollution alert may convince an individual to refrain from any vigorous physical activity for a day or longer, with the indi- vidual incurring both an opportunity loss and a possible economic loss, balanced against a projected medical gain. • A pollution alert might result in the voluntary or mandatory shutdown of local factories. • The failure to call an alert might result in an unusually large number of emphysema attacks or deaths. • Decisions to allow or forbid the construction of factories might be partially based on prediction forecasts. In these examples, the impacts of misprediction can be economic, medical, and political. Such impacts are difficult to quantify and aggregate. Nevertheless, specifying the relative seriousness of different types of errors in air pollution prediction is an absolutely vital step in final evaluation. In the absence of a prior study on relative losses, we advocate the specification of a reasonable loss function, obtained through a process for generating a consensus opinion. THE DEGREE OF INTERCHANGEDILITY OF INTERMEDIATE- AND FINAL-EVALUATION STATISTICS As mentioned earlier, although a particular statistic may be par- ticularly well suited for intermediate or final evaluation, there is no logical prohibition against using it for the other evaluation purpose. * An "alternative" to the specification of a single final-evaluation sta- tistic is the specification of a battery of statistics, some of which are appropriate to final evaluation and some to intermediate evaluation. Investigators use the battery of statistics to gain an appreciation of the strengths and weaknesses of the model. They then subjectively weight the relative importance of these strengths and weaknesses in order to compare models. This type of final evaluation may be useful, but it is less rigorous, scientific, objective, or clearly specified than the type of final evaluation we have advocated. ------- The average-squared-error statistic just discussed is an example of a statistic that may be profitably used in an intermediate evaluation. In general, we find that the final-evaluation statistics associated with simpler loss functions are also useful for intermediate evaluation; how- ever, statistics appropriate to intermediate evaluation appear in general to be rather poorly suited to final evaluation. A particularly glaring example of this phenomenon is Pearson's corre lation coefficient (which, unfortunately, is often used in final evalua- tions). When performing a final evaluation, it is reasonable to implic- itly associate a loss of zero with a Pearson's correlation coefficient of +1 and a large loss with a Pearson's correlation coefficient of -1. This loss function necessitates some very strange considerations when final evaluations are chosen. For example, if Model A predicts the observed concentrations very accurately and Model B always predicts ten times the concentrations of Model A, then both models have the same Pearson's correlation coefficient. Again, if Model C always predicts 50 ppb more pollution than Model A, then both these models have the same Pearson's correlation coefficient. The investigator would presumably find all these models equally acceptable. The near-irrationality of these acceptability conditions highlights the unsuitability of using V Pearson's correlation coefficient as a final single-evaluation statistic. FINAL-EVALUATION STATISTICS The Accuracy Score--A General Final-Evaluation Statistic We have formalized a number of reasonable loss functions and their associated statistics, which we have termed accuracy scores. Accuracy If, in the intermediate-evaluation phase, all prediction models are adjusted so that the average predicted concentration is the same as the average observed concentration, and the variance of the predicted con- centration is the same as the variance of the observed concentration, then Pearson's correlation coefficient is mathematically equivalent to mean-square error. In this event, either Pearson's correlation coef- ficient or mean-square error may be used as a final-evaluation statis- tic, with equivalent results. ------- scores are based on a particularly simple, but quite general, type of loss function. These are the loss functions that are additive over time periods. A loss function is additive over time periods if (1) a loss can be computed due to the error in prediction at one particular time period and (2) the total loss due to error in prediction over a set of time periods i<; the arithmetic sum of the losses in each time period in the set. The accuracy score is the average loss over the set of time periods. An example of an accuracy score is average-squared loss, which has already been discussed. Eight different accuracy scores are described in the first section of Appendix B. Seven of these are concerned with the losses due to errors in prediction at a particular location. If data are available for many locations, one may base the final evaluation on the average loss over the total number of prediction-observation pairs.* The eighth accuracy score is concerned with losses due to an error in predicting the location of the maximum observed concentration. More complicated accuracy scores are also possible. For example, one might specify an accuracy score based on the loss for a particular time period while simultaneously considering both the error in the pre- diction of the maximum concentration and the error in the prediction of the location of the maximum concentration. In addition, one might also base final evaluation on a statistic other than an accuracy score (for example, a statistic based on a nonadditive loss function ) if this better reflected the concept of relative losses. * Alternatively, one could base the final evaluation on the largest value of the accuracy score when a particular accuracy score is computed over many locations. This final-evaluation statistic should not be used without careful thought. For example, it implies that if Model A pre- dicts perfectly accurately at Location 1 and poorly at Location 2, and Model B predicts, at Locations 1 and 2, as poorly as Model A does at Location 2, then Models A and B are equally desirable. A nonadditive loss function might be appropriate in certain situations. For example, suppose that an alert is called only after three consecu- tive hourly predictions of excessive pollution for the next 24-hour period. In this case, very little loss would be associated with over- prediction for a single time period, since it would take three consecu- tive overpredictions to trigger a false alert. ------- To supplement the accuracy score, we recommend two features that enhance its meaning and interpretation. First, confidence limits can be computed so that its statistical significance can be assessed. (Refer to Appendix C for a description of the general procedure.) As a second feature, the accuracy score for each observation point (monitoring loca- tion) can be plotted on a geographical map of the area. The resulting display can then be analyzed for geographical trends. For instance, we could determine if the higher accuracy scores corresponded to urban areas or rural areas, or if the highest scores were associated with one geographical sector (e.g., northwest). Thus, such a display could serve to enhance the accuracy score's overall utility. Other Final-Evaluation Statistics Prior to arriving at our recommendation of the accuracy score for evaluating the final model, we considered several other tests. Each has its own advantages. These other tests were considered in detail but not included in our final package because they repeated, to a large extent, the accuracy-score statistics. The three tests (skill score, P-score, and maximum location tolerance intervals) are described below and in Appendix B. Skill Score— The skill score discussed in Appendix B (see p. 21), is a statis- tic that can be used either as an intermediate- or as a final-evaluation statistic. The skill score is an index of model performance, which may be adjusted to reflect how well a baseline model (such as chance or per- sistence) would have performed. This statistic is substantially the same' as the accuracy score, with a user-supplied loss matrix (see the write-up of the accuracy score in Appendix B) with zeros on the diagonal and a loss of 1 elsewhere. P-Score— The P-score statistic, as described in Appendix B (see p. 22), is useful in evaluating models that assign probabilities to various levels ------- of concentration. This type of prediction is not common, however, and the accuracy score could be adjusted to accommodate this type of model. Maximum Concentration Tolerance Intervals-- The purpose of this test is to evaluate a model's ability to predict the location of the maximum (observed) concentration. (See p. 23 of Appendix B). This test computes the mean distance between the "observed" and predicted locations of maximum concentration and a spatial tolerance interval (which would basically be defined by an ellipsoid). The draw- back with this test is that the monitoring network must be of sufficient spatial density that the observed concentration is known everywhere in the region of interest. Realistically, we must expand our number of observation locations through interpolation techniques. If the monitoring network is not dense enough to support an accurate interpolation, the test is neither objective nor statistically sound. For the RAPS network, monitoring networks are separated by about 5 km, and the ability to interpolate air-quality concentrations between observational stations is largely unknown. Therefore, we have not recommended the inclusion of this test in our statistical package. Instead, we have incorporated a test with a similar objective into our accuracy score, using a maximum concentra- tion location loss function. However, when (or if) it is proven that accurate interpolation between monitoring sites can be achieved, inclu- sion of the maximum concentration tolerance intervals should be recon- sidered . INTERMEDIATE-EVALUATION STATISTICS In addition to the recommended final-evaluation statistical tests contained in the accuracy score, we are recommending several tests that are useful in diagnosing problem areas within models. Results from these "intermediate-evaluation" tests serve to identify components of the model most in need of improvement. In some cases, these tests can serve as final-evaluation criteria as well. The following tests and displays are recommended for the above type of intermediate evaluation: 10 ------- • Scatterplot • Linear regression with confidence and prediction bands • Pearson's correlation coefficient • Chi-square test and frequency distribution displays • Interstation error correlation • Multiple regression of error residuals • Analysis of residual time series. The above and other tests considered in our study are discussed below and in Appendix D. Discussion of the Tests Selected Perhaps the most basic and most useful intermediate-evaluation technique is the graphical display of the data in a scatterplot. The evaluation of the scatterplot of predicted concentration (on the vertical axis) versus the observed concentration (on the horizontal axis) is aided by calculating the linear regression line and the sample variance around this line. The linear regression line may be used directly to modify the model, so that the sample variance around the line becomes the new root- mean-squared error. These techniques are discussed in Appendix D (see p. 29). This section also discusses the confidence bands that can be placed around the linear regression line and the prediction bands for additional observations. The width of the first of these bands indicates how well the position of the "true" underlying linear regression is known from the sample (if indeed the relationship is linear), while the width of the second band indicates how far from the linear regression a new observation could be expected to fall. Closely allied to linear regression is Pearson's correlation coef- ficient, which has long been used as a model evaluation statistic. This statistic is described in Appendix D (see p. 31), along with the confi- dence bands for the correlation coefficient. In Appendix D (p. 34), the spatial and temporal correlation coef- ficients are presented along with their appropriate confidence bands. The spatial correlation coefficient measures the agreement between the 11 ------- predicted and observed spatial trends of air pollution over a prediction period. The temporal correlation coefficient measures the agreement of the predicted and observed temporal trends of air pollution over the whole network of monitoring stations. These statistics are essentially derivatives of Pearson's correlation coefficient and contain much the same information. The chi-square test is a standard statistic used for examining the fit of the distribution of predicted concentrations to the distribution of observed concentrations. This statistic and some useful accompanying graphics are discussed in Appendix D (see p. 36). While useful in gen- eral, this statistic does not examine the observed and predicted concen- trations pairwise. For this reason we suggest that it be used only as a back-up statistic for a procedure such as linear regression, which does examine the data pairwise. The chi-square test is also a useful final- evaluation statistic for those cases where the accuracy of the predicted concentration distribution is important. A number of techniques can be used to examine directly the error residuals (that is, the differency between the observed and predicted concentrations). Appendix D (p. 38) introduces the interstation error correlation coefficient--a statistic that examines the correlation between station errors, which may be helpful in uncovering influences accounting for systematic biases in the prediction model. Another, more direct, method is the multiple regression of error residuals, also dis- cussed in Appendix D (see p. 39). This technique attempts to isolate meteorological variables highly correlated with the errors in prediction. Finally, the analysis of residual time series discussed in Appendix D (p. 41) examines the error residuals for any cyclical behavior using two types of statistics--the autocorrelation coefficient and the normal- ized cumulative periodogram. Other Tests Appendix D also includes four statistics seriously considered by SRI as evaluation statistics but finally rejected. The runs test (see p. 45) is a nonparametric test that examines for time trends 12 ------- in error residuals. The analysis of residual time series accomplishes more than the runs test, however, and is more easily interpreted. The prefigurance and post-agreement tables (see p. 46) allow an assessment of the extent to which the forecasts give advance notice of a certain pollution level and the extent to which subsequent observations confirm this prediction. However, the scatterplot contains the same information in a more rapidly digestible form, without the inconvenience of the user specifying pollution concentration categories. Quasi-logistic modeling of the correlation coefficient (see p. 48) is a statistical technique for determining the meteorological conditions under which the correlation coefficient is unusually high or low. The data set must be quite extensive before this technique can be used, how- ever, and the multiple regression of error residuals can extract similar information more efficiently. 13 ------- REFERENCES 1. R. G. Miller, Simultaneous Statistical Inference, pp. 110-116 (McGraw-Hill, New York, New York, 1967). 2. T. W. Anderson, An Introduction to Multivariate Statistical Analysis, pp. 77-79 (John Wiley and Sons, New York, New York, 1958). 3. R. Wonnacott and T. Wonnacott, Econometrics, pp. 103-125 (John Wiley and Sons, New York, New York, 1970). 4. Carmen J. Nappo, "A Method for Evaluating the Accuracy of Air Pollu- tion Prediction Models," Proc. Conf. on Atmospheric Diffusion and Air Pollution, 9-13 September 1974, Santa Barbara, California. 5. Robert Hogg and Allen Craig, Introduction to Mathematical Statistics, pp. 308-315 (The Macmillan Company, New York, New York, 1970). 6. N. R. Draper and H. Smith, Applied Regression Analysis, pp. 104-195 (John Wiley and Sons, New York, New York, 1966). 7. George E. Box and Gwilym M. Jenkins, Time Series Analysis Forecast- ing and Control, pp. 34-36 and 294-298 (Holden-Day, San Francisco, California, 1976). 8. Gwilym M. Jenkins and Donald G. Watts, Spectral Analysis and Its Applications, pp. 183-189 and 234-237 (Holden-Day, San Francisco, California, 1968). 9. W. J. Conover, Practical Nonparametric Statistics, pp. 350-356 and 414 (John Wiley and Sons, New York, New York, 1971). 10. H. A. Panofsky and G. W. Brier, Some Applications of Statistics to Meteorology, pp. 191-208 (Pennsylvania State University Press, University Park, Pennsylvania, 1965). 11. Norman Nie et al., Statistical Package for the Social Sciences, pp. 320-328 (McGraw-Hill, New York, New York, 1975). 14 ------- APPENDIX A COMMON NOTATION AND CONVENTIONS To shorten the descriptions of the statistical tests, we have assembled a common set of notation and conventions. Any exceptions to this common set are specifically mentioned in the descriptions of par- ticular statistical tests. In simplest form, the data for a statistical test consist of n pairs of observed and predicted pollutant concentrations. These data are denoted (OC , PC ), (OC , PC ), . . . (OC , PC ) . This data set may include all the data gathered or only a subset. In general, a final- evaluation statistic will make use of all the available data, and intermediate-evaluation statistics will most often make use of subsets. The concentrations might refer to hourly averages, three-hour averages, daily averages, etc., at a particular station or averaged over the sta- tion network. The pairs need not be sequential nor time-ordered in any particular fashion. Since it is desirable to assess model fit under a variety of meteor- ological conditions, the data base is arranged so that it can be sub- divided into many nonexclusive subsets according to the following parameters: • Date, D • Time period, T • Location, L • Pasquill-Gifford stability class, G, • Wind-speed class, G_ • Wind-direction class, G., • Mixing-height class, G, • Emission class, Q • Observed concentration, OC • Predicted concentration, PC. 15 ------- The data for a statistical test shall then be selected from any subset of the data base (in order to assess model fit under those conditions characterizing that data subset). In quantifying confidence intervals, we use Student's t-distribution. For large sample sizes, the t-distribution approximates that represented by the standard normal curve. In the notation, T represents the upper a/2 cutoff point for a t-distribution with m-degrees of freedom. 1Selby, S. M., CRC Standard Mathematical Tables (Chemical Rubber Com- pany, Cleveland, 1968) . 16 ------- APPENDIX B FINAL-EVALUATION STATISTICS ACCURACY SCORE The accuracy score offers the user several tests to assess either objectively or subjectively the "loss" due to differences in predicted concentrations (PC) and observed concentrations (OC) . Each test is based on the quantification of a loss function, which measures the loss associated with an error in prediction at a particular time period. The accuracy score (denoted A) is the average loss for the prediction model computed over the n time periods in the data base. Consequently, the accuracy score resembles a golf score — the lower, the better. (The text by Ferguson provides material complementary to the concepts pre- sented herein.) We consider eight loss functions, which naturally are associated with eight different accuracy scores. The first five rely on (essentially) prespecified loss functions, which are easily interpretable and have therefore been called objective. The next two require that the user specify loss indices associated with various types of misprediction and have therefore been called subjective. The eighth loss function provides an index of the spatial accuracy of the model in predicting the location of the maximum observed concentration. The importance of correctly specifying the loss function cannot be overemphasized. The loss function embodies our conception of how poorly or well the model fits and the relative seriousness of different depar- tures from perfect fit. For example, a meteorologist who is concerned only about false pollution alerts will choose a very different loss 1Ferguson, Thomas S., Mathematical Statistics, pp. 1-11 (Academic Press, New York, 1967) . 17 ------- function than a meteorologist who is concerned both about false pollu- tion alerts and failure to predict actual alert-level pollution. We now describe eight loss functions and accuracy scores- -five objective and three subjective. Objective Loss Functions Average Absolute Error-- n A = - y JOG. - PC. I ' 1 n 1=1 With this loss function (absolute error), underprediction and overpre- diction are judged to be equally undesirable. Furthermore, a mispredic- tion of 2 ppm is twice as undesirable as a misprediction of 1 ppm. Average-Squared Error — n n i=l With this loss function (squared error) also, underprediction and over- prediction are judged to be equally undesirable. A misprediction of 2 ppm is four times as undesirable as a misprediction of 1 ppm. This type of loss function is used in many statistical tests. Percent Incorrect Predictions with Absolute Error Threshold— n A=i n i=l 0 if |OC. - PC.I ^ e 1 i i' 1 if |OC. - PCi| > e With this loss function, only prediction errors that exceed a certain absolute threshold are counted against the prediction model. 18 ------- Underprediction and overprediction are judged to be equally undesirable, Whenever the misprediction exceeds the threshold, its magnitude is ignored. Thus, moderate and huge misprediction errors count equally. (The error threshold must be user-supplied.) Percent Incorrect Prediction with Constant Percentage Error Threshold (€)— A = - n n 0 if 1 if OC. - PC. i i OC. 1 OC. - PC. 1 1 OC. 1. > e This loss function is similar to the preceding, except that the threshold criterion is percentage error. (In practice, the case where OC = 0 is not allowed.) Symmetric High-Low Loss Function (Cutoff)-- 0 if PC. and OC. ^ cutoff i i or PC. and OC. > cutoff x i 1 if otherwise This loss function is particularly appropriate when an alert level cut- off exists. Here A is the percentage of false alerts and failures to predict actual pollutant alert levels. The magnitude of the false alert is ignored as is the magnitude of the underprediction when an actual pollutant alert condition is not predicted. 19 ------- Subjective Loss Functions Asymmetric High-Low Loss Function (Cutoff)— A = - 0 if or PCi and OC. £ cutoff PC. and OC. > cutoff i i P^ > cutoff and OC. £ cutoff PC. ^ cutoff and OC. > cutoff Here, JL^, 4 and the cutoff value are user-specified loss quantities. This loss function is similar to the preceding one, except that under- prediction and overprediction are weighted differently. User-Supplied Loss Matrix-- The predicted and observed concentrations are divided into categories whose increments are about 20 percent of the applicable standard. The user enters the loss numbers in the matrix. An example of the loss table is shown below: OC Categories (k) 12345 ~ CO o 60 01 4-1 u eu 1 2 3 4 5 0 1 2 3 4 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 4 3 2 1 0 Let & be the number in the jth row and the kth column. The accuracy score is 20 ------- n A = -• > A., n /—t jk i=l where PC. is in the j^th category and OC. is in the kth category. The loss function, &.-.> is extremely general, since the number of cells and the loss numbers in the cells are at the user's discretion. The sample loss matrix approximates the average absolute error loss function. The other loss functions could be similarly approximated. Loss Function for the Location of Maximum Concentration-- The previous concepts of the accuracy score can be slightly modified to provide an index of the spatial accuracy of the prediction model in forecasting the location of the maximum-concentration reading. This accuracy score can be defined in the following fashion: Step 1: For the jLth time period, identify the monitoring locations, k(i) and k'(i), where the highest observed and predicted con- centrations occur. Step 2: From the symmetric m X m matrix, consisting of the distances between monitoring stations, select the distance d, ,.. , /,.. k(i),k (i) between the locations for highest observed and predicted con- centrations . Step 3: Repeat Steps 1 and 2 for all samples and then compute the accuracy score from the formula n A = - Z \ n z—' t i=l Many other types of objective or subjective loss indices JL ,.. i,'/-- k(,i.) >k (. could be used instead of geographical distance. For instance, the loss index could be constructed to take into account topographical features 21 ------- that might cause two geographically close locations to differ substan- tially in pollutant levels. SKILL SCORE o The skill score is an index of model performance that partially summarizes the information contained in the contingency table of pre- dicted and observed pollution concentrations. The skill score may be adjusted to reflect how well a baseline model (chance, persistence, climatology, etc.) would have performed. The skill score S is given by S = C " E T - E where C is the number of correct forecasts and T is the total number of forecasts. If no adjustment for a baseline model is desired, then E is set to zero and the skill score is just the percentage of correct fore- casts. If an adjustment for a baseline model is desired, then E is set to the expected number of correct forecasts for the baseline model. In this case, the skill score (1) has a value of unity when all forecasts are correct, (2) has a value of zero when the number of correct forecasts equals the expected number correct, and (3) is negative when the number of correct forecasts is less than the expected number correct. The skill score may easily be modified to summarize how well a model predicts the location of maximum pollution concentration. To this end, C is defined as the number of correct location forecasts and T is the total number of location forecasts. If an adjustment for a baseline model is desired, then E is set to the expected number of correct location fore- casts for the baseline model. 2Panofsky, H. A., and G. W. Brier, Some Applications of Statistics to Meteorology, pp. 191-208 (Pennsylvania State University Press, Univers- ity Park, Pennsylvania, 1965). 22 ------- P-SCORE The P-score is a method for evaluating a forecast model and is particularly well suited for models that predict the probabilities of certain pollutant levels occurring. This score has the desirable property that it exerts little influence on the model builder to choose a particu- lar model in order to score well on the statistical evaluation. The forecaster is encouraged to get the forecast exactly right, or, if he cannot forecast perfectly, to state unbiased estimates of the probability of each possible event. The P-score ranges from zero to one. A score of zero implies that the prediction model always forecasts the observed pollutant level with a probability of one. A score of one implies that the prediction model always forecasts a probability of zero for the pollutant level that did materialize and forecasts a probability of one for a pollutant level that did not materialize. To compute the P-score, the user must specify various levels for pollutant concentration, say K of them. For the j^th time period, the model is assumed to forecast probabilities that the observed concentra- tion will be at particular levels. Let f... be the forecast probability that the observed concentration at time i will be at level 1; let f be the forecast probability that the observed concentration at time i will be at level 2, etc, The P-score is given by will be at level 2, etc. Clearly, f., + f.0 + ... + f equals one. 1J. 3.^ iK K P - -- rf F P - 2n (f " E where there are n time periods, and where E.. takes the value zero or one according to whether or not level j was observed at time i. 3Morrison, D. F., Multivariate Statistical Methods, pp. 131-134 (McGraw- Hill, New York, 1976) . 23 ------- In the absence of any forecasting skill, the expected P score is minimized by choosing f.. to be the climatological probability of observ- ing level j. If these probabilities are denoted p., then the minimum expected P-score is K 1 1 - The climatological probabilities may be computed either from the sample or from the larger set of data it represents. TOLERANCE INTERVALS FOR THE LOCATION OF MAXIMUM CONCENTRATION The ability to predict the location of the maximum concentration of pollutants is an important characteristic of a prediction model. In this section we derive a tolerance interval (or ellipsoid) to describe this ability. Consider a geographical map of a region. One may locate on this map the monitoring stations and the predicted location of the maximum pollution. Using the results obtained at the monitoring stations, one may determine an interpolated location for the observed maximum concen- trations . In Figure B-l an arrow has been drawn from the location of observed maximum concentration to the location of predicted maximum concentration. Two techniques are now available to describe the ability of the model to predict the location of the maximum concentration. The first technique is concerned only with the length of the arrow (i.e., the distance between the observed and predicted locations). Letting D. be the distance for the ^th time period, we compute D and S,, the usual estimates of mean and standard deviations. A confidence inter- val for U , , the mean of the distribution of D., is given by D ±tS 24 ------- • 5 • 6 • 2 • 3 • 4 Figure B-1. Geographical map. where t q is the 95th percentile point of the t distribution, with n-1 degrees of freedom. Under the assumption that the D. follow a normal distribution, a 90-percent tolerance interval can be developed. Consider an additional observation, D , not used in the original calculations. We are 90-percent confident that D will be in the interval: The second technique is concerned both with the distance between the observed and predicted locations and with the direction of the error. For each time period, an arrow may be drawn. Translating these arrows so that the bases are located at an origin, we obtain Figure B-2. It is easily seen that most of the predictions are to the northeast of the observations, fewer are to the southwest, and still fewer to the north- west or southeast. Taking the average of the vectors, we find the average direction and magnitude of the error in prediction, as shown in Figure B-3. The residual vectors (from the average vector to the translated vectors) may also be drawn, as shown in Figure B-4. Using statistical techniques, we may draw an ellipsoid, which captures most of the residual vectors, around the average vector. (See Figure B-5.) The ellipsoid is large enough that the investigator can be 25 ------- tr UJ \ u. 8S \ \ \ d> a u c _0> O LO CD 2 u c CO CM CD o> cc CD 26 ------- 90-percent confident it will capture a translated vector, when such a vector is obtained under conditions similar to those previously existing. A description of the general procedure follows. Let X denote the east-west axis and Y denote the north-south axis. Then the vector of differences between observed and predicted locations is .,Y.) - D. We assume that the vector (X,Y) follows a bivariate normal probability 2 2 distribution with mean (U ,U ), variances (a a ) and correlation p. x' y x' y ' The best estimate of the mean (U ,U ) is the sample mean (X,Y). A confidence interval for (U ,U ) is x' y 2 2 where S and S are the usual estimates of the variance of X and Y, and x y- 2 S is the estimate of the covariance of X and Y. Here T is the cut- xy 2 , ° off value of Hotellings1 T statistic. Consider an additional vector of differences, not used in the orig- inal calculations, say (X ,Y ). Assuming that (X ,Y ) follows the same bivariate normal distribution as (X,Y), we obtain the following result: we are 90-percent confident that (X ,Y ) will lie in the ellipsoid -) s X - s xy -) s s xy 2 y Morrison, D. F., Multivariate Statistical Methods, pp. 131-134 (McGraw- Hill, New York, 1976). 27 ------- APPENDIX C CONFIDENCE LIMITS The accuracy score and a number of the intermediate-evaluation sta- tistics presented in this report do not have formulas for confidence limits. However, confidence limits for these statistics may be derived. To illustrate the procedure for the accuracy score, let A be an accuracy score computed from a data set. Subdivide the data set into k data subsets. We recommend k = 5 as a reasonable value. In this case, the first subset should contain the data from time-periods 1, 6, 11, 16, etc. The second subset should contain data from time-periods 2, 7, 12, 17, etc. In this fashion each subset contains data from every fifth time period.'v The accuracy scores for each of the subsets can be computed if these accuracy scores are labeled A.., A., ..., A ; then the sample standard deviation of these subset scores is S -l/r-^-T 7, (A,. - A)2 where 1=1 * This procedure ensures that the data subsets are comparable with each other and to the full data set, except when the data set has a "period- icity" of five time periods. As an example of such periodicity, consider the case wherein the periods were days and the full data set consisted only of weekdays. Then the five data sets would not be comparable (since one would consist only of Mondays, another only of Tuesdays, etc.). In this case, one should construct six data subsets. 28 ------- The accuracy-score A computed from the total data set has a smaller standard error than the accuracy scores computed from the smaller data subsets. We may approximate the standard error of A by Thus a 90-percent confidence interval for the true value of the accuracy score is given by where t, _,. is the upper 95-percent cutoff point of the Student's K.~ 1 y • .7 J t-distribution with k - 1 (four) degrees of freedom. 29 ------- APPENDIX D INTERMEDIATE-EVALUATION STATISTICS LINEAR REGRESSION WITH CONFIDENCE AND PREDICTION BANDS The scattergram of observed-versus-predicted concentrations is a visual presentation of the accuracy of the prediction model. The regres- sion line summarizes the average relationship between predicted and observed concentrations. Two types of bands can be drawn around the regression line. A 90- percent confidence band has a 90-percent probability of capturing a true underlying linear relationship between predictions and observations. A 90-pe.rcent prediction band has a 90-percent probability of capturing an arbitrary prediction-observation pair, when that pair is obtained under conditions similar to those existing when the data used in the scattergram were gathered. Typical bands are shown in Figure D-l. • REGRESSION LINE •90% CONFIDENCE BANDS ON REGRESSION LINE • 90% PREDICTION BANDS ON ADDITIONAL PREDICTION- OBSERVATION PAIR tr z 111 o o o o ui OBSERVED CONCENTRATION Figure D-1. Scattergram, regression line, and bands tolerance. 30 ------- The regression line and bands can be obtained by applying normal regression theory. For the sake of notation, let = (OC1,PCi) Then the least-square regression of Y. on X. for n samples may be per- formed, yielding an intercept estimate a, a slope estimate b, and a vari- 2 ance estimate S as follows: n \2 E 1=1 * \i=l n / n ZV v^ ' X n / n \2 n 9 1 ^™* 9 S . -L (YI . a - bX.) In contrast to the usual convention, the vertical axis has been reserved for the predicted observation and the horizontal axis has been reserved for the observed concentration. This arrangement more closely approximates the underlying structure of normal regression theory since the predominant error is in the prediction rather than in the observa- tion of air pollution. The usual 90-percent confidence interval for the mean a + bX is given by 31 ------- a+bX ±tn-2f.95S I + n- n - x)2 1=1 (xt - x)2 1/2 Now consider an additional ooservation (X ,Y ). For given X , it is possible to derive an interval that possesses a 90-percent probability of capturing Y . This interval is given by the formula a+bXo ±tn-2,.95S (X. - X)2 1=1 1/2 The estimates of the intercept and slope may also be used to correct the model for linear trend. To accomplish this, we replace the model prediction Y by the corrected prediction (Y - a)/b. To supplement the material presented herein, the reader is referred to the text by Miller.1 PEARSON'S CORRELATION COEFFICIENT Direct correlation between observed and predicted concentrations exists when a rise or fall in predicted concentrations is associated with a corresponding linear rise or fall in observed concentration. The degree of correlation between predicted and observed concentra- tions can often be visually assessed through the use of a scatterplot, as illustrated in Figure D-2. In the situations shown in Figure D-2, the more closely the plotted concentrations can be fitted by a line, the more highly correlated are predicted and observed concentration. A correlation of +1 denotes that all the plotted points fall along a line; a correlation close to zero denotes that the data scatters widely. 32 ------- OBSERVED CONCENTRATION (a) LOW CORRELATION OBSERVED CONCENTRATION (b) MODERATE CORRELATION OBSERVED CONCENTRATION (e) HIGH CORRELATION Figure D-2. Visual assessment of correlation. Two anomalous situations may also occur. In the first, the correla- tion coefficient is negative. The line that best fits the scatterplot has a negative slope. A correlation of -1 denotes that all the plotted points fall along such a line. In the second, the predicted and observed concentration scatterplot shows a definite nonlinear relationship. As illustrated in Figure D-3, the correlation will be low in such cases unless the data can also be fit well by a line. When the relationship between predicted and observed concentration is linear, the square of the correlation coefficient may be interpreted as the percentage of the variance of the predicted concentrations OBSERVED CONCENTRATION (a) ZERO CORRELATION OBSERVED CONCENTRATION (b) MODERATE POSITIVE CORRELATION Figure D-3. Nonlinear scatterplot relationships. 33 ------- "explained" or "accounted for" by the observed observations. The con- verse interpretation—that the squared correlation coefficient represents the percentage of the variance of the observed concentrations explained by the predicted observations—also holds. Pearson's formula for the correlation coefficient is given by 1=1 1=1 i=i 1=1 In discussing the correlation coefficient, we have been concerned with the correlation exhibited in the sample. We now consider the corre- lation coefficient in the population from which the sample has been drawn. Due to sampling variability, Pearson's correlation coefficient for the sample will not usually equal the correlation coefficient in the population. With a large sample, however, the two correlation coefficients will be close. The correlation coefficient for the population is usually denoted "p." It can be shown that if the observed and predicted concentrations are jointly normally distributed, there is a 95-percent probability that p is contained in the interval tanh ( tanh"Lr - 1>96 ) ^ P ^ tanh ( tanh"1r + \ V^V \ This interval is called a 95-percent confidence interval for the popula- tion correlation coefficient. The texts by Anderson and Wonnacott provide more detailed information on the derivation of the confidence interval. 34 ------- TEMPORAL AND SPATIAL CORRELATION COEFFICIENTS As described by Nappo, an accurate air-prediction model must repro- duce at each monitoring station the observed time-varying pollution con- centration, and reproduce at each monitoring time the observed space- varying pollution pattern. A measure of the agreement of the predicted and observed temporal trends of air pollution over the whole network of monitoring stations is given by the temporal correlation coefficient R(t) , the average over stations (spatial average) of the temporal cor- relation coefficient formed at each station. A measure of the agreement of the predicted and observed spatial trends of air pollution over a prediction period is given by the spatial correlation coefficient R(s) , the average over time of the correlation coefficient between the predicted and observed patterns of the concentration isopleths at each monitoring time. Let the collection of observed and predicted concentrations, 0 s, t and P , be indexed by the monitoring station s = 1,2,...,S and by the s, t monitoring time t = 1,2,...,T. It is not necessary for the stations or the times to be contiguous. The time-averaged, station-specific, predicted and observed concen- trations are given by T s,t ~ T T £os, The station-specific temporal correlation coefficient is given by the usual formula, 35 ------- T E(p - Pt)(o - o fc s,t s,t ' ^ s,t s,t Rs(t) = t=l VX /~f V(P -P t)2l/V(o -o fc 2LV s,t s,t ; WZ^ s,t s,t t=l T t=l The spatial average of the temporal correlation coefficient is given by s=l Under the assumption that the R (t) are not related to each other S (i.e., statistically independent) and have identical frequency distribu- tions with mean I R(t) is given by tions with mean E[R (t)] = R(t), a 95-percent confidence interval for S tanh tanh"1* (t) - 1.96 l S !„ /rr TT s=l Vs (T - 3) R(t) tanh s 2-1 (t) + 1.96 - 3) The station-averaged, time-specific, predicted and observed concen- trations are given by s,t S t-J s,t s=l s=l 36 ------- The time-specific station correlation coefficient is given by the usual formula, s,t ^ s,t s,t s=l s s=l .V The temporal average of the time-specific spatial correlation coefficient is given by I T Under the assumption that the R (s) are independent and identically ributed with mean E[R (s) for R(s) can be formulated as distributed with mean E[R (s)] = R(s), a 95-percent confidence interval tanh T ^ tanh"1Rt(s) - 1.96 t=l - 3) R(s) tanh tanh-1R (s) - 1.96 1 t-l t \/T(S - 3) CHI-SQUARE GOODNESS-OF-FIT TEST The chi-square goodness-of-fit test can be used to assess whether the distribution of predicted concentrations deviates significantly from the distribution of observed concentrations, and to graphically depict such departure. Differences between the observed and predicted concen- tration distributions, such as different variances or shapes, may be 37 ------- identified from appropriate data displays. For more detailed information on this test, the reader is referred to the text by Hogg and Craig. The observed concentrations OC,,. . . ,OC are used to define cell boundaries for air-pollution concentration with an approximately equal number of observed concentrations in each cell. If possible, each cell should have approximately ten observations. By E., denote the number of observed concentrations in the j-th cell, where there are J cells, and 1 ^ j ^ J. By P., denote the number of predicted concentrations of the j-th cell. If the model is accurate, the E. and the P. should be approxi- mately equal. Under the assumption that the E. are the expected number of predicted concentrations in the j-th cell, the statistic 2 „£, (P4 - EJ2 X = is distributed as a chi-square random variable with J - 1 degrees of freedom. Larger values of the chi-square statistic than might reasonably be anticipated from a chi-square random variable are indicative of poor model fit. The significance of the chi-square statistic may be obtained from a chi-square distribution table. It is usual practice to report the 2 value of the chi-square statistic (x ) as well as its significance (obtained from appropriate tables). However, both these values are very 2 sensitive to sample size. We suggest also reporting x /n> which may be interpreted as the sum over the cells of the percentage deviation from the expected cell proportions. 38 ------- INTERSTATION ERROR CORRELATION TEST Any correlation between the errors in concentration prediction of two receptor stations is indicative of a systematic bias in the predictor model. The significance of these error correlations may be tested and confidence bounds placed on the error correlations. A correlation coefficient that is approximately zero between the errors in concentration prediction of two receptor sites is a highly desirable situation. It indicates that no important condition "linking" the two receptor sites exists that might account for underprediction or overprediction. If the correlation coefficient is high (near 1.0), it indicates that whenever the model overestimates concentrations at one receptor site it tends to do so at the other, and similarly for underestimation. The presence of a factor influencing both observed concentrations but unaccounted for by the model is suggested. If the correlation coefficient is low (near -1.0), it indicates that whenever the model overestimates concentrations at one receptor site, it tends to underestimate concentrations at another, and, conversely, when- ever the model underestimates at one receptor site it tends to overestimate at the other. The presence of a factor affecting both observed concen- trations in oppo'site directions, but unaccounted for by the model, is suggested. For two stations j and k, let the set of observed and predicted concentrations be given by (OC ,PC ) , ..., (OC. ,PC. ) and (OC ,PC ), J J- JJL J^- j^ K.-L K.J. ..., (OC, >PC, ). The differences between the observed and predicted Kn. K.n concentrations will be denoted DC._ = OC... - PC. Jt Jt jt where j (or k) is the station identifier and t is the time identifier. Let r be the correlation coefficient between (DC..:i =1, ..., n) and (DC :i =1, ..., n). Under the assumption that DC. and DC, follow Kl J K- 39 ------- a bivariate normal distribution with correlation coefficient p, a 2 95-percent confidence interval is given by tanh ( tanh" r - ' tanh tanh" r + V (• If this confidence interval fails to capture the value zero, then the correlation coefficient p is significantly different from zero. MULTIPLE REGRESSION OF ERROR RESIDUALS Multiple regression is a technique for uncovering linear relation- ships between the residuals (observed concentrations less predicted con- centrations) and the air-pollution/meteorological variables. Figure D-4 illustrates a hypothetical plot of residuals versus ambient air tempera- ture. In this case, the prediction model tends to underestimate concen- trations at high temperatures and overestimate concentrations at low temperatures. The elimination of this linear trend through model 201 15- 10 I g -5 3 LU CC -10 -15 -20 TEMPERATURE- Figure G-4. Sample residual versus parameter plot. 40 ------- adjustment will improve the accuracy of the prediction. Multiple linear regression can be used to uncover these linear relationships when there are many air pollution/meteorological variables. To perform a multiple linear regression, we must have data consist- ing of a set of n paired measurements of observed concentrations, pre- dicted concentrations, and air-pollution/meteorological variables, (OCI,PCI,XII,...,XI ), ..., (OCn,PCn,Xnl,...,Xnp). The air-pollution/ meteorological variables can be interval, ordinal, or nominal 0-1 vari- ables. They may even include predicted and observed concentrations. The residual Z, is the difference between the k-th observed and predicted observation Zk = OCk - PCk (1 S k * n) To uncover linear relationships between the residual and the air-pollution/ meteorological variables, we fit the following model: AJ + ek (1 * k * n) 2 Here the e, are independent variables following a normal (0,o~ ) distribu- tion. The usual least-square procedures are used to estimate p, and 3 coefficients . The estimate of p, is the amount by which the model, on the average, underestimates or overestimates the observed concentration. The esti- mates of the p. are the slopes of the linear relationship of the residuals to the air-pollution/meteorological variables. The statistical significance of the estimates of the (3. may be determined by using the F-test. If the F-test is significant at the 5-percent level, then consideration should be given to using the regres- sion results to modify the model. 41 ------- The multiple regression should be performed in a stepwise fashion in order to isolate the most important subset of the air-pollution/ meteorological variables. These variables can be identified as providing the largest increments to the proportion of the explained variance in the residuals. Efforts at model modification should be focused on these variables. ANALYSIS OF THE RESIDUAL TIME SERIES Statistical techniques, as described by Box and Jenkins7 and Jenkins Q and Watts, may be used to examine the residual time series (observed concentrations less predicted concentrations) to determine if components other than white noise are present. Due to the serial and cyclical time dependence of meteorological conditions and pollutant emissions, both the observed-concentrations series and predicted-concentrations series will also exhibit time depen- dencies. To the extent, however, that the predicted time series agrees with the observed concentrations, the residual time series (observed concentrations less predicted concentrations) will consist primarily of white noise. If there is a serial or cyclical component in the residual time series, then either the prediction model is failing to account for a time-dependent relationship or it is inducing an extraneous time-dependent relationship. In either event, a fault in the prediction model is indi- cated . To perform the time-series analysis, we need data consisting of a set of n paired measurements of observed concentrations and predicted concentrations equally spaced in time: (OC,,PC..), (OC_PC ), ..., (OC ,PC ). These data are used to form the residual time series n n rt = OCt - PCfc (t = 1, ..., n) The residual tii..e series is then investigated for serial and cyclical departures from white noise. 42 ------- The presence of serial departure from white noise may be detected by using the autocorrelation function. The autocorrelation function of the residual time series is given by n-k A(k) = 1 n n t=l where n r = ~ s ^r<- n t=l and k indexes the lag of the serial dependence (i.e., daily, weekly, bimonthly, monthly). If the r series is a white-noise process, then any A(k) is asymp- totically normally distributed with mean zero and variance 1/n, and should lie inside the 95-percent confidence limits of ±1.96/*/n. If an A(k) value is outside these limits, the possibility of a time dependency should be closely investigated. Figure D-5 illustrates the confidence bands for the autocorrelation function. Two autocorrelations (at lag one and seven days) fall outside the confidence bands. These autocorrelations indicate that if the model overestimates on one day, it tends to underestimate the next, and if the model overestimates on one day of the week, it tends to overestimate on the same day of the next week. Finally, it should be noted that of every twenty or so autocorrela- tions tested, about one will fall outside the bands by mere chance. Therefore, each autocorrelation falling outside the bands needs to be more closely investigated (either analytically or by means of another time series) to establish that the observed result is not a chance artifact. 43 ------- 5 01 I « o o -1.0-1 95% LIMITS T I T 1 23456789 10 11 12 13 14 15 95% LIMITS Figure D-5. Sample autocorrelation plot. The presence of cyclical deviation from white noise may be detected by using the normalized cumulative periodogram. The normalized cumula- tive periodogram is given by J n _ O ~ r_ - r)2 t=l where cos 2n f±t + r sin 2n ft / \t=l / is the periodogram of the residual time series and where f. = i/n is the frequency. 44 ------- For a white-noise series, the plot of C(f.) against f would be close to a straight line joining the points (0,0) and 0.5,1). Ninety- five-percent confidence limits can be drawn at a distance of ±1.96/vn above and below this straight line, as shown in Figure D-6. Note that C(f.) crosses the confidence limits. This indicates that there are cyclical deviations from white noise. The largest jumps in C(f.) occur in a pattern of about seven days and therefore the model should be inves- tigated to determine why it tends to overestimate and underestimate observed concentrations with this frequency. To use these two time-series analysis techniques, we must have a data set consisting of equally spaced readings in time. These readings could be hourly, daily, weekly, monthly, etc. It is recommended that at least 50 consecutive readings be included in each autocorrelation and < tn o o D O tr UJ Q. UJ 1 O o UJ N cc o z ill D i.o 0.5 -— • •PERIOD 1/fj .14 .25 FREQUENCY- .5 Figure D-6. Sample plot of cumulative periodogram. ------- periodogram calculation. It is not necessary that all values be consecu- tive since the computational formulas may be adjusted for occasional missing values. RUNS TEST The Wald-Wolfowitz runs test is a method for determining whether the prediction model overestimates or underestimates the observed concen- tration for long periods of time. Consider a plot of the amount of over- or under-prediction of S09 concentrations, as shown in Figure D-7. The dashed line represents the median overprediction (nine observations are below the median and nine above). Adjusting for the median overprediction, we obtain Figure D-8. Note that the plotted points tend to stay above or below the median for extended periods of time. This indicates that there is a nonrandom com- ponent of prediction error present (we have already subtracted out the constant overprediction component). If only random noise were present, the plotted errors in prediction would fluctuate more erratically around the median. The presence of only four runs (sequence above or below the median of one or more points) indicates the presence of a nonrandom com- ponent . 10 pprn 8 ppm- 6 ppm- 4 ppm 2 ppnr 0 ppm i —2 ppm —4 ppm —6 pprr>' TIME •• • • Figure D-7. Sample plot of residuals versus time. 46 ------- 8 ppm 6 ppm 4 ppm 2 ppm MEDIAN -2 ppm —4 ppm- —6 ppm. —8 ppm - RUN 2 RUN 4 *• TIME- RUN 1 RUN 3 Figure D-8. Adjusted sample plot of residuals versus time. If the Wald-Wolfowitz runs test is used, the observed and predicted concentration readings should be consecutively, and approximately equally, spaced in time. We can conclude that there is a nonrandom component in the sequence if there are too few runs. The minimum allowable number of runs over n time periods is n + 2 1.65 Fewer runs than this indicates, with 95-percent confidence, the presence of a nonrandom component of error. This formula is fairly accurate for over 40 data points. Otherwise, appropriate tables must be used. POST-AGREEMENT AND PREFIGURANCE TABLES The post-agreement and prefigurance tables are two methods that are useful in assessing the accuracy of a forecast model.10 The post- agreement table (Table D-l) allows an assessment of the extent to which subsequent observations confirm the prediction of a certain pollution level. The prefigurance table (Table D-2) allows an assessment of the extent to which the forecasts give advance notice of a certain pollution level. 47 ------- TABLE D-l. POST-AGREEMENT Observed Level Heavy Moderate Light Total Forecast Pollution Level (Percent) Heavy 57.1 19.0 23.9 100 Moderate 21.1 53.3 25.6 100 Light 2.5 10.0 87.5 100 TABLE D-2. PREFIGURANCE Observed Level Heavy Moderate Light Forecast Pollution Level (Percent) Heavy 54.5 14.3 10.0 Moderate 40.9 71.4 20.0 Light 4.6 14.3 70.0 Total 100 100 100 In Tables D-l and D-2, three levels (heavy, moderate, and light) have been used. In general, the number of levels and their specification in terms of pollutant concentrations are determined by the user. The post-agreement table will be important in deciding whether a forecast model is accurate enough to call an alert when heavy pollution is forecast. The prefigurance table will be important in deciding whether a forecast model is correctly forecasting a large enough percentage of heavy-pollution days. 48 ------- QUASI-LOGISTIC MODELING OF THE CORRELATION COEFFICIENT The Pearson's correlation coefficient between observed and predicted concentrations may be computed for many different subsets of the data (such as days with a particular mixing height, wind speed, ambient air temperature, etc.)- Some subsets will be associated with high correla- tion coefficients and others with low correlation coefficients. Quasi- logistic modeling of the correlation coefficient is a method of assessing which of the subset-defining variables (mixing height, wind speed, etc.) are contributing to or reducing the correlation coefficient. That is, quasi-logistic modeling allows a determination of the conditions under which the observed and predicted concentrations are well or poorly correlated. Suppose we are interested in the effect of m subset-defining vari- ables on the correlation coefficient. These "independent" variables will be denoted X.. , ..., X . In the spirit of the logistic model, we model the true correlation coefficient as p = /A + yVx.\ + 1 o i i Here the A , A.. , ..., A are coefficients. Of course, we have not yet specified how to estimate the coefficients nor how to assess whether the model fits the correlation coefficient. But we should note some features of this model. If A is large in comparison with the other A. (even when multiplied by the X.), then the true correlation coefficient is essentially constant over all subsets of the data. On the other hand, whenever any A.X. become large and positive, it follows that the correlation coeffi- cient is especially large due to the contribution of the i-th subset- defining variable. 49 ------- For any particular subset of the data, we can compute Pearson's correlation coefficient r. Under the usual bivariate normality assump- tions for predicted and observed concentrations, the quantity is approximately normally distributed with variance equal to 4/(n - 3). Here n refers to the number of data points in the subset used to calcu- late r. The mean for Z is which under our model for p reduces to m A + XiA.X. ZX; A weighted least-squares regression of Z on the X. may be used to esti- mate the A , A., ..., A coefficients. Many packaged computer routines, such as SPSS, exist for performing this regression. The overall fit of the model may be assessed through the use of the 2 multiple R statistic and plots of the residuals (calculated Z values 2 less estimated Z values). The multiple R statistic (and its associated F value) indicate the percentage reduction in the variance of Z accounted for by the regression. The residuals pinpoint those data subsets for which the model is particularly ill-fitting, perhaps indicating the presence of "interactions" among variables. 2 The partial R statistics (and their associated F values) may be used to single out the regression coefficients that are particularly effective in accounting for the variability in the correlation coefficient, This may allow a simplification of the model by reducing the number of coefficients. 50 ------- TECHNICAL REPORT DATA (Please read Instructions on the reverse before completing) 1 REPORT NO. EPA-600/4-80-013b 3. RECIPIENT'S ACCESSION-NO. 4. TITLE AND SUBTITLE EVALUATION OF THE REAL-TIME AIR-QUALITY MODEL USING THE RAPS DATA BASE Volume 2. Statistical Procedures 5. REPORT DATE February 1980 6. PERFORMING ORGANIZATION CODE 7 AUTHOR(S) Harold S. Javitz and Ronald E. Ruff 8. PERFORMING ORGANIZATION REPORT NO. Final Report SRI Project 6868 9 PERFORMING ORGANIZATION NAME AND ADDRESS SRI International 333 Ravenswood Avenue Menlo Park, California 94025 10. PROGRAM ELEMENT NO. 1AA603 A-26 (FY-77) 11. CONTRACT/GRANT NO. 68-02-2770 12. SPONSORING AGENCY NAME AND ADDRESS Environmental Sciences Research Laboratory - Office of Research and Development U.S. Environmental Protection Agency Research Triangle Park, North Carolina 27711 13. TYPE OF REPORT AND PERIOD COVERED RTF, NC FINAL 8/77-4/79 14. SPONSORING AGENCY CODE EPA/600/09 15. SUPPLEMENTARY NOTES 16. ABSTRACT The theory and programming of statistical tests for evaluating the Real-Time Air-Quality Model (RAM) using the Regional Air Pollution Study (RAPS) data base are fully documented in four volumes. Moreover, the tests are generally applicable to other model evaluation problems. Volume 2 presents the tests considered for evaluating air-quality simulation models in general and those that would be useful in evaluating the RAM. The capability of the RAM to predict sulfur dioxide (SO2) concentrations was of particular interest. Specific tests for both inter- mediate and final evaluations are recommended, with accompanying descriptions of formats, plots, and procedures to establish confidence limits. Discussion focuses on the mathematics, procedures, and interpretation of the individual tests; their relevance to the project objectives; and possible trade-offs among,tests. 17. KEY WORDS AND DOCUMENT ANALYSIS DESCRIPTORS b.lDENTIFIERS/OPEN ENDED TERMS C. COS AT I Field/Group * Air pollution * Mathematical models * Evaluation * Tests * Computer systems programs .* Statistical tests Real-Time Air-Quality Model Regional Air Pollution Study Data Base 13B 12A 14B 09B 13. DISTRIBUTION STATEMENT RELEASE TO PUBLIC 19. SECURITY CLASS (This Report) UNCLASSIFIED 21. NO. OF PAGES 57 20. SECURITY CLASS (This page) UNCLASSIFIED 22. PRICE EPA Form 22ZO-1 (9-73) 51 ------- |