Evaluation Of The Real Time Air-Quality Model Using The RAPS Data Base, Volume 2 Statistical Procedures


oEPA
            United States
            Environmental Protection
            Agency
            Environmental Sciences
            Laboratory
            Research Triangle Park NC ?7711
t PA 600 4 80 013b
K-hi.i.rv 1980 .. ^,

     c.'i
            Research and Development
Evaluation of the
Real-Time Air-Quality
Model Using the RAPS
Data Base

Volume 2.
Statistical Procedures

-------
                RESEARCH REPORTING SERIES

Research reports of the Office of Research and Development, U S Environmental
Protection Agency, have been grouped into nine series These nine broad cate-
gories were established to facilitate further development and application of en-
vironmental technology  Elimination of traditional grouping  was  consciously
planned to foster technology transfer and a maximum interface in related fields
The nine series are

      1   Environmental  Health  Effects Research
      2   Environmental  Protection Technology
      3   Ecological Research
      4   Environmental  Monitoring
      5   Socioeconomic Environmental Studies
      6   Scientific and Technical Assessment Reports (STAR)
      7   Interagency Energy-Environment Research and  Development
      8   "Special" Reports
      9   Miscellaneous Reports

This  report has been assigned to the ENVIRONMENTAL MONITORING series.
This  series describes research conducted to develop new  or improved methods
and  instrumentation for the identification and quantification of  environmental
pollutants at the lowest conceivably significant concentrations It also includes
studies to determine the ambient concentrations of pollutants in the environment
and/or the variance of pollutants as a function of time or meteorological factors.
This document is available to the public through the National Technical Informa-
tion Service, Springfield, Virginia 22161

-------
                                               EPA-600/4-80-013b
                                               February  1980
       EVALUATION OF THE REAL-TIME
             AIR-QUALITY MODEL
         USING THE RAPS DATA BASE

         Volume 2.  Statistical Procedures
                       by
          Harold S. Javitz and Ronald E. Ruff
             Atmospheric Science Center
                 SRI International
            Menlo Park, California 94025
              Contract No. 68-02-2770
                  Project Officer
                  John S. Irwin
         Meteorology and Assessment Division
       Environmental Sciences Research Laboratory
      Research Triangle Park, North Carolina 27711
ENVIRONMENTAL SCIENCES RESEARCH LABORATORY
     OFFICE OF RESEARCH AND DEVELOPMENT
    U.S. ENVIRONMENTAL PROTECTION AGENCY
RESEARCH TRIANGLE PARK, NORTH CAROLINA 27711

-------
                                 DISCLAIMER
     This report has been reviewed by the Environmental Sciences Research
Laboratory, U.S. Environmental Protection Agency,  and approved for publica-
tion.  Approval does not signify that the contents necessarily reflect the
views and policies of the U.S. Environmental Protection Agency, nor does men-
tion of trade names or commercial products constitute endorsement or recom-
mendation for use.
                                      ii

-------
                               ABSTRACT

     The theory and programming of statistical tests for evaluating the
Real-Time Air-Quality Model (RAM) using the Regional Air Pollution Study
(RAPS) data base are fully documented in four volumes.  Moreover, the
tests are generally applicable to other model evaluation problems.  Vol-
ume 2 presents the tests considered for evaluating air-quality simulation
models in general and those that would be useful in evaluating the RAM.
The capability of the RAM to predict sulfur dioxide (SC^) concentrations
was of particular interest.  Specific tests for both intermediate and
final evaluations are recommended, with accompanying descriptions of
formats, plots, and procedures to establish confidence limits.  Discus-
sion focuses on the mathematics, procedures, and interpretation of the
individual tests; their relevance to the project objectives; and possible
trade-offs among tests.  This report was submitted in fulfillment of
Contract No. 68-02-2770 by SRI International under the sponsorship of
the U.S. Environmental Protection Agency.  This report covers a period
from 1 October 1977 to 1 April 1979, and work was completed as of April
1979.
                                  111

-------
                            CONTENTS
Abstract ........................... iii
Figures  ...........................  vi
Tables ............................  vi

   1.  Overview  ........................ 1
            Introduction .................... 1
            Intermediate and final evaluation
              in model verification  .............. 1
   2.  Evaluation Statistics .................. 3
            General  ...................... 3
            The steps involved in selecting
              intermediate-evaluation statistics ........ 3
            The steps involved in selecting
              final-evaluation statistics  ........... 4
            The degree of interchangeability
              of intermediate- and final-evaluation
              statistics .................... 6
            Final-evaluation statistics  ............ 7
            Intermediate-evaluation statistics ........  10
References ..........................  14
Appendices

   A.  Common notation and conventions .... .........  15
   B.  Final-evaluation statistics ..............  17
   C.  Confidence limits ...................  28
   D.  Intermediate-evaluation statistics  ..........  30

-------
                                FIGURES

Number                                                             Page
  B-l  Geographical map	25
  B-2  Translated vectors  	   26
  B-3  Average of vectors	   26
  B-4  Residual vectors  	   26
  B-5  Tolerance ellipse 	   26
  D-l  Scattergram, regression line, and bands tolerance 	   30
  D-2  Visual assessment of correlation  	   33
  D-3  Nonlinear scatterplot relationships 	   33
  D-4  Sample residual versus parameter plot 	   40
  D-5  Sample autocorrelation plot 	   44
  D-6  Sample plot of cumulative periodogram	45
  D-7  Sample plot of residuals versus time	46
  D-8  Adjusted sample plot of residuals versus time	47
                                TABLES
Number                                                             Page
  D-l  Post-Agreement	48
  D-2  Prefigurance	48

-------
                               SECTION 1
                                OVERVIEW

INTRODUCTION
     Two types of model verification are distinguished:  (1) intermediate
evaluation for model modification and (2) final evaluation for model com-
parison.  These two types of evaluation are performed via evaluation sta-
tistics, and the steps required in the selection of appropriate statistics
are discussed.  For intermediate evaluation, these steps are fairly
straightforward.  For final evaluation, the investigator must consider the
relative seriousness of different types of prediction errors and must
specify a "loss function."  Because of the uncertainty concerning the
impact of a misprediction, these are difficult tasks.  The particular
statistics we have chosen for intermediate and final evaluation are
explained briefly in the text and more thoroughly in the appendices.  In
particular, considerable attention is devoted to the accuracy score--a
general-purpose final-evaluation statistic.

INTERMEDIATE AND FINAL EVALUATION IN MODEL VERIFICATION
     Verification of an air-pollution model is the process of comparing
predicted and observed readings to evaluate model performance.  The
evaluation may be used for (1) modifying the air-pollution model or
(2) quantifying its relative or absolute accuracy.  In the first case
(which we shall term "intermediate" evaluation), our test might show that
the model tends to overpredict pollution concentrations at times when low
wind speeds prevail.  We may, therefore, choose to modify the model to
compensate for the overprediction.  In the second case (which we shall
term  "final" evaluation), we may demonstrate that the average error in
prediction of a particular model is only 18 percent of the observed con-
centration, and that this is a lower average error than any competing
model.

-------
     Intermediate evaluation is often the basis for model modification
and precedes the final evaluation of the "best" model.  The terminology
assumes there are several models (or versions of the same model).   The
"best" model is that one which most accurately represents the physical
processes occurring in the atmosphere as determined by comparison of the
final evaluation statistic for each model.  Thus, final evaluation is
the basis for model comparison and precedes the selection of the best
model.
     During the intermediate evaluation phase, we are generally "fine
tuning" a model after analyzing results from our statistical tests applied
to a test data base.  Hence, we run the risk of our results being unique
to the test data base.  To ensure that we are in fact improving a model,
we must expand our evaluation using other data bases.  (Alternatively,
we may use a subset of a data base for intermediate evaluation and a
separate, statistically independent, subset for final evaluation.)

-------
                               SECTION 2
                         EVALUATION STATISTICS

GENERAL
     Evaluation statistics are indices or scores computed from the data
base of observed and predicted concentrations.   These statistics consti-
tute the method by which the evaluation is performed.  For example, an
intermediate evaluation of a model may involve the computation of many
correlation coefficients between observed and predicted concentrations
under different meteorological conditions.  Certain types of statistics
are better suited to intermediate evaluations and others are better
suited to final evaluations.  (In addition, graphical displays and
accompanying subjective judgments may be appropriate for intermediate
evaluations.)  There are no hard and fast lines, however, and a statistic
may be applied to both types of evaluations.

THE STEPS INVOLVED IN SELECTING
INTERMEDIATE-EVALUATION STATISTICS
     Different steps are involved in selecting the appropriate statistic
to accomplish an intermediate or a final evaluation.
     For an intermediate evaluation, we first specify an aspect of the
agreement (or disagreement) between the predicted and observed concen-
trations we intend to investigate.  For example, we may be interested in
investigating whether high (low) predicted concentrations are associated
with high (low) observed concentrations.  Or we may be interested in
investigating cyclical patterns of overprediction and underprediction.
Secondly, we translate the aspect of the agreement we wish to investigate
into a model for agreement.  Such a model for agreement might be that
observed and predicted concentrations are random variables following a
bivariate normal distribution with an unknown correlation coefficient.
Alternatively, we might postulate that the error in prediction is composed
of a sinusoidal component and a white-noise component.  Thirdly, we

-------
specify a statistic that is optimal in some sense for estimating or test-
ing the postulated model or some aspect thereof.  For example, Pearson's
sample correlation coefficient is often used to estimate the true corre-
lation coefficient.  Or the normalized cumulative periodogram may be used
to test for the presence of cyclical trends in prediction error.  (Both
are described in more detail later in this Section.)

THE STEPS INVOLVED IN SELECTING FINAL-EVALUATION STATISTICS
The Relative Seriousness of Errors in Prediction
     For the final evaluation, the first step is determining the relative
seriousness (or "loss") associated with different types of errors in
air pollution prediction.  Consideration should be given to the following
types of questions:
     •  Is an underprediction more or less serious than an overprediction?
     •  Is the assessment of the seriousness of an underprediction or an
        overprediction also dependent upon the magnitude of the observed
        concentration?
     •  Can misprediction errors be categorized according to quantifiable
        losses (such as the costs incurred in calling erroneous air pol-
        lution alerts)?
     •  How serious is an error in predicting the location of the maxi-
        mum air pollution?
     •  Is a string of overpredictions that is followed by a string of
        underpredictions more or less serious than a more interspersed
        pattern?

Specifying a Loss Function
     The end result of these considerations is the specification of a
loss function.  In its most general form, the loss function quantitatively
determines, for any set of observed and predicted concentrations, the loss
associated with any errors in prediction.  The data base to which the loss
function is applied consists of the observed and predicted concentrations
at one or more spatial locations and at one or more time periods.  The
data base may also include meteorological and other types of data that
are relevant to the computation of loss.  Consequently, the loss function
may be quite complex.  The evaluation statistic associated with an

-------
air-pollution model and a data set is simply the loss associated with the
data set.  For example, a relatively simple evaluation statistic would be
the average-squared error between observed and predicted concentrations
over n time periods at a single location or group of locations.  Mathe-
matically, this statistic may be written

                             n
                                           2
                                   .  - PC.)
where OC. is the observed concentration at the i^th time period and PC.
is the corresponding predicted concentration.'  Using this statistic
implies that the investigator considers underprediction and overpredic-
tion to be equally serious, and the assessment of the seriousness of an
error in prediction to be independent of the magnitude of the observed
concentration, the existence of any threshold, the location of the maxi-
mum predicted or observed concentration, or any temporal arrangement of
the underpredictions and overpredictions .

The Difficulty in Specifying a Loss Function
     Specifying a final-evaluation statistic is equivalent to specifying
the loss function that quantifies the losses associated with every type
of error in prediction.  While generally acceptable final-evaluation
statistics (such as mean square error) have been used in the past, the
associated loss function and its implications have not been uniformly
considered.  This omission appears to be linked to the difficulty in
quantifying the impacts of misprediction.
     As an evaluation tool in the model-development phase, the loss
function must be objective so that we can evaluate changes to a specific
model or make comparisons among competing models.  However, when a model
is applied in an operational mode, many subjectively derived factors may
*
 Refer to Appendix A for a complete description of the notation used
 throughout this text.

-------
figure in the evaluation process.   Difficulties are encountered both in
the quantification of the utility or lack thereof to each user of a pre-

diction and in the aggregations of these over many users.  Consider the

following examples:

     •  A pollution alert may convince an individual to refrain from any
        vigorous physical activity for a day or longer, with the indi-
        vidual incurring both an opportunity loss and a possible economic
        loss, balanced against a projected medical gain.
     •  A pollution alert might result in the voluntary or mandatory
        shutdown of local factories.

     •  The failure to call an alert might result in an unusually large
        number of emphysema attacks or deaths.
     •  Decisions to allow or forbid the construction of factories might
        be partially based on prediction forecasts.

In these examples, the impacts of misprediction can be economic, medical,
and political.  Such impacts are difficult to quantify and aggregate.

     Nevertheless, specifying the relative seriousness of different types
of errors in air pollution prediction is an absolutely vital step in
final evaluation.   In the absence of a prior study on relative losses,

we advocate the specification of a reasonable loss function, obtained
through a process for generating a consensus opinion.
THE DEGREE OF INTERCHANGEDILITY OF INTERMEDIATE-
AND FINAL-EVALUATION STATISTICS

     As mentioned earlier, although a particular statistic may be par-

ticularly well suited for intermediate or final evaluation, there is no

logical prohibition against using it for the other evaluation purpose.
*
 An "alternative" to the specification of a single final-evaluation sta-
 tistic is the specification of a battery of statistics, some of which
 are appropriate to final evaluation and some to intermediate evaluation.
 Investigators use the battery of statistics to gain an appreciation of
 the strengths and weaknesses of the model.  They then subjectively
 weight the relative importance of these strengths and weaknesses in
 order to compare models.  This type of final evaluation may be useful,
 but it is less rigorous, scientific, objective, or clearly specified
 than the type of final evaluation we have advocated.

-------
The average-squared-error statistic just discussed is an example of a
statistic that may be profitably used in an intermediate evaluation.
In general, we find that the final-evaluation statistics associated with
simpler loss functions are also useful for intermediate evaluation; how-
ever, statistics appropriate to intermediate evaluation appear in general
to be rather poorly suited to final evaluation.
     A particularly glaring example of this phenomenon is Pearson's corre
lation coefficient (which, unfortunately, is often used in final evalua-
tions).  When performing a final evaluation, it is reasonable to implic-
itly associate a loss of zero with a Pearson's correlation coefficient of
+1 and a large loss with a Pearson's correlation coefficient of -1.
This loss function necessitates some very strange considerations when
final evaluations are chosen.  For example, if Model A predicts the
observed concentrations very accurately and Model B always predicts ten
times the concentrations of Model A, then both models have the same
Pearson's correlation coefficient.  Again, if Model C always predicts
50 ppb more pollution than Model A, then both these models have the same
Pearson's correlation coefficient.  The investigator would presumably
find all these models equally acceptable.  The near-irrationality of
these acceptability conditions highlights the unsuitability of using
                                                                         V
Pearson's correlation coefficient as a final single-evaluation statistic.

FINAL-EVALUATION STATISTICS
The Accuracy Score--A General Final-Evaluation Statistic
     We have formalized a number of reasonable loss functions and their
associated statistics, which we have termed accuracy scores.  Accuracy
 If, in the intermediate-evaluation phase, all prediction models are
 adjusted so that the average predicted concentration is the same as the
 average observed concentration, and the variance of the predicted con-
 centration is the same as the variance of the observed concentration,
 then Pearson's correlation coefficient is mathematically equivalent to
 mean-square error.  In this event, either Pearson's correlation coef-
 ficient or mean-square error may be used as a final-evaluation statis-
 tic, with equivalent results.

-------
scores are based on a particularly simple, but quite general, type of
loss function.  These are the loss functions that are additive over time
periods.  A loss function is additive over time periods if (1) a loss
can be computed due to the error in prediction at one particular time
period and (2) the total loss due to error in prediction over a set of
time periods i<; the arithmetic sum of the losses in each time period in
the set.  The accuracy score is the average loss over the set of time
periods.  An example of an accuracy score is average-squared loss, which
has already been discussed.  Eight different accuracy scores are described
in the first section of Appendix B.  Seven of these are concerned with
the losses due to errors in prediction at a particular location.  If
data are available for many locations, one may base the final evaluation
on the average loss over the total number of prediction-observation
pairs.*  The eighth accuracy score is concerned with losses due to an
error in predicting the location of the maximum observed concentration.
     More complicated accuracy scores are also possible.  For example,
one might specify an accuracy score based on the loss for a particular
time period while simultaneously considering both the error in the pre-
diction of the maximum concentration and the error in the prediction of
the location of the maximum concentration.  In addition, one might also
base final evaluation on a statistic other than an accuracy score (for
example, a statistic based on a nonadditive loss function ) if this
better reflected the concept of relative losses.
*
 Alternatively, one could base the final evaluation on the largest value
 of the accuracy score when a particular accuracy score is computed over
 many locations.  This final-evaluation statistic should not be used
 without careful thought.  For example, it implies that if Model A pre-
 dicts perfectly accurately at Location 1 and poorly at Location 2, and
 Model B predicts, at Locations 1 and 2, as poorly as Model A does at
 Location 2, then Models A and B are equally desirable.
 A nonadditive loss function might be appropriate in certain situations.
 For example, suppose that an alert is called only after three consecu-
 tive hourly predictions of excessive pollution for the next 24-hour
 period.  In this case, very little loss would be associated with over-
 prediction for a single time period, since it would take three consecu-
 tive overpredictions to trigger a false alert.

-------
     To supplement the accuracy score, we recommend two features that
enhance its meaning and interpretation.  First, confidence limits can be
computed so that its statistical significance can be assessed.   (Refer
to Appendix C for a description of the general procedure.)  As a second
feature, the accuracy score for each observation point (monitoring loca-
tion) can be plotted on a geographical map of the area.  The resulting
display can then be analyzed for geographical trends.  For instance, we
could determine if the higher accuracy scores corresponded to urban
areas or rural areas, or if the highest scores were associated with one
geographical sector (e.g., northwest).  Thus, such a display could serve
to enhance the accuracy score's overall utility.

Other Final-Evaluation Statistics
     Prior to arriving at our recommendation of the accuracy score for
evaluating the final model, we considered several other tests.  Each has
its own advantages.  These other tests were considered in detail but not
included in our final package because they repeated, to a large extent,
the accuracy-score statistics.  The three tests (skill score, P-score,
and maximum location tolerance intervals) are described below and in
Appendix B.

Skill Score—
     The skill score discussed in Appendix B (see p. 21), is a statis-
tic that can be used either as an intermediate- or as a final-evaluation
statistic.  The skill score is an index of model performance, which may
be adjusted to reflect how well a baseline model (such as chance or per-
sistence) would have performed.  This statistic is substantially the
same' as the accuracy score, with a user-supplied loss matrix (see the
write-up of the accuracy score in Appendix B) with zeros on the diagonal
and a loss of 1 elsewhere.

P-Score—
     The P-score statistic, as described in Appendix B (see p. 22),  is
useful in evaluating models that assign probabilities to various levels

-------
of concentration.  This type of prediction is not common, however, and
the accuracy score could be adjusted to accommodate this type of model.

Maximum Concentration Tolerance Intervals--
     The purpose of this test is to evaluate a model's ability to predict
the location of the maximum (observed) concentration.  (See p. 23 of
Appendix B).  This test computes the mean distance between the "observed"
and predicted locations of maximum concentration and a spatial tolerance
interval (which would basically be defined by an ellipsoid).   The draw-
back with this test is that the monitoring network must be of sufficient
spatial density that the observed concentration is known everywhere in
the region of interest.
     Realistically, we must expand our number of observation locations
through interpolation techniques.  If the monitoring network is not
dense enough to support an accurate interpolation, the test is neither
objective nor statistically sound.  For the RAPS network, monitoring
networks are separated by about 5 km, and the ability to interpolate
air-quality concentrations between observational stations is largely
unknown.  Therefore, we have not recommended the inclusion of this test
in our statistical package.  Instead, we have incorporated a test with
a similar objective into our accuracy score, using a maximum concentra-
tion location loss function.  However, when  (or if) it is proven that
accurate interpolation between monitoring sites can be achieved, inclu-
sion of the maximum concentration tolerance intervals should be recon-
sidered .

INTERMEDIATE-EVALUATION STATISTICS
     In addition to the recommended final-evaluation statistical tests
contained in the accuracy score, we are recommending several tests that
are useful in diagnosing problem areas within models.  Results from
these "intermediate-evaluation" tests serve to identify components of
the model most in need of improvement.  In some cases, these tests can
serve as final-evaluation criteria as well.  The following tests and
displays are recommended for the above type of intermediate evaluation:
                                   10

-------
     •  Scatterplot
     •  Linear regression with confidence and prediction bands
     •  Pearson's correlation coefficient
     •  Chi-square test and frequency distribution displays
     •  Interstation error correlation
     •  Multiple regression of error residuals
     •  Analysis of residual time series.
The above and other tests considered in our study are discussed below and
in Appendix D.

Discussion of the Tests Selected
     Perhaps the most basic and most useful intermediate-evaluation
technique is the graphical display of the data in a scatterplot.  The
evaluation of the scatterplot of predicted concentration (on the vertical
axis) versus the observed concentration (on the horizontal axis) is aided
by calculating the linear regression line and the sample variance around
this line.  The linear regression line may be used directly to modify the
model, so that the sample variance around the line becomes the new root-
mean-squared error.   These techniques are discussed in Appendix D
(see p. 29).  This section also discusses the confidence bands that can
be placed around the linear regression line and the prediction bands for
additional observations.  The width of the first of these bands indicates
how well the position of the "true" underlying linear regression is known
from the sample (if indeed the relationship is linear),  while the width
of the second band indicates how far from the linear regression a new
observation could be expected to fall.
     Closely allied to linear regression is Pearson's correlation coef-
ficient, which has long been used as a model evaluation statistic.  This
statistic is described in Appendix D (see p. 31), along with the confi-
dence bands for the correlation coefficient.
     In Appendix D (p. 34), the spatial and temporal correlation coef-
ficients are presented along with their appropriate confidence bands.
The spatial correlation coefficient measures the agreement between the
                                   11

-------
predicted and observed spatial trends of air pollution over a prediction
period.  The temporal correlation coefficient measures the agreement of
the predicted and observed temporal trends of air pollution over the
whole network of monitoring stations.  These statistics are essentially
derivatives of Pearson's correlation coefficient and contain much the
same information.
     The chi-square test is a standard statistic used for examining the
fit of the distribution of predicted concentrations to the distribution
of observed concentrations.  This statistic and some useful accompanying
graphics are discussed in Appendix D (see p. 36).  While useful in gen-
eral, this statistic does not examine the observed and predicted concen-
trations pairwise.  For this reason we suggest that it be used only as a
back-up statistic for a procedure such as linear regression, which does
examine the data pairwise.  The chi-square test is also a useful final-
evaluation statistic for those cases where the accuracy of the predicted
concentration distribution is important.
     A number of techniques can be used to examine directly the error
residuals (that is, the differency between the observed and predicted
concentrations).  Appendix D (p. 38) introduces the interstation error
correlation coefficient--a statistic that examines the correlation
between station errors, which may be helpful in uncovering influences
accounting for systematic biases in the prediction model.  Another, more
direct, method is the multiple regression of error residuals, also dis-
cussed in Appendix D (see p. 39).  This technique attempts to isolate
meteorological variables highly correlated with the errors in prediction.
Finally, the analysis of residual time series discussed in Appendix D
(p. 41)  examines the error residuals for any cyclical behavior  using
two types of statistics--the autocorrelation coefficient and the normal-
ized cumulative periodogram.

Other Tests
     Appendix D also includes four statistics seriously considered by
SRI as evaluation statistics but finally rejected.  The runs test
(see p. 45) is a nonparametric test that examines for time trends
                                   12

-------
in error residuals.  The analysis of residual time series accomplishes
more than the runs test, however, and is more easily interpreted.  The
prefigurance and post-agreement tables (see p.  46) allow an assessment
of the extent to which the forecasts give advance notice of a certain
pollution level and the extent to which subsequent observations confirm
this prediction.  However, the scatterplot contains the same information
in a more rapidly digestible form, without the  inconvenience of the user
specifying pollution concentration categories.
     Quasi-logistic modeling of the correlation coefficient (see p. 48)
is a statistical technique for determining the  meteorological conditions
under which the correlation coefficient is unusually high or low.  The
data set must be quite extensive before this technique can be used, how-
ever, and the multiple regression of error residuals can extract similar
information more efficiently.
                                  13

-------
                             REFERENCES
 1.   R.  G.  Miller,  Simultaneous Statistical Inference, pp. 110-116
     (McGraw-Hill,  New York, New York, 1967).

 2.   T.  W.  Anderson, An Introduction to Multivariate Statistical Analysis,
     pp. 77-79  (John Wiley  and Sons, New York, New York, 1958).

 3.   R.  Wonnacott and T. Wonnacott, Econometrics, pp. 103-125  (John
     Wiley  and  Sons, New York, New York, 1970).

 4.   Carmen J.  Nappo, "A Method for Evaluating the Accuracy of Air Pollu-
     tion Prediction Models," Proc. Conf. on Atmospheric Diffusion and
     Air Pollution, 9-13 September 1974, Santa Barbara, California.

 5.   Robert Hogg and Allen  Craig, Introduction to Mathematical Statistics,
     pp. 308-315  (The Macmillan Company, New York, New York, 1970).

 6.   N.  R.  Draper and H. Smith, Applied Regression Analysis, pp. 104-195
     (John  Wiley and Sons,  New York, New York, 1966).

 7.   George E.  Box  and Gwilym M. Jenkins, Time Series Analysis Forecast-
     ing and Control, pp. 34-36 and 294-298  (Holden-Day, San Francisco,
     California, 1976).

 8.   Gwilym M.  Jenkins and  Donald G. Watts, Spectral Analysis and Its
     Applications,  pp. 183-189 and 234-237 (Holden-Day, San Francisco,
     California, 1968).

 9.   W.  J.  Conover, Practical Nonparametric Statistics, pp. 350-356 and
     414 (John  Wiley and Sons, New York, New York, 1971).

10.   H.  A.  Panofsky and G.  W. Brier, Some Applications of Statistics to
     Meteorology, pp. 191-208 (Pennsylvania State University Press,
     University Park, Pennsylvania, 1965).

11.   Norman Nie et  al., Statistical Package for  the Social Sciences,
     pp. 320-328  (McGraw-Hill, New York, New York, 1975).
                                  14

-------
                              APPENDIX A
                    COMMON NOTATION AND CONVENTIONS

     To shorten the descriptions of the statistical tests, we have
assembled a common set of notation and conventions.  Any exceptions to
this common set are specifically mentioned in the descriptions of par-
ticular statistical tests.
     In simplest form, the data for a statistical test consist of n
pairs of observed and predicted pollutant concentrations.  These data
are denoted (OC , PC ),  (OC , PC ), . . . (OC ,  PC ) .   This data set
may include all the data gathered or only a subset.  In general, a final-
evaluation statistic will make use of all the available data, and
intermediate-evaluation statistics will most often make use of subsets.
The concentrations might refer to hourly averages, three-hour averages,
daily averages, etc., at a particular station or averaged over the sta-
tion network.  The pairs need not be sequential  nor time-ordered in any
particular fashion.
     Since it is desirable to assess model fit under a variety of meteor-
ological conditions, the data base is arranged so that it can be sub-
divided into many nonexclusive subsets according to the following
parameters:
     •  Date, D
     •  Time period, T
     •  Location, L
     •  Pasquill-Gifford stability class, G,
     •  Wind-speed class, G_
     •  Wind-direction class, G.,
     •  Mixing-height class, G,
     •  Emission class,  Q
     •  Observed concentration,  OC
     •  Predicted concentration, PC.
                                   15

-------
The data for a statistical test shall then be selected from any subset
of the data base (in order to assess model fit under those conditions
characterizing that data subset).
     In quantifying confidence intervals, we use Student's t-distribution.
For large sample sizes, the t-distribution approximates that represented
by the standard normal curve.  In the notation, T    represents the upper
a/2 cutoff point for a t-distribution with m-degrees of freedom.
1Selby, S. M., CRC Standard Mathematical Tables (Chemical Rubber Com-
 pany, Cleveland, 1968) .
                                   16

-------
                               APPENDIX B
                      FINAL-EVALUATION STATISTICS

ACCURACY SCORE
     The accuracy score offers the user several tests to assess either
objectively or subjectively the "loss" due to differences in predicted
concentrations (PC) and observed concentrations (OC) .  Each test is
based on the quantification of a loss function, which measures the loss
associated with an error in prediction at a particular time period.
The accuracy score (denoted A) is the average loss for the prediction
model computed over the n time periods in the data base.  Consequently,
the accuracy score resembles a golf score — the lower, the better.  (The
text by Ferguson  provides material complementary to the concepts pre-
sented herein.)
     We consider eight loss functions, which naturally are associated
with eight different accuracy scores.  The first five rely on (essentially)
prespecified loss functions, which are easily interpretable and have
therefore been called objective.  The next two require that the user
specify loss indices associated with various types of misprediction and
have therefore been called subjective.  The eighth loss function provides
an index of the spatial accuracy of the model in predicting the location
of the maximum observed concentration.
     The importance of correctly specifying the loss function cannot be
overemphasized.  The loss function embodies our conception of how poorly
or well the model fits and the relative seriousness of different depar-
tures from perfect fit.  For example, a meteorologist who is concerned
only about false pollution alerts will choose a very different loss
1Ferguson, Thomas S., Mathematical Statistics, pp. 1-11 (Academic Press,
 New York, 1967) .
                                   17

-------
function than a meteorologist who is concerned both about false pollu-
tion alerts and failure to predict actual alert-level pollution.
     We now describe eight loss functions and accuracy scores- -five
objective and three subjective.

Objective Loss Functions
Average Absolute Error--
                                n
                         A = - y JOG. - PC. I
                                   '          1
                             n
                               1=1
With this loss function (absolute error), underprediction and overpre-
diction are judged to be equally undesirable.  Furthermore, a mispredic-
tion of 2 ppm is twice as undesirable as a misprediction of 1 ppm.

Average-Squared Error —
                                n
                             n
                               i=l
With this loss function (squared error) also, underprediction and over-
prediction are judged to be equally undesirable.  A misprediction of
2 ppm is four times as undesirable as a misprediction of 1 ppm.  This
type of loss function is used in many statistical tests.
Percent Incorrect Predictions
with Absolute Error Threshold—
       n
A=i
    n
      i=l
                              0  if  |OC. - PC.I ^ e
                                     1  i     i'
                              1  if  |OC. - PCi| > e
With this loss function, only prediction errors that exceed a certain
absolute threshold are counted against the prediction model.
                                   18

-------
Underprediction and overprediction are judged to be equally undesirable,


Whenever the misprediction exceeds the threshold, its magnitude is


ignored.  Thus, moderate and huge misprediction errors count equally.


(The error threshold must be user-supplied.)
Percent Incorrect Prediction

with Constant Percentage Error Threshold (€)—
                   A = -
                       n
                          n
                              0  if
                              1  if
   OC. - PC.
     i	i

      OC.
        1
                                     OC.  - PC.
                                       1      1
                                        OC.
                                          1.
              > e
This loss function is similar to the preceding, except that the threshold


criterion is percentage error.  (In practice, the case where OC = 0 is


not allowed.)





Symmetric High-Low Loss Function (Cutoff)--
                            0  if
PC. and OC. ^ cutoff
  i       i

or

PC. and OC. > cutoff
  x       i
                            1  if otherwise
This loss function is particularly appropriate when an alert level cut-


off exists.  Here A is the percentage of false alerts and failures to


predict actual pollutant alert levels.  The magnitude of the false alert


is ignored as is the magnitude of the underprediction when an actual


pollutant alert condition is not predicted.
                                  19

-------
Subjective Loss Functions
Asymmetric High-Low Loss Function (Cutoff)—
          A = -
                      0 if  or
PCi and OC. £ cutoff
PC. and OC. > cutoff
  i       i
P^ > cutoff and OC. £ cutoff
PC. ^ cutoff and OC. > cutoff
Here, JL^, 4   and the cutoff value are user-specified loss quantities.
This loss function is similar to the preceding one, except that under-
prediction and overprediction are weighted differently.

User-Supplied Loss Matrix--
     The predicted and observed concentrations are divided into categories
whose increments are about 20 percent of the applicable standard.  The
user enters the loss numbers in the matrix.  An example of the loss table
is shown below:

                                OC Categories (k)
                            12345
~
CO
o
60
01
4-1
u
eu
1
2
3


4
5
0
1
2


3
4
1
0
1


2
3
2
1
0


1
2
3
2
1


0
1
4
3
2


1
0
Let &   be the number in the jth row and the kth column.   The  accuracy
score is
                                  20

-------
                                    n
                             A = -•  >  A.,
                                 n  /—t  jk
                                    i=l

where PC. is in the j^th category and OC.   is in the kth category.

     The loss function, &.-.> is extremely general, since the number of
cells and the loss numbers in the cells are at the user's discretion.
The sample loss matrix approximates the average absolute error loss
function.  The other loss functions could be similarly approximated.

Loss Function for the Location
of Maximum Concentration--
     The previous concepts of the accuracy score can be slightly modified
to provide an index of the spatial accuracy of the prediction model in
forecasting the location of the maximum-concentration reading.  This
accuracy score can be defined in the following fashion:
Step 1:   For the jLth time period, identify the monitoring locations,
          k(i) and k'(i), where the highest observed and predicted con-
          centrations occur.
Step 2:   From the symmetric m X m matrix, consisting of the distances
          between monitoring stations, select the distance d,  ,..  ,  /,..
                                                            k(i),k (i)
          between the locations for highest observed and predicted con-
          centrations .
Step 3:   Repeat Steps 1 and 2 for all samples and then compute the
          accuracy score from the formula
                              n
                       A = - Z \
                           n z—'  t
                             i=l
     Many other types of objective or subjective loss indices JL  ,.. i,'/--
                                                               k(,i.) >k  (.
could be used instead of geographical distance.  For instance, the  loss
index could be constructed to take into account topographical features
                                   21

-------
that might cause two geographically close locations to differ substan-
tially in pollutant levels.

SKILL SCORE
                    o
     The skill score  is an index of model performance that partially
summarizes the information contained in the contingency table of pre-
dicted and observed pollution concentrations.  The skill score may be
adjusted to reflect how well a baseline model (chance, persistence,
climatology, etc.) would have performed.
     The skill score S is given by
                               S = C " E
                                   T - E
where C is the number of correct forecasts and T is the total number of
forecasts.  If no adjustment for a baseline model is desired, then E is
set to zero and the skill score is just the percentage of correct fore-
casts.  If an adjustment for a baseline model is desired, then E is set
to the expected number of correct forecasts for the baseline model.  In
this case, the skill score (1) has a value of unity when all forecasts
are correct, (2) has a value of zero when the number of correct forecasts
equals the expected number correct, and (3) is negative when the number
of correct forecasts is less than the expected number correct.
     The skill score may easily be modified to summarize how well a model
predicts the location of maximum pollution concentration.  To this end, C
is defined as the number of correct location forecasts and T is the total
number of location forecasts.  If an adjustment for a baseline model is
desired, then E is set to the expected number of correct location fore-
casts for the baseline model.
2Panofsky, H. A., and G. W. Brier, Some Applications of Statistics to
 Meteorology, pp. 191-208 (Pennsylvania State University Press, Univers-
 ity Park, Pennsylvania, 1965).

                                   22

-------
P-SCORE
     The P-score  is a method for evaluating a forecast model and is
particularly well suited for models that predict the probabilities of
certain pollutant levels occurring.  This score has the desirable property
that it exerts little influence on the model builder to choose a particu-
lar model in order to score well on the statistical evaluation.  The
forecaster is encouraged to get the forecast exactly right, or, if he
cannot forecast perfectly, to state unbiased estimates of the probability
of each possible event.
     The P-score ranges from zero to one.  A score of zero implies that
the prediction model always forecasts the observed pollutant level with
a probability of one.  A score of one implies that the prediction model
always forecasts a probability of zero for the pollutant level that did
materialize and forecasts a probability of one for a pollutant level that
did not materialize.
     To compute the P-score, the user must specify various levels for
pollutant concentration, say K of them.  For the j^th time period, the
model is assumed to forecast probabilities that the observed concentra-
tion will be at particular levels.  Let f...  be the forecast probability
that the observed concentration at time i will be at level 1; let f
be the forecast probability that the observed concentration at time i
will be at level 2, etc,
The P-score is given by
will be at level 2, etc.  Clearly, f., + f.0 + ... + f   equals one.
                                    1J.    3.^          iK
                              K
                      P - --         rf     F
                      P - 2n         (f   " E
where there are n time periods, and where E..  takes the value zero or
one according to whether or not level j was observed at time i.
3Morrison, D. F., Multivariate Statistical Methods, pp. 131-134 (McGraw-
 Hill, New York, 1976) .
                                   23

-------
     In the absence of any forecasting skill, the expected P score is
minimized by choosing f..  to be the climatological probability of observ-
ing level j.  If these probabilities are denoted p., then the minimum
expected P-score is

                                   K
                                  1
                              1 -
The climatological probabilities may be computed either from the sample
or from the larger set of data it represents.

TOLERANCE INTERVALS FOR THE LOCATION
OF MAXIMUM CONCENTRATION
     The ability to predict the location of the maximum concentration of
pollutants is an important characteristic of a prediction model.  In this
section we derive a tolerance interval (or ellipsoid) to describe this
ability.
     Consider a geographical map of a region.  One may locate on this
map the monitoring stations and the predicted location of the maximum
pollution.  Using the results obtained at the monitoring stations, one
may determine an interpolated location for the observed maximum concen-
trations .
     In Figure B-l an arrow has been drawn from the location of observed
maximum concentration to the location of predicted maximum concentration.
Two techniques are now available to describe the ability of the model to
predict the location of the maximum concentration.
     The first technique is concerned only with the length of the arrow
(i.e., the distance between the observed and predicted locations).
Letting D. be the distance for the ^th time period, we compute D and S,,
the usual estimates of mean and standard deviations.  A confidence inter-
val for U , , the mean of the distribution of D., is given by

                            D ±tS
                                   24

-------
                                                                 • 5
                                                                • 6
              • 2
                                  • 3
                                                    • 4
                         Figure B-1. Geographical map.
where t q  is the 95th percentile point of the t distribution, with n-1
degrees of freedom.  Under the assumption that the D. follow a normal
distribution, a 90-percent tolerance interval can be developed.  Consider
an additional observation, D  , not used in the original calculations.
We are 90-percent confident that D  will be in the interval:
     The second technique is concerned both with the distance between the
observed and predicted locations and with the direction of the error.
For each time period, an arrow may be drawn.  Translating these arrows
so that the bases are located at an origin, we obtain Figure B-2.  It is
easily seen that most of the predictions are to the northeast of the
observations, fewer are to the southwest, and still fewer to the north-
west or southeast.
     Taking the average of the vectors, we find the average direction
and magnitude of the error in prediction, as shown in Figure B-3.  The
residual vectors (from the average vector to the translated vectors) may
also be drawn, as shown in Figure B-4.
     Using statistical techniques, we may draw an ellipsoid, which captures
most of the residual vectors, around the average vector.  (See Figure
B-5.)  The ellipsoid is large enough that the investigator can be

                                   25

-------
90-percent confident it will capture a translated vector, when such a
vector is obtained under conditions similar to those previously existing.
A description of the general procedure follows.
     Let X denote the east-west axis and Y denote the north-south axis.
Then the vector of differences between observed and predicted locations
is
                             .,Y.) - D.
We assume that the vector (X,Y) follows a bivariate normal probability
                                            2  2
distribution with mean (U ,U ), variances (a  a )  and correlation p.
                         x' y               x' y '
     The best estimate of the mean (U ,U ) is the sample mean (X,Y).
A confidence interval for (U ,U ) is
                            x' y
       2      2
where S  and S  are the usual estimates of the variance of X and Y, and
       x      y-                                         2
S   is the estimate of the covariance of X and Y.  Here T  is the cut-
 xy                       2           ,                  °
off value of Hotellings1 T  statistic.
     Consider an additional vector of differences, not used in the orig-
inal calculations, say (X ,Y ).  Assuming that (X ,Y ) follows the same
bivariate normal distribution as (X,Y), we obtain the following result:
we are 90-percent confident that (X ,Y ) will lie in the ellipsoid
                      -) s
                          X
                      -  s
                          xy
-) s
   s
xy
2
y
 Morrison, D. F., Multivariate Statistical Methods, pp. 131-134 (McGraw-
 Hill, New York, 1976).
                                  27

-------
                              APPENDIX C
                           CONFIDENCE LIMITS

     The accuracy score and a number of the intermediate-evaluation sta-
tistics presented in this report do not have formulas for confidence
limits.  However, confidence limits for these statistics may be derived.
     To illustrate the procedure for the accuracy score, let A be an
accuracy score computed from a data set.  Subdivide the data set into k
data subsets.  We recommend k = 5 as a reasonable value.  In this case,
the first subset should contain the data from time-periods 1, 6, 11,
16, etc.  The second subset should contain data from time-periods 2, 7,
12, 17, etc.  In this fashion each subset contains data from every fifth
time period.'v
     The accuracy scores for each of the subsets can be computed if these
accuracy scores are labeled A.., A., ..., A ;  then the sample standard
deviation of these subset scores is
                       S -l/r-^-T 7, (A,. - A)2
where
                                 1=1
*
 This procedure ensures that the data subsets are comparable with each
 other and to the full data set, except when the data set has a "period-
 icity" of five time periods.  As an example of such periodicity, consider
 the case wherein the periods were days and the full data set consisted
 only of weekdays.  Then the five data sets would not be comparable (since
 one would consist only of Mondays, another only of Tuesdays, etc.).  In
 this case, one should construct six data subsets.

                                   28

-------
     The accuracy-score A computed from the total data set has a smaller
standard error than the accuracy scores computed from the smaller data
subsets.  We may approximate the standard error of A by
Thus a 90-percent confidence interval for the true value of the accuracy
score is given by
where t,     _,. is the upper 95-percent cutoff point of the Student's
       K.~ 1 y • .7 J
t-distribution with k - 1 (four) degrees of freedom.
                                  29

-------
                               APPENDIX D
                   INTERMEDIATE-EVALUATION STATISTICS

LINEAR REGRESSION  WITH CONFIDENCE
AND PREDICTION BANDS
     The scattergram  of observed-versus-predicted  concentrations is a
visual presentation of the accuracy of the prediction model.  The regres-
sion line  summarizes  the average relationship between predicted and
observed concentrations.
     Two types of  bands can be drawn around  the  regression line.  A 90-
percent confidence band has a 90-percent probability  of capturing a true
underlying  linear  relationship between predictions and observations.
A 90-pe.rcent  prediction band has a 90-percent probability of capturing
an arbitrary  prediction-observation pair, when that pair is obtained under
conditions  similar to those existing when the data used in the scattergram
were gathered.   Typical bands are shown in Figure  D-l.
             • REGRESSION LINE
             •90% CONFIDENCE BANDS
                ON REGRESSION LINE
             • 90% PREDICTION BANDS
                ON ADDITIONAL PREDICTION-
                OBSERVATION PAIR
     tr
     z
     111
     o
     o
     o
     o
     ui
                         OBSERVED CONCENTRATION
               Figure D-1. Scattergram, regression line, and bands tolerance.
                                    30

-------
     The regression line and bands can be obtained by applying normal
regression theory.  For the sake of notation, let
                                  = (OC1,PCi)
Then
the least-square regression of Y.  on X.  for n samples may be  per-
formed, yielding an intercept estimate a, a slope estimate b, and a vari-
               2
ance estimate S  as follows:
                                      n    \2
                                     E
                            1=1 *   \i=l
                    n          / n
                               ZV
                       v^ '      X
                         n       / n    \2
                          n
               9     1    ^™*               9
              S  . -L     (YI . a - bX.)
     In contrast to the usual convention,  the vertical axis has been
reserved for the predicted observation and the horizontal axis has  been
reserved for the observed concentration.   This arrangement more closely
approximates the underlying structure of  normal regression theory since
the predominant error is in the prediction rather than in the  observa-
tion of air pollution.
     The usual 90-percent confidence interval for the mean a + bX is
given by
                                   31

-------
               a+bX ±tn-2f.95S
I +
n-    n
                                             - x)2
                                       1=1
                                           (xt - x)2
                  1/2
Now consider an additional ooservation (X ,Y ).   For given X ,  it is
possible to derive an interval that possesses a  90-percent probability
of capturing Y .   This interval is given by the  formula
            a+bXo ±tn-2,.95S
          (X. -
                                                   X)2
                                         1=1
                                                       1/2
     The estimates of the intercept and slope may also be used to correct
the model for linear trend.  To accomplish this, we replace the model
prediction Y by the corrected prediction (Y - a)/b.
     To supplement the material presented herein, the reader is referred
to the text by Miller.1

PEARSON'S CORRELATION COEFFICIENT
     Direct correlation between observed and predicted concentrations
exists when a rise or fall in predicted concentrations is associated
with a corresponding linear rise or fall in observed concentration.
     The degree of correlation between predicted and observed concentra-
tions can often be visually assessed through the use of a scatterplot,
as illustrated in Figure D-2.
     In the situations shown in Figure D-2, the more closely the plotted
concentrations can be fitted by a line, the more highly correlated are
predicted and observed concentration.  A correlation of +1 denotes that
all the plotted points fall along a line; a correlation close to zero
denotes that the data scatters widely.
                                   32

-------
 OBSERVED CONCENTRATION
  (a) LOW CORRELATION
 OBSERVED CONCENTRATION
(b) MODERATE CORRELATION
OBSERVED CONCENTRATION
  (e) HIGH CORRELATION
                      Figure D-2. Visual assessment of correlation.
     Two  anomalous situations may also occur.  In  the first, the correla-
tion coefficient is negative.   The line that best  fits the scatterplot
has a negative slope.  A correlation of -1 denotes  that all the plotted
points  fall  along such a line.   In the second, the  predicted and observed
concentration scatterplot shows a definite nonlinear  relationship.  As
illustrated  in Figure D-3, the  correlation will be  low in such cases
unless  the data can also be fit well by a line.
     When the relationship between predicted and observed concentration
is linear, the square of the correlation coefficient  may be interpreted
as the  percentage of the variance of the predicted  concentrations

          OBSERVED CONCENTRATION
           (a) ZERO CORRELATION
               OBSERVED CONCENTRATION
          (b)  MODERATE POSITIVE CORRELATION
                    Figure D-3. Nonlinear scatterplot relationships.
                                    33

-------
"explained" or "accounted for" by the observed observations.  The con-
verse interpretation—that the squared correlation coefficient represents
the percentage of the variance of the observed concentrations explained
by the predicted observations—also holds.
     Pearson's formula for the correlation coefficient is given by
                     1=1
                  1=1       i=i         1=1
     In discussing the correlation coefficient, we have been concerned
with the correlation exhibited in the sample.  We now consider the corre-
lation coefficient in the population from which the sample has been
drawn.  Due to sampling variability, Pearson's correlation coefficient
for the sample will not usually equal the correlation coefficient in the
population.  With a large sample, however, the two correlation coefficients
will be close.
     The correlation coefficient for the population is usually denoted
"p."  It can be shown that if the observed and predicted concentrations
are jointly normally distributed, there is a 95-percent probability that
p is contained in the interval

        tanh  ( tanh"Lr  -  1>96   ) ^ P ^  tanh  ( tanh"1r  +
              \          V^V             \

This interval is called a 95-percent confidence interval for the popula-
tion correlation coefficient.  The texts by Anderson  and Wonnacott
provide more detailed information on the derivation of the confidence
interval.
                                   34

-------
TEMPORAL AND SPATIAL CORRELATION COEFFICIENTS
     As described by Nappo,  an accurate air-prediction model must repro-
duce at each monitoring station the observed time-varying pollution con-
centration, and reproduce at each monitoring time the observed space-
varying pollution pattern.  A measure of the agreement of the predicted
and observed temporal trends of air pollution over the whole network of
monitoring stations is given by the temporal correlation coefficient
R(t) , the average over stations (spatial average) of the temporal cor-
relation coefficient formed at each station.  A measure of the agreement
of the predicted and observed spatial trends of air pollution over a
prediction period is given by the spatial correlation coefficient R(s) ,
the average over time of the correlation coefficient between the predicted
and observed patterns of the concentration isopleths at each monitoring
time.
     Let the collection of observed and predicted concentrations, 0
                                                                   s, t
and P   , be indexed by the monitoring station s = 1,2,...,S and by the
     s, t
monitoring time t = 1,2,...,T.  It is not necessary for the stations or
the times to be contiguous.
     The time-averaged, station-specific, predicted and observed concen-
trations are given by

                                      T
                            s,t  ~ T
                                      T
                                     £os,
     The station-specific temporal correlation coefficient is given by
the usual formula,
                                   35

-------
                         T

                           E(p    - Pt)(o    - o   fc
                             s,t    s,t ' ^ s,t    s,t
           Rs(t) =
                        t=l
                      VX                   /~f

                     V(P    -P   t)2l/V(o    -o   fc
                     2LV s,t    s,t ;  WZ^  s,t    s,t

                     t=l                 T t=l
     The spatial average of the temporal correlation coefficient is


given by
                                    s=l





     Under the assumption that the R (t) are not related to each other
                                    S

(i.e., statistically independent) and have identical frequency distribu-


tions with mean I


R(t) is given by
tions with mean E[R (t)] = R(t), a 95-percent confidence interval for
                   S
            tanh
   tanh"1* (t)  - 1.96     l
          S             !„ /rr   TT
s=l                   Vs (T - 3)
                                                        R(t)
              tanh
   s

  2-1
                                  (t)  + 1.96
                                                  - 3)
     The station-averaged, time-specific, predicted and observed concen-


trations are given by
                          s,t    S t-J s,t
                                   s=l
                                   s=l
                                   36

-------
     The time-specific station correlation coefficient is given by the
usual formula,
                                    s,t ^ s,t    s,t
                     s=l
                                           s
                                          s=l
                                                        .V
     The temporal average of the time-specific spatial correlation
coefficient is given by
                                  I
                                  T
     Under the assumption that the R (s) are independent and identically
     ributed with mean E[R (s)
for R(s) can be formulated as
distributed with mean E[R (s)] = R(s), a 95-percent confidence interval
            tanh
T ^ tanh"1Rt(s)  - 1.96	
  t=l
                                                - 3)
                                                        R(s)
              tanh
                         tanh-1R (s)  - 1.96     1
                      t-l       t           \/T(S - 3)
CHI-SQUARE GOODNESS-OF-FIT TEST
     The chi-square goodness-of-fit test can be used to assess whether
the distribution of predicted concentrations deviates significantly from
the distribution of observed concentrations, and to graphically depict
such departure.  Differences between the observed and predicted concen-
tration distributions, such as different variances or shapes, may be
                                   37

-------
identified from appropriate data displays.  For more detailed information
on this test, the reader is referred to the text by Hogg and Craig.
     The observed concentrations OC,,. . . ,OC  are used to define cell
boundaries for air-pollution concentration with an approximately equal
number of observed concentrations in each cell.  If possible, each cell
should have approximately ten observations.  By E., denote the number of
observed concentrations in the j-th cell, where there are J cells, and
1 ^ j ^ J.  By P., denote the number of predicted concentrations of the
j-th cell.  If the model is accurate,  the E. and the P. should be approxi-
mately equal.
     Under the assumption that the E.  are the expected number of predicted
concentrations in the j-th cell, the statistic

                           2   „£, (P4  -  EJ2
                          X  =
is distributed as a chi-square random variable with J - 1 degrees of
freedom.
     Larger values of the chi-square statistic than might reasonably be
anticipated from a chi-square random variable are indicative of poor
model fit.   The significance of the chi-square statistic may be obtained
from a chi-square distribution table.  It is usual practice to report the
                                    2
value of the chi-square statistic (x ) as well as its significance
(obtained from appropriate tables).   However, both these values are very
                                                      2
sensitive to sample size.  We suggest also reporting x /n> which may be
interpreted as the sum over the cells of the percentage deviation from
the expected cell proportions.
                                   38

-------
INTERSTATION ERROR CORRELATION TEST
     Any correlation between the errors in concentration prediction of
two receptor stations is indicative of a systematic bias in the predictor
model.  The significance of these error correlations may be tested and
confidence bounds placed on the error correlations.
     A correlation coefficient that is approximately zero between the
errors in concentration prediction of two receptor sites is a highly
desirable situation.  It indicates that no important condition "linking"
the two receptor sites exists that might account  for underprediction or
overprediction.
     If the correlation coefficient is high  (near 1.0), it indicates that
whenever the model overestimates concentrations at one receptor site it
tends to do so at the other, and similarly for underestimation.  The
presence of a factor influencing both observed concentrations but
unaccounted for by the model is suggested.
     If the correlation coefficient is low (near  -1.0), it indicates that
whenever the model overestimates concentrations at one receptor site, it
tends to underestimate concentrations at another, and, conversely, when-
ever the model underestimates at one receptor site it tends to overestimate
at the other.  The presence of a factor affecting both observed concen-
trations in oppo'site directions, but unaccounted  for by the model, is
suggested.
     For two stations j and k, let the set of observed and predicted
concentrations be given by (OC  ,PC  ) , ..., (OC. ,PC. ) and (OC  ,PC  ),
                              J J-   JJL           J^-   j^         K.-L   K.J.
..., (OC,  >PC,  ).  The differences between the observed and predicted
        Kn.   K.n
concentrations will be denoted
                          DC._ = OC... - PC.
                            Jt     Jt     jt
where j (or k) is the station identifier and t is the time identifier.
     Let r be the correlation coefficient between (DC..:i =1,  ..., n)
and (DC  :i =1, ..., n).  Under the assumption that DC. and DC,  follow
       Kl                                              J        K-

                                   39

-------
a bivariate normal distribution with correlation coefficient p, a
2
95-percent confidence interval is given by
tanh ( tanh" r - '
tanh tanh" r +
V
(•
If this confidence interval fails to capture the value zero, then the
correlation coefficient p is significantly different from zero.

MULTIPLE REGRESSION OF ERROR RESIDUALS
Multiple regression is a technique for uncovering linear relation-
ships between the residuals (observed concentrations less predicted con-
centrations) and the air-pollution/meteorological variables. Figure D-4
illustrates a hypothetical plot of residuals versus ambient air tempera-
ture. In this case, the prediction model tends to underestimate concen-
trations at high temperatures and overestimate concentrations at low
temperatures. The elimination of this linear trend through model
201
15-
10
I
g -5

-------
adjustment will improve the accuracy of the prediction. Multiple linear
regression can be used to uncover these linear relationships when there
are many air pollution/meteorological variables.
To perform a multiple linear regression, we must have data consist-
ing of a set of n paired measurements of observed concentrations, pre-
dicted concentrations, and air-pollution/meteorological variables,
(OCI,PCI,XII,...,XI ), ..., (OCn,PCn,Xnl,...,Xnp). The air-pollution/
meteorological variables can be interval, ordinal, or nominal 0-1 vari-
ables. They may even include predicted and observed concentrations.
The residual Z, is the difference between the k-th observed and
predicted observation
Zk = OCk - PCk (1 S k * n)
To uncover linear relationships between the residual and the air-pollution/
meteorological variables, we fit the following model:
AJ + ek (1 * k * n)

2
Here the e, are independent variables following a normal (0,o~ ) distribu-
tion. The usual least-square procedures are used to estimate p, and 3
coefficients .
The estimate of p, is the amount by which the model, on the average,
underestimates or overestimates the observed concentration. The esti-
mates of the p. are the slopes of the linear relationship of the residuals
to the air-pollution/meteorological variables.
The statistical significance of the estimates of the (3. may be
determined by using the F-test. If the F-test is significant at the
5-percent level, then consideration should be given to using the regres-
sion results to modify the model.
41
-------
The multiple regression should be performed in a stepwise fashion
in order to isolate the most important subset of the air-pollution/
meteorological variables. These variables can be identified as providing
the largest increments to the proportion of the explained variance in the
residuals. Efforts at model modification should be focused on these
variables.

ANALYSIS OF THE RESIDUAL TIME SERIES
Statistical techniques, as described by Box and Jenkins7 and Jenkins
Q
and Watts, may be used to examine the residual time series (observed
concentrations less predicted concentrations) to determine if components
other than white noise are present.
Due to the serial and cyclical time dependence of meteorological
conditions and pollutant emissions, both the observed-concentrations
series and predicted-concentrations series will also exhibit time depen-
dencies. To the extent, however, that the predicted time series agrees
with the observed concentrations, the residual time series (observed
concentrations less predicted concentrations) will consist primarily of
white noise.
If there is a serial or cyclical component in the residual time
series, then either the prediction model is failing to account for a
time-dependent relationship or it is inducing an extraneous time-dependent
relationship. In either event, a fault in the prediction model is indi-
cated .
To perform the time-series analysis, we need data consisting of a
set of n paired measurements of observed concentrations and predicted
concentrations equally spaced in time: (OC,,PC..), (OC_PC ), ...,
(OC ,PC ). These data are used to form the residual time series
n n

rt = OCt - PCfc (t = 1, ..., n)

The residual tii..e series is then investigated for serial and cyclical
departures from white noise.

42
-------
The presence of serial departure from white noise may be detected
by using the autocorrelation function. The autocorrelation function of
the residual time series is given by

n-k
A(k) =
1 n
n
t=l
where
n
r = ~ s ^r<-
n t=l

and k indexes the lag of the serial dependence (i.e., daily, weekly,
bimonthly, monthly).
If the r series is a white-noise process, then any A(k) is asymp-
totically normally distributed with mean zero and variance 1/n, and
should lie inside the 95-percent confidence limits of ±1.96/*/n. If an
A(k) value is outside these limits, the possibility of a time dependency
should be closely investigated.
Figure D-5 illustrates the confidence bands for the autocorrelation
function. Two autocorrelations (at lag one and seven days) fall outside
the confidence bands. These autocorrelations indicate that if the model
overestimates on one day, it tends to underestimate the next, and if the
model overestimates on one day of the week, it tends to overestimate on
the same day of the next week.
Finally, it should be noted that of every twenty or so autocorrela-
tions tested, about one will fall outside the bands by mere chance.
Therefore, each autocorrelation falling outside the bands needs to be
more closely investigated (either analytically or by means of another
time series) to establish that the observed result is not a chance
artifact.
43
-------
5
01
I «
o
o
-1.0-1
95% LIMITS
T I T
1 23456789 10 11 12 13 14 15
95% LIMITS
Figure D-5. Sample autocorrelation plot.
The presence of cyclical deviation from white noise may be detected
by using the normalized cumulative periodogram. The normalized cumula-
tive periodogram is given by
J n _ O
~ r_ - r)2
t=l
where
cos 2n f±t + r sin 2n ft
/ \t=l /
is the periodogram of the residual time series and where f. = i/n is the
frequency.
44
-------
For a white-noise series, the plot of C(f.) against f would be
close to a straight line joining the points (0,0) and 0.5,1). Ninety-
five-percent confidence limits can be drawn at a distance of ±1.96/vn
above and below this straight line, as shown in Figure D-6. Note that
C(f.) crosses the confidence limits. This indicates that there are
cyclical deviations from white noise. The largest jumps in C(f.) occur
in a pattern of about seven days and therefore the model should be inves-
tigated to determine why it tends to overestimate and underestimate
observed concentrations with this frequency.
To use these two time-series analysis techniques, we must have a
data set consisting of equally spaced readings in time. These readings
could be hourly, daily, weekly, monthly, etc. It is recommended that at
least 50 consecutive readings be included in each autocorrelation and
<
tn
o
o
D
O
tr
UJ
Q.
UJ
1
O
o
UJ
N
cc
o
z
ill
D
i.o
0.5
-—
•
•PERIOD 1/fj
.14
.25
FREQUENCY-
.5
Figure D-6. Sample plot of cumulative periodogram.
-------
periodogram calculation. It is not necessary that all values be consecu-
tive since the computational formulas may be adjusted for occasional
missing values.

RUNS TEST
The Wald-Wolfowitz runs test is a method for determining whether
the prediction model overestimates or underestimates the observed concen-
tration for long periods of time.
Consider a plot of the amount of over- or under-prediction of S09
concentrations, as shown in Figure D-7. The dashed line represents the
median overprediction (nine observations are below the median and nine
above). Adjusting for the median overprediction, we obtain Figure D-8.
Note that the plotted points tend to stay above or below the median for
extended periods of time. This indicates that there is a nonrandom com-
ponent of prediction error present (we have already subtracted out the
constant overprediction component). If only random noise were present,
the plotted errors in prediction would fluctuate more erratically around
the median. The presence of only four runs (sequence above or below the
median of one or more points) indicates the presence of a nonrandom com-
ponent .
10 pprn
8 ppm-
6 ppm-

4 ppm

2 ppnr
0 ppm i

—2 ppm
—4 ppm

—6 pprr>'
TIME ••
• •
Figure D-7. Sample plot of residuals versus time.
46
-------
8 ppm

6 ppm

4 ppm

2 ppm

MEDIAN

-2 ppm

—4 ppm-

—6 ppm.

—8 ppm -
RUN 2
RUN 4
*•
TIME-
RUN 1
RUN 3
Figure D-8. Adjusted sample plot of residuals versus time.
If the Wald-Wolfowitz runs test is used, the observed and predicted
concentration readings should be consecutively, and approximately equally,
spaced in time. We can conclude that there is a nonrandom component in
the sequence if there are too few runs. The minimum allowable number of
runs over n time periods is
n + 2
1.65
Fewer runs than this indicates, with 95-percent confidence, the presence
of a nonrandom component of error. This formula is fairly accurate for
over 40 data points. Otherwise, appropriate tables must be used.

POST-AGREEMENT AND PREFIGURANCE TABLES
The post-agreement and prefigurance tables are two methods that are
useful in assessing the accuracy of a forecast model.10 The post-
agreement table (Table D-l) allows an assessment of the extent to which
subsequent observations confirm the prediction of a certain pollution
level. The prefigurance table (Table D-2) allows an assessment of the
extent to which the forecasts give advance notice of a certain pollution
level.
47
-------
TABLE D-l. POST-AGREEMENT
Observed Level
Heavy
Moderate
Light
Total
Forecast Pollution Level
(Percent)
Heavy
57.1
19.0
23.9
100
Moderate
21.1
53.3
25.6
100
Light
2.5
10.0
87.5
100
TABLE D-2. PREFIGURANCE
Observed Level
Heavy
Moderate
Light
Forecast Pollution Level
(Percent)
Heavy
54.5
14.3
10.0
Moderate
40.9
71.4
20.0
Light
4.6
14.3
70.0
Total
100
100
100
In Tables D-l and D-2, three levels (heavy, moderate, and light)
have been used. In general, the number of levels and their specification
in terms of pollutant concentrations are determined by the user.
The post-agreement table will be important in deciding whether a
forecast model is accurate enough to call an alert when heavy pollution
is forecast. The prefigurance table will be important in deciding whether
a forecast model is correctly forecasting a large enough percentage of
heavy-pollution days.
48
-------
QUASI-LOGISTIC MODELING OF THE CORRELATION COEFFICIENT
The Pearson's correlation coefficient between observed and predicted
concentrations may be computed for many different subsets of the data
(such as days with a particular mixing height, wind speed, ambient air
temperature, etc.)- Some subsets will be associated with high correla-
tion coefficients and others with low correlation coefficients. Quasi-
logistic modeling of the correlation coefficient is a method of assessing
which of the subset-defining variables (mixing height, wind speed, etc.)
are contributing to or reducing the correlation coefficient. That is,
quasi-logistic modeling allows a determination of the conditions under
which the observed and predicted concentrations are well or poorly
correlated.
Suppose we are interested in the effect of m subset-defining vari-
ables on the correlation coefficient. These "independent" variables will
be denoted X.. , ..., X . In the spirit of the logistic model, we model
the true correlation coefficient as
p =
/A + yVx.\ + 1
o i i
Here the A , A.. , ..., A are coefficients. Of course, we have not yet
specified how to estimate the coefficients nor how to assess whether the
model fits the correlation coefficient. But we should note some features
of this model. If A is large in comparison with the other A. (even when
multiplied by the X.), then the true correlation coefficient is essentially
constant over all subsets of the data. On the other hand, whenever any
A.X. become large and positive, it follows that the correlation coeffi-
cient is especially large due to the contribution of the i-th subset-
defining variable.
49
-------
For any particular subset of the data, we can compute Pearson's
correlation coefficient r. Under the usual bivariate normality assump-
tions for predicted and observed concentrations, the quantity
is approximately normally distributed with variance equal to 4/(n - 3).
Here n refers to the number of data points in the subset used to calcu-
late r. The mean for Z is
which under our model for p reduces to
m
A + XiA.X.
ZX;
A weighted least-squares regression of Z on the X. may be used to esti-
mate the A , A., ..., A coefficients. Many packaged computer routines,
such as SPSS, exist for performing this regression.
The overall fit of the model may be assessed through the use of the
2
multiple R statistic and plots of the residuals (calculated Z values
2
less estimated Z values). The multiple R statistic (and its associated
F value) indicate the percentage reduction in the variance of Z accounted
for by the regression. The residuals pinpoint those data subsets for
which the model is particularly ill-fitting, perhaps indicating the
presence of "interactions" among variables.
2
The partial R statistics (and their associated F values) may be
used to single out the regression coefficients that are particularly
effective in accounting for the variability in the correlation coefficient,
This may allow a simplification of the model by reducing the number of
coefficients.
50
-------
TECHNICAL REPORT DATA
(Please read Instructions on the reverse before completing)
1 REPORT NO.
EPA-600/4-80-013b
3. RECIPIENT'S ACCESSION-NO.
4. TITLE AND SUBTITLE
EVALUATION OF THE REAL-TIME AIR-QUALITY MODEL USING
THE RAPS DATA BASE
Volume 2. Statistical Procedures
5. REPORT DATE
February 1980
6. PERFORMING ORGANIZATION CODE
7 AUTHOR(S)

Harold S. Javitz and Ronald E. Ruff
8. PERFORMING ORGANIZATION REPORT NO.
Final Report
SRI Project 6868
9 PERFORMING ORGANIZATION NAME AND ADDRESS

SRI International
333 Ravenswood Avenue
Menlo Park, California 94025
10. PROGRAM ELEMENT NO.
1AA603 A-26 (FY-77)
11. CONTRACT/GRANT NO.

68-02-2770
12. SPONSORING AGENCY NAME AND ADDRESS
Environmental Sciences Research Laboratory -
Office of Research and Development
U.S. Environmental Protection Agency
Research Triangle Park, North Carolina 27711
13. TYPE OF REPORT AND PERIOD COVERED
RTF, NC
FINAL 8/77-4/79
14. SPONSORING AGENCY CODE
EPA/600/09
15. SUPPLEMENTARY NOTES
16. ABSTRACT

The theory and programming of statistical tests for evaluating the Real-Time Air-Quality Model (RAM) using the
Regional Air Pollution Study (RAPS) data base are fully documented in four volumes. Moreover, the tests are
generally applicable to other model evaluation problems. Volume 2 presents the tests considered for evaluating
air-quality simulation models in general and those that would be useful in evaluating the RAM. The capability of
the RAM to predict sulfur dioxide (SO2) concentrations was of particular interest. Specific tests for both inter-
mediate and final evaluations are recommended, with accompanying descriptions of formats, plots, and procedures
to establish confidence limits. Discussion focuses on the mathematics, procedures, and interpretation of the
individual tests; their relevance to the project objectives; and possible trade-offs among,tests.
17.
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
b.lDENTIFIERS/OPEN ENDED TERMS C. COS AT I Field/Group
* Air pollution
* Mathematical models
* Evaluation
* Tests
* Computer systems programs
.* Statistical tests
Real-Time Air-Quality Model
Regional Air Pollution Study
Data Base
13B
12A
14B
09B
13. DISTRIBUTION STATEMENT

RELEASE TO PUBLIC
19. SECURITY CLASS (This Report)
UNCLASSIFIED
21. NO. OF PAGES
57
20. SECURITY CLASS (This page)
UNCLASSIFIED
22. PRICE
EPA Form 22ZO-1 (9-73)
51
-------