OCR error (C:\Conversion\JobRoot\00000CUZ\tiff\20015OJB.tif): Unspecified error
-------
RESEARCH REPORTING SERIES
Research reports of the Office of Research and Development, U S Environmental
Protection Agency, have been grouped into nine series These nine broad cate-
gories were established to facilitate further development and application of en-
vironmental technology Elimination of traditional grouping was consciously
planned to foster technology transfer and a maximum interface in related fields.
The nine series are
1 Environmental Health Effects Research
2 Environmental Protection Technology
3 Ecological Research
4 Environmental Monitoring
5 Socioeconomic Environmental Studies
6 Scientific and Technical Assessment Reports (STAR)
7 Interagency Energy-Environment Research and Development
8 ' Special" Reports
9 Miscellaneous Reports
This report has been assigned to the ENVIRONMENTAL MONITORING series
This series describes research conducted to develop new or improved methods
and instrumentation for the identification and quantification of environmental
pollutants at the lowest conceivably significant concentrations It also includes
studies to determine the ambient concentrations of pollutants in the environment
and/or the variance of pollutants as a function of time or meteorological factors
This document is available to the public through the National Technical Informa-
tion Service, Springfield, Virginia 22161
-------
EPA-600/4-80-013d
February 1980
EVALUATION OF THE REAL-TIME
AIR-QUALITY MODEL
USING THE RAPS DATA BASE
Volume 4. Evaluation Guide
by
Ronald E. Ruff
Atmospheric Science Center
SRI International
Menlo Park, California 94025
Contract No. 68-02-2770
Project Officer
John S. Irwin
Meteorology and Assessment Division
Environmental Sciences Research Laboratory
Research Triangle Park, North Carolina 27711
ENVIRONMENTAL SCIENCES RESEARCH LABORATORY
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
RESEARCH TRIANGLE PARK, NORTH CAROLINA 27711
-------
DISCLAIMER
This report has been reviewed by the Environmental Sciences Research
Laboratory, U.S. Environmental Protection Agency, and approved for publica-
tion. Approval does not signify that the contents necessarily reflect the
views and policies of the U.S. Environmental Protection Agency, nor does men-
tion of trade names or commercial products constitute endorsement or recom-
mendation for use.
ii
-------
ABSTRACT
The theory and programming of statistical tests for evaluating the
Real-Time Air-Quality Model (RAM) using the Regional Air Pollution Study
(RAPS) data base are fully documented in four volumes. Moreover, the
tests are generally applicable to other model evaluation problems. Vol-
ume 4 discusses the application and interpretation of the statistical
programs, particularly with regard to use on the RAM. In general, there
is no set procedure for evaluating an air-quality model because of the
different reasons for evaluating models and many subjective decisions
to be made during the process. However, guidelines are presented to
cover a wide variety of evaluation needs, with attention to data prepara-
tion, classification, analysis, .selection and application of tests, and
interpretation of results. Several methods of diagnosing causes of poor
model performance are discussed and some sample program outputs are also
provided. This report was submitted in fulfillment of Contract No.
68-02-2770 by SRI International under the sponsorship of the U.S. Envi-
ronmental Protection Agency. This report covers a period from 1
October 1977 to 1 April 1979, and work was completed as of 1 April 1979.
iii
-------
CONTENTS
Abstract iii
Figures vi
Tables vi
1. Overview 1
Introduction 1
General procedures 1
2. The Model Evaluation Process 3
Discussion .....3
Data preparation 4
Data analysis 12
Selection and application of the
statistical tests 14
Interpretation of results 16
Diagnosing the causes of poor
model performance 25
Combinations of poor performance indicators .... 33
References 35
Appendices
A. SPSS multiple regression output
for test data base 36
B. Partial sample of output for SPSSSORT program 45
-------
FIGURES
Number Page
1 General procedure for evaluating models 2
2 RAM evaluation data bases 11
3 General logic to compute multihour averages 13
4 Frequency distribution for test data base 17
5 Residual frequency distribution for test data base 17
6 Actual versus ideal fit for test data base 19
7 Scattergram with bounds for the test data base 19
8 Autocorrelation function of test data base 21
9 Normalized cumulative periodogram for test data base .... 21
10 SPSS summary output for test data base 24
11 Types of bias 27
12 Temporal bias example 31
TABLES
Number Page
1 Wind-Speed Classes 6
2 Mixing-Depth Classes 6
3 Temperature Classes 7
4 Modified Test Data Base Hourly Format 9
5 Description of RAM Evaluation Format and Parameters 10
VI
-------
SECTION 1
OVERVIEW
INTRODUCTION
In the previous volumes, a distinction was made between final-
evaluation statistics and intermediate-evaluation statistics. Final-
evaluation tests compute accuracy scores, which are used to compare a
model's performance against some standard, such as a competing model or
a different version of the same model. These accuracy scores do not pro-
vide any insight as to why a model is performing poorly (or well).
Instead, they provide a bottom-line figure (score) which sums up the
performance.
Intermediate evaluation statistics are used to diagnose the reasons
for a model's poor performance, either in general or for certain meteoro-
logical or emission conditions. In all, there are five tests that can
be used for intermediate evaluation: residual time series, chi-square
goodness-of-fit, bivariate regression and correlation, interstation error
correlation, and multiple regression of error residuals. Graph displays
are incorporated into most of the programs to facilitate interpretation.
GENERAL PROCEDURES
As will be discussed in the following pages, the general method of
applying the programs remains essentially the same regardless of the
specific air-quality model being tested. To adapt the programs to the
RAM evaluation, only minor software modifications must be made. These
include: (1) scaling the graphical displays, (2) annotating the output,
(3) adjusting DIMENSION statements to accommodate the proper number of
observations and stations, and (4) formatting the input data. The docu-
mentation on the above items is contained in Volume 3 of this report,
"Program User's Guide."
-------
As will be discussed in the following sections, the recommended model
evaluation procedure is an iterative process whereby results from one
test are used to decide which test to apply next. Before delving into
the specifics, consider the generalized procedure shown in Figure 1.
Our first step is to analyze the data base in a number of ways. After we
find out a little about the data base, we can make a well-founded decision
about which statistical test to apply. We then apply that test and ana-
lyze the results. Then we can opt to (1) end our evaluation, (2) conduct
another test, or (3) modify the model and proceed through the evaluation
process again.
PREPARE AND ANALYZE
DATA BASE
SELECT THE
STATISTICAL TEST
APPLY TEST
MODIFY
MODEL
L^-OPTION 3-
ANALYZE RESULTS
-OPTION 2
-J
Figure 1. General procedure for evaluating models.
The flow chart in Figure 1 depicts a human-decision process. (It
is not a flow chart for a computer program.) Because subjective human
decisions are involved, there are many possible variations in the speci-
fic choices. Guidelines for these choices are presented in the following
sections.
-------
SECTION 2
THE MODEL EVALUATION PROCESS
DISCUSSION
As we discussed in the previous section, there will be many varia-
tions possible in our testing strategy for the RAM or any other model.
The particular strategy depends on (1) the user-defined evaluation cri-
teria, (2) the interpretation of intermediate test results, and (3) a
subjective evaluation of the best course of action. Item (1) above recog-
nizes that the most important evaluation criteria can vary among applica-
tions. For instance, one user might be most interested in evaluating
performance for a 24-hour averaging period while another may select a
1-hour averaging period. The interpretation of the results of a particu-
lar test, item (2), will determine which test will be conducted next and
how to stratify the data base accordingly. Finally, for item (3), we
recognize that there are many "gray" areas in selecting a future course
of action. After evaluation of some intermediate test results, one user
may choose to run additional tests while another may decide to modify
the model and then rerun the same test.
There are probably numerous reasonable approaches to the model-
evaluation process, each of which could produce the same successful
results. Therefore, our discussion by necessity must dwell on a general
testing procedure, with some assumptions on how the RAM test would pro-
ceed. We also attempt to discuss the various alternatives along the way.
For evaluating the RAM, we are assuming that the primary objective
is to improve the model for the one-hour averaging time. Secondary objec-
tives would be to compare the RAM with other models; assess its accuracy
against some user-defined standards; or evaluate it for other averaging
times, such as 3-hour, 24-hour, monthly, seasonal, and annual. For these
secondary objectives, the final evaluation statistics are recommended.
-------
As we will be noting in the following sections, some intermediate evalua-
tion statistics may also be helpful, but only to complement the final-
evaluation statistics.
In the remainder of this section we discuss the steps in the model-
evaluation process, starting with data preparation and continuing through
the last computation of the final-evaluation statistics.
DATA PREPARATION
In this volume we have assumed that the primary objective is to eval-
uate and improve the RAM by testing it with the RAPS data base. As a
start, the test data base consists of hourly values covering a one-year
period for (1) observed and predicted concentrations at 13 locations;
(2) wind measurements at these locations; (3) derived wind speed and
direction, atmospheric stability class, mixing height, and temperature
parameters representing the St. Louis area (as a whole); and (4) an
emission inventory. The 13 locations are for the monitoring stations in
the RAPS network. The observed concentrations are averaged from the S09
monitoring instruments; the predicted concentrations result from applying
the RAM to the input data from (3) and (4) above.
Classification of Parameters
Except for atmospheric stability all the RAM input data are numerical
values which are directly averaged from the instrument readings. To stra-
tify test results according to meteorological and/or emission conditions,
the data are divided into classes for each relevant RAM input parameter.
The particular parameters and recommended number of classes are:
Parameter Number of Classes
Atmospheric stability 7
Wind speed 6
Wind direction 16
Mixing height 7
Temperature 5
Emissions 16
-------
If we examine the number of combinations from the above classes, we see
that there are 376,320 possible categories. Further, we note that there
will be only 8,784 hourly SO concentrations for the yearly (1976) test
data base. Therefore, we will have to combine the parameter categories
in a reasonable manner so that within each category there is a statisti-
*
cally valid sample (e.g., 10 cases ). For our purposes, it is better not
to limit the number of categories at this time. This is best done after
the RAPS data base has been reviewed, as will be discussed later.
Our classification structure, at this point, is taken mostly from
the literature. As stated earlier, stability is classified in one of the
seven atmospheric stability categories for an urban area, as calculated
in the RAM preprocessor. Wind directions are divided into 16 sectors of
o t
22.5 each, with the first sector centered in north. Wind speed is
divided into six categories according to the scheme used by the Climato-
2*
logical Dispersion Model. (See Table 1.) Mixing-depth classes corres-
pond to those used in the APRAC urban diffusion model.2 (See Table 2.)
Ambient temperature measurements are used both explicitly by RAM,
in the plume-rise calculation, and implicitly, in the emissions inven-
tory. (S0? emissions from space-heating sources are computed when ambi-
ent temperatures are below 69 F, or 20°C.) Our temperature categorization
o
scheme uses the space-heating cutoff temperature, 20 C, as a dividing
point between the third and fourth categories. Table 3 describes the
temperature categories.
For emission categories we suggest using the actual RAPS emission
rates. Prior to analyzing the data base, we tentatively suggest dividing
the inventory into 4 categories of area-source emissions and 4 categories
*
For statistical significance, the required number of cases depends on
the standard deviation of the sample
In our terminology, this corresponds to the direction from which the
wind is blowing.
*
References are listed at the end of this report.
-------
TABLE 1. WIND-SPEED CLASSES
Wind-Speed Class
1
2
3
4
5
6
Speed Interval
(knots)
0-3
4-6
7-10
11-16
17-21
>21
Central Wind Speed
(m/sec)
1.50
2.46
4.47
6.93
9.61
12.51
TABLE 2. MIXING-DEPTH CLASSES
*
Mixing-Depth Class
1
2
3
4
5
6
7
Mixing-Depth Interval
(meters)
iioo1'
100-200
200-400
400-800
800-1,600
1,600-3,200
>3,200
Geometric Mean
of Depth
70.0
141
283
566
1,131
2,262
4,525*
For classes 1 and 7, the geometric mean was calculated as
though the classes were bounded.
For the RAPS data base, mixing depths are a minimum of
100 meters.
-------
TABLE 3. TEMPERATURE CLASSES
Temperature Class
Temperature Interval
1
2
3
4
5
0-10
10-20
20-30
>30
of point-source emissions. Alternatively one could use the four seasons
of the year, a weekend-weekday separation, and a day-night separation.
(This results in 16 possible emissions categories: 4 seasons X 2 weekend-
weekday x 2 day-night.) In addition, as mentioned earlier, ambient tem-
perature could also be used in conjunction with the time period to fur-
ther categorize emissions.
Formats
The format used for our test data base, described in Volume 3,
must be modified somewhat for evaluating the RAM. The difference is that
for RAM we have derived wind temperature, stability, and mixing-height
data, which are used for model input. In addition we have wind measure-
ments averaged hourly for each station, which we also recommend incor-
porating into the format. (However, this is optional.) Therefore, the
recommended data base consists of values averaged hourly for the
following:
Interpolated wind-speed class
Interpolated wind-direction class
Mixing-height class
Temperature class
Site-specific wind speed (optional)
-------
Site-specific wind direction (optional)
RAM-predicted concentration for each site
RAPS-observed concentration for each site.
For the last four items, there will be 13 observations for each time
period. (This corresponds to the number of sites for which SO monitoring
data are available.) For the first four items there is just one observa-
tion per time period. We also suggest adding the following logistic
parameters:
Date (one per time period)
Time (one per time period)
Site identification (13 per time period).
Therefore, for a full (leap) year (8,784 hours), the size of our data
base is:
8,784 X 6 + 8,784 X 5 X 13 = 623,664*
The manner in which these are ordered is considered next.
The general format of the test data base was given in Volume 3.
To process the RAM data base efficiently, we suggest a slight modifica-
tion in the hourly average format as indicated in Table 4. The parameters
are defined in Table 5. Note that for each hourly time period there are
NS + 1 lines. (We allow that in a mass storage file line numbers are
irrelevant. However, it simplifies our discussion here.) In general, the
first line of the group contains the categorized equivalent to the RAM
input data. The next 13 lines contain the observed and predicted concen-
trations and the wind measurements for each of the 13 stations.
*
If we eliminate the site-specific wind speed and direction the size
reduces to 394,500 words.
-------
TABLE 4. MODIFIED TEST DATA BASE HOURLY FORMAT
Line
Data
1
2
3
NS + 1
14
15
16
17
NS + 2
28
29
1,OC1, PC1,WS11,WD11
2,OC2,PC2,WS21,WD21
NS ,
,WDNS
D2,T2,ASC2,WSC2,WDC2
NS '°C2 -NS >PC2 -NS 'WSNS2 'WDNS2
D3 ,T3 ,ASC3 ,WSC3
-------
m
d
o
H
4-1
CX
r-l
S-i
o
CO
0)
P
*
1 1
cfl
g
O
0)
2 B
II
cfl
PM
cu ,,
J_l
0) _g
6 p
CO |
a*
*u
^_i
£ji
^ §
c ^
H ^
-1
f-x,
1
CM r-l
0
0
r-l CO
VO CO
1 Cfl
i 1
> O
bQ ^
4-1
0) -H
r-l
i-l
^x XI
t>» CT* cfl
co m 4J
TJ CO CO
1 CM 1
O 1 CJ
B o M-I
1 0 S-l
S-i O 0)
>> O JZ
*-s P-
> CO
0) Q) O
M B B
Cfl -S 4-1
O H <
VO r-l VO
0) CO i-l
S-l CO > 1
r-l CO CO r-l
13 r-l CO
1 0 Cfl "
13 r-l CO
d 4-> 0 CO
H x; cfl
£ 00 0) r-l
H S-l CJ
0) CD 3
4-1 x: 4J d
r-l I cfl O
CO bO S-i T-I
O d Cl) CO
ex -H ex co
6 X g T-I
O -H Q) B
u s H w
CM r-l r-l vO
M M M M
U O
§ g H W
in vo i^- oo
m
CM
r-l
1
I-l
o
r-l
*
S-i
OJ
Xl
d
d
o
H
4J
cfl
4-1
CO
60
d
H
S_i
o
4-1
r-l
d
0
S
CO
M
to
r-l
oo
CM
vO
*"*
.
r-l
4J
O
U)
S-i
I-l
13
4-1
O
d
cu
S-l
CO
^
cu
x;
4-1
CU
CQ
3
CO
CJ
0)
XI
13
CU
4J
4J
I-l
B
6
cu
XI
>
a.
.
CQ
d
o
!-(
4-1
CO
r-l
3
O
r-l
Cfl
U
TJ
C
1-1
*
cu
4-1
r4.
CQ
O
§
O
cu
x;
4J
14-1
O
r*1!
CJ
Cfl
3
cr
cu
13
CO
0)
JC
4J
bo
d
r4
CO
CO
Q)
CQ
CO
Cfl
, S-i
o
10
-------
For the RAM evaluation, data exist in three forms, as illustrated
in Figure 2. These data will be supplied in three segments, as shown
on the left-hand side of the figure. The data will be processed into a
single data base in the format described in the previous paragraph. Fur-
thermore, we recommend an abbreviated data base consisting only of the
observed and predicted concentrations covering the time period of interest
for the 13 stations (e.g., for the year). Because most of the statistical
tests use only the observed and predicted concentrations, the abbreviated
data bases are generated to reduce processing costs. For the year, the
maximum size of this test data base is 228,384 words (8,784 hrs X 2 con-
centrations X 13 sites). However, some of these data bases will be segre-
gated by given meteorological and emissions conditions and thus could be
much smaller in size. The SPSS software described in Volume 3 can be
used to segregate the data.
Calculation of Nonhourly Averages
The statistical package can be used to evaluate the RAM for non-
hourly averages. Typical averaging periods of interest would include
3-hour, 24-hour, monthly, seasonal, and annual. To form the data base
for these periods, a simple FORTRAN program that converts the hourly
averaged data into the desired form is recommended.
RAM INPUT
DATA
RAPS MONITORING
DATA
RAM PREDICTIONS
RAM EVALUATION
DATA
ABBREVIATED
DATA BASES
Figure 2. RAM evaluation data bases.
11
-------
Figure 3 illustrates a suggested logic for the FORTRAN program,
which assumes that the data are time-sequential for each group of 13
sites. The date and time are also needed because the time index is com-
puted from the time, for the 3-hour average, and from the date, for the
24-hour, monthly, seasonal, and annual averages. (For instance, if we
assume time ranges from 0000 to 2300, the 3-hour average time index can
be computed from the following: 1 + ITIME/3. Since the second term
will truncate to an integer, we note that the index will range between 1
and 8.) The time index is needed only to indicate when to compute the
specified average. After the time index is computed, the program deter-
mines if the index has changed. If it has not, the observed (OC) and
predicted concentrations (PC) are accumulated, along with the count. If
the time index has changed, the observed and predicted concentrations are
averaged (if valid) and written on the output file. This process conti-
nues until an end-of-file is encountered.
Checks are needed to separate out bad data (denoted by a "-99") so
they are not counted in the multihour average. In addition, after the
time index changes, the number of data points comprising the average must
be checked. Do we need three for the hourly average, or will two suffice?
Such criteria must be established for each average of interest. For the
RAPS data base, these criteria will be fairly rigid, provided we can get
a valid statistical sample.
DATA ANALYSIS
Prior to conducting statistical tests, we recommend first analyzing
the data in an elementary manner to gain a better persepective on the
distribution of the observed and predicted concentrations and the cate-
gorized RAM input parameters. The EPA program, stored in SRI.FREQ (as
documented in Volume 3, p. 13), produces distributions on logarithm-
frequency axes. The SPSS3 procedure, FREQUENCIES, can be used to obtain
line-printer histograms for the RAM input parameters (see Volume 3,
p. 10).
12
-------
SPECIFY
AVERAGING TIME
READ
IDATE, 1TIME, ISITE
OC, PC
COMPUTE
TIME INDEX
HAS
TIME - INDEX
CHANGED OR
END - OF - FILE BEEN
ENCOUNTERED
7
YES
YES
N(ISITE) = N(ISITE) + 1
O(ISITE) = O(ISITE) + OC
P(ISITE) = P(ISITE) + PC
YES
ARE
THERE
ENOUGH
VALUES
7
0 ISITE = O ISITE /N ISITE
P(ISITE) = PIISITE /N(ISITE)
WRITE
ISITE, O(ISITE), P(ISITE)
CHANGE
TIME INDEX
NO
Figure 3. General logic to compute multihour averages.
13
-------
The distribution of concentrations, observed and predicted, will
show whether it is worthwhile to proceed further with out analysis. If
the RAPS data base is to prove useful, the distribution must cover a
reasonable range of observed concentrations (preferably from zero through
levels above the air-quality standards) and a statistically sound sample
size of these concentrations. This distribution should be plotted for
all 13 sites to provide insight into the usefulness of the data base for
each site.
The line-printer histograms of the RAM input parameters indicate how
representative our parameter classification scheme is. (The analysis may
suggest that a different scheme might be more appropriate. For instance,
if 90 percent of the temperatures are contained in two categories we may
choose to assign narrower ranges to the categories.) The histograms will
also be useful at a later stage in our analysis when we are relating
residuals (difference in observed and predicted concentrations) to input-
parameter categories.
SELECTION AND APPLICATION OF THE STATISTICAL TESTS
The basic order of events in evaluating and improving the RAM is
Apply the final-evaluation statistics
Run the intermediate-evaluation statistics
Interpret the results
Run tests on subgroups
Modify the model (RAM)
Apply the final-evaluation statistics to the modified model
Repeat the above process until the desired improvement is
achieved.
In the following subsections, we examine each of the above steps in more
detail.
14
-------
Application of the Intermediate-Evaluation Statistics
The final-evaluation statistics are the gauge by which we judge a
model's overall performance. Because our goal is to improve the RAM,
its current performance becomes the baseline from which to compare future
"improved" versions. Hence, the final-evaluation statistics should be
applied for each version of the model.
There are eight unique tests in our final-evaluation package.
Because the incremental (computer) cost per test is small compared to the
input-output costs, we suggest application of most tests with exceptions
as noted below. The following parameter values are suggested (Refer to
Appendix B of Volume 2):
Test 1 - No user-specified parameters required.
Test 2 - No user-specified parameters required.
Test 3 - The absolute-error threshold, which is 6 = 10 ppb (or
26.1 ug/m3); this error threshold is consistent with instrument
accuracy determined in the quality-assurance program.
Test 4 - The percentage-error threshold, which is g = 0.20.
Using the percentage error can be misleading if the data base is
biased toward lower concentrations, which is usually the case.
Therefore, after analyzing the RAPS data base, we may suggest
omitting this test.
3
Test 5 - As a start, set the cutoff = 500 ppb (or 1,307.5 ug/m ),
which corresponds to the Federal Air Quality Standard averaged
over 3 hours. This may have to be modified downward if there are
not enough concentrations at or above 500 ppb (say, a minimum of
200 hourly averages per station).
Test 6 - The same criteria (Test 5) for the cutoff value applies
here; arbitrarily we suggest a value of 1 for j^ and 2 for #2-
However, this test is very subjective and may not be too useful
in the RAM evaluation.
Test 7 - Again, this is a very subjective test requiring the entry
of a 5 x 5 loss matrix. Arbitrarily, we recommend the symmetric
loss matrix shown in Appendix B of Volume 2.
Test 8 - This loss function depends on the distance between the
maximum-observed and maximum-predicted locations. Therefore, we
recommend a 13 X 13 loss matrix consisting of the distances
between the RAPS monitoring stations.
15
-------
In summary, we suggest conducting the final-evaluation tests for
each version of the RAM. For reasons stated above, Tests 4, 6, and 7
should be less valuable than the others.
Application of the Final Evaluation Statistics j
Each of the intermediate evaluation statistics can tell us something
different about the RAM's performance. Hence, at least for the first
application of intermediate statistics, it is a good idea to run all the
tests. In conjunction with the displays described on page 12, the
user now has a complete record of statistical figures-of-merit, associated
graphic displays and auxiliary displays.
The actual mechanics of running each test are documented in Volume
3. We now focus our attention on how to interpret and categorize the
results, prior to conducting more tests and subsequently modifying the
RAM.
INTERPRETATION OF RESULTS
There is some redundancy in our statistics package, which means that
conclusions about RAM performance may be arrived at in a number of dif-
ferent manners. When evaluating poor performance, we are looking for
one or more types of poor-performance indicators. For each such indica-
tor, there are one or more statistical tests that will tell us something
about the RAM performance. These indicators of poor performance include:
General biasif the residuals between observed and predicted
concentrations are mostly of the same magnitude and sign, then
we have a general bias. General bias will be graphically dis-
played by the following:
- Logarithm-frequency plots of residuals or observed and pre-
dicted concentrations on the same plots. These plots for the
test data base are shown in Figures 4 and 5, respectively.
From Figure 4 we note, for the 10-percent point, that the dif-
ference is less than a factor of 2. Thus, for over 90 percent
of the time the agreement between observation and prediction
is within a factor of 2. Figure 5 presents another view of
the above comparison by displaying the distribution of the
residual concentration (magnitude and sign).
16
-------
FREQUENCY DISTRIBUTION FOR OBSERVED AND PREDICTED CONCENTRATION
*__
mf m*9 m* ».« ttj *J m
Q
BSI
REI
RVI
1C
EO
CUMULATIVE FREQUENCY IX)
Figure 4. Frequency distribution for test data base.
FKEQt.
*
t-
§
z **
o ""
g ^
5 -
a
i ^
IEN
».
CY I
M.
)IS
w-
TRIt
-
IUT10
m
N FOI
t RES
^*«t
«
.. ..
OUR
s.
L II
f
""
IBS
M_
1
1
:RV
/
«.t ".* l.» I4 ».Q 10.* M.O ».*. 0*0.* M.
CUHULRT1VE FREQUENCY 1X1
ED-I
>
i*.
RED
1
,rf*"
r
ICTED
: _j!
^
) C0^
M
BO
4CENT
"
UNDE
R«T
0 1
"
«ts
ION
i
5
in
1 0.
tltl
M.I m.a m.am-a».» t».i
11 '»'?
ii>«*trr
t 0
i
Figure 5. Residual frequency distribution for test data base.
17
-------
- Chi-square test. The graphic output of this test closely par-
allels that of the log-frequency plots above, but is not as
effective in displaying bias.
- Bivariate regression and correlation test. Two of the plots
resulting from application to the test data base are presented
in Figures 6 and 7. In Figure 6, a comparison between the
least-squares regression line and "ideal" agreement line pro-
vides perhaps the most valuable indicator of bias. Bias is
indicated if the difference is large between the two. The 95-
percent confidence level shown in Figure 7 should be analyzed
prior to drawing any conclusions.
Spatial biasis similar to general bias, but differs in that it
focuses on the spatial distribution. Mainly, we wish to deter-
mine if there is a spatial pattern to bias. While clever appli-
cation of other tests could serve the purpose, the interstation
error correlation test specifically addresses spatial bias. Also,
the graphical output feature from the accuracy score tests pro-
vides a visual correlation for "spatial" performance.
Temporal biasthe residual time series test is the only test
specifically designed to provide a direct indicator of temporal
(or cyclic) bias.
Randomnessthe -bivariate regression and correlation test provides
the best indicators. A low Pearson's correlation coefficient
summarizes "randomness", and the scatterplot provides a visual
indication.
Input-parameter dependencethe multiple regression of error
residuals test specifically assesses the effect of each input
parameter using an SSPS procedure.3 In addition, it ranks each
parameter's relative contribution to the error residual. Other
statistical tests will also be used for this purpose. However,
these other tests must first be applied to subsets of the data
base that are categorized by input parameters; then comparisons
among categorized results are made. We discuss this further
later in this section.
The above breakdown summarizes what is initially looked for in performance
indicators.
Up to now we have been concentrating on application of the tests to
the entire data base. The preceding paragraph discusses the general types
of interpretation that lend themselves to our model evaluation process.
In the next subsection, we describe what actions can be taken after this
interpretation phase. First, however, we will look at each test individ-
ually and discuss how to interpret the results. (Refer to Volume 2 for
a more comprehensive description.)
18
-------
,.:riE*P REGRESSION ANALYSIS, C«SE 2
UNITS (ie»*-6 >»'PU«-a J
LOTTtO
Z
O
a
*
I
o
o
o
660
500
400
300
aee
100
100 800 300 400
OBSERVED CONCENTRATION
500
600
Figure 6. Actual versus ideal fit for test data base.
REiPESSlON HNHLVSIS, CnSE 6
(CHIUU/0) -- UNITS >'14»*-6U(IU*-2)
PLOTTtD
eee
see
300
o
o
100
.n
I «T.
.IT
100 200 300 400
OBSERVED CONCENTRATION
500
... . IIITV ...
U IMM - M* OQMTtKMCf MU
It M« - liMT MHU \M
Figure 7. Scattergram with bounds for the test data base.
19
-------
Accuracy
The output of each accuracy score test is a number that becomes
lower as agreement between observed and predicted concentrations becomes
better. A confidence interval is also computed. A narrow range about the
accuracy score is desired; the narrower the range, the more confidence we
have in the accuracy score computation.
Residual Time Series
Sample outputs for the residual time series test are given in Fig-
ures 8 and 9. The autocorrelation plot is examined to see if there is a
large cyclic component in the error residual. The normalized cumulative
periodogram yields much the same information but provides a cumulative
display. The autocorrelation function is somewhat easier to interpret.
Here one looks for the highest "spike" and, if it is above the 95-percent
confidence line, one can assume that the model performance is significantly
poorer for that time lag. In our example (Figure 8), we see that the
autocorrelation is greatest for a time lag of 1.
Chi-Square Goodness-of-Fit
The chi-square goodness-of-fit compares the distribution of observed
and predicted concentrations (example outputs are given in Volume 3).
As noted before, the logarithm-frequency plots provide the same graphical
2
information in much the same form. The X statistic summarizes the over-
all agreement of the two distributions in a single figure of merit. Like
o
the accuracy score tests, a lower X corresponds to a better agreement.
Bivariate Regression and Correlati^on
The bivariate regression and correlation test is perhaps one of the
most useful of our intermediate evaluation statistics. The output plots
contain many useful visual displays to go along with useful numeric indi-
cators. Individual features are described below:
Pearson's correlation coefficient is an indicator of the linear-
ity between predicted and observed concentrations. As such, it
20
-------
TIHE SEPIES
1 .0
0.5
0.0
RUTOCORRELflriO
i
» «
» In
9.
, c«
-
-
ee
SE 1. STATION
1 1
1 1
2.00 4.00
TIHE LA
3
6.
G
00
8.
1
1
00 10.
00
1
1
12.00
1 .
1
14.00 16.
LOTTED
l«'«S/71
HHJill
00
Figure 8. Autocorrelation function of test data base.
J'E SERIES, CftSE £, STrtTIOH 1
1.0
PlOTTtD
1»'«S/7
1BI23I1I
.- DftSHED LINES: 95X CONFIDENCE LIPIITS
I I I I I I I I I I I 1 I I I I I '"l I I I I I
0.0
0.0
0.1 0.2
FREQUENCY (F)
0.3
0.4
0.S
Figure 9. Normalized cumulative periodogram for test data base.
21
-------
becomes an invaluable detector of randomness. The highest score,
unity, basically rules out a random lack of agreement, while the
low score, zero, indicates a completely random lack of agreement.
The scatterplot is a visual display of the point-by-point compar-
ison of predicted versus observed concentrations. It is best
used with an auxiliary regression analysis.
Regression analysis is another indicator of general bias. The
slope and intercept of the least-squares line numerically quanti-
fies the bias. As shown in Figure 6, the agreement between the
least-squares (bias) line and the ideal (no bias) line can be
visually compared on the scatterplot. The 90-percent confidence
bounds (Figure 7) can also be displayed. These bounds tell the
viewer that the real least-squares line will fall within the
designated bounds at a statistically significant level. The 90-
percent probability bounds tell the viewer that there are 9 chan-
ces out of 10 that a new point will fall within those designated
bounds.
User-supplied sensitivity bounds can also be entered. These sen-
sitivity bounds essentially establish a dead zone on the scatter-
plot. In the subsequent calculations, the regression coefficients
are based solely on the points falling outside the zone defined
by dashed lines on the scatterplot. Thus, we can determine a
trend in the so-called "outlying" comparison points.
Interstation Error Correlation
Pearson's correlation coefficient is computed for the residual con-
centrations between each pair of monitoring stations. If all correlations
are about the same, we can probably conclude that the model is erroneously
biasing the results in a uniform manner for all sites. If there is a
large variation in the correlation, then we should examine the results
for trends, such as differences between urban and rural sites, proximity
to known local meteorological or emission factors, and so forth.
Multiple Regression of Error Residuals
The multiple regression of error residuals test is the only device
to explicitly evaluate model performance as a function of the input
parameters. (The other tests can be used for the same purpose but the
data base must first be subdivided by parameter.) For the multiple
regression test, essentially all model input parameters become independent
22
-------
variables in the linear multiple regression equation; the concentration
error residual is the dependent variable. The SPSS program computes the
coefficients in the regression equation, the multiple correlation coeffi-
cient, and the contribution to this coefficient by each parameter. Hence,
we have explicit indicators of which parameters contribute most to a
model's poor performance.
Let us consider our example, in which the multiple regression test
is applied to our test data base. The SPSS summary output for the test
data base is given in Figure 10. There are three independent variables:
Atmospheric-stability class (ASC)
Wind-direction class (WDC)
Wind-speed class (WSC).
The summary table illustrates the multiple correlation coefficient
(MULTIPLE R) in the first column. The SPSS program arranged the param-
eters in the order of their contribution to MULTIPLE R. By far the most
substantial contribution comes from the ASC parameter, with a value of
0.62443. Going down the list, the contribution from each of the other
parameters is negligible because the MULTIPLE R increases only to
0.62586. For another way of looking at the results, note that the R
SQUARE column shows that the first parameter, ASC, explains about 39
percent of the variance in the residuals; the other two, WDC and WSC, do
not substantially account for more of the variance. The third and fourth
columns, RSQ CHANGE and SIMPLE R, give the incremental change to the
MULTIPLE R and the bivariate correlation coefficient, respectively.
The fifth column, B, gives the regression coefficients. Hence, the
linear equation becomes:
RESIDUAL = 487.89 - 125.27ASC + 5.21WDC + 2.23WDC + ^
where £.. is the residual or the number required to equate both sides of
the equation. However, the table shows that the contribution from the
23
-------
u z
-i S
M
* in in
> ft «4
« N
i« a o
M88
o
u>
C
» 9> in
u « w m
_J f»J U) ^J
& w> KI r»
u
o *- r^ ^>
Z 9< "I
5 Ssi
(0
13
U
a
a
§ «
wi n J J
LJ * J1 "3
to
E
to
d
£
O)
Z
LJ
a
z
u
o.
w
o
a!
«
n
or
u u u
^ Q V\ 1
24
-------
last two parameters is negligible. Therefore, as a good approximation,
our equation simplifies to:
RESIDUAL = 487.89 - 125.27ASC + 2
where £_ differs from ,.
Basically, we are being told that improvement in the ASC parameter
alone offers the potential for decreasing the variance in concentration
residuals. Therefore, our approach would be to improve our scheme in
estimating atmospheric stability (i.e., improve our model). On the other
hand, it appears that our parameterization of wind speed and wind direc-
tion is all right. Therefore, little is to be gained from improving the
model interface for wind speed and direction.
The summary table is only one part of the output resulting from the
multiple regression test. The remaining output and further description
of the SPSS results for the test data are given in Appendix A. In par-
ticular, we discuss statistical significance (the F statistic).
DIAGNOSING THE CAUSES OF POOR MODEL PERFORMANCE
The example of the previous subsection illustrates results that were
quite easy to interpretnamely, that atmospheric stability parameteriza-
tion was a primary contributor to poor model performance. The results
from the multiple regression tests will not always be so easy to inter-
pret. Instead, they will be used to augment results from other tests to
obtain a clear definition of the causes of poor model performance.
Our recommended approach in evaluating RAM is first to conduct any
test that will provide useful insight into the problem prior to attempt-
ing any improvements to the RAM. The rationale is simplein contrast to
the model modification process, the statistical tests are inexpensive to
run. After any model modification, it is necessary to generate a new
test data base by running RAM using RAPS data for one year. This is a
very expensive process which should take place only after we have obtained
the maximum information from our diagnostic tests.
25
-------
Diagnosis of the causes of poof model performance will normally be
an iterative process, as shown in Figure 1 (Section 1). The iterative
diagnostic process is illustrated as OPTION 3. As stated earlier in Sec-
tion 2, we recommend application of all six tests to the entire data base
as a first step in the diagnostic process. The results are then analyzed
for our indicators of poor model performance: general bias, spatial bias,
temporal bias, and randomness (see Section 2). Next, we must use our
tests to relate these indicators to the causes of poor performance in
terms of RAM parameters. There is no predefined prescription of how to
proceed with our diagnosis. However, if we look at each of the indicators
separately, there are certain recommendations that can be made on how to
apply and interpret further tests.
In the following subsections, we address possible actions to explain
and diagnose indicators of poor model performance.
General Bias
General bias exists when predicted and observed concentrations differ
by a uniform amount (high or low) throughout the data base. This type of
bias can easily be diagnosed by several of our tests, as described ear-
lier. Using assumed displays from the bivariate regression and correla-
tion test, let us consider various ways bias can manifest itself by com-
paring the least-squares regression line and the ideal fit. Figure 11
shows three different types of overproduction bias. Of the three, the
second (diverging) is generally the most common in model evaluation.
(This happens because the general wind direction is usually known and
leads to a low concentration prediction to correspond with conditions when
the monitor was upwind of the source area and recording low concentra-
tions. For downwind conditions, the full diffusion equation comes into
play and small errors in formulation or input parameters lead to large
errors in concentration prediction.) The problem is, once we have
detected a bias, what action do we take to explain it?
26
-------
(a) CONSTANT OFFSET
(c) CONVERGING
(b) DIVERGING
Figure 11. Types of bias.
At this point we must go into more detail. The first question con-
cerns whether we are dealing with general bias or spatial bias, so the
next step is to run the bivariate regression and analysis test for each
monitoring site. If the comparisons between the ideal and least-squares
lines are essentially the same for each station, then we have general
bias. Next, we have to try to isolate the cause.
Results from other tests can help in diagnosing the causes of general
bias. In particular, the correlation of error residuals among monitoring
sites (interstation error correlation) can provide useful information.
If all correlations are high, we are being told that the overprediction
(or underprediction) generally exists for all stations at the same time.
This not only confirms our diagnosis of general bias but also tells us
that the causes of mispredictions are probably the same for all locations.
Next, we examine the results of the multiple regression of error residuals
test. This will tell us which RAM input parameter is most highly corre-
lated with the error residuals. For example, we might find that the error
residuals are inversely correlated to wind-speed class (WSC) or, as in
our test data base, atmospheric stability class (ASC).
27
-------
Unexplainable Biases--
At this point, let us depart from the above example and recognize
one limitation of our diagnostic tests for finding correlations between
RAM input parameters and error residuals. If a RAM input parameter
causes a high bias throughout the data base, none of our tests may help
us explain this. For instance, consider the relationship between pre-
dicted concentration (X) and emissions (Q) as follows:
X = kQ f(M)
where f(M) is the meteorological function and k is a constant. If Q is
biased high, then X will always be high if all other representations are
accurate. If Q is 25 percent high, then a constant 25-percent offset
would exist in our error residuals. Our tests are geared to correlate
error residual changes (variance) to input parameters. For a constant
25-percent offset in the error residuals, there would be no change and
hence no correlation.
It is unlikely that the above condition will exist in reality; emis-
sions will most likely be biased differently according to the type of
source. Hence, certain stations would exhibit a different bias from
others because they would be influenced by different sources.
Second Iteration--
After the detection of general bias and analysis of other results,
we will want to take a closer look at the dependence of error residuals
on RAM input parameters. As stated, the multiple regression test pro-
vides us with some indication of input parameter dependence. For general
bias, we will want to stratify the data base by classes of that parameter
which influences the error residual most (as determined by the multiple
regression of error residuals test).
Stratification of the data base can be accomplished by the
STAT06*SPSSSORT program (see Volume 3). In our previous example, where
the multiple regression test showed a high correlation between error
28
-------
residuals and wind-speed class, we might choose to conduct our evaluations
with the data base divided into low and high wind-speed categories. We
would then apply our SPSSSORT program twice and generate two files, one
for wind-speed classes of 1, 2, and 3, and one for classes 4, 5, and 6.
We can then apply our bivariate regression and correlation tests to each
file separately. If we find that one file shows good agreement while the
other shows poor performance, we have identified problems in the formula-
tion. Depending on how clear-cut the results are, we may wish to conduct
further tests or proceed to consider modifications.
It must be recognized that there is no preset prescription for diag-
nosing model performance, and that our description here is for illustra-
tive purposes only.
Spatial Bias
Probably the best initial test for spatial bias is the interstation
error correlation test. As stated in Section 2, the output correla-
tion matrix indicates the existence of a general bias if the correlation
coefficients among all stations are high (say, 0.5 or above). Conversely,
low correlation coefficients suggest the possibility of spatial bias.
However, further test results are needed to confirm our suspicion because
low correlation coefficients also result when any of the other indicators
(e.g., randomness) of poor model performance exist.
The map display of the accuracy score test can also lend insight
into spatial bias. If accuracy scores vary greatly from station to sta-
tion, we again suspect spatial bias, but need to confirm our suspicion.
Perhaps the best indicators for spatial bias are the bivariate correlation
and regression displays applied to each station. The test for site-
specific bias is then exactly the same as it was for general bias (see
Section 2).
For the RAM evaluation, we must establish the degree of spatial bias
for 13 stations. Next, our results must be categorizednamely, which
stations can be thought of as being similar in performance? Referring
-------
to our least-squares test versus the ideal fit (Figures 6 and 7), we can
broadly categorize stations according to actual and ideal agreement.
Results from the interstation error correlation test provide indicators
of which stations behave similarly. Hence, we have the means for esta-
blishing reasonable categorizations of stations with similar performance.
Using the resulting categorizations, we look for rather obvious dif-
ferences among locations, such as:
Urban versus rural sites
Eastern versus western sites
Local influences (anomalous locations)
Distribution in relation to large point-sources.
Furthermore, if we can group by similar stations, the next step is to use
our sorting program to establish data bases for these similar stations.
Then we apply the multiple regression of error residuals tests to these
smaller data bases and proceed as we described in Section 2 (General
Bias). It must be recognized, however, that each categorized data base
may lead to different conclusions and subsequent remedial actions.
Before concluding our description, we caution that it is possible to
have a situation in which certain stations exhibit spatial bias while
others do not; other sites, for instance, could exhibit completely random
but unbiased behavior.
Temporal Bias
The residual time series test is the only one capable of directly
detecting temporal bias. Either the autocorrelation function or the nor-
malized cumulative periodogram contains the same diagnostic information.
Consider the example for the autocorrelation function A(k) in Figure 12.
Temporal bias manifests itself when A(k) exceeds the 95-percent confidence
limits. In Figure 12, temporal bias is shown to occur every 5 hours.
30
-------
AUTOCORRELATION, A(K) ±
b
95-perce
nt limit
I
1 2 3
9&-nercent limit
45 678
TIME LAG (HOURS)
-1.0 |-
Figure 12. Temporal bias example.
Once detected, the cause of temporal bias must be diagnosed. We can
look for the obvious, such as weekend versus weekday (lags of 6 or 7
days), day versus night (lags of 12 hours) or daily differences (lags of
1 day). When obvious conclusions cannot be reached, we suggest conducting
tests for randomness.
Randomness
When randomness exists in the absence of bias (general or spatial),
we would expect large error residuals but fairly good agreement between
observed and predicted concentration distributions. Likewise, the least-
squares regression line could compare quite favorably with the ideal line;
however, the scatter by visual inspection would appear quite high and the
correlation coefficient would be low (about 0.6 or less).
After randomness is detected, we suggest analyzing those point out-
side the 90-percent probability bounds (or subjectively selected sensi-
tivity bounds) in conjunction with the results from our multiple regres-
sion of error residuals test, and with the results from the previous
analysis of our data base (Section 2). Basically, we wish to determine
which model input conditions exhibit the most influence on poor perfor-
mance. Therefore, by comparing the input parameter distributions for
31
-------
points outside the probability (or sensitivity) bounds with the input
parameter distributions for the entire data base, we can determine which
conditions most often lead to poor performance.
Let us consider the specific, step-by-step procedure along with an
example from our test data base to illustrate the recommendations in the
preceding paragraph:
Step 1: Conduct the bivariate regression and correlation test, plotting
either probability or sensitivity bounds on the scatter-plot.
For our example, consider those points outside the probability
bounds (see Figure 7).
Step 2: Determine some logical criteria for specifying the majority of
points falling outside the probability bounds. Referring to
Figure 7, we see that most of the outlying points occur when
predicted concentrations exceed 400 units and observed concen-
trations are below 300 units.
Step 3: Modify the SPSSSORT program so that it displays the input
parameters for those cases that meet the criteria established
in Step 2. For application to our test data base, we added the
following statements to our SPSSSORT program:
*SELECT IF (PC GT 4CO AND OC LT 300)
LIST CASES VARIABLES=ALL
FREQUENCIES INTEGER=WSC(1,6)
The first statement perfo*ins the sorting of the data base, given
our Step 2 criteria. LIST CASES then lists all the parameters
in the data base for those cases satisfying our criteria. The
FREQUENCIES statement plots a histogram of the desired input
parameter for the criteria case. A complete listing of program
and output is given in Appendix B.
Step 4: Analyze the Step 3 results and compare the frequency distribu-
tions with those obtained earlier for the whole data base.
Referring to our examples in Appendix B, we see that four cases
met our Step 2 criteria. All of these were for an atmospheric-
stability class (ASC) of 5--as shown on page 3 (Appendix B) of
output. Further, on page 4, we see that all 4 cases were in
the low wind-speed category. (Generally, we will be working
with frequency distributions for other input parameters. How-
ever, for 4 cases we can easily review the case-by-case
For our test data base, units are m * 10 ; for RAM, units will be
either |_ig m~^ or ppb.
32
-------
listings.) If we examine the atmospheric stability categories
(ASC) for the entire data base (as shown in Volume 3), we see
that the most frequent ASC is 4. Therefore, our results suggest
that our prediction mechanism is giving us the greatest error
when an ASC of 5 is input.
Step 5: Refine the analysis for the input parameter to which the model
is most sensitive. Basically, we would use the SPSSSORT pro-
gram to stratify our data base by that parameter. For our test
example, we would form a small data base for each ASC class.
Then we would apply our bivariate regression and correlation
test to each of these small data bases.
Step 6: Identify other sensitive input parameters using the above
steps. We wish to determine if our model is mispredicting
because of poor parameterizations for more than one input
parameter.
The above procedure is by no means unique and will not answer all our
questions about why the model is doing poorly in certain situations.
However, in concert with other tests, we have enough diagnostic tools to
quantify conditions leading to poor performance.
Input-Parameter Dependence
As stated above, the multiple regression of error residuals is the
only test that directly addresses input-parameter dependence. However,
as shown in the previous four subsections, other tests in conjunction
with SPSSSORT-stratified data bases can address the problem. The multi-
ple regression test itself can be applied to the stratified data bases.
The net result would then be a clearer indication of the second most
important parameter contributing to poor performance, i.e., high residuals.
COMBINATIONS OF POOR PERFORMANCE INDICATORS
Our previous discussion very neatly separated poor performance indi-
cators into categories. Now, it must be recognized that there may be
several indicators that must be treated. For instance, we could very
easily have high general bias and randomness at the same time. As a
further complication, the underlying causes may be the same or different.
33
-------
No matter how complex the problem appears to be, our tests can be
applied in the manner described in this text. As .stated before, there
is no set formula for applying a specific test at a certain point. Some-
times highly subjective decisions will enter into the process. Neverthe-
less, the tools of analysis remain the same and provide sufficient means
for assisting in the diagnosis of poor model performance.
34
-------
REFERENCES
1. A. D. Busse, and J. R. Zimmerman, "User's Guide to the Climatologi-
cal Dispersion Model," Publication EPA-R4-73-024, Environmental
Protection Agency, Research Triangle Park, North Carolina (1973).
2. F. L. Ludwig et al., "A Practical Multipurpose Urban Diffusion
Model for Carbon Monoxide," Final Report, SRI Project 7874, SRI
International, Menlo Park, California (1970).
3. N. H. Nie, C. H. Hull, J. G. Jenkins, K. Steinbrenner, D. H. Bent,
Statistical Package for the Social Sciences (McGraw-Hill Book Co.,
New York, New York, 1975).
4. F. Smith and R. Strong, "Independent Performance Audit #4," Interim
Report, EPA Contract 68-02-2407, Research Triangle Institute Project
43U-1291-1, Research Triangle Park, North Carolina (1977).
35
-------
APPENDIX A
SPSS MULTIPLE REGRESSION OUTPUT FOR TEST DATA BASE
SOURCE LISTING
Specification of the SPSS REGRESSION procedure is described in Vol-
ume III of this report. Referring to the first page of the program out-
put (located at the end of this appendix, the source listing, note in
particular the "(1)" in the field of the REGRESSION specification. This
makes the program order the input parameters as a function of their sig-
nificance or correlation with the error residual. Hence, the parameter
entered first (by the program) is the most highly correlated variable;
the second is the second most highly correlated variable, and so forth.
UNIVARIATE STATISTICS
The program when "ALL" statistics are specified automatically com-
putes some auxiliary statistics. This includes the mean, standard devi-
ation, and number of cases, as illustrated on the second page of our test
example.
BIVARIATE CORRELATIONS
Bivariate correlation coefficients are also printed out. See the
third page of the output listing.
STEP-BY-STEP PARAMETER ENTRIES
The step-by-step entry of variables and associated statistics is
illustrated on the fourth and firth pages of the sample output. In Table
A-l, definitions c-f a number of the output statistics are presented. Of
particular importance are the MULTIPLE R (as described in the test) and
36
-------
TABLE A-l. DEFINITION OF SELECTED REGRESSION STATISTICS'
Name
Description
MULTIPLE R
R SQUARE
ADJUSTED R SQUARE
STANDARD ERROR
DF
RESIDUAL
F
B
BETA
CONSTANT
Multiple correlation coefficient
Square of the correlation coefficient, which is
normally equated to the fraction of variance
of the dependent variable explained by the
independent variables
R SQUARE corrected by an SPSS procedure
according to the number of variables and cases
Standard error of estimate for the prediction
of the residual concentration
Degrees of freedom (or the number of independent
variables)
Number of cases - DF - 1
2
Significance indicator, which is the same as t
of student's t test. A table must be used to
equate the F value to "statistical significance"
or confidence levels. More formally,
F =
R SQUARE/DF
(1 - R SQUARE)/RESIDUAL
Coefficient of the independent variable in the
linear regression coefficient
Same as B except normalized to a unity value
for the dependent variable
Constant in the linear regression equation
* 3
Refer to Section 20 of the SPSS Manual for comprehensive
definitions of all parameters.
37
-------
F, which is a measure of statistical significance. The reader is referred
to the SPSS manual for a comprehensive description of all the statistics.
Examining the sample output, one can see that atmospheric stability
class (ASC) is entered in the first step and has a 0.62443 correlation
coefficient and a highly significant F value of 63.3. In step 2, wind-
direction class (WDC) is entered but fails to improve the correlation
coefficient appreciably while decreasing the significance (F = 31.5).
On the third step (fifth page), wind-speed class (WSC) is entered with
results even less significant than in the previous step.
SUMMARY TABLE
A summary table is printed on the last page of the output listing.
The input parameters are listed in the order of their importance.
38
-------
:
:
IA
X Ul
Ul M Ul OX *
s s i i
oo o £r:~ 2
-m -m 3 QC O »4 M
, Q Q
o ui in u 3 u 3
§ S 5 |5g d
* ** *" « m ui o * H- z
01 * I T
M Ul M M O
51-33: *
M O.
> IA Q QI IA
a . M R x
u IA N ui 4VtOO! O
O.QZ« * O * W M M 3
Z (^ O M J f. »-IAZU _lu>L1
( O Z ^- T£ * «UI JOtAlAi i4
M o Li O * J O -I * Ul UH >-4
z »- wuqo»-"» «"^ rf ^?, X. "*
a «* ».m-iakibi IAU, < x ij o _t
* u *_i _» ui a, a-n »- rfuiuij
o o x*aejh-u» u>3 * >aeo:*
n a j, Te»ci * "» a
M h. _* O < Ul Ui Ul U>
i i/» »» u. w» a: ui juicui ui
it ui utcnx o ; H x * a. oe
3 Ul Z 3<»- « u. >- Z «A M
* d Ui -TOM O. Z IA 3
o u « a "til 16 (A o u 9
r.u,xa.!-x;''«UIM MMU.
f. 5>'«XJ *«: xzzz3u ui^-aJ
U « *- M OC O QC IA 3
ceo .j x x x i- oc
M -I -J 3 « * * J n
«ZX,Z3 *
5 * O £ u O
K-Lll-C Ot»* H
a.o »
-------
LJ
O
U
o
U)
o
LJ o c a o o o
l/» *4 tH t-l «-l -! r*
«r
U
u
o
«
O
II
Ul
t-
«r
o
> kO U} <£
U to f 9-
O M
cr
«
o
cr>
tc
a-
u
09
Z M 3 (M "O i-l ^
a- O <£ to en en
u r~ CJ f- 10 to «
z: ui c; < i-i c. ^~
^^ *
f N w M iv en
f H
C i
O
2:
tt
H
U1
Q-
CO
m
«r
M
K
u o ifl
U1 Q U
3 ai or
-------
Ul
o
to
o
M (M
en Q
o
H
V\
ia i- * M to a
u> M «M «M in e
N a u> ro to o
s
o> i-i « (M O en
tr n t-i N o en
co O fw O o co
co » N CM c m
in a to f< a K!
I r* I
CO
3-
f>
C1
U rt O 3-
a M N
C» - M I
u> 3 a » u> '
l -« i
*.
3
e
z
o
Ul
K
tt
CL
v\
ui
0%
H Z
c z
u
C> (J
O U. U.
H ou.
r bl O
K
O
U
» C M «! i-« (X
«o s m » *o r-
if> -« M a t-
M O "1 M » N
o a oc c c.
C 3- a- O CO (M
6aj in * a- h-
Lfj fl r-> CO u>
ui c N >4 r>
£ O C U> r4 tf> IM
O
o
Ul Ul
K I-
M H
I- O O
U t*
CO Q Ul
an * ee
41
-------
a-
Ul
co
^
Q,
CO
^»
g
w
a
r-l
.
O
z
z
^
cc
r-
l/l
Ul
««
CO
f-
^_
U
o
sr
a
II
Ul
o
1^1
5
*
ft
«
«
ft
*
z
0
M
Ul
in
UJ
CC
CO
Ul
cc
Ul
~*
a.
M
r-
_J
3
£
w
ft 1^
*T
ft 1
if
ft 3
O
ft M
LI *
ft Ul -1
CC
4|
cc
ft Ul
CO
ft X
3
a z
H
ft l/> Q.
Lw Ul
« CC T-
Li
z
0
* o
UJ C
ft -J CC
CD Ul
«r r~
M Z
* CC Ul
ft >
10
r
Z Ul
U J
o co
z *
Ul H
ft 0. CC
Ul <
ft O >
1
3 1
U. "-I 1
r- 1
'** 1
1
*> 1
ui !
1
1
a
c
Ul r*
5 H
r in to «
3 (SI S S
a * c
I/I en i-l u
_ KI r-
z u
«r a- i- 3
u o en P
r M a-
r-l fO 2
(M I-
CSI
r-
c
3
u
* L
ui in u>
cc (si en c
« r~ c c
3 en co t-
O» rO » Q
1.1 «
a- M a
L- C I-
0 K) (M
r-l U}
z INI a
3 M f)
LI
u. M en
o en
Ul
z"
«r
CC
r
>
U.
0 Z
o
I/I M -J
t-4 t/t «r
Ul I/I 3
Sfe-S
«f O H
Z UJ Ui
«r CC c:
i
1-
»
t
L
(.
t
*
IO i-! IO O I
a- 01 f- ar I
tf Cn "! r-4
ig o? eo a (
_J ~1 "1 r-l .
' "E i
Ul
a
^
3 cr
a g
l/> or
a
cc cc LJ
ui ui a a
_i c. ui re
6- i f- «r
H 3 LO D
t- to 3 Z
_l Lfl -J <
3 Q I-
E cr «» L^
00
N
P«
U.
B a
: Z r-l
) -1 »-
< K r*
» Ul ft
; =i
5 *
j
: -i to
f 1-1
M CD
: i- to
< 5 °
0.
>
1
J
j z a-
3 H t-<
r N
H * M
; r- °
UJ
» CO
u
t-
M
CC U
«r in
> ->
.
-i
U, 1-
CM
.
IO
IX)
CO
CC « i
o c«
CC f~
a N-
Ul ft
: «
3 Q SI
4 t- «-«
I/)
r
3
"5
u
J ' to
C r- »
ui a-
CO «e
*
*
«
*
*
«
*
*
«
n
*
«
ft
t
*
ft
*
*
*
«
#
*
»
*
*
ft
*
*
«
*
»
*
*
«
«
ft
*
*
«
»
ft
u
0
3
.
t
M
CC
Ul
CO
i:
z
£_
Ul
^-v
10
z
o
s
K
Ul
r-
Z
Ul
M
(/I
^
U
CO
^
M
K
>
i
S 1
** S 1
in i
1
1
i
1
1
z
0
Ul M
CC (-
«r ^
3 - 3
o en o p
v/i LO en C
z". 1 UJ
«r a- in i
ui in (M r-
sc a m
fl f> Z
-< H
t-
O
Z
tl
I/I UJ
LJ r- » _l
CC IO CO CO
*r en co «r
3 ro M H
o -1 ID «
LI «r
en co >
u c 10
o H a- i
(M in 1
r tv a-
3 N M
1/1 1
1
1
« t
U. M CO 1
o en i
Ul
u
z
*
cc
«* 1
> 1
1
u.
0 Z
o
LO M _l
W l/» «r |
1/11/13 1
> u) Q 1
-J CC M 1
«r (!) H 1
Z Ul Ul 1
«f cc a
i
i
i
r
c
i-
«
n
^
u
2
1-
7.
^-
uJ n «-i «-i f
in ro en fo L
lA r^ CO 'O
CM cn f- h- 5
I t r-
ffi S
s
Ul 1
K
&
3 Ct
rt o
u> cc
n:
K or ui
u ui ° o
j o* w a
O- -< r- «
H 3 LI O
t- C5 3 Z
3 a »-
_c ec «t i/i
(0
in
o
_.
y "
z en
«f H
cc r-
UJ
d
i-
-j m
~ Ul
M 8-
1- N
5 ?
a.
z c-
H in
M
tf (SI
r- 0
Ul
m
UJ
J
r
S CJ
f 1.1
> *
r- oo
u» in rvj
tM (M
t
CM
a-
m
CC. fO -O
O eH CO
C. tt 3
pr ^ -^
Lj P~ *£
t
O O"t CM
^- rH r-l
fl
J » wC r-l
; (_ K) r-l
LJ sr CO
ea m a*
us 2
4
1
1
J
4 f-
1 (SI
r m r- r-i
4 co :n co
m P M
tC.c.
cn LO un
C>J
tH
1
1
*
i ^-
Ul Z
, -1 »
03 t-
r LO
HI Z
CC O L) O
t LI O O
> < -c *-
42
-------
.J 0
SB
«r ce
> 10
UI
a
1 U
z z
o *
H a
H- UI
3 d
UI
u
« in cn
ui a in
x .-i in
w 2
H- O.
«
r-
z
o
ui
£'
a- i-
«r (M Ifl
3 K) O-
cs r» »
i/i
r- c
U- t-l U*
o t*> Is*
CM UO
c M a-
3 M KI
ui
13
m M
H »
5 S
> m
UI
UI
UI
Ce
O
UI
1I-
cn
r-t
.
o
z
z
3
ce
^.
i/i
UI
^^
CO
r-
u
o
*
o
II
UI
K
u.
o z
o
t/1 M-l
MI/I*
in i/> 3
>- IU Q
_l a M
«r O UI
Z LJ UI
r ce ce
i
ifl
H to oo
cn in in
4 <-l O
t t
01
fM
Q eg M
cr m r-
» 01 (M
CO C «M
H r-l (SI
PO *o cn
(M -*
a> h- r*
* in u>
P" 1/1
ui in O
tf O PO
M (M
*
u
K
a.
UI
tA
X
3
Z
M
X
^
X
43
-------
«»
M
MM
Ol >4 N
>> » <*
«o o e
I
o» ^ in
in in a
at
" to
o
o
« K) 01 N
» o» in
U # CO W
_| M IP (SI
fib IA ro 10
z
o
ca - N r-
z w » w
01 «4 O
5
K
a i/i o,
44
-------
APPENDIX B
PARTIAL SAMPLE OF OUTPUT FOR SPSSSORT PROGRAM
45
-------
t>
VI
VI
»
u
o
CL
O
a
o
Ci
en
in
o
a
o
1/1
a
K
O
o o
in O
w
_l
m
w
i-
o
(M
X
»>-
O
c
o
ct
Ul
CL
I/I
z
o
$
ui
a
o
z
o
I-
I-
01
01
K
3
(£
a
o
u
o
t/l
Ul
Ul
oc
U
u
r
(E
LA o *
Ul O U
I- 3 -J O
?*« d
CE > O *
O
but x ui
ox u
Z O I- «r
« O H O.
£^S !S
oc
b. U. U. O
000 »
«
o a.
U
oc
H U>
* u
O -J
Ul W
to u.
a. a
i/i p
U. 1^1
O UJ
(SI U
Q.
O
« o
» * o
u. »
5* ~ °
UJ O Q
2
O UJ .
» « N to
Q.
V>
LJ Ul
X »-
3 U
OC CD
K Ik U
3 M UJ
a.
H »- O
SO
W Ul
5
CL
Ul
Ul
u
a.
in
U.
U
O
ta
1/1
1/1 i/i
a. in
v\ o,
a n
46
-------
UJ
ca ui o
ui a. ui M M
i Ul
i a
O 3 O VI O Ul
'1 Ul -J
* Ul Ul
o oc z « ^J ce ui
HI ui o z * o u
o 3 a ae » « vi
o o a ib bt ui vi
-J i" z u, ui x z « r> a> on c
S2 5 tf - &
«a Q W
a o
47
-------
u
o
*
a.
00
h«
t-
^
_l
i
3
O
a
u
t-
i/i
3
->
a
*
u.
-»
a
*
>
u
z
Ul
i
Ul
ae
Ul
u
ac
Ul
0.
^
*-
Z
Ul
u
a
u
a.
in
N
o
*
in
(M
t
a
a
.-*
o
in
r-
ui >
Q
p>
o
Z
*
Ul
ae
u
I
«
K
^g
a
*-
*
d
et
u
z
Ul
3
O
Ul
a
u.
Ul
t-
3
J
O
v»
a
»-
Ul
u
cc.
Ul
Q.
««
>-
U
z
Ul
3
O
Ul
o
.
in
(M
4
a
in
h-
ro
l
a
i
f o
3
1
1
1 *
1
1
I/I
a.
1/1
i
ui
-e
ui
a
3
O
O
ul
48
-------
UJ
oc
u
a
irt
1/1
a.
1/1
ID
z
M M
O
ou.
>
o
z
a
o
u
u)
f,
u
a
49
-------
oo ON g m or*
O » O ut OL> OO
01 N ON .4** «
u
O Ul Ul Ul UJ
« x r x x
& MO MO MO HU
HO HO HO HO
CO
c- m M mm in i»>
«4 * r4
> e o o
O o* f4 M
x. * » »
r» N r-
o
ui u u ut
HO HU HU HO
r a * a «o * a
as ax a a a 3
a a* a* o
ON ON ~~ '
ON- ON QN O
§ § § §
H H H H
uio So ui
-------
TECHNICAL REPORT DATA
(Please read Instructions on the reverse before completing)
1 REPORT NO.
EPA-600/4-80-013d
2.
3. RECIPIENT'S ACCESSION-NO.
4. TITLE AND SUBTITLE
EVALUATION OF THE REAL-TIME AIR-QUALITY MODEL USING
THE RAPS DATA BASE
Volume 4. Evaluation Guide
5. REPORT DATE
6. PERFORMING ORGANIZATION CODE
February 1980
7 AUTHOR(S)
Ronald E. Ruff
8. PERFORMING ORGANIZATION REPORT NO
Final Report
SRI Project 6868
9. PERFORMING ORGANIZATION NAME AND ADDRESS
SRI International
333 Ravenswood Avenue
Menlo Park, California 94025
10. PROGRAM ELEMENT NO.
1AA603 AA-26 (FY-77)
11. CONTRACT/GRANT NO.
68-02-2770
12. SPONSORING AGENCY NAME AND ADDRESS
Environmental Sciences Research Laboratory RTF, NC
Office of Research and Development
U.S. Environmental Protection Agency
Research Triangle Park, North Carolina 27711
13. TYPE OF REPORT AND PERIOD COVERED
FINAL 8/77-4/79
14. SPONSORING AGENCY CODE
EPA/600/09
15. SUPPLEMENTARY NOTES
16. ABSTRACT
The theory and programming of statistical tests for evaluating the Real-Time Air-Quality Model (RAM) using the
Regional Air Pollution Study (RAPS) data base are fully documented in four volumes. Moreover, the tests are
generally applicable to other model evaluation problems. Volume 4 discusses the application and interpretation
of the statistical programs, particularly with regard to use on the RAM. In general, there is no set procedure for
evaluating an air-quality model because of the different reasons for evaluating models and many subjective
decisions to be made during the process. However, guidelines are presented to cover a wide variety of evaluation
needs, with attention to data preparation, classification, analysis, selection and application of tests, and interpre-
tation of results. Several methods of diagnosing causes of poor model performance are discussed and some
sample program outputs are also provided.
17.
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
b. IDENTIFIERS/OPEN ENDED TERMS C. COSATI Field/Group
* Air Pollution
* Mathematical models
* Evaluation
* Tests
* Computer systems programs
-* Statistical tests
Real-Time Air-Quality Model
Regional Air Pollution Study
Data Base
13B
12A
14B
09B
13. DISTRIBUTION STATEMENT
RELEASE TO PUBLIC
19. SECURITY CLASS {ThisReport)
UNCLASSIFIED
21. NO. OF PAGES
57
20. SECURITY CLASS (Thispage)
UNCLASSIFIED
22. PRICE
EPA Form 2220-1 (9-73)
51
------- |