United States
Environmental Protection
Agency
Office of Air Quality
Planning and Standards
Research Triangle Park, NC 27711
EPA-454/R-92-025
September 1992
AIR
& EPA
PROTOCOL FOR DETERMINING THE BEST PERFORMING MODEL
-------
EPA-454/R-92-025
14
Protocol for Determining the Best Performing Model
U. S. ENVIRONMENTAL PROTECTION AGENCY
Office of Air Quality Planning and Standards
Technical Support Division
Research Triangle Park, NC 27711
December 1992
-------
DISCLAIMER
This report has been reviewed by the Office of Air Quality Planning
and Standards, EPA, and approved for publication. Mention of
trade names or commercial products is not intended to constitute
endorsement or recommendation for use.
-------
CONTENTS
Section
Page
Figures ii
Tables iii
1.0 Background and Purpose 1
2.0 Screening Test 3
3.0 Statistical Test 5
Test Statistic 6
Performance Measures 7
Composite Performance Measures 8
Selecting the Best Models 11
Limitations 12
4.0 Display of Results 12
5.0 References 14
Appendix A Example Comparison of Model Performance A-1
-------
FIGURES
Number Page
1 Example Fractional Bias Plot for a Given Averaging Period,
e.g., 3-hour 4
A-l Fractional Bias of the Average and Standard Deviation:
3-hour Averages A-4
A-2 Fractional Bias of the Average and Standard Deviation:
24-hour Averages A-5
A-3 Fractional Bias and Bootstrap Percentiles (5 and 95) for
Clifty Creek (1975): MPTER and Alternate Model A-8
A-4 Absolute Fractional Bias and Bootstrap Percentiles (5 and 95),
Composite for Six Inventories: MPTER and Alternate Model A-13
A-5 Absolute Fractional Bias and Bootstrap Percentiles (5 and 95),
Composite Difference for Six Inventories:
MPTER and Alternate Model A-14
11
-------
TABLES
Number Page
A-l Composite Performance of MPTER and the Alternative Model
for Six Rural Data Bases A-10
111
-------
1.0 BACKGROUND AND PURPOSE
EPA has an extensive ongoing program to evaluate the performance of air quality
simulation models. The program has resulted in a standard set of statistical and
graphical procedures for analyzing and displaying the performance of models, both
from an operational and scientific viewpoint.1'23'4 Application of these procedures has
produced considerable information about the performance of models. While the
information has provided numerous insights into model behavior, it is not entirely
suitable for comparing the overall performance of the models. Model comparisons are
difficult because the statistics that were generated cannot be easily composited and also
because methods for calculating confidence limits for complex statistics were
unavailable at that time.
Since these studies were published, advances have been made in the statistical
methodology needed to compare the performance of models.5 With this newer
methodology, it is feasible to combine results from different averaging periods and
different data bases into a probabilistic framework. For example, it is possible to make
statements such as "The overall difference in performance between model A and model
B is X units and this difference is significant at the 95 percent confidence level". The
purpose of this document is to present a statistical method for comparing the
performance of models using the statistical techniques that are now available.
EPA has also published a document which describes a process for selecting a best
model for case-by-case regulatory applications.6 The document, "Interim Procedures
for Evaluation of Air Quality Models", provides guidance in areas such as selection of
1
-------
the data bases, the statistics, the performance measures, and the weights to be given to
each evaluation component. The statistical procedures discussed in the interim
procedures document are still believed to be sound. However, if computer resources
are available, the newer statistical procedures described in this document (see also
reference 7) may be more appropriate.
The statistical approach for determining which model(s) perform better than other
competing models involves two steps. The first step is a screening test to eliminate
models that fail to perform at a minimum operational level. Although a completely
objective basis for choosing a minimum level of performance is lacking, accumulated
results from a number of model evaluation studies suggests that a factor-of-two is a
reasonable performance target a model should achieve before it is used for refined
regulatory analyses. The second step applies only to those models that pass the
screening test. The analysis is based on a computer intensive resampling technique
known as bootstrapping which generates a probability distribution of feasible data
outcomes. The outcomes are used to calculate a composite measure of performance
that combines information among averaging periods, data bases and integrates both the
scientific and operational components of model performance. Comparison of the
distributions of the composite measures of performance for each pair of models
provides evidence of the degree to which one model performs better than other
competing models.
-------
Appendix A provides an example application of the protocol using data from six
large mid-western power plants to compare the performance of two rural air quality
models.
-------
2.0 SCREENING TEST
Each competing model is subjected to a screening test to determine if it meets
minimum standards for operational performance. The fractional bias is used as the
performance measure. The general expression for the fractional bias (FB) is given by:
FB = 2 [OB ~PR]
OB + PR
The fractional bias of the average is computed using this equation where OB and
PR refer to the averages of the observed and predicted highest 25 values. The same
expression is used to calculate the fractional bias of the standard deviation where OB
then refers to the standard deviation of the 25 highest observed values and PR refers to
the standard deviation of the 25 highest predicted values.
The fractional bias has been selected as the basic measure of performance in this
evaluation because it has two desirable features. First, the fractional bias is
symmetrical and bounded. Values for the fractional bias range between -2.0 (extreme
overprediction) and +2.0 (extreme underprediction). Second, the fractional bias is a
dimensionless number which is convenient for comparing the results from studies
involving different concentration levels or even different pollutants.
Figure 1 is a graphical illustration of model performance in which the fractional
bias of the standard deviation (y-axis) is plotted against the fractional bias of the
average (x-axis). Mpdels that plot close to the center of the graph (0,0) are relatively
-------
free from bias, while models that plot further away from the center tend to over or
underpredict. Values
-------
CT\
a
OQ
I
w
x
o
Cf
8
§•
3
EL
g
s
5»
I
9
OQ
1
a
CTQ
I
o
I
1x3
BIAS OF STD DEVIATION
I
I _
I _
CD
o
-n
>•:
CT5
-------
of the fractional bias that are equal to -0.67 are equivalent to overpredictions by a
factor-of-two, while values that are equal to +0.67 are equivalent to underpredictions
by a factor-of-two.
Since neither underprediction nor overprediction is desirable, the calculations are
simplified by determining the absolute fractional bias (AFB) which is the absolute
value of the fractional biases computed for the average and the standard deviation.
The absolute fractional bias is calculated for each averaging period for which ambient
standards or goals have been established. Separate calculations are made using
information from each available data base. If the computed AFB statistic tends to
exceed the value of 0.67 for any averaging period or data base, consideration may be
given to excluding that model from further evaluation due to its limited credibility for
refined regulatory analysis.
3.0 STATISTICAL TEST
Models that pass the screening test are then subjected to a more comprehensive
statistical comparison that involves both an operational and scientific component. The
rationale for the operational component is to measure the model's ability to estimate
concentration statistics most directly used for regulatory purposes. For a pollutant such
as SO2 for which short-term ambient standards exist, the statistic of interest involves
the network-wide highest concentrations. In this example, the precise time, location
and meteorological condition is of minor concern compared to the magnitude of the
highest concentrations actually occurring. The scientific component is necessary to
evaluate the model's ability to perform accurately throughout (1) the range of
-------
meteorological conditions that might be expected to occur and (2) the geographic area
immediately surrounding the source(s) for which model estimates are needed.
Because of the emphasis on highest concentrations, a robust test statistic is
calculated that represents a "smoothed" estimate of the highest concentration.* A
performance measure, based on the fractional bias, is calculated which compares the
air quality and model test statistics. The performance measures obtained from the
operational and scientific components and from among the various data bases are
combined to create a composite performance measure. The bootstrap procedure is used
to estimate the standard error for the composite performance measure for each model.
Using the estimate of standard error obtained from the bootstrap, the statistical
significance of the difference between models is assessed.
Test Statistic
The test statistic used to compare the performance of the models is a robust
estimate of the highest concentration (RHC) using the largest concentrations within a
given data category. The same robust estimator is used in both the operational and
scientific phases of the statistical comparison. The robust estimate is based on a tail
exponential fit to the upper end of the distribution and is computed as follows:7*8
*Typically, the network-wide highest value from among the annual second highest
concentrations at each monitor is used to determine attainment/non-attainment of ambient
standards. Because the highest concentration value is subject to extreme variations, the
robust highest concentration is preferable in this analysis because of its stability. Also,
the bootstrap distribution of robust highest concentrations is not artificially bounded at the
maximum concentration, which allows for a continuous range of concentrations that in fact
may exceed the highest concentration.
8
-------
RHC = X(N) + [X - X(N)] In [3N " l ] ,
where:
X = average of the N-l largest values
X(N) = Nth largest value
N = number of values exceeding the threshold value (N < 26)
The value of N is nominally set equal to 26 so that the number of values averaged
(X) is arbitrarily 25. The value of N may be lower than 26 whenever the number of
values exceeding the threshold is lower than 26. Whenever N is less than 3, the RHC
statistic is set equal to the threshold value where the threshold is defined as a
concentration near background which has no impact on the determination of the robust
highest concentration.
The robust estimator of the highest value is strongly related to the two statistics
used in the screening test. Increasing values of the average and standard deviation
have the effect of increasing the central location and spread of the 25 highest values.
Increases in the central location and spread tends to increase the magnitude of the
highest value within the 25 highest concentrations. The robust highest value in effect
is a direct measurable result of the composite impact of the central location of the
highest values and their spread about that central location.
Performance Measures
The operational component of the evaluation compares the performance of the
models in terms of the largest network-wide RHC test statistic. The robust highest
value is calculated separately for each monitoring station within the network. The
-------
largest measurement based RHC value in the monitoring network and the largest model
based RHC value from the model estimates are used to calculate the absolute fractional
bias for each model. An absolute fractional bias is calculated for each averaging
period for which short-term ambient standards exist, e.g., 3- and 24-hour.
The scientific component of the evaluation is also based on the absolute fractional
bias as the basic measure of performance. The absolute fractional bias for each model
is calculated using the robust highest statistic determined for each meteorological
condition and monitoring station. Only data for 1-hour averaging periods are used in
this component of the evaluation. The meteorological conditions used are a function of
atmospheric stability and wind speed. Six unique meteorological conditions are
defined from two wind speed categories (below and above 4.0 m/s) and three stability
categories: unstable (A,B,C), neutral (D), and stable (E and F). Other categories or
combinations of meteorological conditions might be chosen at the discretion of the
evaluator.
Composite Performance Measures
A composite performance measure is computed for each model as a weighted
linear combination of the individual fractional bias components. Within the operational
evaluation component, results for the various averaging periods are given equal weight.
For the scientific component, each of the combinations of the meteorological
conditions is given equal weight. Because the operational evaluation component is
deemed to be the more important of the two, it is given a weight that is twice the
weight of the scientific component. Finally, results from the various data bases are
10
-------
given equal weight unless it is determined that differences in such factors as data
quality and geographical coverage of the monitoring networks suggest otherwise. The
algebraic expression for the composite performance measure (CPM) is:
CPM -
•J
where:
Xj = Absolute Fractional Bias for meteorological category i at station j
(AFB)3 = Absolute Fractional Bias for 3-hour averages
(AFB)^ = Absolute Fractional Bias for 24-hour averages
Because the purpose of the analysis is to contrast the performance among the
models, the composite performance measure is used to calculate pairs of differences
between the models. For discussion purposes, the difference between the composite
performance of one model and another is referred to as the model comparison measure.
The expression for the model comparison measure is given by:
(MCM)^ = (CPM)A - (CPM)B ,
where:
(CPM)A = Composite Performance Measure for Model A
(CPM)B = Composite Performance Measure for Model B
When more that two models are being compared simultaneously, the number of
model comparison measure statistics is equal to the total of the number of unique
combinations of two models. For example, for three models, three comparison
measures are computed, for four models, six comparison measures are computed, etc.
The model comparison measure is used in judging the statistical significance of the
apparent superiority of any one particular model over another.
11
-------
Standard Error Determination
The yardstick used to determine if the composite difference between models is
statistically significant is known as the standard error. At the simplest level, the ratio
of the composite difference to the standard error provides a convenient measure of the
significance for the resulting difference. Nominally, ratios that are larger than ±1.7
are significant, while values < 1.7 indicate no significance at approximately the 90
percent level.
Because the model comparison measure is a rather involved statistic, the usual
statistical methods for estimating the standard error do not apply. Fortunately,
computing speeds have increased so that resampling techniques such as the "jackknife"
and "bootstrap" are feasible methods for estimating the standard error and for
determining confidence limits. Because of its simplicity, the blocked bootstrap
method9 is used to generate an estimate of the standard error. The bootstrap is
basically a resampling technique whereby the desired performance measure is
recalculated for a number "trial" years. To do this, the original year of data (nominally
365 days), is partitioned into 3-day blocks. Within each season, 3-day blocks
(approximately 30 blocks per season) are sampled with replacement until a total season
is created. This process is repeated using each of the four seasons to construct a
complete bootstrap year. Three day blocks are chosen to preserve day to day
meteorological persistence, while sampling within seasons guarantees that each season
will be represented by only days chosen from that season. Since sampling is done
with replacement, some days are represented more than once, while other days are not
represented at all. Next, the data generated for the bootstrap year are used to calculate
12
-------
the composite performance measures for each model. This process is repeated until
sufficient samples are available to calculate a meaningful standard error for each of the
model performance statistics. The standard error is calculated as simply the standard
deviation of the bootstrap generated outcomes for the model comparison measure.
Selecting the Best Models
The magnitude and sign of the model comparison measure are indicative of the
relative performance of each pair of models. The smaller the composite performance
measure the better the overall performance of the model. This means that for two
arbitrary models, Model A and Model B, a negative difference between the composite
performance measure for Model A and Model B implies that model A is performing
better (Model A has the smaller composite performance measure), while a positive
value indicates that model B is performing better. When only two models are
compared, the test statistic is simply the ratio of the composite difference to the
standard error calculated from the bootstrap outcomes.
When more than two models are being compared, it is convenient to calculate
simultaneous confidence intervals for each pair of model comparisons.10 For each pair
of model comparisons, the significance of the model comparison measure depends
upon whether or not the confidence interval overlaps zero (0). If the confidence
interval overlaps zero, the two models are not performing at a level which is
statistically different. If the confidence interval does not overlap zero, (upper and
lower limits are both negative or both positive), then there exists a statistically
significant difference between the two models at the stated level of confidence.
13
-------
The level of confidence chosen has an important impact on the decision. The
larger the probability or confidence level, the larger the length of the confidence limits
requked to satisfy the confidence statement. Choosing a confidence level that is overly
demanding (e.g., 99.9999%) would almost surely result in such wide limits that no
decision could be reached regarding which model(s) are performing better. At the
other extreme, choosing a confidence level that is very lenient (e.g., 70%) may lead to
a decision that one or more models are superior when in fact no real difference exits.
This choice must be such that the two competing needs are balanced which requkes
judgement from the evaluators. A confidence level in the vicinity of 90 to 95 percent
represents a reasonable compromise between these two needs.
Limitations
This protocol document contains very specific requkements for conducting the
statistical comparisons believed necessary to compare the performance of models.
These requkements are based on experiences gained from EPA's model evaluation
activities over the past several years. The reader is reminded that there may be more
logical choices of meteorological conditions and specific weights for compositing
performance among various data categories. Likewise, the specific test statistic,
performance measure and range of data may be different depending on the nature of
the data bases being used and the judgement of those conducting the evaluation.
4.0 DISPLAY OF RESULTS
To fully understand the final outcome from applying the methodology, each of the
component results should be examined. For example the absolute fractional bias does
14
-------
not provide any information about the direction of the bias, i.e., it does not indicate if
a particular model tends to under or overpredict. Greater understanding about the
relative performance of each model can be obtained through graphic display of the
fractional bias for the various data categories used in the evaluation.
For the screening test, results are displayed graphically using the fractional bias of
the average vs the fractional bias of the standard deviation as illustrated in Figure 1
(see reference 2). Information is presented separately for each averaging period and
each of the data bases used in the analysis. For the statistical test, this is accomplished
with the use of box diagrams (see Appendix A) which display the magnitude of
selected percentiles for the fractional bias using the outcomes of the bootstrap process.
Although these diagrams are not intended for use in making the final decision, they are
useful in summarizing and presenting the outcome. Also, the scientific results should
prove useful for the improvement of existing models and in the development of new
models.
15
-------
5.0 REFERENCES
1. Environmental Protection Agency, 1982. Evaluation or Rural Air Quality
Simulation Models (EPA-450/4-83-003). U.S. Environmental Protection Agency,
Research Triangle Park, NC.
2. Environmental Protection Agency, 1985. Evaluation of Rural Air Quality
Simulation Models, Addendum B: Graphical Display of Model Performance Using
the Clifty Creek Data Base (EPA-450/4-83-003b). U.S. Environmental Protection
Agency, Research Triangle Park, NC.
3. Cox, W. M. and J. A. Tikvart. Assessing the Performance of Air Quality Models,
Paper presented at the 15th International Technical Meeting on Air Pollution and its
Applications (NATO/CCMS), April 16-19, 1985, St Louis, MO.
4. Baldridge, K. W., 1985. Standardized SAS Graphics Subsystem User's Manual.
Prepared by Computer Sciences Corporation for EPA, Computer Sciences
Corporations, Research Triangle Park, NC.
5. Efron B., 1982. The Jackknife, the Bootstrap and Other Resampling Plans. Society
for Industrial and Applied Mathematics, Philadelphia, PA.
6. Environmental Protection Agency, 1984. Interim Procedures for Evaluating Air
Quality Simulation Models (Revised) (EPA-450/4-83-023). U.S. Environmental
Protection Agency, Research Triangle Park, NC.
7. Cox, W. M. and J. A. Tikvart, 1990. A statistical procedure for determining the
best performing air quality simulation model. Atmos. Environ., 24A(9): 2387-2395.
8. Breiman, L., J. Gins and C. Stone, 1978. Statistical Analysis and Interpretation of
Peak Air Pollution Measurements (TSC-PD-A190-10). Technology Service
Corporation, Santa Monica, CA.
9. Tukey, J. W., 1987. Kinds of bootstraps and kinds of jackknifes, discussed in
terms of a year of weather-related data (Technical Report No. 292). Department of
Statistics, Princeton University, Princeton, NJ.
10. Cleveland, W. S., and R. McGill, 1984. Graphical Perception Theory,
Experimentation, and Application to the Development of Graphical Methods.
/. Am. Stat. Assoc., 79(387): 531.
16
-------
APPENDIX A
EXAMPLE COMPARISON OF MODEL PERFORMANCE
A-l
-------
A.1 INTRODUCTION
This appendix demonstrates an example application of the protocol for comparing
models using six data bases available around four large power plants located in the
midwest. For purposes of this example, MPTER (Version 6) and an alternative point
source model are evaluated and compared. The MPTER model was chosen since it is the
EPA preferred model for regulatory applications. The alternative model was chosen as
a state-of-the-art rural point source model incorporating advanced features for simulating
plume behavior in flat terrain.
The six model evaluation data bases used in the analysis consisted of Clifty Creek
(1975 and 1976), Muskingum River (1975 and 1976), Paradise (1976) and Kincaid
(1980/1981). The Clifty Creek plant is a coal fired, base-load facility located along the
Ohio River in southern Indiana. Terrain surrounding the plant consists of low ridges and
rolling hills at elevations that are below the top of the stacks. Hourly SO2 data is
available for six monitoring stations located at distances ranging from approximately 3 -
15km from the power plant. The Muskingum River plant is also a coal-fired plant,
located in Ohio surrounded by low ridges and rolling hills. Four SO2 monitoring stations
are located at distances ranging from 4 - 20km from the plant. The Paradise plant, located
in Kentucky, has 12 monitors located at distances from 3 - 17km from the plant. The
Kincaid plant, located in central Illinois, employed a dense network of SO2 monitors
ranging from approximately 2-20 km from the plant. Each of these data bases is
documented in greater detail elsewhere.1'2'3'4 For each of these data bases, 1, 3 and 24-
hour average measured and predicted concentrations have been assembled for each of the
operating monitoring stations. In addition, wind speed and atmospheric stability are
A-2
-------
available for each of the hourly records. An hourly background concentration was
estimated and subtracted from the measured hourly concentrations using the EPA method.5
In addition, a threshold check is used to eliminate low observed or predicted values that
have no effect on the performance statistics. For 1-hour averages, a threshold value of
25 pg/m3 is used while, for 3-hour and 24-hour averages, a value of 5 ug/m3 is used. The
threshold checks are applied independently to the measured and predicted concentrations.
A.2 SCREENING TEST RESULTS
For each data base, the observed concentrations from all monitoring stations were
pooled and sorted by averaging period. From the sorted data, the 25 highest observed
concentrations, unpaired in space or time, were used to calculate a mean and standard
deviation. The same procedure was applied to the predicted concentrations obtained from
MPTER (Version 6) and the alternative model. Using these statistics, a fractional bias for
the mean and a fractional bias for the standard deviation was determined for each model
for 3-hour averages and for 24-hour averages.
Figure 1 shows the results in which the fractional bias of the average and fractional
bias of the standard deviation are plotted for 3-hour averages. For both MPTER (left
panel) and the alternative model (right panel), the results for all six of the data bases are
shown. For both models, the predicted and observed averages and standard deviations are
within a factor-of-two except at Kincaid where underpredictions are apparent. Figure 2
shows similar results for the 24-hour averages. Again, most of the data points indicate
performance within a factor-of-two except for the alternative model at Clifty Creek where
a tendency for overpredictions is evident. Since both models satisfactorily meet the
A-3
-------
High 25 Concentrations: 3-hour Averages
1 = Clifty Creek (1975) 2 = Clifty Creek (1976) 3 = Muskingum River (1975)
4 = Muskingum River (1976) 5 = Paradise (1976) 6 = Kincaid (1980/81)
2.0
c
0
v 1.0
05
P 0.0
Iff
-15
-U
UPTER V6
i
1 1 1 1
6
1 1 1 1
-2.0 -15 -1.0 -05 0.0 05 1.0 15 10
Bios of Average
2.0
15
1 1.0
0
9 05
Q
A
$-05
5 -1.0
o
-15-
-2.0
t
(
1 1
U -15 -1.0
ALTERNATE
1 -5J
6
4
I i
-05 0.0 05
BiasofAveroge
i i i
1.0 1.5 2.0
Figure A-1. Fractional bias of the average and standard deviation: 3-hour averages.
A-4
-------
High 25 Concentrations: 24-hour Averages
1 = Clifty Creek (1975) 2 = Clifty Creek (1976) 3 = Muskingum River (1975)
4 = Muskingum River (1976) 5 = Paradise (1976) 6 = Kincaid (1980/81)
UPTER V6
2.0
15
1.0
0
? is
Q
0 0.0
--05^
5-i.o^
ffl
-1.5
-2.0
-2.0
2J5
-1.5 -1.0 -05 0.0 05 1.0 15 2.0
BiosofAveroge
10-
1.5
c
5 W
0
I W
0
[_ M
/)
v-OJS
II
o -1J)
i
-1.5
-2.0
2
1 1 1
2.0 -15 -1JO
ALTERNATE
6
5
1
1
J
l i
-05 0.0 0.5
BiosofAveroge
i i i
1.0 t5 2.0
Figure A-2. Fractional bias of the average and standard deviation: 24-hour averages.
A-5
-------
minimum level of performance, the two models are subjected to a more comprehensive
statistical comparison.
A.3 STATISTICAL COMPARISONS
The performance of MPTER is compared with the performance of the alternative
model using a composite statistical measure that combines the performance within the
operational component (3-hour and 24-hour averages) and the scientific component (1-hour
averages). For purposes of the operational component, the observed and predicted
concentrations were sorted separately by station and averaging period. Using the 25
largest values, the statistical procedures described in the protocol were applied to calculate
the robust highest concentration (RHC) at each station. A network based absolute
fractional bias was computed for each averaging period and model using the largest
observed RHC and the largest predicted RHC value from among the monitoring stations
in each data base.
For the scientific component, six meteorological categories were defined from two
wind speed categories and three stability categories. The two wind speed categories are:
low (<4.0 m/s) and high (>4.0 m/s). The three stability categories are: unstable (class A,
B, C), neutral (class D), and stable (class E, F). To minimize distortions associated with
small counts, data categories having fewer than 100 observations were eliminated from
the analysis.
The hourly observed and hourly predicted concentrations within each data category
were sorted. The 25 highest values were used to calculate a separate robust highest
A-6
-------
concentration for each of the station/meteorological data categories. A composite absolute
fractional bias was computed by averaging the individual absolute fractional biases. A
composite performance measure for each model was then calculated by averaging three
quantities: (1) the absolute fractional bias based on 3-hour averages, (2) the absolute
fractional bias based on 24-hour averages, and (3) the composite absolute fractional bias
based on 1-hour averages. The difference between the composite performance measure
for MPTER and the alternative model (Model comparison measure) is actually the statistic
used in judging the overall difference in performance between the two models.
Following the procedure outlined in the protocol, 100 bootstrap trial years were
generated.* For each trial year, the statistics and model performance measures described
above were recalculated resulting in 100 sets of statistical outputs. The statistics in each
set included the fractional biases, absolute fractional biases, composite absolute fractional
biases, composite performance measure and the differences between the composite
performance measures for MPTER and the alternative model. For this example
demonstration, a confidence level of 90 percent was selected for determining statistical
significance for the difference in performance between the two models.
A.4 STATISTICAL RESULTS: CLIFTY CREEK
Figure 3 presents an example comparison of the bias for the two models using the
1975 Clifty Creek data. The figure presents the results for 1-, 3- and 24-hour averages
*The number of bootstrap trials was limited to 100 by available computing resources.
Nominally, 500 to 1000 bootstrap trials would be used if computational resources were
not a prime consideration.
A-7
-------
Cliftv Creek (1975)
MPTER V6 (D) and Alternate (X)
.
AVEUH-I
coyposiu
Figure A-3. Fractional bias and bootstrap percentiles (5 and 95) for Clifty Creek (1975): MPTER and Alternate Model
A-8
-------
in terms of the RHC test statistic for Clifty Creek(1975). The data displayed consists of
the fractional bias for each averaging period along with the 5th and 95th percentiles
resulting from the bootstrap. The results for 1-hour averages is the composite average
over the individual stations and six meteorological conditions, while the 3-hour and 24-
hour results are based on the largest RHC test statistic across the six monitoring stations.
The composite 1-hour fractional bias indicates an overall tendency for MPTER to
underpredict peak 1-hour concentrations at Clifty Creek.* Since the upper and lower
percentiles are far above the zero reference line, the underpredictions are "significant" in
a statistical sense. Composite 1-hour results for the alternative model indicate a clear
tendency for overprediction at Clifty Creek. For 3-hour averages, MPTER appears to be
essentially unbiased while the alternative model shows a tendency for over-predictions.
For 24-hour averages, MPTER shows a tendency for modest underpredictions while the
alternative model shows a clear tendency for overprediction. The overall composite
fractional bias shown in the last panel suggests a tendency for underpredictions by
MPTER and overpredictions by the alternative model. The composite underprediction by
MPTER is dominated by the 1-hour results while the composite overpredictions by the
alternative model are more evenly spread across each of the three averaging periods.
Table 1 summarizes the results of the comparison in performance between the two models
in terms of the absolute fractional bias for each averaging period. The ratio of the
difference between the two models to the standard error provides a rough measure of the
statistical significance of the difference in composite performance between the two
'For 1-hour concentrations unpaired in space or time, both models actually overpredict
the observed 1-hour concentrations. If ambient standards existed for 1-hour average data,
the operational component of this evaluation would include 1-hour average comparisons
equivalent to those described for 3-hour and 24-hour averages.
A-9
-------
Table A-l. Composite Performance of MPTER and the Alternative Model
for Six Rural Data Bases
Data Averaging
Base Period
Clifty 1-hr
Creek 3-hr
(1975) 24-hr
Composite
Clifty 1-hr
Creek 3-hr
(1976) 24-hr
Composite
Muskingum 1-hr
River 3-hr
(1975) 24-hr
Composite
Muskingum 1-hr
River 3-hr
(1976) 24-hr
Composite
Paradise 1-hr
(1976) 3-hr
24-hr
Composite
Kincaid 1-hr
(1980/81) 3-hr
24-hr
Composite
Grand Composite.
(All 6 Data Bases)
Absolute Fractional Bias
MFltK
0.81
0.01
0.31
0.37
0.58
0.25
0.12
0.41
1.04
0.10
0.08
0.40
0.54
0.04
0.34
0.44
1.25
0.09
0.25
0.53
0.68
0.29
0.29
0.42
0.43
Alternate
0.83
0.71
0.64
0.73
0.47
0.73
0.91
0.78
0.60
0.22
0.02
0.28
0.38
0.11
0.03
0.23
0.75
0.55
0.48
0.59
0.59
0.53
0.69
0.60
0.54
Diff. (d)
-0.02
-0.70
-0.33
-0.35
0.11
-0.48
-0.79
-0.37
0.43
-0.12
0.06
0.12
0.41
-0.06
0.31
0.21
0.50
-0.46
-0.23
-0.06
0.09
-0.24
-0.40
-0.18
-0.11
Std.
Dev. (s)
0.05
0.14
0.24
0.10
0.04
0.12
0.14
0.07
0.07
0.15
0.17
0.08
0.08
0.10
0.12
0.05
0.03
0.10
0.14
0.13
0.04
0.39
0.35
0.22
0.05
Ratio (d/s)
-0.4
-5.0
-1.4
-3.5
2.8
-4.0
-5.6
-5.3
6.1
-0.8
0.4
1.5
5.1
-0.6
2.6
4.2
16.7
-4.6
-1.6
-0.5
2.2
-0.6
-1.1
-0.8
-2.2
A-10
-------
models. Absolute values for the ratio that exceed a nominal value of 1.7 indicate
significance at approximately the 90 percent confidence level. For the Clifty Creek 1975
data base, the composite results indicates that the overall performance of MPTER is
significantly better than the performance of the alternative model. The difference between
the composite absolute fractional bias statistics for the two models is -0.35 which is 3.5
times as large as the standard error for the difference. Before discussing the results for
all six data bases, it is instructive to examine the results at Clifty Creek more closely.
Although the composite difference between the two models is large, there are noticeable
differences among the three averaging periods. For 1-hour averages, the two models
performed about the same but for different reasons. By referring to Figure 3, it is clear
that underpredictions by MPTER and overpredictions by the alternative model are of
approximately the same magnitude. The net effect is that both models are penalized by
approximately the same degree. The performance of the two models for 3-hour averages
appears to be the dominant factor contributing to the overall difference. The fractional
bias for MPTER is essentially zero while for the alternative model the fractional bias is
0.71 leading to a large difference compared with its standard error (-0.70 vs. 0.14). For
24-hour averages, MPTER is also the better performing model. The absolute fractional
bias for MPTER is low (0.31) but not significantly lower than for the alternative model
(0.64).
A.5 STATISTICAL RESULTS: ALL DATA BASES
The relative performance between the two models varies somewhat among the six
data bases. The ratio of the composite difference to its standard error ranged from -5.3
at Clifty Creek (1976) to 4.2 at Muskingum (1976). At Clifty Creek, MPTER clearly
A-ll
-------
performs better than the alternative model while at Muskingum River, the alternative
model performs at least as well or better than MPTER. For Paradise and Kincaid the
composite results indicate that MPTER is performing slightly better than the alternative
model; however, the results are not statistically significant. The grand composite result
over the six data bases indicates that MPTER performs statistically better than the
alternative model. The composite difference between the two models is -0.11 which is
more than twice the estimated standard error.
Figures 4 and 5 present the composite results graphically for each averaging period
and for the overall grand composite. Clearly, the overall tendency is for MPTER to
perform better for 3-hour and 24-hour averages (operational component), while the
alternative model performs better for 1-hour averages (scientific component). The
composite statistics shown in the last panel of Figures 4 and 5 suggest that the better
performance of MPTER in the operational component more than compensates for the
better performance by the alternative model in the scientific component. Note that in this
example comparison, the overall statistical significance was small and also there were
rather large differences in performance between data bases. In practice, these facts might
be taken into consideration when choosing the model for applications and/or in
considering whether additional data were necessary before arriving at a final decision.
A-12
-------
Composite for All Inventories
MPTER V6 (D) and Alternate (X)
11
3
a
1"
0
!•
i
I"
i
1
u
1
I
AYWGE»I
u
u
u
u
u
I
I
MRHCE'J
ii
u
u
u
II
•
I
AVENGE * 11
u
CflUPOSlJE
Figure A-4. Absolute fractional bias and bootstrap percentiles (5 and 95), composite for six inventories:
MPTER and Alternate Model
A-13
-------
Composite for All Inventories
Difference between MPTER V6 and Alternate
1
I
t
i
K
I
.1
I
A v L Ml* V L 1
(
i
AYEIA«*J
i
i
-i
i
AVEIACE-M
i
3
I
j
i
\
t '
L
i
i
coyposiu
Figure A-5. Absolute fractional bias and bootstrap percentiles (5 and 95), composite difference for six inventories:
MPTER and Alternate Model
A-14
-------
A.5 REFERENCES
1. Environmental Protection Agency, 1982. Evaluation or Rural Air Quality
Simulation Models (EPA-450/4-83-003). U.S. Envkonmental Protection Agency,
Research Triangle Park, NC.
2. Environmental Protection Agency, 1985. Evaluation of Rural Air Quality
Simulation Models, Addendum A: Muskingum River Data Base
(EPA-450/4-83-003a). U.S. Environmental Protection Agency, Research Triangle
Park, NC.
3. Envkonmental Protection Agency, 1986. Evaluation of Rural Ak Quality
Simulation Models, Addendum C: Kincaid SO2 Data Base (EPA-450/4-83-003c).
U.S. Envkonmental Protection Agency, Research Triangle Park, NC.
4. Envkonmental Protection Agency, 1987. Evaluation of Rural Ak Quality
Simulation Models, Addendum D: Paradise SO2 Data Base (EPA-450/4-83-003d).
U.S. Envkonmental Protection Agency, Research Triangle Park, NC.
5. Envkonmental Protection Agency, 1987. Guideline on Ak Quality Models
(Revised) and Supplement A (EPA-450/2-78-027R). U.S. Envkonmental
Protection Agency, Research Triangle Park, NC.
A-15
-------
TECHNICAL REPORT DATA
(Please read Instructions on reverse before completing)
1. REPORT NO.
EPA-454/R-92-025
2.
4. TITLE AND SUBTITLE
Protocol for Determining the Best Performing Model
7. AUTHOR(S)
William M. Cox
9. PERFORMING ORGANIZATION NAME AND ADDRESS
U.S. Environmental Protection Agency
Office of Air Quality Planning and Standards
Technical Support Division
Research Triangle Park, NC 2771 1
12. SPONSORING AGENCY NAME AND ADDRESS
3. RECIPIENTS ACCESSION NO.
5. REPORT DATE
December1992
6. PERFORMING ORGANIZATION CODE
8. PERFORMING ORGANIZATION REPORT NO.
10. PROGRAM ELEMENT NO.
1 1 . CONTRACT/GRANT NO.
13. TYPE OF REPORT AND PERIOD COVERED
14. SPONSORING AGENCY CODE
15. SUPPLEMENTARY NOTES
16. ABSTRACT
This document describes a recommended procedure for evaluating the performance of air quality dispersion
models, and for selecting the best performing model for particular regulatory applications. The procedure is based
on direct comparisons between measured and model-predicted concentrations to establish the accuracy of each
model. The document includes an example evaluation using SO2 data collected at four midwestern power plants in
which the performance of MPTER and one other rural air quality model were compared.
Region 5( {jbran/'o- '1"^;:f|'on Agency
riu- ^'-''-'^Scri »'• ,
tn/oago it '- ": ••• "YC/( i?^ -,
. - -, 1-r" c/oor
17.
KEY WORDS AND DOCUMENT ANALYSIS
a DESCRIPTORS
Air Pollution
Atmospheric Dispersion Modeling
Model Performance
Statistical Evaluation
18. DISTRIBUTION STATEMENT
Release Unlimited
b. IDENTIFIERS/OPEN ENDED TERMS
19. SECURITY CLASS (Report)
Unclassified
20. SECURITY CLASS (Pago)
Unclassified
e. COSATI Field/Group
21. NO. OF PAGES
30 (incl. appendix)
22. PRICE
EPA Form 2220-1 (Rev. 4-77) PREVIOUS EDITION IS OBSOLETE
------- |