United States Office of Air Quality EPA-450/4-84-023
Environmental Protection Planning and Standards September 1984
Agency Research Triangle Park NC 27711
Air
Interim Procedures
For Evaluating Air
Quality Models
(Revised)
-------
-------
Disclaimer
This report has been reviewed by The Office of Mr Quality
Planning and Standards, U. S. Environmental Protection
Agency, and has been approved for publication. Mention of
trade names or commercial products is not intended to
constitute endorsement or recommendation for use.
U.S.
11
-------
EPA-450/4-84-023
Interim Procedures for Evaluating Air
Quality Models (Revised)
U.S. Environmental Protection Agency,
Region V. Library
230 South Dearborn Street ^
Chicago^ Illinois 60604
U.S. ENVIRONMENTAL PROTECTION AGENCY
Monitoring and Data Analysis Division
Office of Air Quality Planning and Standards
Research Triangle Park, North Carolina 27711
September 1984
-------
Preface
The quantitative evaluation and comparison of models for application
to specific air pollution problems is a relatively new problem area for
the modeling community. Although considerable experience has been gained
in applying the procedures contained in an earlier version of this document,
it is expected that there will continue to be a number of problems in
carrying out the procedures described herein. Thus, procedures discussed
in this document should continue to be considered interim.
EPA Regional Offices and State air pollution control agencies are
encouraged to use this document to judge the appropriateness of a proposed
model for a specific application. However, they must exercise judgment
where individual recommendations are not of practical value. After a
period of time during which further experience is gained, problem areas
will become better defined and will be addressed in additional revisions
to this document.
The procedures described herein are specifically tailored to
operational evaluation, as opposed to scientific evaluation. The main
goal of operational evaluation as applied here is to determine whether
a proposed model i.s that which is most reliable for use in a specific
regulatory action. The ability of various sub-models (plume rise, etc.)
to accurately reproduce reality or to add basic knowledge assessed by
scientific evaluation is not specifically addressed by these procedures.
An example illustrating the procedures described in this document
has been prepared, and is attached as Appendix B. As noted in the preface
to Appendix B, the primary utility of the example is to illustrate some
iii
-------
considerations in designing the performance evaluation protocol. The
example is not intended to be a "model" to be followed in an individual
application of these procedures.
iv
-------
Table of Contents
Preface iii
Table of Contents v
List of Tables vii
List of Figures ix
Summary xi
1.0 INTRODUCTION 1
1.1 Need for Model Evaluation Procedures 2
1.2 Basis for Evaluation of Models 4
1.3 Coordination with Control Agency 5
2.0 PRELIMINARY ANALYSIS 7
2.1 Regulatory Aspects of the Application 7
2.2 Source and Source Environment 8
2.3 Reference Model 10
2.4 Proposed Model 11
2.5 Preliminary Estimates 12
2.6 Technical Comparison with the Reference Model 13
2.7 Technical Evaluation When No Reference Model Is Used 14
2.8 Technical Summary 16
3.0 PROTOCOL FOR PERFORMANCE EVALUATION 19
3.1 Performance Measures 20
3.1.1 Model Bias 21
3.1.2 Model Precision 23
3.1.3 Correlation Analysis 24
3.2 Data Organization 25
3.3 Protocol Requirements 27
3.3.1 Performance Evaluation Objectives 29
3.3.2 Selecting Data Sets and Performance Measures 31
3.3.3 Weighting the Performance Measures 34
3.3.4 Determining Scores for Model Performance 36
3.3.5 Format for the Model Comparison Protocol 38
3.4 Protocol When No Reference Model Is Available 42
v
-------
4.0 DATA BASES FOR THE PERFORMANCE EVALUATION 45
4.1 On-Site Data 46
4.1.1 Air Quality Data 46
4.1.2 Meteorological and Emissions Data 50
4.2 Tracer Studies 51
4.3 Off-Site Data 53
5.0 MODEL ACCEPTANCE 57
5.1 Execution of the Model Performance Protocol 57
5.2 Overall Acceptability of the Proposed Model 59
5.3 Model Application 60
6.0 REFERENCES 63
APPENDIX A. Reviewer's Checklist A-l
APPENDIX B. Narrative Example , B-l
APPENDIX C. Procedure for Calculating Non-Overlapping Confidence
Intervals . C-l
VI
-------
List of Tables
Number Title Page
3.1 Statistical Estimators and Basis for Confidence
Limits on Performance Measures 22
3.2 Summary of Candidate Data Sets for Model
Evaluation 28
3.3 Summary of Data Sets and Performance Statistics for
Various Performance Evaluation Objectives 32
3.4 Suggested Format for the Model Comparison Protocol 39
5.1 Suggested Format for Scoring the Model Comparison 58
vii
-------
List of Figures
Number Title Page
1 Decision Flow Diagram for Evaluating a Proposed
Air Quality Model xii
3.1 Observed and Predicted Concentration Pairings Used
in Model Performance Evaluations 26
IX
-------
Summary
This document describes interim procedures for use in accepting, for
a specific application, a model that is not recommended in the Guideline
on Air Quality Models-'-. The primary basis for the model evaluation
assumes the existence of a reference model which has some pre-existing
status and to which the proposed nonguideline model can be compared from
a number of perspectives. However for some applications it may not be
possible to identify an appropriate reference model, in which case specific
requirements for model acceptance must be identified. Figure 1 provides
an outline of the procedures described in this document.
After analysis of the intended application, or the problem to be
modeled, a decision must be made on the reference model to which the
proposed model can be compared. If an appropriate reference model can be
identified, then the relative acceptability of the two models is determined
as follows. The model is first compared on a technical basis to the
reference model to determine if it can be expected to more accurately
estimate the true concentrations. Next a protocol for model performance
comparison is written and agreed to by the applicant and the appropriate
regulatory agency. This protocol describes how an appropriate set of
field data will be used to judge the relative performance of the proposed
and the reference model. Performance measures recommended by the American
Meteorological Society^ will be used in describing the comparative perfor-
mance of the two models in an objective scheme. That scheme should
consider the relative importance to the problem of various modeling
objectives and the degree to which the individual performance measures
support those objectives. Once the plan for performance evaluation is
XI
-------
(2.4)
Wri te Tech.
Descri pti on of**^
Proposed Model
eferenc
Model Avai
able
Unsound or
'erform
(2.7)<(ech. Analysi
Model/ Not Applicable
Acceptable
or Marginal
(3.4)
(4.0)
(5.1)
Write Perf.
Evaluati on
Protocol
\
;
Collect Perf.
Evaluation
Data
\
Conduct I
Perf.
Evaluation |
Technical
Comparison
of Models
V
Write Perf.
Evaluation
Protocol
Marginal
(2.7)
Result\ Marginal^
of Tech.
Eval.X (5.2)
Acceptable
Collect Perf.
Evalution
Data
Conduct
Perf.
Evaluation
(2.4)
(2.6)
(3.3)
(4.0)
(5.1)
(5.1)
(5.2)
Figure 1. Decision Flow Diagram for Evaluating a Proposed Air Quality Model
(Applicable Sections of the Document are indicated in Parentheses.)
-------
written and the data to be used are collected/assembled, the performance
measure statistics are calculated and the weighting scheme described in
the protocol is executed. Execution of the decision scheme will lead to
a determination that the proposed model performs better, worse or about
the same as the reference model for the given applications. The final
determination on the acceptability of the proposed model should be primarily
based on the outcome of the comparative performance evaluation. However,
it may also be based, if so specified in the protocol, on results of the
technical evaluation, the ability of the proposed model to meet minimum
standards of performance, and/or other specified criteria.
If no appropriate reference model is identified, the proposed model
is evaluated as follows. First the proposed model is evaluated from a
technical standpoint to determine if it is well founded in theory, and is
applicable to the situation. This involves a careful analysis of the
model features and intended usage in comparison with the source configura-
tion, terrain and other aspects of the intended application. Secondly,
if the model is considered applicable to the problem, it is examined to
see if the basic formulations and assumptions are sound and appropriate
to the problem. (If the model is clearly not applicable or cannot be
technically supported, it is recommended that no further evaluation of
the model be conducted and that the exercise be terminated.) Next, a
performance evaluation protocol is prepared that specifies what data
collection and performance criteria will be used in determining whether
the model is acceptable or unacceptable. Finally, results from the
performance evaluation should be considered together with the results of
the technical evaluation to determine acceptability.
Xlll
-------
INTERIM PROCEDURES FOR
EVALUATING AIR QUALITY MODELS
1. INTRODUCTION
This document describes interim procedures that can be used in judging
whether a model, not specifically recommended for use in the Guideline on
Air Quality Models^, is acceptable for a given regulatory action. It
identifies the documentation, model evaluation and data analyses desirable
for establishing the appropriateness of a proposed model.
This document is only intended to assist in determining the accepta-
bility of a proposed model for a specific application (on a case-by-case
basis). It is not intended for use in determining which of several
models and/or model options (similar to an optimization procedure) are
best for application to a given situation, nor does it address procedures
to be used in model development ..M>J "validation." It is not for use in
determining whether a new model could be acceptable for general use
and/or should be included in the Guideline on Air Quality Models. This
document also does not address criteria for determining the adequacy of
alternative data bases to be used in models, except in the case where a
nonguideline model requires the use of a unique data base. The criteria
or procedures generally applicable to the review of fluid modeling procedures
are contained elsewhere.3,4,5
The reminder of Section 1 describes the need for a consistent set of
evaluation procedures, provides the basis for performing the evaluation,
and suggests how the task of model evaluation should be coordinated
between the applicant and the control agency. Section 2 describes the
preliminary technical analysis needed to define the regulatory problem,
-------
the choice of the reference and proposed models, arid the regulatory
consequences of applying these models. Section 2 also contains a suggested
method of analysis to determine the applicability of the proposed model
to the situation. Section 3 discusses the protocol to be used in judging
the performance of the proposed model. Section 4 describes the design of
the data base for the performance evaluation. Section 5 describes the
execution of the performance evaluation and provides guidance for combining
these results with other criteria to judge the overall acceptability of
the proposed model. Appendix A provides a reviewer's checklist which can
be used by the appropriate control agency in determining the acceptability
of the applicant's evaluation. Appendix B provides an example illustrating
the use of the procedures. Appendix C describes a procedure for calculating
non-overlapping confidence intervals.
1.1 Need for Model Evaluation Procedures
The Guideline on Air Quality Models makes specific recommenda-
tions concerning air quality models and the data bases to be used with
these models. The recommended models should be used in all evaluations
relative to State Implementations Plans (SIPs) and Prevention of Signifi-
cant Deterioration (PSD) unless it is found that the recommended model is
inappropriate lor a particular application and/or a more appropriate
model or analytical procedure is available. However, for some applications
the guideline does not recommend specific models and the appropriate model
must be chosen on a case-by-case basis. Similarly, the recommended data
bases should be used unless such data bases are unavailable or inappropriate.
In these cases, the guideline states that other models and/or data bases
deemed appropriate by the EPA Regional Administrator may be used.
-------
Models are used to determine the air quality impact of both new
and existing sources. The majority of cases where nonguideline models
have been proposed in recent years have involved the review of new sources,
especially in connection with PSD permit applications. However, most
Regional Offices have also received proposals to use nonguideline models
for SIP relaxations and for general area-wide control strategies.
Many of the proposals to use nonguideline models involve modeling
of point sources in. complex terrain and/or a shoreline environment.
Other applications have included modeling point sources of photochemical
pollutants, modeling in extreme environments (arctic/tropics/deserts),
modeling of fugitive emissions and modeling of burning where smoke manage-
ment (a form of intermittent control) is practiced. For these applications
a refined approach is not identified in the Guideline on Air Quality Models.
Also a relatively small number <>[ proposals have involved applications
where a recommended model is appropriate, but another model is considered
preferable.
The types of nonguideiine models proposed have included: (1)
minor modification of computer codes to allow a different configuration/
number of sources and receptors that essentially do not change the estimates
from those of the basic model; (2) modifications of basic components in
recommended models, e.g., different dispersion coefficients (measured or
estimated), wind profiles, averaging times, etc; and (3) completely new
models that involve non-Gaussian approaches and/or phenomenological
modeling, e.g. temporal/spatial modeling of the wind flow field.
The Guideline on Air Quality Models, while allowing for the use
of alternative models in specific situations, does not provide a technical
basis for deciding on the acceptability of such techniques. To assure a
-------
more equitable approach in dealing with sources of pollution in all
sections of the country it is important that both the regulatory agencies
and the entire modeling community strive toward a consistent approach in
judging the adequacy of techniques used to estimate concentrations in the
ambient air. The Clean Air Act^ recognizes this goal and states that the
"Administrator shall specify with reasonable particularity each air quality
model or models to be used under specified sets of conditions ..."
The use of a consistent set of procedures to determine the accep-
tability of nonguideline models should also serve to better ensure that
the state-of-the-science is reflected. A properly constructed set of
evaluation criteria should not only serve to promote consistency, but
should better serve to ensure that the best technique is applied. It
should be noted that a proposed model cannot be proprietary since it may
be subject to public examination and could be the focus of a public
hearing or other legal proceeding.
1.2 Basis for Evaluation of Models
The basis for accepting a proposed model for a specific appli-
cation, as described in this document, involves a comparison of performance
between the proposed model and an applicable reference model. The proposed
model would be acceptable for regulatory application if its performance
is better than that of the reference model. It should not be applied to
the problem if its performance were inferior to that of the reference
model. This model should also meet other criteria that may be specified
in the protocol.
A second basis for accepting or rejecting a proposed model
could involve the use of performance criteria written specifically for
the intended application. While this procedure is limited by a lack of
4
-------
experience in writing such criteria and the necessity of considerable
subjectivity, it is recognized that in some situations it may not be
possible to specify an appropriate reference model. Such a scheme should
ensure that the proposed model is technically sound and applicable to the
problem. Further, the model should pass certain performance requirements
that are acceptable to all parties involved. Marginal performance together
with a marginal determination on technical acceptability would suggest
that the model should not be used.
At the present time one cannot set down a complete set of
objective evaluation criteria and standards for acceptance of models
using these concepts. Bases for such objective criteria are lacking in
a number of areas, including a consistent set of standards for model
performance, scientific consensus on the nature of certain flow phenomena
such as interactions with complex terrain, etc. However, this document
provides a framework for inclusion of iuture technical criteria, as well
as currently available criteria.
1.3 Coordination with Control Agency
The general philosophy of this document is that the applicant
or the developer of the model should perform the analysis. The reviewing
agency should review this analysis, perform checks, and/or perform an
independent analysis. The reviewing agency must have access to all of
the basic information that went into the applicant's analysis (model
computer code, all input data, all air quality data) so that an indepen-
dent judgment is possible.
To avoid costly and time-consuming delays in execution of the
model evaluation, the applicant is strongly urged to maintain close
-------
liaison with the reviewing agency(s), both at the beginning and throughout
the project. A mininum* of two reports should be submitted to the control
agency for review and subsequent negotiation. The first report should
contain the preliminary analysis, the protocol for the performance evalua-
tion and the design of the data base network. Before any monitors are
deployed or data collection begins, it is important that the control
agency concur on all aspects of the planned evaluation, including choice
of the reference model, design of the performance evaluation protocol and
the design of the data base network. The second report would be submitted
at the conclusion of the study. It should describe the data base, the
results'of executing the protocol, and the model chosen for application.
*As a mechanism to maintain close liaison between the source and
control agency, the submission of other periodic progress reports is
encouraged.
-------
2.0 PRELIMINARY ANALYSIS
As a prerequisite to design of the performance evaluation and the
data base network, it is necessary to develop and document a complete
understanding of all regulatory and technical aspects of the model appli-
cation. This preliminary analysis establishes the regulatory requirements
of the application and describes the source and its surroundings. Based
on these factors, the analysis identifies and describes a reference model
or historically based regulatory model which would normally be applied to
the source(s). The preliminary analysis includes concentration estimates
from the reference and proposed models, based on existing data and appro-
priate emission rates. If the protocol specifies that the technical
analysis is to be considered in the final decision (see Section 3) then
the application-specific technical aspects of the two models are compared
using techniques described in the Workbook lor Comparison of Air Quality
Models.' This workbook is used to develop a judgment on the scientific
credibility of the models for the regulatory application.
The outcome and primary purpose of the preliminary analysis is to
provide a focus for the performance evaluation (Section 3) and for identi-
fication of the requisite data bases (Section 4). A secondary purpose is
to provide a technical basis for judging the model, in the event that the
performance evaluation is inconclusive. The preliminary analysis require-
ments are detailed in the following subsections.
2.1 Regulatory Aspects of the Application
The preliminary analysis should establish the pollutant or
pollutants to be modeled, the averaging times, e.g. 3-hour, 24-hour, etc.
for these pollutants, and the limiting ambient criteria (standards, PSD
-------
increments, etc.)- The current regulatory classification, e.g., attain-
ment, nonattainment, PSD Class I, should be documented. The regulatory
boundaries of the area for which concentration estimates apply should also
be established. Existing emission limits, if any, should be identified.
For example, there may be a question whether the SC>2 emission
limits from several sources in an attainment area can be relaxed, and if
so, by how much. In this case, the 3-hour, 24-hour and annual S02 ambient
air standards apply, as do the Class II Increments for these averaging
times. There may also be a distant Class 1 PSD area for which any emission
relaxation could result in incrcMnent consumption; as such, incremental
concentration estimates corresponding to the three averaging times would
be required in that area. The allowable time frame for regulatory action
should also be identified since the evaluation of model performance
involves a significant amount of time and expenditure of resources.
Allowable emission rates during the period of model evaluation should be
specified.
2.2 Source and Source Environment
To define the important source-receptor relationships involved
in a regulatory modeling probleir, it is necessary to assemble a complete
description of the source and its surroundings. Information on the source
or sources involved includes the configuration of the sources, location
and heights of stacks, stack parameters (flow rates and gas temperature)
and location of any fugitive emissions to be included. Existing and
proposed emission rates should be identified for each averaging time that
corresponds to an ambient air quality standard applicable to pollutants
under consideration. In the case of complex industrial sources it is
-------
also generally necessary to obtain a plant layout including dimensions
of plant buildings and other nearby buildings/obstacles. Sources should
be characterized in as much detail as possible, i.e. commensurate with
the input requirements of the models (See Sections 2.3 and 2.4). For
example, source emissions should be Dissembled as mobile and area line
source segments, grid squares, etc.
Information on the source surroundings are usually best identi-
fied on a topographic map or maps that cover the modeling area. The
areal coverage is sometimes predetermined by political jurisdiction
boundaries, i.e., an air quality region. More often, however, modeling
is confined to the region where any significant threat to the standards
or PSD increments is likely to exist. The locations of major existing
sources (for the pollutants in question), urban areas, PSD Class I areas,
and existing meteorological and air quality data should be identified on
the maps.
A determination should be made whether the source(s) in question
are located in an urban or rural setting. The recommended procedure for
making this determination utilizes the techniques of Auer^ where the land
use within a 3 km radius of the source is classified. Other techniques,
based on population and judgmental considerations may be used if they can
be shown to be more appropriate.
The method to be used in establishing the ambient concentration
due to all other existing sources should be established. If nearby sources
are to be modeled, then their emissions and source characterization needs
to be specified. Applicable background concentrations and the method
used to estimate them should be documented.
-------
2.3 Reference Model
The reference model is the model that would normally be used
by the regulatory agency in setting emission limits tor the source. The
choice of reference model should be made by the appropriate regulatory
agency and follow from guidance provided in the Guideline on Mr Quality
Models.
However, not all modeling situations are covered by recommended
models. For example, models for point sources of reactive pollutants or
shoreline fumigation problems are not included. In other cases the model
normally used by the regulatory agency might be a screening technique
that does not lend itself easily to performance evaluations. In these
circumstances the applicant and the reviewing agency should attempt to
agree on an appropriate and technically defensible reference model, which
provides for hour-by-hour estimates based on the current technical literature
and on past experience. Major considerations are that the reference
model is applicable to the type of problem in question, has been described
in published reports or the open literature, and is capable of producing
concentration estimates for all averaging times for which a performance
measure statistic must be calculated (usually 1-hour and the averaging
times associated with the standards/increments). This latter requirement
usually* precludes the use of screening techniques which rely on assumed
meteorological conditions for a worst case.
Where it is clearly not possible to specify a reference model,
the proposed model must "stand alone" in the evaluation. In such cases
*Some screening techniques do contain provisions for hour-by-hour
estimates and as such they may be used.
10
-------
the technical justification and the performance evaluation necessary to
determine acceptability should be more substantial. Section 2.7 discusses
a rationale for determining it the model is technically justified for use
in the application. Section 3.4 discusses some considerations in designing
the performance evaluation protocol when no reference model comparison is
involved.
2.4 Proposed Model
The model proposed for use in the intended application must be
capable of estimating concentrations corresponding to the regulatory
requirements of the problem as identified in Section 2.1. In order to
conduct^ the performance evaluation, the model should be capable of sequen-
tially estimating hourly concentrations based on meteorological and
emission inputs.
A complete technical description of the model is needed for the
analysis in Sections 2.6 or 2.7. This technical description should
include a discussion of the features of the proposed model, the types of
modeling problems for which the model is applicable, the mathematical
relationships involved and their basis, and the limitations of the model.
The model description should take the form of a report or user's manual
that completely describes its operation. Published articles which describe
the model are also useful. If the model has been applied to other problems,
a review of these applications should be documented. For models designed
to handle complex terrain, land/water Interfaces and/or other special
situations, the technical description should focus on how the model
treats these cases. To the maximum extent possible, evidence for the
validity of the methodologies should be included.
11
-------
2.5 Preliminary Estimates
Once the reference and proposed models are identified, it is
essential that, at least in a preliminary sense, the consequences of
applying each of these models to the regulatory problem be established.
The questions to be answered are: (1) What are the preliminary concentra-
tion estimates for each model that would be used to establish emission
limits? (2) Where are the locations of such critical concentrations? and
(3) What are the differences between estimates at these locations? The
preliminary estimates should utilize the appropriate emission rates for
the regulatory problem and whatever representative meteorological data
are available before the evaluation.* In those infrequent cases where no
representative meteorological data can be identified, it may be necessary
to collect on-site data before making preliminary estimates.
It is recommended that two or three separate preliminary estimates
of the concentration field be made. The first set of estimates should be
made with the screening techniques mentioned or referenced in the Guideline
on Mr Quality Models. The second set of estimates should be done with
the proposed model and the third set with the reference model. Estimates
for all applicable averaging times should be calculated. The three sets
of estimates serve to define the modeling domain and critical receptors.
They also aid in determining the applicability of the proposed model (Sec-
tions 2.6 and 2.7), the development of a performance evaluation protocol,
and the design of requisite data networks (Sections 3 and 4).
*A final set of model estimates, to be used in decision making, may
utilize additional data collected during the performance evaluation as
input to the appropriate model.
12
-------
2.6 Technical Comparison with the Reference Model
When an appropriate reference model can be identified it may
prove useful to compare the proposed model with the reference model.
Emphasis should be on dispersion conditions and subareas of the modeling
domain that are most germane to the regulatory and technical aspects of
the problem (Sections 2.1 and 2.2). The procedures described in the Work-
book for Comparison of Air Quality Models are appropriate for this compari-
son. This Workbook contains a procedure whereby a proposed model is
qualitatively compared, on technical grounds, to the reference model, and
the intended use of the two models and the specific application are taken
into account.
The Workbook procedure is application-specific; that is, the
results depend upon the specific situation to be modeled. The reference
model serves as a standard of comparison against which the user gages the
proposed model. The way in which the proposed model treats twelve aspects
of atmospheric dispersion, called "application elements," is determined.
These application elements represent physical and chemical phenomena that
govern atmospheric pollutant concentrations and include such aspects as
horizontal and vertical dispersion, emission rate, and chemical reactions.
The importance of each element to the application is defined in terms of
an "importance rating." Tables giving the importance ratings for each
element are provided in the Workbook, although they may be modified under
some circumstances. The heart of the procedure involves an element-by-
element comparison of the way in which each element is treated by the two
models. These individual comparisons, together with the importance
ratings for each element in the given application, form the basis upon
which the final comparative evaluation of the two models is made.
13
-------
It is especially important that the user understand the physi-
cal phenomena involved, because the comparison of two models with respect
to the way that they treat these phenomena is basic to the procedure.
Sufficient information is provided in the Workbook to permit these compari-
sons. Expert advice may be required in some circumstances. If alternate
procedures are used to complete the technical comparison of models, they
should be discussed with the reviewing agency. The results of the compari-
son may be used in the overall model evaluation in Section 5.
2.7 Technical Evaluation When No Reference Model Is Used
If it is not possible to identify an appropriate reference model
(Section 2.3), then the procedures of Section 2.6 cannot be used and the
proposed model must be technically evaluated on its own merits. The
technical analysis of the proposed model should attempt to qualitatively
answer the following questions:
1. Are the formulations and internal constructs of the model
well founded in theory?
2. Does the theory fit the practical aspects and constraints
of the problem?
To determine whether or not the underlying assumptions have
been correctly and completely stated requires an examination of the basic
theory employed by the model. The technical description of the model
discussed in Section 2.4 should provide the primary basis for this exami-
nation. The examination of the model should be divided into several
subparts that address various aspects of the formulation. For example,
for some models it might be logical to separately examine the methodologies
used to characterize the emissions, the transport, the diffusion, the
14
-------
plume rise, and the chemistry. For each of these model elements it should
be determined whether the formulations are based on sound scientific,
engineering and meteorological principles and whether all aspects of each
element are considered. Unsound or incomplete specification of assumptions
should be flagged for consideration of their importance to the actual
modeling problem.
For some models, e.g., those that entail a modification to a
model recommended in the Guideline on Air Quality Models or to the reference
model, the entire model would not need to be examined for scientific credi-
bility. In such cases only the submodel or modification should be examined.
Where the phenomenological formulations are familiar and have been used
before, support for their scientific credibility can be cited from the
literature.
For models that are relatively new or utilize a novel approach
to some of the phenomenological formulations, an in-depth examination of
the theory should be undertaken. The scientific support for such models
should be established and reviewed by those individuals who have broad
expertise in the modeling science and who have some familiarity with the
approach and phenomena to be modeled.
To determine how well the model fits the specific application,
the assumptions involved in the methodologies proposed to handle each
phenomenon should be examined to see if they are reasonable for the given
situation. To determine whether the assumptions are germane to the
situation, particular attention should be paid to assumptions that are
marginally valid from a basic standpoint or those that are implicit and
unstated. For assumptions that are not met, it should be established
that these deficiencies do not cause significant differences in the
15
-------
estimated concentrations. The most desirable approach takes the form of
sensitivity testing by the applicant where variations made on these
assumptions are indeed critical. Such an exercise should be conducted,
if possible, and should involve estimates that reflect alternate assumptions
before and after modification of formulas or data. However, in many
cases this exercise may be too resource consumptive and the proof of model
validity should still rest with the performance evaluation described in
Section 3.
Execution of the procedures in this section should lead to a
judgment on whether the proposed model is applicable to the problem and
can be scientifically supported. If these criteria are met, the model
can be designated as appropriate and should be applied if its field perfor-
mance (Section 5) is acceptable. When a model cannot be supported for
use based on this technical evaluation, it should be rejected. When it
is found that the model could be appropriate, but there are questionable
assumptions, then the model may be designated as marginal and carried
forward through the performance evaluation.
2.8 Technical Summary
The final step in the technical analysis is to combine the
results of Sections 2.1 through 2.6/2.7 into a technical summary. This
summary should serve to define (1) the scope of the issues to be resolved
by the performance evaluation, (2) the areal and temporal extent of the
differences in estimates between the proposed and the reference models,
and (3) the reasons why the two models produce different estimates and/or
different concentration patterns.
16
-------
The technical summary provides a focus for the performance
evaluation and the design of the requisite data base network. The results
of the technical summary are used in Section 3 to establish criteria for
the performance evaluation protocol and in Section 4 to define the requisite
data base.
17
-------
3.0 PROTOCOL FOR PERFORMANCE EVALUATION
The goal of the model performance evaluation is to determine whether
the proposed model provides better estimates of concentrations germaine
to the regulatory aspects of the problem than does the reference model.
To achieve this goal, model concentration estimates are compared with
observed concentrations in a variety of ways.* The primary methods of
comparison produce statistical information and constitute a statistical
performance evaluation.
This section describes a procedure for evaluating the performance of
the proposed model and for determining whether that performance is adequate
for the specific application. The procedure requires that a protocol be
prepared for comparing the performance of the reference and proposed
models. The protocol must be agreed upon by the applicant and appropriate
regulatory agencies prior to collection of the requisite data bases. The
description of the protocol includes a scheme to (1) weight the relative
importance of various performance measures to the regulatory goals of the
evaluation and (2) objectively discriminate between the relative performance
of the proposed and reference models. Some guidance is also provided on
how to write a protocol and evaluate model performance when comparison
with a reference model is not possible.
Before going into the details of the protocol, it is important to
review briefly some of the statistical performance measures that are commonly
used to assess the performance of a model against measured data. It is al-
so useful to consider how the ambient data base is commonly broken down into
*Concentration and meteorological data needed for the performance evaluation
are discussed in Section 4. The data base network design/requirements are
partially determined by the nature of and amount of performance statistics
defined in the protocol.
19
-------
data subsets which are operated on by the statistical measures. Section
3.1 describes these performance measures and Section 3.2 briefly describes
some of the commonly used data subsets.
Model performance should be evaluated for each of the averaging times
specified in the appropriate regulation(s). In addition, performance for
models whose basic averaging time is shorter than the regulatory averaging
time should also be evaluated for that shorter period, provided, of course,
that the measurements are available for shorter averaging periods. For
example, a model may calculate sequential 1-hour concentrations for SC>2
from which concentrations for longer averaging periods can be computed.
Performance of this model can thus be evaluated separately for 1-, 3-,
and 24-hour averages and, if appropriate, for the annual mean. It should
be noted that although frequency distribution statistics are indicated in
Table 3.1, they may be considered somewhat redundant: when performance
measures of both bias and precision are used. For this reason, graphical
displays of the cumulative frequency distribution of observed and predicted
values may be useful as supplementary aids in the evaluation process.
3.1 Performance Measures
The basic tools used in determining how well a model performs
in any given situation are performance measures. Performance measures can
be thought of as surrogate quantities whose values serve to characterize
the discrepancy between predictions and observations. Values obtained
from applying the performance measures to a given data base are most
often statistical in nature; however, certain performance measures (e.g.,
frequency distributions) may be more qualitative thaa quantitative in
nature.
20
-------
Performance measures may be classified as magnitude of differ-
ence measures and correlation or association measures. Magnitude of
difference measures present a quantitative estimate of the discrepancy
between measured concentrations and concentrations estimated by a model
at the monitoring sites. Correlation measures quantitatively delineate
the degree of association between estimations and observations.
Table 3.1 lists a number of the more commonly used and recom-
mended performance statistics for model evaluation purposes. These
statistics and the corresponding nomenclature are taken from Fox^ and are
based primarily on the recommendations of an AMS Workshop on Dispersion
Model Performance held in 1980. Since the statistics and basis for
confidence limits are described extensively in most statistical texts,
only a brief description of how these measures apply to model performance
is presented below. Although each of the statistics provide a quantitative
measure of model performance, they are somewhat easier to interpret when
accompanied by graphical techniques such as histograms, isopleth analyses
and scatter diagrams.
3.1.1 Model Bias
Many of the performance statistics serve to characterize,
in a variety of ways, the behavior of the model residual, defined as the
observed concentration minus the estimated concentration. For example,
model bias is determined by the value of the model residual averaged over
an appropriate range of values. Large over- and underestimations may
cancel in computing this average. Supplementary information concerning
the distribution of residuals should therefore be supplied. This supple-
mentary information consists of confidence intervals about the mean
value, and histograms or frequency distributions of model residuals.
21
-------
cn
•H
cn
•a
pa
-a
d
cn
C
o
•H
cfl
t-l
4-1
C.
CD
O
C
O
0
13
01
J-I
•H
CO
Cu
C
r-i
cfl C
o
u cn
O 4-1 W
4-1 .H 01
-o a vj
c= -H 3
•H 1-3 '*!
4J CO
cn cu o>
w cj 3d
C
rH CU 11
co -a u
cj .H c;
• H '4-1 :fl
4-> c a
cn o VJ
•H 0 0
4-1 4-1
Cfl i-l S-*
4-1 O Ol
CO 4-1 frj
, — 1
CO
o>
43
CO
H
cn
p;
O
•H
4J
Cfl
(-1
4-J
CD
U
O
O
-d
0)
•H
cfl
PM
cfl
p>
S-i
CD
4-J
C!
1 — l
0)
O
a
0)
-a
•i— (
c
o
CJ
S-l
o
4-1
cn
•H
cn
co
PQ
u
o
4-J
cfl
e
•H
4-1
cn
w
rH
Cfl
}_j
CD
4-1
C
CD
O
C
O)
•H
4H
d
o
u
o
4H
cn
•H
cn
cfl
S-l
o
4J
CO
a
•H
4-1
cn
|T"]
"O
0)
CD 4-J
cj cfl
C CD 0
CO VJ -H
S3 4-1
cn CD
O cfl Cd
4-1 CD
s-i g a;
0) O3
P-i
O
H
;
4-J
:
OJ
rH
Cu
e
cfl
cn
O
3
H
O-i
'V
o
1 CJ
cn
4-J
a
CD
E
4-1
cn
3
-l — 1
"d
CO
-d C
4-> O
•H -H
cfl
4-1 CD
- S-l
U
CD O
rH O
eu o
cfl 3
cn co
0) S-l
a o
O 4-1
l-d
r**>
CD
C
4-1
i
c~!
|^5
1
rj
C
co
^^
Cu
'V
o
1 C-)
)_l
•H
cfl
Cu
T-,
CD
O
4-1
CO
a
d
o
X
o
o
rH
•H
^
LO
•
O
-o
cn
cfl
• H
PO
0)
"d cfl
c e
CO -H
o> cn
cn cu
•H
O O)
'^ j_j
cfl
0)
t-l >->
Cfl 4-J
S-i -H
CO rH
CJ CU .rH
•r-l D 43
4-1 cn co
cn -H
•H 0 S-l
4-J '<3 cO
cfl >
4-J
cn •• cn
CD cn
1 4-J O
,° -*-1
CU
u
Ol
cn
^^
o
CN]
cn
cxl
,*!.
"d
rs|
cn
0)
cn
•H
(O
f — ^
i — I
^_^
•^
—
^i
Csl
-a z
>,
4—1
•H
rH
•H
cfl
-H
S-i
CO
cn
cn
O
S-l
O
r~N
— 1
S^ X
-H
3 4-1
rH Cfl
O -H
cn >
43 CD
< p
t
f S.
t."Nj
s /
,
1
o
o
^^
tt^
_ „
£>-l
4-J
•H
, — 1
cfl
0 B
•H M
4-J O
cn Z
•H
4-1 VJ
cfl O
4-J ').r|
CO
4-)
cn cn
1 OJ
^ H1
/~*\
T)
pL|
cn
a
o
•H
K^ 4-*
CJ 3
C 43
CD -H
3 S-l
0* 4-1
a) cn
S-l -H
•
cn
•H
cn
K*~>
rH
CO
C
cO
4-1
O
C1J
Cu
>~»
4-J
cn
•H
43
4-1
S-l
O
4H
rH
3
00
C
•H cfl
C 4-1
cfl CO
Ol -d
0
C4H
4-1 O
O
C 0)
Cu
4-1 >,
3 4-1
43
cn
T3 -H
0) A
4-1 4-1
Cfl
3 0
CJ 4-1
rH
co cn
O 4-1
cn
CD -H
43 X
0)
c
CO cj
O -H
4-1
cj cn
•H -H
4J 4-J
cn cfl
•rH 4J
4J CO
Cfl
4-1 O
CO <3
1 1
CD cu
43 43
CO cfl
O cj
•H -r-1
rH rH
Cu Cu
Cu Cu
^ "^
4-1 4-1
0 0
21 2
r—\ l*~xl
22
-------
For certain applications, especially cases in which the
proposed model is designed to simulate concentrations occurring during
important meteorological processes, it is important to estimate model
bi as under different: meteorological conditions. The degree of data dis-
aggregation is a compromise between the desired goals of defining a large
enough number of meteorological categories to cover a wide range of con-
ditions and having a sufficient number ot observations in each category
to calculate statistically meaningful values. For example, it may be
appropriate to stratify data by lumped stability classes, unstable (A-C),
neutral (D) and stable (E-F) rather than by individual classes A, B, C,
D, R and F. The use of wind speed classes may also be appropriate.
3.1.2 Model Precision
Model precision refers to the average amount by which
estimated and observed concentrations differ as measured by a different
type of residual than that used for bias, that is the absence of an
algebraic sign. While large positive and negative residuals can cancel
when model bias is calculated, the unsigned residuals comprising the
precision measures do not cancel. Thus, they provide an estimate of the
error scatter about some reference point. This reference point can be
the mean error or zero error. Two types of precision measures are the
noise, which delineates the error scatter about the mean error, and the
gross variability, which delineates the error scatter about zero error.
The performance measure for noise is either the variance
of the residuals, s,^, or the standard deviation of the residuals, s,.
The performance measure for gross variability is the mean square error,
or the root-mean-square-error. An alternate performance measure for
the gross variability is the mean absolute residual, TdT- The mean
23
-------
absolute residual is statistically more robust than the root-mean-square-
error; that is, it is less affected by removal of a few extreme values.
Supplementary analyses for model precision should include
confidence limits, as appropriate, and computation of these measures for
selected meteorological categories as discussed in Section 3.1.1.
*
3.1.3 Correlation Analyses
Correlation analyses involve parameters calculated from
linear least squares regression and associated graphical analyses. The
numerical results constitute quantitative measures of the association
between estimated and observed concentrations. The graphical analyses
constitute supplementary qualitative measures of the same information.
There are three types of correlation analyses; coupled space-time, spatial,
and temporal analyses.
Coupled space-time correlation analysis involves computing
the Pearson's correlation coefficient, r, or an equivalent nonparametric
coefficient such as Spearman's p or Kendall's T. The parameters a and b
of the linear least squares regression equation should be included. A
scattergram of the predicted data pairs is supplementary information
which should be presented.
Spatial correlation analysis involves calculating the
spatial correlation coefficient and presenting isopleth analyses of the
predicted and observed concentrations for particular periods of interest.
The spatial coefficient measures the degree of spatial alignment between
the estimated and observed concentrations. The method of calculation
involves computing the correlation coefficient for each time period and
determining an average over all time periods. Estimates of the spatial
24
-------
correlation coefficient for single source models are most reliable for
calculations based on data intensive networks such as those contained in
a tracer study. Isopleths of the distributions of estimated and observed
concentrations for periods of interest should be presented and discussed.
Temporal correlation analysis involves calculating the
temporal correlation coefficient and presenting time series of observed
and estimated concentrations or of the model residual for each monitoring
location. The temporal correlation coefficient measures the degree of
temporal alignment between observed (Co) and predicted (Cp) concentrations.
The method of calculation is similar to that for the spatial correlation
coefficient. Time series of Co and Cp or of model residuals should be
presented and discussed for each monitoring location.
3.2 Data Organization
The performance measures described above may be applied to various
combinations of observed and predicted values depending on the objectives
of the evaluation and the nature of the regulatory problem (i. e., the
intended application). For example, when "once per year" ambient stan-
dards are of primary concern, observed and predicted maximum (or near
maximum) concentrations should be compared. Since complete space and/or
time pairing is often not important from a regulatory point of view, the
appropriate data combination need not be restricted to only concentration
pairs having the same hour or location.
There are many possible combinations of observed and predicted
concentrations that may be chosen for evaluation. Thus, it is useful to
organize, at least conceptually, the complete data set into a matrix of
observed and predicted values as exhibited in Figure 3.1. Entries in the
center of the figure are completely paired in time and space. Entries
25
-------
LO
e
o
•H
4J
CO
4-1
CO
,
..
CN
C
0,
4J '
CO
4-1,
c/j
•H,
' C,
0
4J
cO ^
£j
i- *
, 1-
'
1 .
[ ;
T „
4->
x\_/
cfl
S O«
CJ
zr
Js'"*'
^* Q
o
o
Ed
0
0)
CO
g
iH
•
CO
/-~s r-s
-------
shown in the bottom two rows and last two columns represent, are respectively,
pairs of maximum concentration paired in space only and time only.
Since the figure permits illustration of only a few data combina-
tions which may be of interest from a regulatory viewpoint, a more complete
tabulation of data combinations (data sets) is shown in Table 3.2. The
first type of data combination refers to "peak concentration" which by defi-
nition excludes the low concentration comparisons. Except for combination
A-3 (completely paired peak residuals), all data sets involve some degree
of spatial and/or temporal unpairing between observed and predicted
values. The second type of data combination refers to "all concentrations"
which comprise complete time and space pairing for all predicted and
observed values within a defined category. For example, data set B-l
refers to the set of all observed and predicted values at a given station,
paired in time. Since each station is evaluated separately, the total
number of data combinations is equal to the total number of stations.
The rationale for selecting particular data combinations and
statistics to evaluate various aspects of model performance is simplified
by first establishing major objectives to be accomplished by the performance
evaluation. The procedure for establishing these objectives and for assign-
ing levels of importance to each objective is discussed in the following
section.
3.3 Protocol Requirements
Because of the variety of statistical measures and data combi-
nations that might be considered for evaluation purposes, it is essential
that a written protocol be prepared and agreed to by the applicant and
appropriate control agency before the data collection and evaluation
27
-------
Table 3.2. Summary of Candidate Data Sets for Model Evaluation
A. Peak Concentration
Comparisons
(A-l) Compare highest observed
value for each event with
highest prediction for
same event (paired in
time, not location).
(A-2) Compare highest observed
value for the year at
each monitoring station
with the highest prediction
for the year at the same
station (paired in location,
not time).
(A-3a) Compare maximum observed
value for the year with
highest predicted values
representing different time
or space pairing (fully
unpaired, paired in location,
paired in time, paired in
space and time).
(A-3b) Compare maximum predicted
value for the year with
highest observed values for
various pairings, as in (A-3a)
(A-4a) Compare highest N (=25)
observed and highest N
predicted values, regardless
of time or location.
(A-4b) Compare highest N (=25)
observed and highest N
predicted values, regardless
of time, for a given monitor-
ing location. (A data set
for each station.)
(A-5) Same as (A-4a), but for sub-
sets of events by meteoro-
logical conditions (stability
and wind speed) and by time
of day.
B. All-Concentrations
Comparisons
(B-l) Compare observed and
predicted values at a
given station, paired
in time. (A data set
for each station.)
(B-2) Compare observed and
predicted values for
a given time period,
paired in space (not
appropriate for data
sets 'with few moni-
toring sites).
(B-3) Compare observed and
predicted values at all
stations, paired in
time and location (one
data set) and by time
of day.
(B-4) Same as (B-3), but for
subsets of events by
meteorological condi-
tions (stability and
wind speed) and by time
of day.
28
-------
process is initiated. Conceptually, the protocol describes how various
performance measures will be used to compare the relative performance of
the proposed and reference models in a manner that is most relevant to
the regulatory need (the intended application as described in Section
2.1.) To organize this concept it is suggested that the protocol contain
four major components as follows: (1) a definition of the performance
evaluation objectives to be accomplished in terms of their relevance to
the regulatory application; (2) a compilation of specific data sets and
performance measures that will be applied under each performance objective;
(3) an objective scheme for assigning weights to each performance measure
and data set combination; and (4) an objective scheme for scoring the
performance of the proposed model relative to the reference model.
This section discusses the factors to be considered in estab-
lishing such a protocol for an individual performance evaluation. Although
some experience has been gained in applying the techniques to actual
regulatory situations, it remains clear that the procedures described
below must remain general enough to adequately cover all types of regulatory
problems.
3.3.1 Performance Evaluation Objectives
The first step in developing the model performance protocol
is to translate the regulatory purposes associated with the intended model-
ing application into performance evaluation objectives, which, in turn,
can be linked to specific performance measures and data sets. This step
is important since each intended modeling application is unique with
respect to source configuration, the critical source-receptor relationships,
the types of ambient levels to be protected (e.g., NAAQS vs. PSD), averaging
29
-------
times of concern (e.g., 1-hr, 3-hr, 24-hr, etc.), and the form of the
ambient standard (e.g., not to be exceeded more than once per year vs.
annual).
In most applications, the primary regulatory purposes
can be stated clearly in terms that relate directly to certain performance
measures and data sets. For example, if the primary regulatory purpose
is to prevent violations of short-term ambient standards which might be
threatened by construction of a large isolated SC>2 source, then the ability
of the models to accurately predict highest 3-hour and 24-hour concentrations
is critical. In this example situation, the primary performance objective
might be stated as: "Determine the accuracy of peak estimates in the
vicinity of the proposed plant."
While "peak accuracy" is the first order objective in
this example, other performance objectives can be stated that relate to
the ability of the selected models to perform over a variety of concen-
trations levels and conditions. For example, additional confidence can
be placed in a model if it is also accurate in estimating the magnitude
of lower concentrations at specific stations and for specific meteoro-
logical events. Thus, a second order objective might be stated as "Deter-
mine the accuracy of estimates of concentrations over a range of concentra-
tions, time periods, and stations."
A third performance evaluation objective which can be
derived from this example regulatory application is related to measures
of spatial and temporal correlation. While a model may adequately predict
peaks and average levels at given stations, a measure of additional
confidence can be gained if the model also traces the time sequence of
30
-------
conceatratLons reasonably well. Thus, a third order performance evalua-
tion objective might be stated as "Determine the degree of correlation
between predicted and measured values in time and space." While corre-
lation is a reasonably stringent performance measure (time and/or space
pairing is required), it is ranked below the previous two pertormance
objectives. Iwen good correlation can be obtained in cases where the
magnitude of peak levels are poorly predicted and for which a large
overall bias exists.
It should be noted that the generic formulation and
number of performance objectives for any given application may differ
substantially from those illustrated here. In other words, the specific
regulatory purpose should be the guide for the selection of those perfor-
mance objectives that are most directly relevant to the intended application.
3.3.2 Selecting Data Sots and Performance Measures
Once the performance evaluation objectives are established,
it is necessary to choose among the various data combinations and performance
statistics listed in Tables 3.1 and 3.2. These are used to characterize
the ability of the models to meet the evaluation objective. Table 3.3
summarizes the more important data sets and performance statistics relevant
to each generic objective described above. These objectives have been
arbitrarily numbered in relative order of importance as they might pertain
to the hypothetical SC>2 regulatory application described above. For an
actual application, any of the three generic objectives (or some other
derived objective) could have a higher level of importance depending on
the nature of the regulatory problem.
Table 3.3 shows, for each performance evaluation objective,
a suggested list of the most relevant data sets and performance measures
31
-------
Table 3.3. Summary of Data Sets and Performance
Statistics for Various Performance Evaluation
Objectives
Performance
Evaluation
Objectives
1. Determine
Model
Accuracy for
Peak Values
2. Determine
Model
Accuracy
Over Entire
Concentration
Domain
3. Spatial and
Temporal
Correlation
Data Sets
(Table 3.2)
A-3a, A-3b
A-4a, A-4b, A-5
A-l, A- 2
B-l,B-2,B-3,B-4
B-l,B-2,B-3,B-4
A-l,A-2
Performance
Statistics
(Table 3.1)
Single Valued
Residuals
s2c /s2c , d
o p
s2d, d
s2d, d
r, Regression
Statistics
r, Regression
Statistics
Supplementary
Graphics
None
Freq Dist
of Top 25
Freq Dist of
All Values
Selected
Isopleths
and Time
Series Plots
Scattergrams
Scattergrams
Notes: (1) If particular site(s) are crucial (i.e.,, PSD), then analyses
should be confined to a site or a subset, of important sites.
(2) For reactive pollutants, performance measures should be
developed for each of a number of selected days.
32
-------
along with supplementary graphical displays that may prove useful in the
evaluation process. For example, the first order objective shown as
"Accuracy of Peak Values" has three data sets. Except for selected cases,
these data sets correspond to the peak concentration category shown in
Table 3.2. While each data set offers some measure of information regarding
accuracy of peak estimates, the focus is on different aspects of peak
levels that may be of greater importance in some applications. Data sets
A-3a and A-3b relate most directly to short-term ambient standards.
However, they suffer by being statistically non-robust compared to data
sets A-4 and A-5 which involve a greater number of highest values in the
computation of the performance measures. Data set A-l, since it consists
of a large number of values (one pair for each time period), is subject
to the least statistical variability but suffers by including many events
that may be below the concentration range of primary concern. Thus, the
tradeoff is between the degree of confidence desired and the degree of
regulatory relevance associated with each candidate data set/performance
statistic.
The performance statistics are directly tied to the
nature of the performance evaluation objective and the degree of natural
pairing between the measured and predicted values. Since the first and
second objectives both relate to accuracy, measures of bias and precision
(noise and variability) are indicated. The third performance evaluation
objective, by virtue of its definition, involves correlation and hence
the correlation coefficient, r, is indicated. Note that whenever perfor-
mance measures are applied to paired data (e.g., A-l, A-2), the measure
of precison is the noise, s^, while for unpaired data (e.g., A-4a, A-4b,
A-5), the ratio of variances is indicated.
33
-------
A precise procedure for choosing data sets and perform-
ance statistics for each objective cannot be illustrated here. It must
be determined by consideration of the nature of the regulatory purpose(s),
the degree of confidence desired in the final result and the resources
available for the evaluation. In specific applications, some of the
statistics and/or data sets may be omitted depending upon the degree of
redundancy or relevance to the regulatory problem. For example, data set
B-3 uses all available pairs of data but requires that only one set of
statistics be calculated. This contrasts with data set B-l which also
makes use of all data pairs but requires a separate calculation for each
station. The decision regarding the use of both data sets (i.e., B-3
and B-l) depends to some extent on the need to know how well the models
perform at specific station locations over all concentration levels.
3.3.3 Weighting the Performance Measures
Once the appropriate performance evaluation objectives,
data sets and performance measures are specified, it is necessary to
establish the relative importance each performance measure should hold
in the final decision scheme. It is suggested that the relative impor-
tance of the performance measures be objectively established by assigning
weights to the performance evaluation objectives and also to each per-
formance measure according to how well that measure characterizes the
objective. The assignment of weights in any given situation is somewhat
judgmental and may differ slightly among trained analysts. Thus, it is
important in the protocol to document the rationale used to establish
the relative weights. It is suggested that, in order to keep the problem
simple, that weights be established on the basis of a percentage or
fraction of a total 100 points.
34
-------
Generally the first order objective would be weighted
most heavily while less important objectives would be weighted less
heavily depending upon their importance in the application. As an example,
the first order objectives might be weighted 50 percent, the second order
objectives 30 percent, and the third order objectives 20 percent. For
each performance objective, each combination of performance measures and
data sets must also be given a weight. Again the determination of the
appropriate weight for each performance measure is judgmental and should
be accompanied by a rationale. Some of the judgments involved, for
example are: (1) Is model bias a more important factor than gross varia-
bility? (2) Is accurate prediction of the magnitude of the peak more
important than accurate prediction of the location of that peak? Answers
to these questions vary with the application and will result in different
assignment of weights accordingly, Those measures of performance which
best characterize the ability of either model to more accurately estimate
the concentrations that are critical to decision-making should carry the
most weight. If the estimated maximum concentration controls the emission
limit for the source(s), then more weight should be given to performance
measures that assess the models' ability to accurately estimate the maxium
concentration.
The magnitude of the weights should also take into
consideration the degree of confidence that can reasonably be assigned
to the performance statistic to be calculated, i.e., only minimal confi-
dence can be placed in single-valued residuals since these values are
non-robust and sensitive to unusual conditions. Generally, there will be
some trade-off between degree of confidence and relevance of the particular
performance measure. This means that the most relevant performance
35
-------
measures may be given less weight than otherwise might be assigned were
confidence in the result not a critical factor.
3.3.4 Determining Scores for Model Performance
The final step in writing the protocol is to establish
how each performance statistic (calculated by applying a performance meas-
ure to a given data set) can be translated into a performance evaluation
score. Such a scheme involves definition of the rationale to be used in
determining the degree to which each pair of performance measure statistics
supports the advantage of one model over the other. Stated differently,
it is necessary to have a measure of the degree to which better performance
of one model over the other can be established for each performance measure,
It seems apparent that the more confidence one has that one model is per-
forming better than the other, the higher that model should score in the
final decision on the appropriateness of using that model. Clearly this
is important when at least one of the models is performing moderately
well. For example, if only one model appears to be unbiased, the degree
to which the other is biased can be a factor in quantifying the relative
advantage of the apparently unbiased model.
Qualitatively, the problem of determining which model is
performing better is straightforward. Clearly, the model with the smaller
residuals, the smaller bias, the smaller noise and the higher correlation
coefficient is better. The difficulty, which is not straightforward, is
how to meaningfully quantify the comparative advantage that one model has
over the other. There are several approaches that can be used.
In one approach, a "score" is derived for each pair (one
for each model) of performance statistics. The number of points which
are awarded is based on the degree of statistical significance attached
36
-------
to the difference in each model's ability to reproduce the observed data.
The level of significance could be determined by the degree to which con-
fidence limits on performance measures of each model overlap or, alterna-
tively, on an hypothesis-testing method in which a specified confidence
level is assigned. A procedure for awarding points using confidence limits
is outlined in the Appendix B. In the "example problem" positive points
are awarded for each performance statistic if the proposed model performs
better than the reference model and negative points if the reference
model performs better. The (absolute) magnitude of the score is dependent
on the relative difference in the model's performance of each model but
is limited to the "maximum score" established for each measure. Such a
maximum score is directly proportional (or perhaps equivalent) to the
weight for each measure.
The reader is cautioned that the actual level of statis-
tical significance is based to a varying degree on the assumptions that
model residuals are independent of one another, an assumption that is
clearly not true. For example, model residuals from adjacent time periods
(e.g., hour-to-hour) are known to be positively correlated. Also, the
proposed and reference model residuals for a given time period are related
since the residual for each model involves the same observed concen-
tration for a given data pairing. However, if such statistical limitations
are recognized, this approach can be useful as a quantitative indicator
for determining which model is performing better in a particular situation.
A second approach for assigning points is to assign
points separately to each model for each performance measure; then, by
difference, derive a net point total for the proposed model; the point
37
-------
total can be positive or negative, as discussed below. Various schemes,
both statistical and nonstatistical, have been proposed for assigning
points based on the numerical difference between measured and predicted
levels (i.e., the performance measures). A predetermined function of the
performance measure could be used to award points for each model. The
number of possible points could range from zero when the model performs
unacceptably (e.g., the bias exceeds the observed average by more than 50
percent) up to a maximum when the model performs perfectly (e.g., the
bias is zero). The net number of points assigned to the proposed model
would then be the number of points awarded to the proposed model minus
the number of points awarded to the reference model. A positive difference
favors the proposed model while a negative difference favors the reference
model. In essence, this second approach involves a subjective decision
as to what constitutes acceptable performance. Although this suggests a
"de facto" performance standard, the result may be informative since the
total accumulated points for each model would serve as an indicator of
how poorly or how well the models are doing overall in terms of the
particular application.
3.3.5 Format for the Model Comparison Protocol
A suggested format for the model comparison protocol,
based on the weighting and scoring scheme, discussed above, is provided by
Table 3.4. The example format in Table 3.4 is for the first order perfor-
mance objectives. A similar format would be used for second order objec-
tives, third order objectives, etc. An example of "filled out" tables
using this format are provided in Tables 1-4 of Appendix B.
In the first column of Table 3.4 the various data sets
or subsets which will be used to generate statistical or other information
38
-------
o
oo
o
u
o
JJ
o
>J
Oi
c
o
on
cfl
0.
e
o
t*
f) CD
00 0>
0) 0
P -H
0) H
0)
•o
o
as
CD
4-1
K C
O -H
IM o
O 60
SI C
I
y
m
Ui
0)
•o
Ui
o
CO
u
v-l
CK
O)
U co
C 01
tfl t-i
B 3
C oo
O A
cu
Oi
0)
00 5
C H
-H (U
<0 U
PU 03
(X
u
4J
0)
CO
o
0
1-1
0)
4)
01
u
0)
01 O)
c/3 en
03 ,&!
*-> 3
A en
o
u
00
•
u
*
39
-------
are listed. The second column specifies the various combinations of time
and space pairing between estimates and measurements. The third column
lists the performance measures to be employed on each data set and time/
space pairing. The fourth column contains the numerical scheme that will
be used to determine the points to be awarded to the proposed model. The
fifth column lists the averaging times for which statistical or other
information that will be obtained and the sixth column lists the maximum
points or "weighting" for each statistic (or other objective quantity).
In the last column a rationale is to be provided for the choices made in
the preceding columns.
This format is intended to provide a quick visual summary
of the overall scheme for scoring the relative performance of the models
and for use in establishing criteria for selecting the best model. The
actual scoring should proceed in a straightforward manner once the perfor-
mance statistics have been calculated and used to allocate points for
each indicated data set. A total score can be derived by simply summing
the individual scores which will result in a net positive score if the
proposed model scores higher and a net negative score if the reference
model Scores higher.
Although it is tempting to choose the higher scoring
model for use in the regulatory application, two additional criteria may
be considered in arriving at the final determination. First, it may be
desirable to establish, a priori, standards of performance that must be
met before either model may be selected. For example, a limit (positive
or negative) on peak bias could be set that, if exceeded, would be suffi-
cient for rejecting the proposed model.
40
-------
A second selection criteria that may be considered is
establishment of a scoring point range that serves to separate outcomes
that clearly favor the proposed or reference model and outcomes that do
not clearly favor either model. When the score falls within the scoring
ranges or "window" where neither model is clearly favored, the final
rejection or acceptance of the proposed model could be decided by the
outcome of the technical evaluation (Section 2.6 or 2.7). Under this
scheme a marginal outcome of the performance evaluation coupled with a
marginal or unfavorable outcome of the technical evaluation would suggest
that the model not be accepted. Conversely, if the proposed model is
clearly technically well founded or superior to the reference model but
its performance score falls in the window it probably should be accepted.
Several factors might influence the width of such a scoring margin including
the representativeness and completeness of the data base and the need to
choose a model having a clear performance edge.
If any or all of the above suggested additional criteria
are to be considered, then these criteria and their objective use in the
decision process need to be specified in the protocol. This requirement
is in concert with the basic philosophy of this document that the entire
decision-making scheme is specified "up-front" before any data are collected/
analyzed which might provide insight into the possible outcome.
After the model selection process is completed it is still
desirable to ensure that the chosen model will not underpredict measured
concentrations to the extent that the emission limit inferred from appli-
cation of the model would likely result in violations of the NAAQS or PSD
increments. This could occur in those cases where one model outscores the
other, and thus judged to be the better performer, yet it still underpredicts
AT
-------
the highest concentrations. To cover such an eventuality it may be
desirable to include criteria in the protocol that allow the emission
limits or the model to be adjusted to such an extent that attainment of
the ambient criteria will be ensured.
3.4 Protocol When No Reference Model Is Available
When no reference model is available, it is necessary to write
a different type of protocol based on case-specific criteria for the
model performance. However, at the present time, there is a lack of
scientific understanding and consensus of experts necessary to provide a
supportable basis for establishing such criteria for models. Thus the
guidance provided in this subsection is quite general in nature. It is
based primarily on the presumption that the applicant and the regulatory
agency can agree to certain performance attributes which, if met, would
indicate within an acceptable level of uncertainty that the model predic-
tions could be used in decision-making.
A set of procedures should be established based on objective
criteria that, when executed, will result in a decision on the accept-
ability of the model from a performance standpoint. As was the case for
the model comparison protocol, it is suggested that the relative importance
of the various performance measures be established. Table 3.3 may serve
as a guide. However, the performance score for each measure should be
based on statistics of d, or the deviation of the model estimates from
the true concentration, as indicated by the measured concentrations. For
each performance measure, criteria should be written in terms of a quanti-
tative statement. For example, it might be stated that the average model
bias should not be greater than plus or minus X at the Y percent signifi-
cance level. Some considerations in writing such criteria are:
42
-------
(1) Conservatism. This involves the introduction of a
purposeful bias that is protective of the ambient standards or increments,
i.e., overprediction may be more desirable than underprediction.
(2) Risk. It might be useful to establish maximum or
average deviation from the measured concentrations that could be allowed.
(3) Experience in the performance of models. Several
references in the literature^.10,11,12 describe the performance of various
models. These references can serve as a guide in determining the performance
that can be expected from the proposed model, given that an analogy with
the proposed model and application can be drawn.
As was the case for the model comparison protocol, a decision
format or table analogous to Table 3.4 should be established. Execution
of the procedures in the table may lead to a conclusion that the performance
is acceptable, unacceptable or marginal.
43
-------
4.0 DATA BASES FOR THE PERFORMANCE EVALUATION
This section describes interim procedures for choosing, collecting
and analyzing field data to be used in the performance evaluation. In
general there must be sufficient accurate field data available to adequately
judge the performance of the model in estimating all the concentrations
of interest for the given application.
Three types of data can be used to evaluate the performance of a
proposed model. The preferred approach is to utilize meteorological and
air quality data from a specially designed network of monitors and instru-
ments in the vicinity of the sources(s) to be modeled (on-site data). In
some cases especially for new sources, it is advantageous to use on-site
tracer data from a specifically designed experiment to augment or be used
in lieu of long-term continuous data. Occasionally, where an appropriate
analogy to the modeling problem can be identified, it may be possible to
utilize off-site data to evaluate the performance of the model.
As a general reference for this section the criteria and requirements
contained in the Ambient Monitoring Guidelines for Prevention of Signifi-
cant Deterioration (PSD)13 should be used. Much of the information con-
tained in the PSD monitoring guideline deals with acquiring information
on ambient conditions in the vicinity of a proposed source, but such data
may not entirely fulfill the input needs for model evaluation.
All data used as input to the air quality model and its evaluation
should meet standard requirements or commonly accepted criteria for
quality assurance. New site-specific data should be subjected to a
quality assurance program. Quality assurance requirements for criteria
pollutant measurements are given in Section 4 of the PSD monitoring
guideline. Section 7 of the PSD monitoring guideline describes quality
45
-------
assurance requirements for meteorological data. For any time periods
involving missing data, it should be specified how such time periods, e.g.
data substitution, will be handled.
4.1 On-Site Data
The preferable approach of performance evaluation is to collect
an on-site data base consisting of concurrent measurements of emissions,
stack gas parameters, meteorological data and air quality data. Given an
adequate sample of these data, an on-site data base designed to evaluate
the proposed model relevant to its intended application should lead to a
definitive conclusion on its applicability. The most important goal of
the data collection network is to ensure adequate spatial and temporal
coverage of model input and air quality data.
4.1.1 Mr Quality Data
The analysis performed in Section 2 serves to define the
requisite areal and temporal coverage of the data base and the range of
meteorological conditions over which data must be acquired. Once the
scope of the data base is established the remaining problem is to define
the density of the monitoring network, the specific locations of ambient
monitors and the period of time for which data are to be recorded. In
general it can be said that the type and quantity of data to be collected
must be sufficient to meet the needs of the protocol developed from the
guidance provided in Section 3. This determination is a judgment that
must be made in advance of the network design; some more specific con-
siderations are now provided.
46
-------
The number of monitors needed to adequately conduct a
performance evaluation is often the subject of considerable controversy.
It has been argued that one monitor located at the point of maximum con-
centration for each averaging time corresponding to the standards or
increments should be sufficient. However, the points of maximum concen-
tration are not known but are estimated using the models that are them-
selves the subject of the performance evaluation, which of course unaccept-
abiy compromises the evaluation. Tt is possible that the use of data
from one or two monitors in a performance evaluation may actually be worse
than no evaluation at all since no meaningful statistics can be generated.
Attempts to rationalize this problem may lead to erroneous conclusions on
the suitability of the models. When the data field is sparse, confidence
bands on the residuals for the two models will be broad. As a consequence,
the probability of statistically distinguishing the difference between
the performance of the two models may be unacceptably low.
At the other extreme is a large number of monitors, perhaps
40 or more. The monitors may cover the entire modeling domain or area
where significant concentrations, above a small cutoff, can be reasonably
expected. The monitors may be sufficiently dense that the entire concen-
tration field (isopleths) is established. Such a concentration field allows
the calculation of the needed performance statistics and given adequate temporal
coverage, as discussed below, would likely result in narrow confidence bands
on the model residuals. With these narrow confidence bands it is easier
to distinguish between the relative capabilities of the proposed model vs.
the reference model. however, costs associated with such a network would
likely be large.
47
-------
Thus, the number of monitors needed to conduct a signifi-
cantly meaningful performance evaluation should be judged in advance.
Some other factors that should be considered are:
1. Models or submodels that are designed to handle special
phenomena should only be evaluated over the spatial domain where that
phenomena would result in significant concentrations. Tiius, the monitor-
ing network should be concentrated in that area, perhaps with a lew out-
lying monitors for a safety factor.
2. In areas where the concentration gradient is expected
to be high (based on preliminary estimates) a high density of monitors
should be considered, while in areas of low concentration gradient a less
dense network Ls often adequate.
3. Lf historical on-site air quality and/or meteorological
data are available, these data should also be used to define the locations
and coverage of monitors.
In the temporal sense some of the above rationale are also
appropriate. A short-term study may lead to iow or no confidence on the
ability of the models (proposed and reference) to reproduce reality. A
multi-year effort will yield several samples and model estimates of the
second-highest short-term concentrations, thus providing some basis for a
statistically significant comparison of models for this frequently critical
estimate. Realistically, multi-year efforts usually have prohibitive
costs and one has to rely on somewhat circumstantial evidence, e.g. the
upper end of the frequency distribution, to establish confidence in the
models' capabilities to reproduce the second-highest concentration.
In general, the data collected should cover a period of re-
cord that is truly representative of" the site in question, taking into account
48
-------
variations in meteorological conditions, variations in emissions and
expected frequency of phenomena leading to high concentrations. One year
of data is normally the minimum, although short-term studies are sometimes
acceptable if the results are representative and the appropriate critical
concentrations can be determined from the data base. Thus short-term
studies are adequate it it can be shown that "worst case conditions" are
limited to a specific period of the year and that the study covers that
period. Examples might be ozone problems (summer months), shoreline
fumigation (summer months) and certain episode phenomena.
Models designed to handle special phenomena need only
have enough temporal coverage to provide an adequate (produce significant
statistical results) sample of those phenomena. For example, a downwash
algorithm might be evaluated on the basis of 50 or so observations in the
critical wind speed range.
It is important that the data used in model development
or model selection be independent of those data used in the performance
evaluation. In most cases, this is not a problem because the model is
either based on general scientific principles or is based on air quality
data from an analogous situation. However, in some semi-empirical approaches
where site-specific levels of pollutants are either an integral part of
the model or are used to select certain model options, an independent set
of data must be used for performance evaluation. Such an independent
data set may be collected at the same site as the one used in model
development, but the data set should be separated in time, e.g. use one
year of data for model development/tuning and a second year for performance
evaluation purposes.
-------
For air quality measurements used in the performance
evaluation, it is necessary to distinguish between (1) the contribution
o£ sources that are included in the model and (2) the contribution attri-
butable to background (or baseline levels). The Guideline on Air Quality
Models discusses some methods for estimating background. Considerable
care should be taken in estimating background so as not to bias the
performance evaluation.
4.1.2 Meteorological and Emissions Data
Requisite supporting data such as meteorological and
emissions data should be collected concurrently with the ambient data.
The degree of temporal resolution of such data should be comparable to
that of the ambient data (usually 1-hour) or shorter if model input needs
so require. The location and type of meteorological sensors are generally
defined by the model input requirements. The more accurately one can pin-
point the location of the plume(s) the less noise that will occur in the
model residuals. This can be done by increasing the spatial density and
degree of sophistication in meteorological input data, for models that
are capable of accepting such data. Continuous collection of representative
meteorological input data is important. if multiple (redundant) sensors
are to be deployed, a statement should be included in the protocol as to
how these data will be used in the performance evaluation.
Accurate data on emissions and stack gas parameters, over
the period of record, diminishes the noise in the temporal statistics. The
more accurate the emissions data are, the less noise in the residuals.
Although data contained in a standard emissions inventory can sometimes
be used, it is generally necessary to obtain and explicitly model with
50
-------
real time (concurrent with the air quality data used in performance
evaluation) emissions data. "In-stack" monitoring is highly recommended
to ensure the use of emission rates and stack gas parameter data comparable
in time to measured ground-level concentrations.
4.2 Tracer Studies
The use of on-site tracer material to simulate transport and dis-
persion in the vicinity of a point or line source has received increasing
attention in recent years as a methodology for evaluating the performance
of air quality simulation models. This technique is attractive from a
number of standpoints:
1. It allows the impacts from an individual source to be
isolated from those of other nearby sources which may be emitting the
same pollutants;
2. It is generally possible to have a reasonably dense net-
work of receptors in areas not easily accessible for placement of a
permanent monitor;
3. It allows a precise definition of the emission rate;
4. It allows for the emissions from a proposed source to be
simulated.
There are some serious difficulties in using tracers to demon-
strate the validity of a proposed model application. The execution of
the field study is quite resource intensive, especially in terms of man-
power. Samplers need to be manually placed and retrieved after each test
and the samples need to be analyzed in a laboratory. Careful attention
must be placed on quality control of data and documentation of meteorological
conditions. As a result most tracer studies are conducted as a short-term
51
-------
(a few days to a few weeks) intensive campaign where large amounts of
data are collected. If conducted carefully, such studies provide a
considerable amount of useful data for evaluating the performance of the
model. However, the performance evaluation is limited to those meteorolo-
gical conditions that occur during the campaign. Thus, while a tracer
study may allow for excellent spatial coverage of pollutant concentrations,
it provides a limited sample, biased in the temporal sense, and leaves an
unanswered question as to the validity of the modei for all points on the
annual frequency distribution of pollutants at each receptor.
Another problem with tracer studies is that the plume rise
phenomena may not be properly simulated unless the tracer material can be
injected into the gas stream from an existing stack. Thus, for new
sources where the material is released from some kind of platform, the
effects of any plume rise submodel cannot be evaluated.
Given these problems, the following criteria should be considered
in determining the acceptability of tracer tests:
1. The tracer samples should be easily related to the averaging
time of the standards in question;
2. The tracer data should be representative of "worst case
meteorological conditions";
3. The number and location of the samplers should be sufficient
to ensure measurement of maximum concentrations;
4. Tracer releases should represent plume rise under varying
meteorological conditions;
5. Quality assurance procedures should be in accordance with
those specified or referenced in the PSD monitoring guideline, as well as
other commonly accepted procedures for tracer data;
52
-------
6. The on-site meteorological data base should be adequate;
7. All sampling and meteorological instruments should be
properly maintained;
8. Provisions should be made for analyzing tracer samples at
remote locations and for maintaining continuous operations during adverse
weather conditions, where necessary.
Of these criteria, items 1 and 2 are the most difficult to
satisfy because the cost of the study precludes collection of data over
an annual period. Because of this problem it is generally necessary to
augment the tracer study by collecting data from strategically placed
monitors that are operated over a full year. The data are used to establish
the validity of the model in estimating the second-highest short term and
the annual mean concentration. Although it is preferable to collect these
data "on-site," this is usually not possible where a new plant is proposed.
It may be possible to use data collected at a similar site, in a model
evaluation as discussed in the next subsection.
As with performance evaluations using routine air quality data,
sufficient meteorological data must be collected during the tracer study
to characterize transport and dispersion input requirements of the model.
Since tracer study data are difficult to Interpret, it is suggested that
the data and methodologies used to collect the data be reviewed by indivi-
duals who have experience with such studies.
4.3 Off-Site Data
Infrequently, data collected in another location may be sufficiently
representative of a new site so that additional meteorological and air
quality data need not be collected. The acceptability of such data rests
53
-------
on a demonstration of the similarity of the two sites. The existing
monitoring network should meet minimum requirements for a network required
at the new site. The source parameters at the two sites should be similar.
The source variables that should be considered are stack height, stack gas
characteristics and the correlation between load and clitnatological con-
ditions.
A comparison should be made of the terrain surrounding each source,
The following factors should be considered:
1. The two sites should fall into the same generic category of
terrain:
a. flat terrain;
b. shoreline conditions;
c. complex terrain;
(1) three-dimensional terrain elements, e.g., isolated
hill,
(2) simple valley,
(3) two dimentional terrain elements, e.g., ridge, and
(4) complex valley.
2. In complex terrain the following factors should be considered
in determining the similarity of the two sites:
a. aspect ratio of terrain, i.e., ratio of
(1) height of valley walls to width of valley,
(2) height of ridge to length of ridge, and
(3) height of isolated hill to width of hill base;
b. slope of terrain;
c. ratio of terrain height to stack/plume height;
d. distance of source from terrain, i.e., how close to
valley wall, ridge, isolated hill;
e. correlation of terrain feature with prevailing winds;
f. the relative size (length, height, depth) of the terrain
features.
54
-------
It is very difficult to secure data sets with the above emission
configuration/terrain similarities. Nevertheless, such similarities are
of considerable importance in establishing confidence in the representa-
tiveness of the performance statistics. The degree to which the sites
and emission configuration are dissimilar is a measure of the degree to
which the performance evaluation is compromised.
More confidence can be placed in a performance evaluation which
uses data collected off-site if such data are augmented by an on-site
tracer study (See Section 4.2). In this case the considerations for
terrain similarities still hold, but more weight is given to the compara-
bility of the two sets of the observed concentrations. On-slte tracer
data can be used to test the ability of the model to spatially define the
concentration patterns if a variety of meteorological conditions are
observed during the tracer tests. Off-site data must be adequate to test
the validity of the model in estimating maximum concentrations.
55
-------
5.0 MODEL ACCEPTANCE
This section describes interim criteria which can be used to judge
the acceptability of the proposed model for the specific regulatory appli-
cation. This involves execution of the performance protocol which will
lead to a determination as to whether the proposed model performs better
than the reference model. Or when no reference model is available the
proposed model may be found to perform acceptably, marginally, or unaccept-
ably in relation to established site-specific criteria. Depending on the
results of the performance evaluation, the overall decision on the accepta-
bility of the model might also consider the results of the technical
evaluation of Section 2.
5.1 Execution of the Model Performance Protocol
Execution of the model performance protocol involves: (1) collect-
ing the performance data to be used (Section 4); (2) calculation and/or
analysis of the model performance measures (Section 3.1); and (3) combining
the results in the objective manner described in the protocol (Section
3.3 or Section 3.4) to arrive at a decision on the relative performance
of the two models.
Table 5.1 provides a format which may be used to accommodate the
results of the model comparison protocol described in Section 3.3.5. If
a different protocol format is prepared, it should have the same goal,
i.e., to arrive at a decision on how the proposed model performs relative
to the reference model.
The first column lists the performance objectives. The next
three columns in Table 5.1 are analogous to the first three columns in
Table 3.4. The fifth column contains the actual score for each modeling
57
-------
CO cO
•H 42 CD
CO 4-J p
>* O
i— 1 CO o
CO c! CO
d O
< -H CD
1 i p-;
- CO 4->
CO rH
0 3 4J
•H CJ (-<
4-1 ^H O
CO CO CX
•H 0 CX
4-) 3
cQ '"O ^^
4-1 C!
CO cO
o
co
•H
U
CO
CX
e
O
CJ
r-l
CD
0
CD
4=
4-1
oo
C!
•H
S-i
o
CJ
to
j_t
0
4-1
CO
g
O
[w
a)
4->
CO
0)
oo
oo
£3
to
I—I
0)
,— i
43
cC
H
CO
•H
co
K^
1— 1
cO
£j
<
^
CO
o
•H
4-1
CO
•r-l
4-1
cO
4-1
CO
CD
)_i
O
O
CO
00
c
•H
00
cO
^_l
0)
^
•H
CJ
CD
01
>
•r-l
CJ
0)
01
>
o
-------
objective as well as the sub-scores for each supporting performance measure.
The scores in this column cannot exceed the maximum scores allowed in the
protocol. The last column is for the statistics, graphs, analyses and cal-
culations that determine the score for each performance measure, although
most of this information would probably be in the form of attachments.
5.2 Overall Acceptability of the Proposed Model
Until more objective techniques are available, it is suggested
that the final decision on the acceptability of the proposed model be based
primarily on the results of the performance evaluation. The rationale is
that the overall state of modeling science has many uncertainties regard-
less of what model is used, and that the most weight should be given to
actual proven performance. Thus when a proposed model is found to perform
better than the reference model, it should be accepted for use in the
regulatory application. If the model performance is clearly worse than
that of the reference model, it should not be used. Similarly, if the
performance evaluation is not based on comparison with a reference model,
acceptable performance should imply that the model be accepted, while
unacceptable performance would indicate that it is inappropriate.
As mentioned at the end of Section 3.3.5, the protocol may contain
other criteria, beyond the simple consideration of the score, to determine
whether a proposed model is acceptable. For example, the protocol might
specify that when the results of the performance evaluation are marginal
or Inconclusive, the results of the technical evaluation discussed in
Section 2 should be used as an aid to deciding on the overall acceptability.
In this case, a favorable (better than the reference model) technical
review would suggest that the model be used, while a marginal or worse
59
-------
determination would indicate that the model offers no improvement over
existing reference techniques. if Section 2.7 were used to determine
technical acceptability, a marginal or Inconclusive determination on
scientific supportability combined with a marginal performance evaluation
would suggest that the model not be applied to the regulatory problem.
Also, as mentioned in Section 3.3.5 the protocol might also
specify standards of performance or provisions to guard against underpre-
diction of critical concentrations. If so, these additional criteria must
be compared against the performance of the model (in the manner specified
in the protocol) before a final decision on the acceptability of the
model can be made.
5.3 Model Application
If, as a result of execution of the procedures described in this
document, the proposed model is found to be acceptable, then the model
should be appropriately applied to the intended application. The data
base requirements, the requirements for concentration estimates and other
applicable regulatory constraints described in the Guideline on Air
Quality Models should be considered.
Much of the data collected during the performance evaluation
may also be used during the application phase. For example the meteoro-
logical data records may be used as model input. However, in order to
ensure that temporal variation of critical meteorological conditions are
adequately accounted for, it may be necessary to include a longer period
of record. Source characterization data collected during the performance
evaluation can be used to the extent that they reflect operating conditions
corresponding to the proposed emission limits.
60
-------
The "proven" model is only applicable for the source-receptor
relationship for which the performance evaluation was carried out. Any
new application, even for a similar source-receptor relationship, in a
different location would generally require a new evaluation. Significant
differences in the source configuration, e.g., doubling the stack height
from those in existence during the model technical test, may necessitate
a new evaluation.
61
-------
6.0 REFERENCES
1. Environmental Protection Agency. "Guideline on Air Quality Models,"
EPA-450/2-78-027, Office of Air Quality Planning and Standards, Research
Triangle Park, N.C., April 1978.
2. Fox, D. G. "Judging Air Quality Model Performance," Bull. Am. Meteor.
Soc. 62, 599-609, May 1981.
3. Environmental Protection Agency. "Guideline for Use of Fluid Modeling
to Determine Good Engineering Practice Stack Height," EPA 450/4-81-003,
Office of Air Quality Planning and Standards, Research Triangle Park,
N.C., July 1981.
4. Environmental Protection Agency. "Guideline for Fluid Modeling of
Atmospheric Diffusion," EPA 600/8-81-008, Environmental Sciences Research
Laboratory, Research Triangle Park, N.C., April 1981.
5. Environmental Protection Agency. "Guideline for Determination of
Good Engineering Practice Stack Height (Technical Support Document for
Stack Height Regulations)," EPA 450/4-80-023, Office of Mr Quality
Planning and Standards, Research Triangle Park, N.C., July 1981.
6. U. S. Congress. "Clean Air Act Amendments of 1977," Public Law 95-95,
Government Printing Office, Washington, D.C., August 1977.
7. Environmental Protection Agency. "Workbook for Comparison of Air
Quality Models," EPA 450/2-78-028a', EPA 450/2-78-028b, Office of Air
Quality Planning and Standards, Research Triangle Park, N.C., May 1978.
8. Auer, A. H., "Correlation of Land Use and Cover with Meteorological
Anomalies," J_. Appl. Meteor. 17, 636-643, May 1978.
9. Bowne, N. E. "Preliminary Results from the EPRI Plume Model Validation
Project—Plains Site." EPRI EA-1788-SY, Project 1616, Summary Report, TRC
Environmental Consultants Inc., Wethersfield, Connecticut, April 1981.
10. Lee, R. F., et. al. "Validation of a Single Source Dispersion Model,"
Proceedings of the Sixth International Technical Meeting on Air Pollution
Modeling and Its Application, NATO/CCMS, September 1975.
11. Mills, M. T., et. al. "Evaluation of Point Source Dispersion Models,"
EPA 450/4-81-032, Teknekron Research, Inc. September 1981.
12. Londegran, R. J., et. al. "Study Performed for the American Petroleum
Institute—An Evaluation of Short-Term Air Quality Models Using Tracer
Study Data," Submitted by TRC Environmental Consultants, Inc. to API,
October 1980.
13. Environmental Protection Agency. "Ambient Monitoring Guideline for
Prevention of Significant Deterioration (PSD)," EPA 450/4-80-012, Office
of Air Quality Planning and Standards, Research Triangle Park, N.C.,
November 1980.
63
-------
Appendix A
Reviewer's Checklist
A-l
-------
Preface
Each proposal to apply a nonguideline model to a specific situation
needs to be reviewed by the appropriate control agency which has jurisdiction
in the matter. The reviewing agency must make a judgment on whether the
proposed model is appropriate to use and should justify this judgment with
a critique of the applicant's analysis or with an independent analysis.
This critique or analysis normally becomes part of the record in the case.
It should be made available to the public hearing process used to justify
SIP revisions or used in support of other proceedings.
The following checklist serves as a guide for writing this critique or
analysis. It essentially follows the rationale in this document and is
designed to ensure that all of the required elements in the analysis are
addressed. Although it is not necessary that the review follow the format
of the checklist, it is important that each item be addressed and that the
basis or rationale for the determination on each item is indicated.
A-3
-------
CHECKLIST FOR REVIEW OF MODEL EVALUATIONS
1. Technical Evaluation
A. Is all of the information necessary to understand the intended
application available?
1. Complete listing of sources to be modeled including source
parameters and location?
2. Maps showing the physiography of the surrounding area?
3. Preliminary meteorological and climatological data?
4. Preliminary estimates of air quality sufficient to (a) determine
the areas of probable maximum concentrations, (b) identify the probable
issues regarding the proposed model's estimates of ambient concentrations
and, (c) form a partial basis for design of the performance evaluation data
base?
B. Is the reference model appropriate?
C. Is enough information available on the proposed model to understand
its structure and assumptions?
D. Was a technical comparison of the proposed and reference models
conducted?
1. Were procedures contained in the Workbook for Comparison of Mr
Quality Models followed? Are deviations from these procedures supportable
or desirable?
2. Are the comparisons for each application element complete and
supportable?
3. Do the results of the comparison for each application element
support the overall determination of better, same or worse?
A-5
-------
E. For cases where a reference model is not used, is the proposed
model shown to be applicable and scieutifLcally supportable?
II. Model Performance Protocol
A. Are all the performance measures recommended in the document to be
used? For those performance measures that are not to be used, are valid
reasons provided?
B. Is the relative importance of performance measures stated?
1. Have performance evaluation objectives that best characterize
the regulatory problem been properly chosen and objectively ranked?
2. Are the performance measures that characterize each objective
appropriate? Is the relative weighting among the performance measures
supportable?
C. How are the performance measure statistics for the proposed and
the reference model to be compared?
1. Are significance criteria used to discriminate between the
performance of the two models established for each performance measure?
2. Is the rationale to be used in scoring the significance criteria
supportable?
3. Is the proposed "scoreboard" associated with marginal model
performance supported?
4. Are there appropriate performance limits or absolute criteria
which must be met before the model could be accepted?
D. How is performance to be judged when no reference model is used?
1. Has an objective performance protocol been written?
2. Does this protocol establish appropriate site-specific performance
criteria and objective techniques for determining model performance relative
to these criteria?
A-6
-------
3. Are the performance criteria in keeping with experience, with
the expectations of the model and with the acceptable levels of uncertainty
for application of the model?
III. Data Bases
A. Are monitors located in areas of expected maximum concentration
and other critical receptor sites?
B. Is there a long enough period of record in the field data to
judge the performance of the model under transport/dispersion conditions
associated with the maximum or critical concentrations?
C. Are the field data completely independent of the model development
data?
D. Where off-site data are used, is the situation sufficiently
analogous to the application to justify the use of the data in the model
performance evaluation?
E. Will enough data be available to allow calculation of the various
performance measures defined in the protocol? Will sufficient data be
available to reasonably expect that the performance of the model relative
to the reference model or to site-specific criteria can be established?
IV. Is the Model Acceptable
A. Was execution of the performance protocol carried out as planned?
B. Is the model acceptable considering the results of the performance
evaluation and the technical evaluation?
A-7
-------
Appendix B
Narrative Example
B-l
-------
Preface
This narrative example was developed to illustrate the use of the
Interim Procedures for Evaluating Air Quality Models. Although the
example substantially abbreviates many of the tasks involved in a real
model comparison problem and recommended in the interim procedures, it
does illustrate the task with which users are most unfamiliar, i.e., the
development and execution of the performance evaluation protocol. The
following comments/caveats are in order to help better understand and
utilize the example:
1. The preliminary technical/regulatory analysis of the intended
model application, while included in the example, is significantly fore-
shortened from that which would normally be needed for an actual problem.
2. The example was specifically designed to illustrate in a very
general way the components of the decision making process and the protocol
for performance evaluation. As such, the protocol incorporates a broad
spectrum of performance statistics with associated weights. The number of
statistics contained in this example is probably overly broad for most per-
formance evaluations and perhaps, even for the problem illustrated. Thus
its use is not intended to be a "model" for actual situations encountered.
For an individual performance evaluation it is recommended that a subset
of statistics be used, tailored to the performance evaluation objectives
of the problem. The statistical performance measures and associated weight-
ing scheme should be kept as simple (and as understandable) as possible.
Complexity implies more precision than exists in the performance measures
B-3
-------
and weighting schemes and does not reflect the current level of knowledge
and experience in conducting performance evaluations.
3. Similarly, the method used to assign scores to each performance
statistic (non-over-lapping confidence intervals) is not intended to be a
"model" to be followed but should be viewed as only one of several possible
techniques to accomplish the same goal.
4. The example does not illustrate the design of the field measurement
program required to obtain model evaluation data.
The original narrative example was developed in 1982 by TRC Inc.,
under contract to EPA. This revised example was adapted from the TRC
contract report to reflect the revisions made in the Interim Procedures
in September 1984.
B-4
-------
Table of Contents
Preface B-3
Table of Contents B-5
List of Tables B-7
List of Figures B-9
1.0 Introduction B-ll
2.0 Preliminary Analysis B-13
3.0 Model Evaluation Protocol B-19
3.L NAAQS Attainment B-19
3.2 PSD Analysi s B-31
4.0 Field Measurements B-35
5.0 Performance Evaluation Results and Model Selection B-37
5.1 Results for Model Performance Comparison in the NAAQS
Analysi s B-46
5.2 Results for Model Performance Comparison in the PSD
Analysi s B-4 7
6.0 Summary B-49
7.0 References B-51
B-5
-------
List of Tables
Number
1 Model Comparison Protocol for NAAQS Analysis.
First-Order Objective: Predicted Highest
Concentrations B-21
2 Model Comparison Protocol for NAAQS Analysis.
Second-Order Objective: Predict the Domain of
Concentrations B-24
3 Model Comparison Protocol for NAAQS Analysis.
Third-Order Objective: Predict the Pattern
(Spatial and Temporal) of Concentrations B-26
4 Model Comparison Protocol for PSD Analysis.
First-Order Objective: Predict Highest
Concentrations in PSD Area B-32
5 Model Comparison Results for NAAQS Analysis.
First-Order Objective: Predict Highest
Concentrations B-38
6 Model Comparison Results for NAAQS Analysis.
Second-Order Objective: Predict the Domain of
Concentrations B-41
7 Model Comparison Results for NAAQS Analysis.
Third-Order Objective: Predict the Pattern
(Spatial and Temporal) of Concentrations B-44
8 Model Comparison Results for PSD Analysis.
First-Order Objectives: Predict Highest
Concentrations in PSD Areas B-45
B-7
-------
List of Figures
Number Page
1 Field Monitoring Network Near the
Clifty Creek Power Plant B-15
2 Example of Overlapping 95% Confidence
Intervals on Bias for Two Models B-30
3 Example of "Tightened" Confidence
Intervals to Result in Non-Overlapping Biases. B-30
B-9
-------
B-10
-------
1.0 Introduction
The Interim Procedures for Evaluating Air Quality Models, provide a
methodology to judge whether a proposed model, not specifically recommended
for use by the Guideline on Air Quality Models,2 is acceptable for a
particular regulatory application. This example model evaluation illustrates
the methodology set forth in the Interim Procedures.
The Interim Procedures provide a basis tor objectively selecting
between the proposed model and a reference model that is either recommended
in the Guideline on Air Quality Models or is otherwise agreed to be
acceptable for the particular application. To judge which model is more
acceptable, the technical features of the two models are compared and
then a site-specific performance evaluation of both models is carried
out. (For certain regulatory applications, EPA does not designate a
reference model. In these cases the Interim Procedures provide a method
for assessing the suitability of the proposed model, based on a technical
review of the model's applicability and a model performance evaluation).
This example application illustrates the use of the Interim Procedures
to select between a proposed model and the reference model for a specific
regulatory application. The proposed model is AQ40, a hypothetical air
quality dispersion model defined for the purpose of this narrative example.
The reference model will be selected as a step in applying the Interim
Procedures. The regulatory issue of interest is the short-term air
quality impact from a coal-fired power plant in relation to maintaining
the National Ambient Air Quality Standards (NAAQS) and the Prevention of
Significant Deterioration (PSD) requirements. The Interim Procedures
specify the following major steps:
B-ll
-------
1. Perform a preliminary technical/regulatory analysis of intended model
application. This includes a definition of the regulatory issues of con-
cern, a description of the source and physical situation being modeled,
identification of the appropriate reterence model, identification and tech-
nical description of the proposed model, preliminary estimates of air quality
impacts of the two models and an application specific technical comparison
of the proposed and reference models.
2. Prepare a model performance evaluation protocol which specifies the
statistical performance comparisons for selecting the appropriate model.
3. Describe the proposed field measurements program required to obtain
model evaluation data.
4. Carry out the field measurements program, conduct the performance
evaluation of the proposed and reference models with the data collected
in the field measurements program using the statistical performance mea-
sures specified in the protocol and, based upon an objective comparison
of performance results, select/reject the proposed model.
Although each of these steps is discussed in sequence in the narrative
example, resources precluded a rigorous illustration of Steps 1 and 3.
Thus the primary utility of the example is the detailed illustration of
Steps 2 and 4, the design and execution of the performance evaluation.
B-12
-------
2.0 Preliminary Analysis
The first step in applying the Interim Procedures for Evaluating Air
Quality Models is to analyze the regulatory issues, physical setting and
pollutant source to which the proposed and reference models are to be
applied. The regulatory requirements dictate the impact region and the
averaging periods of interest for the model applications. The physical
setting and source characteristics are the basis Cor selecting the appro-
priate reference model for the comparative model evaluations. Additionally,
preliminary modeling estimates of the expected air quality impacts are
made at this time. These estimates are used subsequently in designing
the performance evaluation data network and the statistical performance
evaluation field program methodology.
The regulatory issue addressed in this example evaluation is the
short-term (3- and 24-hour average) air quality impact from a coal-fired
power plant located in the Midwest. The power plant used for this example
is the Clifty Creek generating station located in southern Indiana and
operated by the Indiana-Kentucky Electric Corporation. Compliance with
NAAQS and PSD Class I requirements for 3- and 24-hour average sulfur
dioxide (SC^) impacts is the specific regulatory concern to be addressed.
No actual PSD Class I region exists in the vicinity of the Ciifty Creek
station; therefore, a hypothetical Class I region 15 kilometers northeast
of the plant is assumed for this example. To assess compliance with NAAQS.
model prediction of the highest, second-highest concentration per year
within 50 kilometers of the Clifty Creek station is required. PSD regula-
tions are based on the predicted highest, second-highest impact per year
of the source within the Class I region.
B-13
-------
The physical setting for this example case is the region surrounding
the Clifty Creek generating station. The plant is located in the Ohio
River Valley in southern Indiana. The Clifty Creek station is a baseload
facility and has three 208-meter stacks, with combined average emissions
of 8600 g/s S02- The average exit temperature is approximately 445°K and
the exit velocity ranges from approximately 25 to 50 m/s, depending on
the load. The plant is on a flood plain located on a bend in the river.
Bluffs rise approximately 60 meters along the Ohio River near the plant.
The terrain beyond the bluffs from south-southwest clockwise to north-
northeast is quite flat. In the other directions are several streams
cutting down to the Ohio River which have created dendritic drainage
valleys. The maximum relief in the area (plant grade to the highest
monitor, located on a ridge) is about 130 meters and is associated with a
ridge between stream cuts. The terrain surrounding Clifty Creek is not
"ideally flat"; however, terrain is well below stack height. The site-
specific monitoring program includes a 60-meter meteorological tower to
measure winds and vertical temperature gradients, and six S02 stations 3
to 16 kilometers from the stack. A map of the monitoring network is
included as Figure 1.
Selection of the reference model for this application is based upon
the recommendations of the Guideline on Air Quality Models. The Guideline
recommends the CRSTER model as appropriate for point sources with collocated
stacks located in regions where the terrain does not exceed stack height.
For this example evaluation the CRSTER-equivalent model, MPTER, cited in
the Guideline on Air Quality Models, will serve as the reference model.
(Unlike CRSTER, MPTER permits the user to specify receptor locations exactly.)
B-14
-------
fU
w*p&
•1?-".
^^
->,-*v
-^
^V
'>Si •
;ri-
L
^
l
*<
j
Key
\f] Surface wind measurvment
AS02 Monitor
A Meteorological tower
] f i
hum
f 1 f ?
Kiometws
ri
3
f'
f
^pOTJrtfriSr^
>J^-*-s?s?.3*-A&wr_%; *;>
"*• " -i. *^K3*" ^^-S*1"®.-''
./.fcsSig&k^-
Plant
Distance
Ckm)
Eleva- Azimuth
tion From North
MSL(m) of Plant(°)
1. Bacon Ridge
2. Rykers Ridge
3. North Madison
4. Hebron Church
5. Liberty Ridge
6. Canip Creek
Clifty Creek Plant Elevation
Elevation at Top of Stacks
Figure 1. Field monitoring network near the Clifty Creek Power Plant.
B-15
-------
The proposed model for this narrative example of the Interim
Procedures is AQ40, a hypothetical dispersion model. The computer code
for AQ40 embodies the features of several publicly available Gaussian
dispersion models. Because of resource constraints for preparing this
example, the proposed model description and the technical model comparisons
have been foreshortened. For this example, familiarity with the features
of the MPTER model is assumed, and the technical comparison presents only
the key technical differences between AQ40 and MPTER. For an actual
application of the Interim Procedures, a complete technical description
of the proposed model should be prepared.
Preliminary estimates of the S02 impact of the Clifty Creek plant were
obtained using EPA screening techniques as recommended in the Interim Pro-
cedures. These estimates indicate that iraxi mum concentrations occur within
10 kilometers of the Clifty Creek generating station. Refined modeling
using the proposed and reference models AQ40 and MPTER, respectively,
with 1975 hourly meteorological data has also been done. On the basis of
the AQ40 modeling results, the 3- and 24-hour average maximum S02 concen-
trations would be expected to occur approximately 3 kilometers south of
the plant. MPTER predicts that both the J- and 24-hour average maximum
S02 impacts will occur approximately 7 kilometers northeast of the Clifty
Creek station. Results of this preliminary modeling are to be used in
designing an appropriate performance evaluation data network by indicating
potential maximum impact areas and are useful in designing the statistical
model comparison methodology required by the Interim Procedures. Once the
preliminary ambient estimates have been made, the next step in applying
the Interim Procedures For Evaluting Air Quality Models is to perform a
technical comparison of the proposed and reference models. The technical
B-16
-------
comparison of the proposed and reference models should then be performed
following the methodology set forth in the Workbook For Comparison of Air
Quality Models.-* The purpose of the technical model comparison is to
determine which model would be expected to predict more accurately concen-
trations for the source being considered. If results of the statistical
performance comparisons, carried out in a subsequent step, are inconclusive,
the results of the technical model comparison can serve as the bias for
determining the acceptability of the proposed model.
The important technical differences between AQ40 and MPTER are:
(a) Terrain considerations. MPTER simulates the effect of terrain
by subtracting the full terrain height from the effective plume height.
AQ40 uses full terrain subtraction from the effective plume height for
stable atmospheric mixing conditions and half terrain height subtraction
for neutral and unstable meteorology.
(b) Dispersion coefficients. MPTER uses the Pasquill-Gifford hori-
zontal and vertical dispersion coefficients and six stability classes.
AQ40 uses the rural ASME^ horizontal and vertical dispersion coefficients
and five stability classes (one stable class).
(c) Stack tip downwash. MPTER, as run for this example evaluation,
does not invoke this option. AQ40 does simulate this phenomenon.
(d) Plume rise. MPTER uses the final Briggs' plume rise approximation.
AQ40 uses the transitional or distance-dependent Briggs' plume rise
formulation.
(e) Buoyancy induced dispersion. MPTER does not enhance dispersion
due to buoyantly rising plumes, but AQ40 does employ this option.
(f) Wind profile. MPTER and AQ40 both use a power law for adjusting
wind speed with height, but use different coefficients, as presented in
B-17
-------
the tollowirig table. AQ40 uses the predicted wind speed at final plume
height in the denominator of the Gaussian dispersion equation. MPTER
uses the predicted wind speed at stack height in the Gaussian equation.
Stability 1 2 3 4 _5 6_
MPTEK .1 .15 .2 .25 .3 .3
AQ40 .LO .11 .12 .15 .20 none
(g) Mixing height. With both MPTER and AQ40, the mixing height rises
and falls to maintain a constant height above local terrain. For MPTER,
however, plumes rising above the mixing height have no ground-level impact
and plumes below the mixing height are fully reflected. With AQ40, on
the other hand, unlimited mixing heights are used for stable atmospheric
conditions, while a partial plume penetration algorithm is employed for
nonstable conditions.
In an actual model evaluation, a complete technical model comparison
using the "Workbook" procedures would be carried out arid submitted to the
control agency for review and agreement that both the proposed and reference
models are appropriate for the regulatory application at hand.
B-18
-------
3.0 Model Evaluation Protocol
As previously stated, the two principal regulatory purposes that
this evaluation protocol addresses are the following:
0 Compliance with National Ambient Air Quality Standards (NAAQS)
for sulfur dioxide (862) for 3-hour and 24-hour averaging times.
0 Assessment of plant 862 impact on a hypothetical Class I Prevention
of Significant Deterioration (PSD) region located 15 kilometers northeast
of the plant (3-hour and 24-hour averaging times in the vicinity of the
Bacon Ridge Site.)
The performance of the proposed and reference models, AQ40 and MPTER,
respectively, will be compared based upon each model's ability to simulate
air quality impacts measured on a monitoring network of six SC>2 stations
in the vicinity of the Clifty Creek generating station. The period of
record for the concurrent air quality, meteorological and source data
proposed for these evaluations is January 1 through December 31, 1976.
Since the projected impact areas are different for each regulatory
purpose, the performance of the models for NAAQS and PSD applications
will be assessed independently. The performance of the models for NAAQS
will be judged based upon the data from the entire six station network
for 1-, 3- and 24-hour averaging times. Model performance for the PSD
application will emphasize data from the Bacon Ridge Station within the
hypothetical Class I region. It is possible that different models may be
selected as being most appropriate for each of the above issues.
3.1 NAAQS Attainment
Three performance evaluation objectives have been established
which are important with respect to this primary regulatory purpose.
B-19
-------
The first-order objective is to test the ability of the models to predict
successfully the highest concentrations for use in the regulatory decision-
making process. It is recognized that the single-point prediction of the
highest, second-highest concentration Ls statistically unmeaningful;
therefore, performance measures in this group also include analysis of
the uppermost predicted and observed concentrations for the data period
of record.
The second-order objective is to test the ability of the models to
predict successfully the entire domain of concentrations.
The third-order objective is to test the ability of the models to
predict successfully the spatial and temporal patterns of concentrations.
Tables 1 through 3 summarize the model comparison protocol for the NAAQS
analysis. The tables describe the evaluation data sets, the performance
measures, bases for calculating confidence Intervals, averaging times to
which the performance measures will be applied, and the point assignments
for each measure that will be used to score and compare the predictive
abilities of the two models.
The performance measures listed in Tables 1 through 3 were selected
to reflect the spirit of the American Meteorological Society (AMS)5
recommendations. The listed performance measures are specifically those
required to test the ability of the models to meet the model performance
objectives stated above. The AMS recommendations define statistical
procedures for comparing model predictions with observed concentration
values. In this example, two models are being compared based on how each
performs against the same set of observations. This three-way comparison
(proposed model vs. reference model vs. observations) poses a formidable
B-20
-------
i
*4
09 4
M b
co u
» e
•J 0
< o
^ 1
S 44
4! GO
* V
Z £
_ o»
K •*«
p s
44
r* J O
3U '
p o
m c b
S So.
^ 1
0 S
w 7
(D —
i 2
•C V
Oi 1-1
§a
o
b
J 0
IS
i?
w
a
u
hi
O
^4
a
e
0
^.
i
M
§tt
*>
c
••4 —4
4 A
S £
Z
?
w4
? S
U -4
V H
>
<
0 ••
£ 4
>
a b
— i C
• a
4 c
ai >->
~< a
a o
u c
— 0
JJ ^
a —
•*4 IW
44 C
a o
^ u
to
0
0 «
e 0
Pecforaa
Measuc
IS
Is
b
i* O
0 u
a. a
a
eo
?
0
a
a
3
CO
0
to
a
jj
i
0 -«>-<•
M 0 u 0 -a
0.3 4 U 0
2 > " o2
>,e5 2S
*4 O 44 a
•4 *4 U »
-4 *> »< r* C
-1 0 -O 3 0
A U 0 m 0
• Mw o<2
. g°-5.
• U A) C >
- c = « n
0 5 — 0 jc
BOO!
8 a 5 •
• 4J 1 9 4J
• • i c
0 0 ^H ^4
£ £ 0> > O
i> O« C -* &
-< -. ^<
«£•<•>
0 . .iJ5
0 j: u jj
u u • « >,
3 ^ * "* "*
a O S a >
• -x 5 •« -<
2 •o £ « «
0 0
W (S
m *»
O4
•
^4
A
a
^j
^4
1
0
0
2
O
e
•*4
U
a *
II
w •
^
W -4
-i a
O >
£
£
1^
S*
5 *<
U 0
o e
i?
u -«
a u
— o
§^
ri
a
b 0
M .C
c v
S C
\J C
§-•
O "•»
T3
4J 0
• >
4) U _
£ 0 ("I
o« ; i
-• -O <
3 O •—
1 ><
• V ^H
M -< V
S«5
50 JJ W
> a a
*^ o ^
e^ . 2 .
S . 2 u TJ
04 0
•O 3 o C
0 -H S O>
• a • * ^f
3> c a
O •
4J -4 » C
0 • 4^
JSJS-Sf g
• 5?f J
§j« ^ ^2
i a = 0
^* "o 2 *
*• e 2 •
« 0 •" i J=
•38^1 .
ST- & ^
b a 1 >, —
• 0 -7 o
CO o -4 -< O,
Of £ 94
•< o» c o >
Z £ • U rf
0 0
rt n
<•» *
fl«
•
A
•
0
^4
»4
1
0
S
e
s .
-< >.
b ri
a c
|0
s;
*i ^
o •
0 >
b
*•* «4
0 0
i
2
SJ,
O <04
is
•*> 0
c j:
0 i
c e
O **4
,•,
• ?
i!«
5g, s
J: a
I 0
lib
fi Z rt
s • §
• •O M
0 0
* ±1 G
•" 0
tg _ gi
0 t: c
j: 0 -.
O< b b
^ a, o
O • rs «
W «W 4^ C w
O 1 u «J C •
? 2? 3 ? S!
« i» « i g
u ** O 5 -T «
o, M z 2 tn M
2 i " ?•
0 I « 0 • •
J it S -I T! S •
" • » 1 u
>, j 0 0 • 3
** • "* S •
— • C *< S •.-! •
•* • S • 2 1! s
S5|ra|s
4 4 « 4 E7* 4,1
> 5 ** • c
'. - 0 • 4J £ -
38- .532.
i^ssa.-
U Q in o m in m in m
(^ «4 ^4 ^4 (^ P4) ^4
iHnw r4m«Mn«
<"• «• c«
44 a
• e
• b
- 44 <•*
1 •
*> in
5 &
a, C
s a
• a S
0 f
O jj -i
5 i o
£ h> K
.u
^i
M
«4
0
• •
H :
" i
4 b Q
-4 40
a > 3
S 2 £
i i s
s'J
?i
>
b
0 •
• •
a e
Q n
u O
«•
•» 44
XI 0
n b
4J
b c
£ 0
— g
^ o
m o
u 13
S2
• o
u — •
0 -0
£2
3 a.
-------
•
^
a
*»N
JJ
i
i •
Is
•» •»
s p
• a.
• m
c e
O «-
•— 0i •
jj a •
OB A W «4
M U • H
tn *> >
x c «e
j •
< u
is
4£ O *•*
»u £•.
a AI >
< • • u
•C • «•* «
Z .S • -
01 « C
EC ••* AM
f 2- -5
M 5 J o u c
> O •** «•* V
33 81 32
S t- P W -« »4
(DC S J^ iJ ^
y 55 Q o« ** c
5 # a a o
* si a. ** u
- 00
O ^
M JJ
2 S
i S1 «
8° 8 S
j S S S
M T3 u «
§M O a
O « •
I u x
w •
• a.
u
v4
fel
|H
w
£ §
a
«o
AJ
O
S
m
'"*
*j
e
to
«
^
a
a
i
t* *»
oca
-" o a,
?--
^?:
sll
. = «
i1,-
-*4 £
2 •" «
^3 B •£
• - s
— M £
* 2
- §-
9 -^ C •
•g AI • -»
»Sr 2
u
£2 Si
moo o *» 4 **t V
M M
41
•}
*
4J
1
4J
•
^
J
g s
I ~x
w
O
§«^
•
3
« t)
II «rt -^
a u •
—• « •
m > u
• •
£ i
2 £
§ i
a •"
• e
J?
M "^ .••&
S u 3
i o i
«i £
v £
*i 0
o • •
-. u
•o • e
a £ «
u a >
°" e *
?§-*
O ^
fl •-*
• - *
> a a
U hd
O AJ j^
5 c w
a « o
001
1
1 3 , «
1 ? Si
' 0, t> I
1 "
2*1 JJ .
« •
u n a
1 >, u a o
AJ • «
i -* • e a,
•H • M r
1 — 3 O> X
£ •»<
1 " 52 g
1 -.c-ll
1 — i 5 5" «
• •M C U
1 ^ JJ -< *1
5 0 JJ C
1 I U -H «
i< i O
i •* e -< c
0 a> j: p
1 -88 ?
I • 0 J=
U M ^ O»
1 3 O -*
• -C A* -C
1 • S OU
• «H • «J
1 X £ U O
i
i
moo i n in m
^ »H ^H f4 *H ^
1
1
1
1
*
M **» •» 1 ^4 « *»
i
*
i
> i
8*»
i •
u •
•4 1 **
S i i
Q 1 •*
> t •
O ^
s • §•
? : »
o ; ^
i
•
•
*> i
^4
M 1
•H 1
o
. '
• r
1 ' -
X ' —
8 i a
i
i ; s
i
s ! «
i
i o* 5
1 * ?5
| • • ft ~, 4J
• •JvV'** > 4%l^4
i »js**5*«i-o-'
At > £ AI •
i v o> u e c — .
Suc«oe«mu
i >o—«-"OgtM«
1 «*--«o«2oi-0
• • ^ A» • O 0
1 J3 C — J= « *< •—«
O 0 J= 0> • *• 0
1 -rt X — » 0 « >
*>*!•£-•• inu
I m 4 ^ g i^ ^
M M C Vi4 fr^ )_' If j_f **
I A* 0 o a a a ~
u c -< » • « 0 J2
1 O«A i c •c — = <
» 0 « • 0 « ^ J= 0 —
I m u 3 — i x a —
U C A# 444 O 0 AJ
I Ai«3
-------
-4 3
53
S§
>; c
J a
!:
z a
mS
£s
-«
o *•«
o -a
p «
ft M
5 o.
5 8
Oi 1-1
• H
O -•
e _,
flQ Ml
• 0
£°
is
14 O
O a
u i
O -*
••* -«4
?s
O * js e
v jz u o
*> •— -4
>. > u
*> u a
•M O Q U
•H w a u
•* u e
o « « •
o c -i o
out:
--too
• *• • o
H « «
(I w « ^
Stl. ?
a88S
M c -4 a
£S^
|
I
I
s
§a I o
« 0 » •
e a u m «
• « * c
3 o> « 4 *» o
55e1 S3
M >j o s i a
4 0 -4 w hi
• 4rf 44 4J « AJ
-; -< • a s
fl c •
• 2
A4 •
1----
O *• JJ P
u -• e c «
a. c -4 ^ — >
—« *P4 4J kl
• a * •
• a a £ c o
> c u u o -o
o « •" •* o
•o -» t u
u u a 4 x>
> « £ ^ > m
W U 4J « LI 01
« 4J 4J V
a c u « CD u
^3 O O — J3 O
o o ** o o •»
B-23
-------
•
e
0
v4
*J
a
u
Jj
<° §
- s
s§
25
3
<
u
p •-
•* A
• w
•ss
55
- 5
o S
-* t
*> -a
• -i
-< w
JJ C
JS
(0
!i
w a
0 a
t 0 C JC
"i ** • *
• £ • jj u
; a a • o
a • u
: • 0 u c •
3 - 3 — -. C
< -4 >H • O 0
j 0 « S, 2
I -a > 0 -•
s g u a w
I e a 3 «
o so.
«J -< « _ g
O JJ w x 5
• 0 4 0
0 x JI •
• *«* *rf 0
S S .55
S g g- w
1 83 S2
y> m .A o IA M
0« *H -4 ^
•-* «n *«• ^ m JJ
u t
0
a e
^
jj
*^ ,S^iS ft 2*^5
- i-So-g S-ljS'tTJc
. 25 >2 ==S-2§.§
fljJC"0 1 •* ~> 0 V <~ u -4
i • i- -< S S. , •Qoij=icw>
<• • I'80'«J^4I Jj Jo ft.
1 •• &|«T; • -. • "•.§£!§
1 -3 *{ « ,1 i-2jj3g.0§E
i|*"i ,0«o5«^ot7>
. Is :.« Iss"l-5S
. ^5 »S ° ' § 5 . 22-J3
, ° S-^S ' o2 u S '"S S °
• S~0 1 .."SOax^iJx
I0JJ.JJ JK»4BJJ J
we,— '•JJfl-^SJ'O'^
i = i o o -i . |-e«-<-Hg0-(
«u0ca ' SSS0^"1"-"
i •eiorj , 5 S 2 a — w J3
00— — -< ' SoJ**1'**""
i * o a-- • S S a
1 •
1 ,
1 sss ; sss s"- sss - s s
1 "*
1
1 1
1
1
1
1
1
J -< "> « | ^ „ » .H -1 » -l^>» , ^
• r» M n ex •<
• !
> I
is S • jj
u • S ' 2
™* 1 ** M V
• i -2 i **
? ' - « • i
S ' 5 > . -
I •' 1 - I'll
J • : a f : 5 i
s ; 5 o , | ^
1 '.
1
- ' - 1
•Ml S ,
o ; % « ; ,
• > 5-5 •-
1 ' . 1« S 85
•5 , • «,^:s ', ««
X '* u « 5 i 5 -*-1
5 ,S «05 'J! w»
° i« >ug ia *•
* > u
1
1
* ; 5 5 s ; 1 s
• ' « «
" ,' 2 5 i ; 3 s
• •'
1 s • • •
1 1 . §-5
i jj •* • o» i -»
1 ., 1 JJ 0 5 u -r
1 u « " ** O 1
** 1 u g, 0 0 a
ic JJ 0 u jj •«»
0 i c -5 o «
i S _ • o 0 2
S
i
•*^
?
>
o
w
o
f
•
*H
o
X
JJ
IM
IM
O
a
a
o
c
o
s
o
"
B-24
-------
• • • a
• £ a u
« u u a
*
-4 « U
5 i!
-
-i - j= • S 9 a.
_. O .
« S n
u -^ •—» •
*? -O
. -^ — 0
• •HOB
> 0 • -x
o ** —
B-25
-------
B
C
0
tJ
B
4J
C
C
o
c
o
U
o
_—
«^
W B
Ml W
10 Q
> U
< 01
Z H
<
^!
in c
0 *
<
< —•
Z «
e 4j
C «
u. i.
tr.
C
b =
g S
c *
d 4-1
U «
0.
Z
o t>
W —
IX
< 4->
i- O
§TT
U
b
j -
u
£
i ;
>
.U
t)
o
"fi*
O
4>
•o
w
13
u
••
JI
f-
V
«
C
o
4J
«
K
t B
3 ^
E C
x O
c U
j
P
c
» ft>
« E
•• —+
U H
>
<
b
O ^
b. C
>
H f
— 4»
D *J
10 C
a —
^ o
C U
U C
— c
+t -D
o -M
J^ C
* c
«v U
1C
0
U Q
C 4>
ID w
£ 3
* a
O c
«* 4>
i- X
c.
1 c>
je
^p
-1
wj
-! 4>
Ci U
C., ID
—
((a
*J
0)
a
i .,
i ti
t «
4^
V
CO
iQ
+J
«
c
4»
•t
*
«•>
S
__
4>
B
1
tt
a
o
rt
U
B>
ti
€J
«/>
(Sl
•-I
(4
1
u
V
to
iu
c
c
—
-J
a
a>
w
O
\J
B
4)
»
m
V
>-
i
«
A>
5
o
c
o
U
•o
4*
4^
t)
•c
41
U
a
a
i :
o
>
M
8>
to
1 C
<
tw
e o
O a •
— » y * e k.
** ^ •* v « o
B = T £ ^ ~
b B « *J .
4J B h. u I} f
c e a • o o o
v c o ^ —
U O C S. k. *<
0 ~ 0) X > JJ
U x — B a
b V B
TJ O T> -1 t) £
u *j 6 — ti a o
> b E C C B
w. t) b o — t>
« U f tl »H
n v jz u «_•»,*
£ ^ 4J *J B > O
0 B C
— >u a o e c
•COO — ti o
C B _ u C B
B B ^ e B ••« *™"
tl W 4J O W
r; a — o a £, B
» « — c. a
i Si !•; | |
^ 3 1M
C ^ O CJ 4) X 4>
.. « ^: z. w » j:
Cta > •*-' J^ 4 £ *J
in *n
»H «^
f^ **
IN
JJ
C
4)
0
IW
bj
O
0
U
..^
1
Q
i
C
0
*J
c
4J
o
i
U
c
a
jj
*
c
o
^
I
1
1
t
1
1
1
1
1
I
1
1
1
t
1
f
1
1
1
1
t
1
1
1
t
1
1
1
1
t
1
1
1
1
1
1
1
1
\
1
1
1
1
1
1
1
1
1
1
1
1
I
1
1
1
1
1
1
1
1
\
\
1
1
1
1
1
1
1
1
1
i
t
i
I
I
C
V
S
•w
&
c
o
JJ
»
8
B
B
«
0
B
£
a
a
&>
X
IT)
rt:
-H
14
1
W
V
o
•^
b»
C
O
*j
«
V
w
8
B
C
O
Z
1
c
c
U
I
i
X
a
C
"0
c
•>
O
•o
tl
w
a.
B
>
•c
ti
>
u
C
K
0
£
3*
£
T
e
>
bl
V
B
S
0
•o
c
B
V
tl
u
^
tl
M
...
*n
»4
^1
±i
c
4>
u
IW
o
0
u
1
4J
£
u
o
*j
C
i
«
£
C
c
o
*>
«
*J
4>
U
•4
«
b
o
5.
r
c 2
0 ~
JC
i •
^ ^^
^ B
tl B
C £
V
£
*l ^ .
C £ «"
— > s
t> ~*
a i x
c >. «
0 £ E
•^ t
4-1 i) »,
B C 0
u „
*^ > C
C tl V.
41 %)
W *J
C ^J
0 C <
o a &,
tit
H
•»
IN
^
<
,
4J
C
a>
>
o
«
E
4J
>.
.X
O
2
I
|
1
1
1
1
I
I
1
1
1
(
1
i
i
i
i
i
i
i
t
i
I
t
i
i
i
i
I
i
t
i
i
i
i
i
i
t
i
I
i
t
i
i
i
i
i
i
i
i
i
i
i
I
i
i
i
i
l
I
i
i
i
i
i
i
i
I
i
i
i
I
i
t
i
i
i
*
b
...
e
•o
§
V
2
e
_o
4V
B
S
B
B
0
M
O
•*
w
B
B
«
X
e
^
N
1
k<
ff
B
C
O
4J
B
V
b
0
o
B
e
a
V
>•
B
C
0
4J
B
4^
C
O
u
c
o
u
^
0)
4J
c
?
tl
k.
a
B
>
9
>
w
ti
a
0
•>
•
|
Jj
C
w
*>
C
4> •
U *
0 *
°4?
^
O
IM
%)
0
u
*n
1
a
.^
u
a
iQ
(S
**
V
u
.V
c
e
c
IM
B-26
-------
problem for which appropriate statistical methods have not yet been
devised. The procedures described below for comparing models provide a
decision making framework based upon standard statistical measures.
The first three columns in Tables 1 through 3 describe the data sets
being used. The Letter and number code in parentheses in the first column
are included for cross reference to a numbering system recently prepared by
EPA (See Table 3.2 Interim Procedures for Evaluating Air Quality Models1).
The fourth and fifth columns specify the performance measure being addres-
sed and the statistical method that will be used to assign the 95 percent
confidence band about each performance measure. The sixth column lists the
averaging times to which each performance measure will be applied. The
seventh column lists the points assigned to each performance measure for
scoring model performance. The final column briefly discusses each of the
performance measures, providing a rational for using each data subset and
group of performance measures.
Tables 1 through 3 contain 67 performance measures designed to test
the relative abilities of AQ40 and MPTER to meet the three evaluation
objectives. In assigning points to each performance measure, an attempt
was made to balance the regulatory importance, statistical significance
and scientific value of each performance measure. A total of 1,000
possible points has been divided among the three model evaluation
objectives. In recognition of the regulatory importance of the first-
order model evaluation objective, one-half of the total available points
(5UO) have been assigned to the set of performance measures grouped under
that objective, that is, the ability of the models to predict the highest
concentration values. The second performance objective, prediction of
B-27
-------
the domain of concentration, has been assigned 300 points, and the third
performance objective, prediction of the pattern of concentrations, has
been allotted the remaining 200 points.
Four types of performance measures and associated statistical tests
are being used to judge model performance. The performance measures (and
associated statistical tests) are the absolute value of the bias (t-test),
variance (F-test and ^ ), goodness of fit (Kolmogorov-Smirnov test), and
correlation coefficient (Fisher-z test). Errors of magnitude of prediction
are considered to be more critical than errors of scatter of prediction,
therefore measures of bias and goodness of fit (which test magnitude errors)
have been alloted more points than measures of variance and correlation.
Since the basic prediction time step of both MPTER and AQ40 is 1 hour, the
1-hour averaging time measures have received more points than the regulatory
averaging times (3- and 24-hours). This is done following the recommendations
of the AMS Workshop on Dispersion Model Performance.
Performance measures and confidence intervals for each performance
measure will be calculated for both MPTER and AQ40 for the averaging
times indicated in Tables 1 through 3. The performance of the models
will be compared, performance measure by performance measure and averaging
time by averaging time. If the performance of the two models is signifi-
cantly different statistically (that is, the 95 percent confidence interval
for either model does not include the value of the performance measure
for the other model) the points indicated in Tables 1 through 3 will be
awarded to the model that calculates closest to the observed value-
Positive points are accumulated for each performance measure if the
proposed model performs better; negative points are accumulated if the
B-28
-------
reference model shows superior performance. For goodness of fit all the
points (plus or minus) will be awarded based upon which model has better
statistical performance.
If, for the two models, the 95 percent confidence intervals for the
absolute value of the bias, variance, or correlation measures do contain
the values of the performance measures for both models, the non-overlapping
confidence intervals for those measures will be calculated, and the
corresponding percentage of the maximum available points will be assigned.
For example, if the 95 percent confidence intervals for the bias of the
two models overlap each other's mean bias (see Figure 2), the confidence
intervals of measures for both models will be "tightened" (See Appendix C)
until the two biases are mutually different statistically at some level
of significance, as in Figure 3. To illustrate, assume that bias is a
10-point measure and assume the biases become statistically distinct at
the 90 percent confidence level; then nine (90 percent) of the possible
10 points would be awarded to the model that better predicts the bias.
Only integer points will be awarded as fractions will be rounded. (Although
it is recognized that this methodology may not be ideal in the strictest
statistical sense, it is acceptable for example purposes and is easy to
apply. Others may wish to propose another methodology for scoring.)
Following the completion of all the performance measure comparisons,
the points awarded will be totalled. If the grand total is > +100 points,
the proposed model will be deemed more suitable for assessing the plant
impact for the appropriate NAAQS averaging time. If the grand total is
between -100 and +100 points, no decisive conclusion may be reached
regarding the superiority of either model and further analysis would be
considered (for example, technical comparisons or further evaluations
B-29
-------
Mean Bias
95% Confidence Interval
-Bias
0 Bias
+Bias
Figure 2. Example of Overlapping 95% Confidence Intervals
on Bias for Two Models
• Mean Bias
h—H Tightened Confidence Intervc
—*-—i
I +.—|
-Bias
0 Bias
Figure 3. Example of "Tightened" Confidence Intervals to Result
in Non-Overlapping Biases
B-30
-------
with more data - see Section 2.) If the grand total is <^ -100 points, the
reference model will he judged to have the better performance.
3.2 PSD Analysis
As with the NAAQS analysis, three performance objectives have
been established to assess the performance of the two models. The major
difference between the two analyses results from the fact that only one
station is available in the Class I PSD area, which reduces the number of
data sets used in the PSD portion of the model evaluation.
Table 4 summarizes the first-order objectives and associated
performance measures designed to assess the ability of the models to
predict concentrations as required for a Class i PSD analysis. That is,
the data subsets and performance measures in Table 4 evaluate the ability
of the models to predict highest impacts within the hypothetical Class I
area described previously. The second- and third-order objectives and
performance measures for evaluating the models in this PSD application are
identical to those presented previously in Tables 2 and 3 for NAAQS analysis.
Again, one-half the points have been allotted to the group of performance
measures included under the first-order objective (500 points), that is,
testing the ability of the models to predict highest concentrations at the
station located within the PSD Class I area. The remaining points are
assigned to the second- and third-order objectives in the same manner as
was done for the NAAQS model evaluation. The performance of the models for
the PSD application will be compared after summing the points that each model
scores in Tables 2, 3, and 4. Awarding of points and identification of the
model with superior performance are accomplished in a manner identical to
the method used for the NAAQS analysis.
B-31
-------
V
M
1
•
0
u
<
0
09
cu
c
••4
a •
a o
» —
< «
< Al
e
a 0
2 s
A* C
i5
Bb Al
, •
J 0
O £
g o>
?4 X
S Al
a. u
•*4
§TJ
0
2 "•
5
& -
§1
Al
«J 0
U 0
Q «r^
O Al
X 0
IB*
*
«3
u
O
Al
•
U
••4
ft)
i
<
1
1
1
J
i
i;
si
X
?s
H -H
0 I"
>
1*
>
• U
"S •,
1 =
OB M
-4 0
sg
5*
•2s
A" C
23
00
ga
0
a L,
o53
i
B - 0 C 0
3 - |S cS '
3 58-S 2* '
J 8- ?ls '
s . : a = .5 •
fSS-rt-gL '
-* ••* ^ t
u -C • « j
0 . .53 '
0 -C U AJ I
"*»••>,
3 > -< _( |
• AI e AI 0
a o > a > i
0 — O Al ._>
X 'O J= " Al 1
1
I
0 0 |
m in
I
i
I
I
*n «» i
i
U ^ 1
l§
5 • i
8S
Al -<
85
±« '
Q O |
1
2
a
0 1
£
(
i
15
** 1
Is
55
•8 »
S. e« '
— 0 1 |
-gs1
-. Al |
Al « •
« « 1
W 0 0
Al JS 14 |
C Al o
0 1
U Al M
§ • .
°?3
Al 0 — 1 1
a > u
0 " 1
^ a >
o> o a i
44 Al CO
X 0 0, |
. ;^
W •** 0 1
0 Al >
1 5-
• 1 *5
5^.2. '
•» u 43 i
S • * «
o = 0 e i
T5U . Sj ,
S > S «
• O • |
5 AI -• _r o
a 0 AI -i i
Jo a c
-• •• 5 i
" 31 -o 5? o
•S S5 * '
84 a'g 5 '
02 0 « i
"• 9 *< 1 5
-8 = 2 t
• 0 -; a •>
•* « 0 Al 1
3 a e
7AI 1 >,-. |
• 0-40
^ 0 -H — i 5, i
_ c. y a
o o> £ o > i
CO 44 *4 *4 O
* JZ " Al -4 1
1
1
o e i
w »
I
|
1
|
I
"S '
1
1
1
• I
»H i
Al '
• |
O
3 '
a •
• i
s
2
i
i
i
e i
0 '
• fc i
H
* 0 1
OB i
0 S '
«5 '
i; '
-« •
00 .
1
2 '
1
. '
V i
>•
1
e i i
0 •
-4 Al 1
Al •
2:
= •= 1
sw . ,
O 14 •)
c o 0 i
O *M h4
0 0 1
'i. « w '
|? .
JS U 0
O< 0 0 I
-4 B -4
•C £ U 1
I rt
•9 « i
C fl -C
0 e AI i
U 0
a •a ~> '
0 J= i
* .h) .kJ
U5-S
0 -O 1
JC 0 C
O< u O |
S3T ,
4J
O 0
- e i "
•5 — a> 0 a
0 £ 0 ~*
14 A> Al u 0! X
Oi 3 -4 3 C «
_ O » B -4 £
0 J= • • Al
** Al C 0
x-S5*5 S
- . tJS" S.-
-< 0 AI « c o ^i
-< 3 • J= «J 0 B
a ^ H jc 2! a
« « 0 Al Al
> j: 0 •
Al . • >
9 C C Al o 0
rt 0 Al o C K 2
IS ,?&I8
• W • V4 0 >.
-«S«2g3
O 0 Q • «
• SSS1" 2"
2 8^55 *1!
3 "-1 U -« Al -4
" jc a w 0 c i
« o> 2> 0 o -. a
J — 6 £ 5 Q jj
X JB " AI u i a
2£2 o«m eoo
m « "> «rtrt «r>S
•^in:; »«m^ rt <••> •»
" MM
.. ^
: 2
• u
Y -<
At J
»
5 4>
81 2
I *
P ^ rt
S Jo
H Ik K
A*
^4
*" >4 >,
I,'.'
Al 0
•0 ki
0 u •
u ° «
0
• • 9
A C «
0 0 *
••4 i-t
rf«. AJ (J
-------
The above methodology will result in the objective selection of
the reference or proposed model for eacli of tVie regulatory situations of
concern.
B-33
-------
B-34
-------
4.0 Field Measurements
Following EPA concurrence that the protocol for model performance
evaluation is technically sound, the field measurements program is
undertaken. The purpose of the field measurements program is to generate
a data base to be used for the comparative model performance evaluations.
The field program design is based upon the results of the preliminary
modeling, as discussed in Section 2, and the requirements of the protocol,
as discussed in Section 3. For this example, resource constraints pre-
cluded the design of a data acquisition network and collection of the
requisite field data. Instead a historical data base is used and a
hypothetical regulatory problem is constructed around that data base with
the primary goal to illustrate the use of the statistical performance
evaluation methodology described in the Interim Procedures. A real
regulatory problem would require an in-depth analysis of the data require-
ments for a comparative model evaluation.
B-35
-------
B-36
-------
5.0 Performance Evaluation Results and Model Selection
After completion of the field measurements program, the data collected
are used for the comparative model performance evaluation. The performance
evaluation follows the plan presented in the protocol. Once the results
of the performance evaluation are compiled, the decision is made whether
to accept or reject the proposed model based upon the objective scoring
scheme presented in the protocol. A report containing the results of the
evaluations, the results of the comparative model scoring, and the decision
whether or not to accept the proposed model is submitted to the control
agency for review. It is essential to calculate the statistical performance
measures and apply the decision criteria exactly as specified in the
preplanned protocol. Adherence to the protocol ensures that the decision
is completely objective.
For this example evaluation, it is assumed that the control agency
approved the performance measures and scoring scheme proposed in the
protocol as presented in Section 3. Tables 5 through 8 present the
results of the example model evaluations called for in Tables 1 through 4.
The first three columns of each table list the data sets being compared
and indicate whether or not the observed and predicted concentrations are
paired in tLme or space. The next two columns list the statistical
performance measures and the statistical bases for calculating confidence
intervals for use in scoring performance. The sixth column gives the
averaging time for which each comparison has been made. The next three
columns list the actual performance of MPTER and AQ40, and, where
applicable, the level of significance at which the models demonstrate
statistically different performance. Where the variance ratio is utilized
3-37
-------
e^
s i?
I 8*
«*>
1*
is
£ •
s «
.1
3,
s
i.
i:
«.:
id
0 •
•- • X
.co-*
» e -i
9 •
" • U •
Hm M ^ ^ T
** 0 W ^ Al
0 0 - C
• M • i
•H 5 U -« L
• e • *> <
> S a. • •!
3O M Ik
— -1 CO -
we c
-4 "0 «
e o "•"
01 X 4
«4
OB
R °
So
I <
•
u
O
•* 5
u M
<2£
?
?;
W «H
I*"
•
**
• •
a 6
01 c
g
«- 13 -^
fl ~* fl
O *< >
3 o «
• O JJ
— e
« 0 "
jj k<
OJ
s _
W •
c •
a u
s s
0 .
U -4 X
« c a
g. 0 •
If ?
4J ^ M
.; 2
b
S? J1
2
2
•o
S?
o •-
•3 S
• M
u ••*
Qi C
•
-. .C
4i *t
Zc
M •* —
e
W «H fM
« e
I- ^
5 • M
*2 S
4J <•* H
.; »
k*
S^ J»
2
2
S
25
w u
44
e e
• -«
If
>
us u
o> • •
»< a M
& a Z
1 O 0
?,!
§S2
•S?
* *J «»H
*> O u
o -« 0
« -0 jj
x: « -.
3> u c
2£i
0 10
-^ •*
+
m in
•^i ^
S S
^ "»
o »
• •
« e
^ Irt
M 1
t
f to
0 W
to f»«
M 1
1
^
•
•
M
1
44
•
!
1
• ^
U Y
S J>
2
s
m •*
>T
•0 <
0 •"
>
u
2 .*
II
«4
«^ »U
»rt «
M b
JJ
u e
£ g
« a
in o
S 2
• 91
8U
0
U -4
a 'o
S, a
A u
s a.
» in m
•*• + i
e in in m in in
** M HI -^
in
Ol
01
at
>e
m
I
a
0
I
ki
m oo
01 m
1
o 3
>
2
«
>H
3
B-38
-------
•o
§
i
•»H
Jj
25
Sg
J V
II
Is
< V
2 ^"
o>
« s
e1
RISOH RESULTS
itlvai Predict
< 8
If *
§S
£ 0
HODI
rlrat-Ord
•
44
e
«•
S
I
•*
M •
.c
u •
.2 • >.
£ O *+
*2«
44 i o
•J • u —
0 0
* «4
-i S u
020
> « £
« o
»J *•< «H
M 0
«4 fl
e o
tr-i
a
o
3
0
.«
u
« g
0. •
u
_
«
a
00
44
•
9i i i Z
ovn MOM owei
O^*O ^IA*^ ...
M » M w •» _l ||
m M*» or^n
••• r»«-t r»rtf»
M9i0 «^fn ...
» l*» -H » ID ~< ||
m **
M r« CM
*" O
M 2
0 U
V •
tt a)
• §
f I 1
5 ~x S
A«
u d)
• •
0 0 O
x x 4
ooo
222
I1-
• >- -1
•3 o i
M ** <
s- -
;S.
0 44
44 0 C
O £ 0
•5 44 >
•o 0
0 e
w — o
ac J
. 0 44
0 -"
* % a
11 "
0 44 *
> C U
ui 0 0
0 v *
a c 44
SO 0
u c
O ^ CI O ("^ I/) |f^ ^ y^
P| ^ *4 *4 ± 1 (M -i ^
+ 11 + * ' 7 V T
MMM OMM MMM
N -^ "• -^ CM -( -1
S S S » » »
ooo MOM yoo
» o> as ati««i 22Z
^MV» r^nn on*
-m» «*"*. «! " «
«o>n *•< o -< ooo
^ III
f*# ft *
l^«p- ... .*.
*^M 000 000
1 i i
1 1
-ms -«s -«»
44 >
: §
2 ™ 1
_ « OD -4 |
0 « p« » >
r< r« •> n o
0* >H «> — u
• m «• o
s *- f
8 5' I
£ i 0* 2
- §
5 ..
M * 44
O •« ••
44 •
• > M
u _ O
s! s
s '1 1
s 5! I
£n n
£ £
• _
2S *
s a 5
•f-l'SMOl* 44
>**••— 2wu a
« » w e "2 c 2S £
| g ?S § 5 g IB
jj*444ja3'c§Sj;«
• -SOaco 5xi
"•«-4400445» .
«§s^-4:« so;j,
-3 8SJ5 °Sg"f
2 * _ £ " n -" o a;
"•"g»jP 0 j= .; . "
W O Q aV A T» a* M
we** S^^^S11
^|*S.-aa3SI5
• I{{5gs°-c->.
*8"l3"ii^sj*
44ijSc-'2B£S-"
TTl?* ^0044445 ^^
•*•*•"•> k. -« «-'U>4
InaSS**** "* e
• wu0c^«k« iMaa
i^ssg01!0^
9'0*j0C00O^K0a
s'JiS.ilX'0-c-''«>"
= O>C4lOIU«)4444r«0 0
?
^
a
Ml
3
O
a
O
44
g
1
u
2
«
B-39
-------
c
o
XI
M
2 s
< 2
5a
I-
< s
TABLE S
(Continued)
RESULTS FOR (
i Predict Hl9
0 •
§ >
i!
0 o
j jj
§u
^
XI
hi
••H
Bk
5 :
ig
x,2
*• • kl
O O
•sis
> a o.
35^
^ i
*4
•IN
Q
•
«
£
AJ
W
<2
etwork
c
0
o
in
o
9\
O
in
in
XI
XI
1
k.
-i
a
hi
•
Jj
_•
u
*
i
XI
a
£
f
XI
£
41
XI
a
0
S
a
a
g
XI
•o
(10280)
i
_-
i
k*
>
?
2
•
|
•
•"
e
•
§
«
ki
e
m
a
0
«j
o
a
z
c
XI
a
?
W
!
0
in
c*
u
2
^
m
^i
a)
i
U
V
&
•s
o r*
c« in
i f
0 0
m
—
XI
0
^ i
„
91
O
1
p^
0
>
i
u
Q
1
O
J5
0
hi
XI
•M
VI
0
s
c
9
i
a
XI
XI
1
I/I
1
«c
.
tt ?
Q
U a
a *
XI
S i?
O —
0 -
o a
i
3
O
4
o
"Q
;
«
B-40
-------
vC
a
03
CO
M
X
2
s
i
06
£
I
i
g
00
a.
S
j
u
X
rations
ncent
o
u
o
c
a
I
e
.c
o
•o
0.
«
2
2
8
Lri
"C
p
•a
8
01
to
^
c
i
•
3
J5
X
X
u •
-4 • ><
*» • 0
•4 a u ->
0 0 *
• IM a
—4 M u -4
• i « u
•* mi n. ft
•* ^ B« **
5 - -. M
•^ -*4 *^ 09
V. 0
11 5
CO
u
c
a
w
0
a.
?
y
c
>
"
• c
a o
• i
•H •O
4 -^
•4 C
•J O
• u
u u
a o
A* fe
co
0
3
o
•
Ou
I
u
a
a*
1
•Q
•a
w
a
•
•W
c
£
^
c
0
IM
Q
O
OC
I
sc
K
H
a
•
c
B>
3
^
»
"'
1
tf
O
a
0
(D
3
*
4J
0
49
«5
^
m
~
S
i
•
1
«
c
o
*
•^
05
•
X
«
a
i
c
o -»
sr
° £
•0
«
0 C
*4 O
t> u
a u
B
a £.
> U
T3 a
a
All observi
tratlon at
o o in
+
m in o
oooox^o oooox>o oaoou^o tnutininino
V V V V V V
s
t;
4-*
Csl
X
o
0 *fl
C 3
Q. * T3
fj -™* -*^
p S «
Q ^ M
s
*"
d
4f
**
f
C£0 § C * § "^ '' § "^ *
fi <*4 *O » 01 0 * ••* 'O C Oi 0 *O ••* *O £ OS 0 "0 ^^ 13 ^ OS O
-• CC « 0 U — Ofi «J U k« -* OC "C tj u -* 3C « CJ U
as I -^ u a x>cj at x xu a: x >>U
o c. *J « c •*-• n c ** B c j-»
c M .c o u a* c u js o " a* EU^IO^O* c trf ^ o *-• o<
O0 -^
O ^ u ^ J3 C y J< - Q .Q C O J* *^ il i C Ujtf^d«>C
"0 >( 0 ff "^ fl * > O « -* *J » O at "** fl 0>«00M4<0
ffl OS Z L!, J U 03 K ,'i = J CJ a IX Z = -3 U 03 CC Z = -3 U
*N (N
•4- +
tn in
mSoo*no tnooSSS
O*£
^H ,H 04 ^
r>«^eo«o9 -H -^^
(N
0 7« •) u T? JC 0 71 CO M *U ^
•6 -^ ^ j; cc «i *o -* "a .c as v
S"!0*,^ S*5U>.5
• e -^ a c w
CwJZOuQt CM-CO1*^
'B^O^'^IO fl^O^-^fl
•a
«
2
3
U
•J
4J
O
z
B-41
-------
•
I
ij
•J
kl
AJ
C
(0 0
S g
S3
X
£ O -i
> C -1
S *
^ *» i ^ •
O O C
0 'M •
*H O U h4
« S 0 0
> a 0h «4
0 O u
•J «4 »•) ~4
*u 0 Q
C 0
£. 2
a
9 O
25
s<
w
O
*4 ac
b u
£ E
CM &
X
?
?s
Is
•
«4
Btlcal Bas
Confidence
ntecval
0 O O
ft rt -^
f) rj ri rj f) ri O f ) CJ rj CJ r\ U Cl CJ tj CJ Cl
Jii^Zz 222z2z 2i2z22
1 1 1 1 1 1 I 1 1 1
W 1*^ ^ f) Of dD *"^ N ^ O <"* CN ^H ^H ^ ^H
jj
£
u
%d
«rf
O
•
•
0
c
1
1
a
S
c.c0 CwC0 c £ «
• oujf "S0^ 'a0^
0encDk>13.o 06 *>«tj 06 *>.u
a e *• • ?•" • e**
c u ,c 5 ^ a c M £ 5 ^ a, c M J= a t, a
O0«-4 O « JJ M V -.
u^ki^-ac u ji •' J3 a c o^^-Q.ae
O >, O « — O O >, O 0) — O « >. 0 0 — «J
a a z = J u « a; :K z J o B « 2 = J u
o --o -a omin o o o
H 1 1 -I + +. CM -( _
1 +• •*• + +
tAWiMt otirt noo
«• -H «4J ^ P< — 1 _4
ooo m u-i m s z" z
» •»
»o>-w —i o in Ni0«
• • • t* C4 ^H M »H »^
« in « eo in » . . .
«r !•>•*; rt » II
m -«
«iAf^ r>4O(N n»»r-
^
.'J o
a c
0 u
u a
• •
5 5 2
• •
000
» x 2
i
e •
0 •"
gs
85
•o «
0 t»
4J
0 «
-4 VI
•3 -« '
0 -W
u C
O> 0
• kl
a 0
> *4
IS
> o
0 i> c-1
a a i
a "- a
O *« —
(M
1
O
M
tf* «» y-
O O ;.-
CM (M vc
V V
* ^ o
» » p^
(N 1 VO
1
O 9* *£
00 VO O
fM I r~
1
^
4J
B
0
4J
1
4J
S
?
a
a
0
§
a,
-¥
-4 0
0 U
•
5
•
0
X
1
. ? - ,
C .-4 - C
0 O Q O
ciSS-0
855S-3
0 3 O O
"o i 0 -4 —
0 C 4J ™ —
u - -4 6 •»
o — —"O -< i
-• m — 2 o O 4j n 0 (d
« -^ 0 ^H "^
n *> n — xi
0 C O " 4J =
u 3 O O a O
0 0 — 03
Su -. JJ 13 J->
o 4 0 e -4 .
Ouuia-3
-------
«
0
«•)
«•*
«
M
«S
M U
CO C
33
Z %4
< o
85
2 o
TABLB G
(Continued)
RESULTS FOR
Predict the 1
^
*•* 0
S >
^ ^4
a> u
§o
•
t-i
rt
•!
§9
*o
u
?
g
V
ca
•o
• 0
3*
II
is
-4 C
sl
c
o •
-4 O >,
£ O —"
* C *H
9 *
M i u *
%4 0 H *"4 AJ
O Q U C.
• u a 0
-I O " — • >-
> a a, a a.
0 O 4J »J
j J; -. ca -i
IM e a
"e 13 n
91 X a
«4
oa
• o
SS
I*
^
o
<" *
u oa
£|
f
•*4
?.'
U -*N
J*
Btical Baals
Confidence
interval
^4 r*
4rf kl
• 0
u tu
CO
•
g ;
« w
• s
hi A
O •
*4 «
s*
a.
ail
h4
•^ »
« C
"1
CO
**
«
j;
do
**
£
•i
*.
«
O irt ^ N O (*1
~4 ** I -t- *H O
* •*• + ^
+
o MI O moo
^ -4 .H -HO
at«i
a« ot 1*1 **> r* at ao
^4 -H ^ M -< o
^i ^ IN » 1*11*1 as *> o ^ a> r- ii*
<4> i-> 1-1 OD » iiii 0«tn r- -< » ill
«cooa<0 ci » H «Dr-o
as » o 1-1 in o VD
^H o
rt -< -4 >
i ^ g
2 S Z
$ 1 &
1 1
5 5 5
M Qf U
t; §> « S §•
« i " £ I
** II •+* m
»-4 a >H
i o e c\j o
^< 5 o x 5
-a .w
•^ — <
•4 «*j
«4 2 d
So" "a
x z >• x z
1 £ • 32
i
, i
. >
g
§?_
u •** u
35 a
•D O •
5 ->?
o 0 a
- vi m _ r
«
• • IN — 1
u £ v U a)
Q.4J— «-,O O P
• «- 3 — O « U « O •
O m o 0 o «
I • v. • X0 X0
> J ^ -^ « i • • i a I i a
a b< a fc e iJ « • mm! mm> mine:
• 00 •• • •
* % » ^A^QiVA ^ ^ *n M *N m CO O4 m
-------
a
§
**
to
fi
2
S
IM
O
« 2
h>4
00 2
x 9*
j •
< a
1 "
z ^J
Sa
" 01
^ e ~
M 5 =
-3 3 J
g W jj
0
Z *
XT
S *•
< >j
S ^
f W
p ~*
U „
5 ••
* 0
—*
8
8
•
•c
hi
J
.C
[4
O
C u
25
Is
— e
M x
tQ O
X S
JC
u «
-4 « >•
e. o -i
X C ~l
« a
4V) • 0 •
o o *> c
• u a a
1 1 2 93
SO AJ *4
-1 «4 09 —I
w « a
"* 'S *
o« X •
a
o 5
o
u OS
a H
04 04
X
~
§1
S H
"*
"3 *
° V -4
-4 13 a
0 ~4 ^
O **4 b
•* s •
u O u
• u e
2 - l-
c 0
4J Ih
CO
t>
g:
*9 U
• 3
0 a
>u 0
• X
J_S
.£ H
••* •
•« o
a. a
a
^4 O O ^ ^4 O
1111
M
I
u
J=
^4
k.
e a
* 2
« M
U «-l
S O
•
•
e
«
8r
S 0
* •
o c
•X o
u «
a
> u
a CAD
•a u « O o i»>
« • 51 • u 45 jc
* •» •a>-4?Ece*
0 -i 06 «5 U hi
a c oe x >, cj
a o o CM
O«4 Cu«OuCV
*• o e *< M 5 -i
-i « o j< u J3 a c
•HU 4xOv<^0
< e a
c o c
— • -M «
M -4 41 wl
a o « o
•4-4 rt -4
MM hi W
U tl hi 0
So o 5
u o u
• n
0 «
» >4
•1
£ a
i ^ '
— O 1 U 0
M « < e •
31". 8 -
fl • • T3 fl
*» 0 C *• "°
O •« 0 U 0
•4 ** > "4 M '
•a 0 tj -4
0 C 9 tt
U -"4 0 *4 C
cu i a 0
C J= « CJS0 C _!
0Ouoi 0009 4 o i< • u
0«awifljt « o. • u -e jc • -^ no
en 13 — 3 — u » ^3 -i 3 — « > xi >, »w
•S -• 13 £ DC « -S -. 13 £ 0! «U « J
-40iau " — os « o hi • o t X >. O « -< a KXO0-4IB O 0 «J J w 05
on as z r j u aaezsJU o u c o *< —
m o
m o
1 ^4
4>
a o
e o
M 0
•VI — 1
1 +1
»^
•
^ T:
** "a
3 ^T
» z
c
K
jf
z
t4
C
X
4J
c
i
»H
a
'o
•o
c
a
hi
a
S
4>J
«
3
u
—4
a
o
4J
0
c
1
y
«
B-44
-------
a
3
i
c/}
M
cn
>H
^.3
<
Q
en
CL,
g
P
RESULTS i
§
to
a.
£
j
H
i
4
•
b
<
Q
S
e
-*4
U
c
0
•H
a
M
4J
C
0
u
1
JJI
a
V
c.
ai
z
4j
o
—4
•a
o
u
a.
u
>
«*
AJ
»
•*••*
13
O
U
0
•o
u
a
u
••0J
Ok*
CD
4J
C
•*«.
2
5
•
*4
M
V
X
J=
a a
-- V >l
JS o *•<
* C -I
0 0
•U • U
•H 0 U -«
0 O *•
0 U «
*4 U to -P«
0 5 0 JJ
> 0 fl. 0
2O JJ
— rt U5
•J 0
•^ Tl O
c S u
o« x 0
09
i
2
**
O
M
ki
•
>
<
~*
a *
Btical Ba
Conf Idenc
^4
*» U
a o
ti Du
TO
«
g
U
O
«w
M
S
|
u
•01
•9
a*
^
41
TJ
u
«
*
<
0
l)
c
-~4
o
a.
«
*»
c
0
u
0
**a
«J
a
o
«r
c*
<
S
u
£
X
S
w4
H
ntecval
*4
Measures
:
K
«
U
0
a
ca
^^
4J
»
«
J3
3
a
*>
0
03
0
At
0
0
O O
in in
l 1
o o
in in
O U
i. 2
n in
v «i
r- *
^4
r- M
r- »
«
m **
«
0
2
0
u
»-4
Q.
a
0
«
c
£
2
•X >l
U M
rect conpa
values on
•^ «4
O O
£
0
0
»
1
j:
•o *»
0 —
•U >
o
?5
0 -0<
k4 A*
a «
— *j
OB
§ .
-i j=
JJ -W
0
U JJ
C
« —
3*
o >
0 0
V
•W 0
0 a
.0
5"?
X «
M
S
a
V
K
O
0
<»>
1
<
0
«
to -M
rect coapa
values on
•^ IM
Q 0
£
0
0
X
1
0 1
to 0
Q. jj
— 0
§5
™^ *j
4J
0 U
u o
W W
c
0 —
Q -a
C 0
o >
O u
•U 0
0 a
5°
o>^
2 g
J>B •(
1
T3 -O
C 0
O -u
85
ca 13
•n
S
C4
f
*
0
r<4
0
tr
0 .
« •
•f
*• *
:§
0 0
^^ 0
u a
JS --
«i
c|
j: ~Z
00) 0Q,
»s
s^
-» £
- £-
in o o
<•«(*>
i i l
o o o
in in *n
* «* *»
o in in
« 01 tn
a
-< --
« "2 * «
S 5 c 5
i u o 0
u — -«
0 -o ^ a
a « 0 0
Q* LI *» jc
s a<
•M
-t 0
0 ->
** 0
O >.
K <-t
O 0
3 C
CO 0
<•«•«• o y u
« o >n 2 2 2 Q
OT 91 tfk QQ
&
to
a
0
0 r- « c
r« m o » to r- ^
0 -4 M 1 1 1 ^
rH
0
^J
m O r>« o ^ ^ o
ot a> •» M « >a 2
ooo i -o
0
O
^ « •*• ^ m «
g
.0 *•
Ox —
<«4 to «4
« 5
"•o °
s s :
e w 0
00 e
"to S ?
52. a
£ £
0 d
0 0
x *
?
0J
*3
(M
3
O
*-t
4]
O
4J
0
c
'i
«
B-45
-------
as a performance measure, the variance of the observed concentrations is
also provided to aid in data interpretation. The last two columns list
the maximum points that may be awarded for each performance measure/averaging
time combination, and the points that actually were awarded. At the end
of each table the subtotal of points awarded is indicated. At the end of
Tables 7 and 8, the grand total points tor the model performance comparisons
are presented for the NAAQS application and the PSD application,
respectively.
5.L Results tor Model Performance Comparisons in the NAAQS Analysis
The grand total points awarded for the model performance
comparison in the NAAQS analysis is +100 points out of a possible -f-1000
points. As set forth in the protocol, a score from +100 to +1000 points
results in the acceptance of AQ40, the proposed model, over MPTER, the
reference model, for NAAQS regulatory applications at the Clifty Creek
station. With a score of +100, AQ40 has attained the minimum score
required for acceptance over the reference model.
Inspection of Table 5 reveals that in the NAAQS application
AQ40 better predicted the highest, second-highest observed S02 concentra-
tions for both 3- and 24-hour averaging times and showed slightly better
performance overall for the first performance evaluation objective,
predicting the highest concentrations (+52 points out of a possible +500
points). AQ40 also outperformed MPTER on the second objective, predicting
the domain of concentrations, by scoring +103 out ot a possible ^+300
points (see Table 6). MPTER, however, scored better (with -55 points out
of +200 possible points) for the third objective which tests the ability
of the models to match spatial and temporal patterns of concentration
B-46
-------
(see Table 7). Based upon the grand total points awarded, AQ40 is accepted
as suitable for the regulatory analyses of 3- and 24-hour average S02
impacts from the Clifty Creek generating station in relation to NAAQS
requirements.
5.2 Results for Model Performance Comparisons in the PSD Analysis
The grand total points awarded for the model performance
comparison in the PSD analysis is -416 points out of a possible +1000
points, as shown in Table 8. As set forth in the protocol, a score from
-100 to -1000 points results in the rejection of the proposed model. The
score of -416 for the PSD comparative performance evaluation indicates a
decisive margin in favor of the reference model MPTER over the proposed
model AQ40.
Inspection of Table 8 shows that MPTER outperformed AQ40 for the
first-order performance evaluation objective of predicting the highest
concentrations within the PSD region. MPTER scored -464 points out of a
possible +500 points for this first-order objective, and more accurately
predicted both the 3- and 24-hour average highest, second-highest concen-
trations as monitored within the PSD region.
As stipulated by the protocol, the model performance measures
and scoring scheme used for the second- and third-order PSD evaluation
objectives, predicting the domain of concentrations, respectively, are
identical to the performance measures and scoring scheme used for the
NAAQS model comparisons. The performance results and points awarded for
these comparisons were presented previously in Tables 6 and 7. Recali
that AQ40 scored +103 points out of +300 possible points for the second-
order objective, and that MPTER scored -55 points of +200 possible points
B-47
-------
for the third-order objective. Based upon the grand total points awarded,
AQ40 is rejected as being suitable for the regulatory analyses of 3- and
24-hour average SC>2 impacts from the Clifty Creek generating station within
the hypothetical PSD Class I area.
B-48
-------
6.0 Summary
This narrative example of the Interim Procedures For Evaluating Air
Quality Models illustrates the analytical steps necessary to judge the
acceptability oL" a proposed, non-Guideline model for a specific regulatory
application. The statistical performance measures and scoring methodology
used in the narrative example have been selected for this hypothetical
application and set forth in the preplanned protocol. In actual applica-
tions of the Interim Procedures, the performance evaluation methodology
must be designed to meet the specific objectives of the intended regulatory
use of the proposed model. It is especially important that close liaison
be maintained wi th the control agency throughout the model evaluation
process to ensure agreement on the objectivity of the model comparison
results.
B-49
-------
7.U References
1. EnvironmenLal Protection Agency. "Interim Procedures For
Evaluating Air Quality Models." Office of Air Quality Planning
and Standards, Research Triangle Park, North Carolina, August 1981,
2. Environmental Protection Agency. "Guideline On Air Quality
Models." EPA 45U/2-78-027. Office of Air Quality Planning and
Standards, Research Triangle Park, North Carolina, April 1981.
3. Environmental Protection Agency. "Workbook For Comparison Of
Air Quality Models." EPA 450/2-78-028a and EPA 450/2-78-0286,
Office of Air Quality Planning and Standards, Research Triangle
Park, North Carolina, May 1978.
4. American Society of Methanical Engineers. "Recommended Guide
for the Prediction of Dispersion of Airborne Effluents." M. Smith
(editor). American Society of Mechanical Engineers, New York, New
York, 1968.
5. Fox, D. G. "Judging Air Quality Model Periormance - A Summary
of the AMS Workshop on Dispersion Model Performance, Woods hole,
Massachusetts, September 8-11, 198U." Bulletin of the American
Meteorological Societh 62:599-609, May T9~8~E
-------
Appendix ('.
Procedure For Calculating Non-Overlapping Contidt-ru.e Intervals
C-l
-------
This Appendix il lust rates the procedure used to calculate non-
overlapping confidente intervals as discussed in Section 4.3 ol the
narrative example. This procedure is used when the 05 percent confidence
intervals of the performance measure' (absolute value of bias, variance or
correlation) contains tne value ol the performance measure lor both
models, as illustrated in Figure 2 of Appendix B. The following >xample
demonstrates tlu s procedure.
Suppose that tor Model. A t. lie value of the bi.is performance' measuif
is 105 -jg/niJ, the standard error is 20 ^g/nH, and the sample size is bOO;
and for Model 8, these values ar
-------
20 percent, until the non-overlapping level was identified- For the example
in Table C-l , the 90 percent confidence intervals are 72 ug/m3 to 138 pg/m3
for Model A and 34 yg/m-' to 116 ug/m-' for Model B. Again, both confidence
TABLE C-l
EXAMPLE CONFIDENCE INTERVALS AT FOUR CONFIDENCE LEVELS
Model A Model B
Bias = 105 ng/M3 Bias = 75 |jg/m3
Confidence
Level
95%
90%
80%
60%
Lower Bound
(ugM3)
66
72
79
88
Upper Bound
(Mg/m3)
144
138
131
122
Lower Bound
(pg/m3)
26
34
43
54
Upper Bound
(pg/m3)
124
116
107
96
intervals include the value of the bias for the other model, and therefore
the confidence level must be decreased by .-mother step. The 80 percent
confidence interval for Model A (79 Mg/irJ to 13J ug/m3) does not include
the value of the bias Cor Model B (75 pg/m3) . However, since the 80 percent
confidence interval for Model B (43 pg/m3 to 107 pg/m3) does include
the value of the bias for Model A (105 ug^3) > the confidence level must
be decreased by another step. The 60 percent confidence intervals are 88
pg/m3 to 122 pg/m3 for Model A and 54 pg/m3 to 96 pg/m3 for Model B.
Since neither interval includes the value of the bias for the other model,
the non-overlapping confidence interval has been identified as being 60
percent. Thus, in the scoring scheme, 60 percent of the total possible
points would be awarded to Model B in the case of this example, the model
with the lower bias. For the case when the 20 percent level fails to produce
non-overlapping confidence intervals, neither model is awarded any points.
C-4
-------
TECHNICAL REPORT DATA
(Please read Instructions on the reverse bejore completingl
1. REPORT NO.
EPA 450/4-84-023
4. TITLE AND SUBTITLE
Interim Procedures for Evaluating Air Quality
Models (Revised)
7. AUTHOR(S)
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Monitoring and Data Analysis Division
Office of'Air Quality Planning and Standards
U.S. Environmental Protection Agency
Research Triangle Park, N,C. 27711
12. SPONSORING AGENCY NAME AND ADDRESS
Monitoring and Data Analysis Division
Office of Air Quality Planning and Standards
U.S. hnvironmental Protection Agency
Research Triangle Park, N.C. 27711
3. RECIPIENT'S ACCESSION NO.
5. REPORT DATE
September 1984
6. PERFORMING ORGANIZATION CODE
8. PERFORMING ORGANIZATION REPORT NO.
10. PROGRAM ELEMENT NO.
11 CONTRACT/GRANT NO
13. TYPE OF REPORT AND PERIOD COVERED
14 SPONSORING AGENCY CODE
CPA-450/4-84-023
15. SUPPLEMENTARY NOTES
16 ABSTRACT
This document describes interim procedures for use in accepting, for a specific
regulatory application, a model that is not recommended in the Guideline on Air Qua!it
Models. The procedure involves a technical evaluation and a performance evaluation,
utilizing measured ambient data, of the proposed rionguideline model. The primary
basis for accepting the proposed model is a demonstration that it performs better
(better agreement with measured data) that tiv Quideline model or the model that EPA
would normally use in the given situation. Thp acceptance procedure may also consider
the technical merits of the proposed model anos especially in cases where an EPA
recommended model cannot be identified, the performance of the model in comparison to
a set of specially designed performance standards, A major component of the proce-
dure is the development of a protocol which describes exactly how the performance
evaluation will be conducted and what the specific basis for accepting or rejecting
the proposed model will be.
7.
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
Air Pollution
Meteorology
Mathematical Models
Performance Evaluation
Performance Standards
Statistics
b.IDENTIFIERS/OPEN ENDED TERMS
Performance Measures
Technical Evaluation
COSATi Field/Group
4B
12A
18. DISTRIBUTION STATEMENT
Release Unlimited
19. SECURITY CLASS (This Report/
Unclassified
21 NO. OF PAGES
20 SECURITY CLASS (Tills page)
Unclassified
22 PRICE
EPA Form 2220-1 (Rev. 4-77) PREVIOUS EDITION is OBSOLETE
-------
Suborn Street
------- |