FROM:
TO:
UNITED STATES ENVIRONMENTAL PROTECTION AGENCY
Office of Air Quality Planning and Standards
Research Triangle Park, North Carolina 27711
DATE: 7/30/81
SUBJECT: Interim Procedures for Evaluating Air Quality Models
Joseph A. Tikvart, Chief
Source Receptor Analysis
Chief, ^Air Programs Branch, Regions I
- X
Attache.d is a report entitled "Interim Procedures for Evaluating
Air Quality Models." The purpose of the report is to provide a general
framework for the quantitative evaluation and. comparison of air quality
models. It is intended to help you decide whether a proposed model, not
specifically recommended in the Guideline on Air Quality Models, is
acceptable on a case-by-case basis for specific regulatory application.
The need for such a report is identified in Section 7 of "Regional
Workshops on Air Quality Modeling: A Summary Report."
An earlier draft (Guideline for Evaluation of Air Quality Models)
was provided to you for comment in January 1981. We received comments
from four Regional Offices and have incorporated many of the suggestions.
These comments reflected a diversity of opinion on how rigid the pro-
cedures and criteria should be for demonstrating" the acceptability of a
nonguideline model. One Region maintained that EPA should establish
minimum acceptable requirements on data bases, decision rationale, etc.
Others felt that we should be more flexible in our approach. This
report defines the steps that should be followed in evaluating a model
but leaves room for considerable flexibility in details for each step.
t
The procedures and criteria presented in this new report are con-
sidered interim. They are an extension of recommendations resulting
from the Woods Hole Workshop in Dispersion Model Performance held in
Setpember 1980. That workshop was sponsored under a cooperative agree-
ment between EPA and the American Meteorological Society. Thus, while
some of the performance evaluation procedures may be resource intensive,
they reflect most of the requirements identified by an appropriate
scientific peer group. However, since the concepts are relatively new
and untested, problems may be encountered in their initial application.
Thus, the report provides suggested procedures; it is not_ a "guideline."
We recommend that you begin using the procedures on actual situations
w'ithin the context of the caveats expressed in the Preface and in Section
5.3. Where suggestions are inappropriate, the use of alternative techniques
to accomplish the desired goals is encouraged. Feedback on your experience
and problems are important to us. After a period of time during which
experience is gained and problems are identified, the report will be
PA Form 1320-6 (Rev. 3-76)
-------
updated and guidance will gradually evolve. Questions on the use of the
procedures and feedback on your experiences with their application
should be directed to the Model Clearinghouse (Dean Wilson, 629-5681).
An example of the procedures applied to a real data base is being devel-
oped under contract and should be completed in early 1982.
Attachment.
cc: Regional Modeling Contacts, Region I - X
VI. Barber •
D. Fox
T. Helms
W. Keith
M. Muirhead
L. Niemeyer
R. Smith
F. White
-------
INTERIM PROCEDURES FOR EVALUATING
AIR QUALITY MODELS
August 1981
United States Environmental Protection Agency
Office of Air, Noise and Radiation
Office of Air Quality Planning and Standards
Source Receptor Analysis Branch
Research Triangle Park, North Carolina 27711
-------
Preface
The quantitative evaluation and comparison of models for application
to specific air pollution problems is a relatively new problem area for
the modeling community. It is expected that initially there will be a
number of problems in this evaluation and comparison. Also several projects
are underway that will subsequently provide better insight to the model
evaluation problem and its limitations. Thus, procedures discussed in
this document are considered to be interim.
Where material presented is inappropriate, the use of alternative
techniques to accomplish the desired goals is encouraged. EPA Regional
Offices and State air pollution control agencies are encouraged to use this
information to judge the appropriateness of a proposed model for a specific
application, but still must exercise judgment where specific recommendations
are not of practical value. After a period of time during which experience
is gained, problem areas will be identified and addressed in revisions to
this document.
The procedures described herein are specifically tailored to oper-
ational evaluation, as opposed to scientific evaluation. The main goal
of operational evaluation is to determine whether a proposed model is
appropriate for use in regulatory decision making. The ability of various
'i
sub-modules (plume rise, etc.) to accurately reproduce reality or to add
basic knowledge assessed by scientific evaluation is not specifically
addressed by these procedures.
An example illustrating the procedures described in this document is
currently being prepared and should be available in early 1982.
-------
Ill
-------
TABLE OF CONTENTS
Page
Preface ....... ......................... ii
Summary ................................ 1
1.0 INTRODUCTION ....................... ... 5
1.1 Need for Model Evaluation Procedures
1.2 Basis for Evaluation of Models
1.3 Coordination with Control Agency
2.0 TECHNICAL EVALUATION ....... ............... 12
2.1 Intended Application .................... 12
2.2 Reference Model ...................... 14
2.3 Proposed Model ............. .......... 15
2.4 Comparison with the Reference Model ............. 15
2.5 Technical Evaluation When No Reference Model Is Used .... -\Q
3.0 PROTOCOL FOR PERFORMANCE EVALUATION ............... 21
3.1 Performance 'Measures ................ .... 22
3.1.1 Accuracy of Peak Prediction . . . . ... . . . .... ... 23
3.1.2 Average Model Bias ....... ' ". ......... 26
3.1.3 Model Precision ................... 26
3.1.4 Correlation Analysis ................ 27
3.2 Protocol for Model Comparison ............. -. . 28
3.2.1 Relative Importance of Performance Measures ..... 30
3.2.2 Comparison of Performance Statistics for the
Proposed and Reference Models ........... 37
3.2.3 Format for Model Comparison Protocol . ....... 40
3.3 Protocol When No Reference Model Is Available ....... 42
4.0 DATA BASES FOR THE PERFORMANCE EVALUATION ....... ..... 45
4.1 On-Site Data1 ........... ............ 47
4.2 Tracer Data ....................... 51
4.3 Off-Site Data ........................ 54
iv
-------
Page
5.0 MODEL ACCEPTANCE . . 57
5.1 Execution of the Model Performance Protocol • 57
5.2 Overall Acceptability of the Proposed Model 59
5.3 Common Sense Perspective 60
6.0 REFERENCES 63
APPENDIX A. Reviewer's Checklist A-l
APPENDIX B. Calculation of Performance Measures and
Related Parameters B-l
APPENDIX C. Review of the Woods Hole Workshop on Model
Performance C-l
-------
Summary
This document describes interim procedures for use in accepting, for
a specific application, a model that is not specifically identified in the
Guideline on Air Quality Models . The primary basis for the model evaluation
assumes the existence of a reference model which has some pre-existing status
and to which the proposed nonguideline model can be compared from a number
*•
of perspectives. However for some applications if may not be possible to
identify an appropriate reference model, in which case specific standards for
model acceptance must be identified. Figure 1 provides an outline of the
procedures described in this document.
After analysis of the intended application or the problem to be modeled,
a decision is made on the reference model to which the proposed model could
be compared. If an appropriate reference model can"be identified, then the
relative acceptability of the two models .is determined as follows. The model
i
is first compared on a technical basis to the reference model to determine
if it would be expected to more accurately estimate the true concentrations.
Next .a protocol for model performance comparison is written. This protocol
describes how an appropriate set of field data will be used to judge the
relative performance of the proposed and the reference model. Performance
2
measures recommended by the American Meteorological Society are used to
'i
describe the comparative performance of the two models in an objective scheme.
That scheme considers the relative importance to the problem of various model-
ing objectives and the degree to which the individual performance measures
-------
support those objectives. Once the plan for performance evaluation is
written and the data to be used are collected/assembled, the performance .
measure statistics are calculated and the weighting scheme described in
the protocol is executed. Execution of the decision scheme will lead to
a determination that the proposed model performs better, worse or about
the same as the reference model for the given applications. The results •
of the technical and performance evaluations are considered together
to determine the overall acceptability of the proposed model.
If no appropriate reference model is identified, the proposed model is
evaluated as follows. First the proposed model is evaluated from a technical
standpoint to determine if it is well founded in theory, and is applicable
to the situation. This involves a careful analysis of the model features
and intended usage in comparison with the source configuration, terrain and
other aspects of the intended application. Secondly","if the model is con-
sidered applicable to the problem, it is examined to see if the basic
formulations and assumptions are sound and appropriate for the problem. If
the model is clearly not applicable or cannot be technically supported, it
is recommended that no further evaluation of the model be conducted and that
the exercise be terminated. Next, a performance protocol is prepared that
specifies certain criteria that should be met. Data collection and execution
of the performance protocol will lead to a determination that the model is
'i
acceptable or unacceptable. Finally results for the performance evaluation
should be considered together with the results of the technical evaluation
to determine the acceptability.
-------
Analyze the
Intended
Application
(2.1)
(2.3)
Write Tech.
Description o
Prooosed Model1
(2.5)
Model Avail-
able ?
Write Tech.
Description of
Proposed Model
(2.3)
Unsound or ^
not .applicable
Acceptable.
or marginal
(3.3)
Write Perform-
ance Evalua-
tion Protocol
Technical
Comparison of
Models
v
J/_
Rejeci
Model
(2.4)
\f
Write Perform
ance Evalua-
tion Protocol
(3.2)
V
(4.0)
Collect
Perfonric
Evaluation
Data
A
(5.1)
.i.
Conduct
Performance
Evaluation
Collect
Performance
Evaluation
Data
(4.0)
\/
Conduct
Performance
Evaluation
(5.1)
Figure 1. Decision Flow Diagram for Evaluating a Proposed Air Quality Model,
(Applicable Sections of the Document are indicated in Parentheses.)
-------
-------
INTERIM PROCEDURES FOR
EVALUATING AIR QUALITY MODELS
1.0 INTRODUCTION
This document describes interim procedures that can be used in judging
whether a model, not specifically recommended for use in the Guideline on
Air Quality Models is acceptable for a given regulatory action. It iden-
*•
tifies the documentation, model evaluation and data analyses desirable
for establishing the appropriateness of a proposed model.
This document is only intended to assist in determining the accepta-
bility of a proposed model for a specific application (on a case-by-case
basis). It is not for use in determining whether a new model could be
acceptable for general use and/or should be included in the Guideline on
Air Quality Models. This document also does not address criteria for
determining the adequacy of alternative data bases to be used in yodels,
except in the case where a nonguideline model requires the use of a unique
.data base. The criteria or procedures generally applicable to the review
345
of fluid modeling procedures are contained elsewhere. ' ' '
The remaining sections provide the following. Section 1.1 describes
the history and the need for a consistent set of evaluation procedures,
Section 1.2 provides the basis for performing the evaluation, and Section
1.3 suggests how the task of model evaluation should be coordinated be-
tween the applicant and the control agency. Section 2 describes the tech-
nical information needed to define the regulatory problem and the choice
of the reference and proposed models. Section 2 also contains a suggest-
ed method of analysis to determine the applicability of the proposed
model to the situation. Section 3 discusses the protocol to be used in
-------
judging the performance of the proposed model. Section 4 describes the
design of the data base for the performance evaluation. Section 5 describes
the execution of the performance evaluation and provides guidance for com-
bining these results with other criteria to judge the overall acceptability
of the proposed model. Appendix A provides a reviewer's checklist which can
be used by the appropriate control agency in determining the acceptability
of the applicant's evaluation. Appendix B describes the calculation of per-
formance measures and related parameters/ Appendix C is a summary of the
Woods Hole Workshop on Dispersion Model Performance, sponsored by the
2
American Meteorological Society.
1.1 Need for Model Evaluation Procedures
The Guideline on Air Quality Models makes specific recommendations
concerning air quality models and the data bases to be used with these
models. The recommended models should be used in all evaluations relative
to State Implementations Plans (SIPs) and Prevention of Significant Deteri-
oration (PSD) unless it is found that the recommended model is inappropriate
for a particular application and/or a more appropriate model or analytical
procedure is available. However, for some applications the guideline does
not recommend specific models and the appropriate model must be chosen on a
case-by-case basis. Similarly, the recommended data bases should be used •
unless such data bases are unavailable or inappropriate. In these cases, the
i
guideline states that other models and/or data bases deemed appropriate by
the EPA Regional Administrator may be used.
Models are used to determine the air quality impact of both new and
existing sources. The majority of cases where nonguideline models have been
-------
proposed in recent years have involved the review of new sources especially
in connection with prevention of significant deterioration permit applications.
However, most Regional Offices have received proposals to use nonguideline
models for SIP relaxations and for general area-wide control strategies. Prior
to 1977, many large scale control strategies involved the use of models not
currently recommended-in the Guideline on Air Quality Models. Such appli-
cations were frequently accepted. Nonguideline techniques have also been
*-
applied to large- scale control strategies since 1977. In the Northeast and
North Central U. S. where there are wide areas of nonattainment or marginal
attainment of standards, nonguideline models are frequently proposed for use
which would allow increased emissions from large point sources. In "cleaner"
areas of the South and West, nonguideline models are also frequently proposed
for new or modified point sources.
Many of the proposals to use nonguideline models have involved
modeling of point sources in complex terrain and/or a shoreline environ-
ment. Other applications have included modeling point sources of photo-
chemical pollutants, modeling in extreme environments (artic/tropics/
deserts), modeling of fugitive emissions and modeling of open burning/
field burning where smoke management (a form of intermittent control) is
practiced. For these applications a refined approach is not recommended in
the Guideline on Air Quality Models. Also a relatively small number of
proposals involved applications where a recommended model was appropriate,
but another model was judged preferable.
-------
The types of nonguideline models proposed have included: (1)
minor modification-of computer codes to allow a different configuration/number
of sources and receptors that essentially do not change the estimates from
those of the basic model; (2) modifications of basic components in recom-
mended models, e.g., different dispersion coefficients (measured or estimated),
wind profiles, averaging times, etc; and (3) completely new models that fre-
quently involve non-Gaussian approaches and/or phenomenological modeling
*
(temporal/spatial modeling of the wind flow field or other meteorological
inputs).
The Guideline on Air Quality Models, while allowing for the use
of alternative models in specific situations, does not provide a technical
basis for deciding on the acceptability of such .techniques. To assure a
more equitable approach in dealing with sources of pollution in all sections
of the country it is important that both the regulatory agencies and the ,
entire modeling community strive toward a'consistent approach in judging
the adequacy of techniques used to estimate concentrations in the ambient air.
.The Clean Air Act recognized this goal and states that the "Administrator
shall specify with reasonable particularity each air quality model or models
to be used under specified sets of conditions ..."
The use of a consistent set of procedures to determine the accep-
tability of nonguideline models should also serve to better ensure that
the state-of-the-science is reflected. A properly constructed set of
evaluation criteria should not only serve to promote consistency, but
8
-------
should better serve to ensure that the best technique is applied. It
should be noted that a proposed model cannot be proprietary since it may
be subject to public examination and could be the focus of a public hear-
ing or other legal proceeding.
1.2 Basis for Evaluation of Models
The primary basis for accepting a proposed model for a specific
application, as described in this document, involves a technical comparison
and a comparison of performance between the proposed model and an applicable
reference model. Under this scheme the greatest emphasis is placed on the
performance evaluation. The proposed model would be acceptable for regulatory
application if its performance is clearly better than that of the reference
model. It should not be applied to the problem if its performance were clearly
inferior to that of the reference model. When the performance evaluation is
inconclusive or marginal one could decide in favor of the proposed model
if it were found to be technically better.than the reference model.
A secondary basis for accepting or rejecting a proposed model could
•involve the use of performance criteria written specifically for the in-
tended application. While this procedure is not encouraged because of lack
of experience in writing such criteria and the necessity of considerable
subjectivity, it is recognized that in some situations it may not be possible
to specify an appropriate reference model. Such a scheme would insure that
the proposed model is technically sound and applicable to the problem, or at
least marginally so, and that it pass certain performance requirements that
are acceptable to all parties involved. Marginal performance together with
a marginal determination on technical acceptability would suggest that the
model should not be used.
-------
At the present time one cannot set down a complete set of objective
evaluation criteria and standards for acceptance of models using these con-
cepts. Bases for such objective criteria are lacking in a number of areas,
including a consistent set of standards for model performance, scientific
consensus on the nature of certain flow phenomena such as interactions
with complex terrain, etc-. However, this document provides the framework
for inclusion of future technical criteria as well as specifying currently
available criteria.
1.3 Coordination with Control Agency
The general philosophy of this document is'that the applicant
or the developer of the model should perform the analysis. Depending
on the complexity/sensitivity of the application and the level of un-
certainty in the applicant's analysis, the reviewing agency should review
this analysis and make a judgment on the findings, perform independent checks
on certain aspects of the analysis, and/or perform an independent analysis.
The reviewing agency must have access to all of the basic information that went
into the analysis.
To avoid costly and time-consuming delays in execution of the model
evaluation, the applicant is strongly urged to maintain close liaison with
the reviewing agency(s) throughout the project. It is important that
'i
agreement be reached up-front on the choice of a reference model. It is
especially important that meetings be held at the completion of the
technical evaluation and before the initiation of the field data collection
phase. At that time the reviewing agency can make a determination on the
applicability of the proposed model (See Section 2) and the design of (or
10
-------
choice of) the data base network to be used in the performance evaluation.
It is also important at that time to agree on the protocol and criteria for
comparing the proposed and the reference models, including precise measures
of model performance such as bias, precision, statistical significance
levels, etc.
11
-------
2.0 TECHNICAL EVALUATION
The technical evaluation consists of a determination of the appro-
priateness of the proposed model for the intended application, exclusive of
the performance evaluation. To adequately address the technical evaluation
requires a thorough understanding of the source-receptor relationships which
must be addressed by the proposed model in the intended application, selection
of an appropriate reference model and a technical comparison of the proposed
model with the reference model. If no appropriate reference model can be
identified, an in-depth technical investigation of the theory, operating
characteristics and applicability of the proposed model should be undertaken.
The following subsections describe these needs in more detail.
2.1 Intended Application
Information that needs to be assembled on the intended appli-
cation includes a complete description of the source or sources to be
modeled, e.g., the configuration of the sources, location and heights of
stacks, stack parameters (flow rates and gas temperature) and location of
any fugitive sources to be included. Appropriate* emission rates for
each averaging time corresponding to ambient air quality standards for
each pollutant should be used. In the case of complex industrial sources
it is also generally necessary to obtain a plant layout including dimensions
of plant buildings and other nearby buildings/obstacles. Mobile and area
source emissions should be assembled in the format (i.e., line source segments,
grid squares, etc.) to be used in the model.
* Section 4.1 in the Guideline on Air Quality Models discusses emission rates
appropriate for use in regulatory modeling.
12
-------
It it also generally necessary to have a topographic map or
maps which cover the modeling area. If the topographic maps do not
include the location of emission sources, monitors, instrumented towers,
etc., a separate map with this information should be supplied. The area!
coverage is sometimes predetermined by political jurisdiction boundaries,
i.e., an air quality control region. More often, however, modeling is
confined to the region where any significant threat to the standards or
»
PSD increments is likely to exist. In these cases it is desirable to
make crude determinations of the area to be considered and at the same
time to tentatively determine the location of critical receptors for each
pollutant where standards/increments are most likely to be threatened.
The recommended approach for making these determinations is to make pre-
liminary estimates of the concentration field using available models and
available data. A preliminary estimate would utilize., the appropriate
emission rates for the regulatory problem and whatever representative
meteorological data are available before the evaluation*.
It is recommended that two or three separate preliminary estimates
of the concentration field be made. The first set of estimates could be
made with the screening techniques mentioned or referenced in the Guide-
line on Air Quality Models. The second set of estimates would be done with
the proposed model and the third set with the reference model (Section 2.2).
')
Estimates for all averaging times should be calculated.
* A final set of model estimates, to be used in decision making, could
utilize additional data collected during the performance evaluation as
input to the appropriate model.
13
-------
The three sets of estimates not only serve to' define the modeling
domain and critical receptors but also aid in determining the applicability
of the proposed model (Sections 2.4 and 2.5] and the design of the performance
evaluation data network (Section 4.0).
2.2 Reference Model
The primary approach used in this document to judge the accept-
ability of a proposed model relies on the philosophy that if the model is
«-
technically better and performs better than the recommended model or the
model that has historically been applied to the situation, then the pro-
posed model should be considered for use. In Section 2.4 procedures con-
tained in the Workbook for Comparison .of Air Quality Models are used to
the maximum extent possible, to make the technical comparison. Sections 3
and 4 describe procedures for comparing the performance of the "reference"
model with that of the proposed model. ...••-•
The first choice for a reference model should be the refined models
recommended in the Guideline on Air Quality models and listed in Appendix
A to that Guideline. However, not all modeling situations are covered by
recommended models. For example, models for point sources of reactive
pollutants or shoreline fumigation problems are not included. In these cases
the applicant and the reviewing agency should attempt to agree on an appropriate
and technically defensible reference model, based on the current technical
literature and on past experience. Major considerations in the selection
of the reference model under these circumstances are that it is applicable
to the type of problem in question, has been described in published reports-
14
-------
or the open literature, and is capable of producing concentration estimates
for all averaging times for which a performance measure statistic must
be calculated (usually one hour and the averaging times associated with
the standards/increments). This latter requirement precludes the use of
screening techniques which rely on assumed meteorological conditions for
a worst case.
Where it is clearly not possible to specify a reference model,
the proposed model must "stand alone" in the evaluation. In such cases
the technical justification and the performance evaluation necessary to
determine acceptability would have to be more substantial. Section 2.5
discusses a suggested rationale for determining if the model is techni-
cally justified for use in the application. Section 3.3 discusses some
considerations in designing the performance evaluation protocol when no
reference model comparison is involved.
2.3 Proposed Model
The model proposed for use in the intended application must be
capable of estimating concentrations corresponding to the regulatory re-
quirements of the problem as identified in Section 2.1. In order to conduct
the performance evaluation the model should be capable of sequentially esti-
mating hourly concentrations, and concentrations for all averaging times within
the area of interest ba'sed on meteorological and emission inputs.
A complete technical description of the model is needed for the
analysis in Section 2.4 or Section 2.5. This technical description should
15
-------
include a discussion of the features of the proposed model, the types of
modeling problems for which the model would be applicable, the mathematical
relationships involved and their bases, and the assumptions and limitations
of the model. The model description should take the form of a report or
user manual that completely describes its operation. Published articles
which describe the model are useful. If the model has been applied to
other problems, a review of these applications should also be undertaken.
For models designed to handle complex terrain, land/water interfaces and/or
other special situations, the technical description should focus on how
the model treats these special factors. To the maximum extent possible,
evidence for the validity of the methodologies should be included.
2.4 Comparison with the Reference Model
When an appropriate reference model can be identified it should
be determined whether the proposed model is. better to use" than the reference
model. The goal is to determine if the model can be expected to more accu-
rately reproduce the actual concentrations caused by the subject source(s),
with emphasis on dispersion conditions and subareas of the modeling domain
that are most germane to the regulatory aspects of the problem (Section 2.1).
The procedures described in the Workbook for Comparison of Air Quality Models
are appropriate for this determination. This Workbook contains a procedure
whereby a proposed model is qualitatively compared, on technical grounds to
t
the reference model, taking into account the intended use of the two models
and the specific application.
16
-------
The Workbook procedure is application-specific; that is, the
results depend upon the specific situation to be modeled. The reference
model serves as a standard of comparison against which the user gages the
proposed model being evaluated. The way in which the proposed model treats
twelve aspects of atmospheric dispersion called "application elements," is
determined. These application elements represent physical and chemical
phenomena that govern'atmospheric pollutant concentrations and include such
aspects as horizontal and vertical dispersion, emission rate, and chemical
reactions. The importance of each element to the application is defined
in terms of an "importance rating." Tables giving the importance ratings for
each element are provided in the Workbook, although they may be modified
under some circumstances. The heart of the procedure involves an element-by-
element comparison of the way in which each element is treated by the two
models. These individual comparisons, together with the importance ratings
for each element in the given application, form the bas'is upon which the
final comparative evaluation of the two models is made.
It is especially important that the user understand the physical
phenomena involved, because the comparison of two models with respect to
the way that they treat these phenomena is basic to the procedure. Suf-
ficient information is provided in the Workbook to permit these comparisons.
Expert advice may be required in some circumstances. If alternate procedures
are used to complete the) technical comparison of models, they should be
negotiated with the reviewing agency.
The results of the comparison of the proposed model with the
reference model should indicate whether the proposed model is better, com-
parable or worse than the reference model. This information is used in the
overall model evaluation in Section 5.
17
-------
2.5 Technical Evaluation When No Reference Model Is Used
If it is not possible to identify an appropriate reference model
(Section 2.2), then the procedures of Section 2.4 cannot be used and the
proposed model must be technically evaluated on its own merits. The tech-
nical analysis of the proposed model should attempt to qualitatively answer
the following questions:
1. Are the formulations and internal constructs of the model well
founded in theory?
2. Does the theory fit the practical aspects and constraints of
the problem?
To determine whether or not the underlying assumptions have been
co"rectly and completely stated requires an examination of the basic theory
t .ployed by the model. The technical description of' the model d'iscussed in
Section 2.3 should provide the primary basis for this examination. The
examination of the model should be divided into several subparts that"
address various aspects of the formulation. For example, for some models
it might be logical to separately examine the methodologies used to characterize
the emissions, the transport, the diffusion, the plume rise, and the chemistry.
For each of these model elements it should be determined whether the formulations
are based on sound scientific, engineering and meteorological principles and
1
whether all aspects of each element are considered. Unsound or incomplete
specification of assumptions should be flagged for consideration of their im-
portance to the actual modeling problem.
For some models, e.g., those that entail, a modification to a model
recommended in the Guideline on Air Quality Models or to the reference model,
18
-------
the entire model would not need to be examined for scientific credibility.
In such cases only the submodel or modification should be.examined. Where
the phenomenological formulations are familiar and have been used before,
support for their scientific credibility can be cited from the literature.
For models that are relatively new or utilize a novel approach to
some of the phenomenological formulations, an in-depth examination of the
theory should be undertaken. The scientific support for such models should
be established and reviewed by those individuals who have broad expertise
in the modeling science and who have some familiarity with the approach and
phenomena to be modeled.
To determine how well the model fits the specific application, the
model assumptions should be compared to the reality of the application. The
assumptions involved in the methodologies proposed to. .handle .each .phenomenon
should be examined to see if they are reasonable for the given situation. '
Particular attention should be paid to flagged assumptions which may either
be only marginally valid from a basic standpoint or be implicit, and unstated
to determine whether such assumptions are germane to the situation. For
assumptions that are not met, it should be established that these deficien-
cies will not cause any significant differences in the estimated concentra-
tions. The most desirable approach takes the form of sensitivity testing by
the applicant where variations are made on the questionable assumptions with-
in the model to determine whether or not these assumptions are indeed criti-
cal. Such an exercise should be conducted if possible and would involve
obtaining model estimates before and after modification of formulas or data
19
-------
to reflect alternate assumptions. However, in many cases this exercise may
be too resource-consumptive and the proof of model validity should still rest
with the performance evalution described in Section 4.
Execution of the procedures in this section should lead to a judgment
on whether the proposed model is applicable to the problem and can be scien-
tifically supported. If these criteria are met, the model can be designated
as appropriate and should be applied if its' field performance (Section 4) is
acceptable. When a model cannot be supported for use based on this technical
evaluation, it should be rejected. When it is found that the model could be
appropriate, but there are questionable assumptions, then the model can be
designated as marginal and carried forward through the performance evaluation.
20
-------
3.0 PROTOCOL FOR PERFORMANCE EVALUATION
The results of air quality simulation models are used in the process of
setting emission limits, determining the suitability of proposed new source
sites, etc. The goal of model performance evaluation is to determine the
degree of confidence which should be placed in these results. To achieve
this goal, model concentration estimates are compared with observed concen-
trations in a variety of ways. The primary methods of comparison produce
statistical information and constitute statistical performance evaluation.
However, statistical performance evaluation should be supplemented by addi-
tional qualitative analysis (case studies) and interpretation to ensure that
the model realistically simulates the physical processes for which it was
designed.
This section describes a process for evaluating the performance of the
proposed model and determining whether that performance'is adequate for the
specific application. It describes specific statistical measures which should
be used to characterize the performance of the model. The process requires
that a protocol be prepared for comparing the performance of the reference
and proposed model and describes a scheme to weigh the relative performance
of each model according to the significance with which.one model outperforms
the other and in terms of the importance of each performance category. Some
guidance is provided on how to evaluate model performance when comparison with
a reference model is not possible.
Model performance should be evaluated for each of the averaging times
specified in the appropriate regulations. In addition, performance for models
whose basic averaging time is shorter than the regulatory averaging time must
also be evaluated for that shorter period. Thus, for example, a model may
calculate one-hour concentrations for S02 and determine concentrations for
21
-------
longer averaging periods from these one-hour averages. Performance of this
model would then be evaluated separately for one, three, and 24-hour averages
and, if appropriate, for the annual mean.
The performance evaluation measures and procedures result in part from
the recommendations of the AMS Workshop on Dispersion Model Performance.
Appendix C presents a'summary of the Workshop recommendations.
3.1 Performance Measures
Performance measures may be classified as magnitude of difference
measures and correlation or association measures. Magnitude of difference
measures present a quantitative estimate of the discrepancy between measured
concentrations and concentrations estimated by a model at the monitoring sites.
Correlation measures quantitatively delineate the degree of association between
estimations and observations. The quantitative measures should be supplemented
by informative graphical techniques and interpretations such as histograms,
isopleth analyses, scatter diagrams and the like. This subsection discusses
the recommended performance measures and analyses.
Magnitude of difference performance measures compare estimated and
observed concentrations through analysis of the model residual, d, defined as
the difference between observed and estimated concentrations. (See Appendix B
for more complete discussion of the performance measures.) The model residual
i
measures the amount of model underestimation. The relative residual, i.e., the
percent underestimation by the model, should be calculated as supplementary
information. The relative residual provides information more readily commu-
nicated to those with nontechnical backgrounds.
22
-------
The model residuals are analyzed to provide values for the following
aspects of model performance; accuracy of the prediction of peak concentrations,
average model bias, model precision and model gross variability.
3.1.1 Accuracy of Peak Prediction
The accuracy of the peak predictions should be evaluated
and reported to conform with the somewhat conflicting requirements of
evaluations responsible to regulatory standards or increments and those
responsible to the needs of statistical reliability. Therefore, the perform-
ance measures to evaluate the accuracy of the peak predictions consist of
the set of residuals D , paired in various combinations of space and time
which measure the amount of underestimation of the nth highest estimation
and the more complete analysis of the set of residuals for the highest 5%
of the observations or for the highest 25 observations, whichever is greater.
Observed and estimated peak concentrations can be paired,in
space and time in the four ways listed in'Table 3.1. Each measure in Table 3.1
should be calculated for each short-term averaging period specified in regula-
tions in addition to the one-hour averaging time. (The appropriate relative
residual set should also be calculated.) Thus, for example, the residuals D~
may be required for a problem involving possible violations of the three-hour
NAAQS for SCL where the highest, second-highest concentrations are at issue.
i
The accuracy of the highest or second-highest estimate is,
however, difficult to evaluate statistically. Statistical evaluations have
greater meaning when applied to a larger number of values than to one or two
extremes. Therefore, the set of residuals D , where n extends over the top
5% of the observed concentrations, is evaluated for the properties of model
bias, and model precision as discussed in Sections 3.2.1 and 3.1.3. If there
are fewer than 500 observations, then n extends over the top 25 observations.
23
-------
Table 3.1 Residuals to Measure Accuracy of Peak Prediction
Paired in Residual Set
Space & Tine Dn (Ln> Tn) = CQ (Ln. !„) - Cp (Ln. !„)
Space not Time Dn (l_n, T) =.CQ Un, Tp) - Cp (Ln, Tj)
Time not Space Dn (L, Tn) = CQ (Ln, y - Cp (Ljs TR)
Unpaired Dn (L, T) = CQ (Ln, Tn) - Cp (LrJ Tj
L = monitor site for nth highest observed concentration.
T = time of nth highest observed concentration.
n -
T. = time of nth highest estimated concentration at site Ln.
w
L. = site of nth highest estimated concentration during time T .
J n
C (Lr, T ) = nth highest estimated concentration.
C (L , T ) = nth highest observed concentration.
(L , T ) = site and time of highest estimated concentration
r m (Generally, Lf f Lp and Tm j Tp).
24
-------
Since statistical analysis cannot supply all desired infor-
mation concerning model performance, supplementary case studies should be
included which examine whether the model is able to replicate a number of
the peak concentrations. The following analyses are suggested:
(1) Measured and calculated concentrations and patterns
are compared for those periods corresponding to the highest 25 observed
values. The- case study should include consideration of the meteorological
conditions associated with the events and should consider averaging times
of one hour as well as for averaging times important to regulatory standards.
(2) For any critical monitoring location, compare the
meteorological conditions such as stability and wind speed class, producing
the highest 25 measured and calculated concentrations. The number and type
of meteorological conditions will be determined by the model input parameters,
The case study approach can identify problems with a model
which might not be so readily apparent from a statistical performance meas-
ure. For example, if high measured concentrations occur during a period
for which the model estimated zero values everywhere, then the treatment of
mixing height penetration by a plume or the value of a may be wrong.
Similarly, if most of the highest concentration measurements occur with
slightly unstable conditions, while the highest concentration estimates
occur with very unstable conditions then either the method used for assign-
ing stability or the choice of dispersion curves associated with different
stabilities may be in error. The results of these case studies may indicate
the physical reasons for poor performance values of any of the measures
listed in Table 3.1. The degree of interpretation and conclusions to be
i
derived from these analyses depend on the confidence placed on the accuracy
and representativeness of the model input data. If data from tracer net-
works are available, the case studies should include analysis of those
periods with meteorological conditions of poor dispersion.
25
-------
3.T.2 Average Model Bias
Model bias is measured by the value of the model residual
averaged over an appropriate range of values. Large over and underestimations
may cancel in computing this average. Supplementary information concerning
the distribution of residuals should therefore be supplied. This supple-
mentary information consists of confidence intervals about the mean value,
calculated according to the methods presented in Appendix B and histograms or
frequency distributions of model residuals.
For certain applications, especially cases in which the
candidate model is designed to simulate concentrations occurring during
important meteorological processes, it can be important to estimate model
bias under different meteorological conditions. Data disaggregation must
compromise between the desired goals of defining a large enough number of
meteorological categories to cover a wide range of conditions and having
a sufficient number of observations in each category to calculate statisti-
cally meaningful values. For example, it may be appropriate to stratify
data by lumped stability classes, unstable (A-C), neutral (D) and stable
(E-F) rather than by individual classes A, B, C, D, E, and F.
3.1.3 Model Precision
Model precision refers to the average amount by which
estimated and observed concentrations differ as measured by residuals with
no algebraic sign. While large positive and negative residuals can cancel
when model bias is calculated, the unsigned residuals comprising the precision
measures do not cancel and thus provide an estimate of the error scatter
about some reference point. 'This reference point can be the mean error or
26
-------
the desired value of zero. Two types of precision measure are the noise,
which delineates the error scatter about the mean error, and the gross
variability, which delineates the error scatter about zero error.
The performance measure for noise is either the variance
2
of the residuals, S, , or the standard deviation of the residuals, S,.
The performance measure for gross variability is the mean square error,
or the root mean square error. An alternate performance measure for the
gross variability is the mean absolute residual, JdTf. The mean absolute
residual is statistically more robust than the root-mean-square-error; that
is, it is less affected by removal of a few extreme values.
Supplementary analyses for model precision should include
tables or histograms of the distribution of performance measures and com-
putation of these measures for the same meteorological categories discussed
in Section 3.1.2.
3.1.4 Correlation Analyses
Correlation analyses involve calculating parameters re-
sulting from linear least squares regression and presenting associated
graphical analyses and their interpretation. The numerical results con-
stitute quantitative measures of the association between estimated and
observed concentrations. The graphical analyses constitute supplementary
qualitative measures of .the same information. There are three types of .
correlation analysis and temporal analysis.
Coupled space-time correlation analysis involves com-
puting the Pearson's correlation coefficient, r, and parameters, a and b,
27
-------
of the linear least squares regression equation. A scattergram of the
CQ (L, T), C (L, T) data pairs is supplementary information which should
be presented.
Spatial correlation analysis involves calculating the spatial
correlation coefficient and presenting isopleth analyses of the estimated and
observed concentrations for particular periods of interest. The spatial co-
efficient measures the degree of spatial alignment between the estimated
*•
and observed concentrations. The method of calculation involves computing
the Pearson's correlation coefficient for each time period and determining
an average over all time periods. Specifics are discussed in Appendix B.
Estimates of the spatial correlation coefficient for single
source models are most reliable for calculations based on data intensive
tracer networks. Isopleths of the distributions of estimated and observed
concentrations for periods of interest should be presented and discussed.,
i
Temporal correlation analysis involves calculating the temporal
correlation coefficient and presenting time series of observed and estimated
concentrations or of the model residual for each monitoring location. The
temporal correlation coefficient measures the degree of temporal alignment
between observed and estimated concentrations. The method of calculation
is similar to that for the spatial correlation coefficient. Time series of
C and C or of model residuals should be presented and discussed for each
monitoring location.
3.2 Protocol for Model Comparison
The model performance measures described in Section 3.1 are
appropriate for most regulatory applications where the relative performance
28
-------
of two competing air quality models is to be evaluated. Each performance
measure, when calculated for the proposed model and the reference model,
provides certain statistics, or in some cases somewhat more qualitative
measures, which can be used to discriminate between the capabilities of the
two models to reproduce the measured concentration.
The objective scheme for considering the relative importance of
each performance measure and significance of the difference in performance
of the two models is called the model comparison protocol. This section
discusses the factors to be considered in establishing such a protocol for
an individual performance evaluation. Lack of experience with performance
evaluations prevents writing sets of objective protocols to cover all types
of problems. Rather, a specific protocol needs to be written for each per-
formance evaluation. The objective of the protocol is to establish objective
weights for each performance measure and for the degree "of intermodel dif-
ference. It is very important that such a protocol be written before the
data base is selected or collected and before any performance measures are
caTculated so as not to bias the final outcome.
The model comparison protocol basically addresses two questions:
(1) What relative importance should each performance measure hold in the final
decision scheme? For example, would model bias be a more important factor than
gross variability or good spatial correlation? Or, for example, is accurate
prediction of the magnitude of the peak concentration more important than
accurate prediction of the location of that peak? Answers to these questions
may vary according to the application. (2) What consideration should be
given to the degree of difference in performance between the two models?
It seems apparent that the more confidence one has that one model is performing
29
-------
better than the other, the more weight that result would carry in the
final decision on the appropriateness of using that model. Clearly this
is important when at least one of the models is performing moderately well.
For example if only one model appears to be unbiased, the degree to which
the other is more biased can be a factor in weighing the relative ad-
vantage of the apparently unbiased model.
Section 3.2.1 discusses criteria to be considered in determining the
*•
relative importance of performance measures. Section 3.2.2 covers techniques
for establishing relative confidence in the ability to discriminate between
the performance of the two models. Section 3.2.3 provides a rationale for
combining these two schemes and a suggested format for the protocol.
3.2.1 Relative Importance of Performance Measures
This subsection discusses factors to be considered when
determining what relative weights the various performance measures should
carry in the overall evaluation of model performance. The assumption is that
the performance results may suggest that the proposed model performs better
for some aspects and the reference model for others. Those measures of
performance which best characterize the ability of either model to more
accurately estimate the concentrations that are critical to decision making
should carry the most weight. For example, the reference model may exhibit
better performance in estimating the overall concentration field but
perform poorer in estimating the concentrations in the vicinity of the
maximum concentration. If the estimated maximum concentration controls the
emission limit on the source(s) then more weight should to given to per-
formance measures that assess the models' capability to accurately estimate
the maximum. In this example, however, some weight should still be given
30
-------
to the relative model performance over the entire domain since this is a
measure of the models' capabilities to correctly account for atmospheric
processes that influence ambient concentrations and thus adds to (subtracts
from in this example) the credibility of the conclusion that the proposed
model more accurately predicts the maximum.
A suggested scheme for determining the relative importance
of the performance measures is to: (1) define a set of "modeling objectives"
or desirable attributes of model performances appropriate to the regulatory
problem (the intended application); (2) rank these objectives in order
of importance and; (3) assign a maximum possible numerical score that each
objective should carry in the overall performance evaluation. Each perform-
ance measure and analysis which supports the objective is listed under that
objective and perhaps, each numerically weighted according to how well it
supports the relative capability of the models to meet that objective.
r
The scheme is best illustrated by an example. Assume that
for a given application accurate prediction of the maximum concentration
is-the most important modeling objective and that it should carry a weight
of 50 out of a total of 100 possible points. (The other modeling objectives
would encompass the remaining 50 points.) If a proposed model is clearly
better than the reference model, i.e., is unequivocally supported by the
performance measure statistics and analyses that characterize that objective,
then a score of 50 would be assigned to the comparison between the two models.
Conversely, if the reference model is clearly supported then the score of -50
would be assigned to the comparison. A score of zero would indicate that the
performance of the two models is the same.
The performance measures that support the determination as to
which model better meets the objective of accurate prediction of the maximum
31
-------
concentration are: (1) D (L...T } and D (L,T) where n might be the second-
highest concentration; (2) the bias, noise and gross variability of D
where n extends over the upper end of the frequency distribution of the
observed data; and (3) the case studies described in Section 3.1.1. Of
the total possible 50 points for this objective performance measure
(1) might carry a weight of 20; (2) 15 points; and (3) 15 points. The
rationale for assigning these weights (recall that this is done before any
data are available) is that the proposed model might do poorly on the
second-highest concentration but that if it performs better over the upper
end of the frequency distribution and accounts for the meteorological
variables correctly (the case studies), the comparison score could still be
positive. This rationale also assumes that the peak concentration statistics
are usually non-robust, i.e., only minimal confidence can be placed in
single values of D unless they are supported by other statistical data.
To generalize the scheme on how to consider the relative
importance of modeling objectives and their supporting performance measures,
it is suggested that modeling objectives be ranked in order of importance
as first, second and third order objectives. _.
First order objectives might be: Those concentrations essential
to the decision in question are accurately estimated. The essential con-
centrations are defined by appropriate regulations and are usually given
in terms of some peak concentration such as the second-high concentration.
As noted in Section 3.1.1 the performance measures, D , for the peak esti-
mations are not statistically very meaningful since their values could
change significantly when using equivalent data from another time period.
Therefore, additional statistical and qualitative analyses must be pre-
sented to lend confidence to the residuals for the estimation of the peak.
32
-------
The performance measures and analysis which delineate the
extent to which a model meets the first-order objectives are, therefore:
- The appropriate residuals from Table 3.1.
- Accuracy and precision measures for the top 5% (or top
25) of the observed concentrations.
- Results of case studies described in Section 3.1.1.
These are summarized for the major model type-task categories in Table 3.2.
*•
Second-order objectives might be: Pollutant concentrations
are modeled accurately and precisely over an extended range of concentrations,
This range of concentrations should be determined on a case-by-case basis.
The performance measures which quantify, the degree to which a model meets
the second-order objectives are summarized in Table 3.3.
Third-order objectives might be: Concentration patterns are
modeled realistically over the range of meteorological or other conditions
of interest. Demonstration that a model meets these goals involves cor-
relation analyses summarized in Table 3.4.
33
-------
TABLE 3.2 Summary of Performance Measures for First Order Objectives (Example).
TYPE OF MODEL/SOURCE -TASK
PERFORMANCE MEASURES
Single or multiple
source; stable
pollutant;, short
term
Compliance
with NAAQS
site of
lesser
importance
Site critical
eg, PSD
Class 1
Single source or
multiple source;
stable pollutant;
long term
Compliance
with NAAQS
site of
lesser
importance
Site
critical
Multiple source;
short term
Compliance
with NAAQS
1.
2.
1.
2,
1.
2.
D (L ,T ), D (L,T)
BTas, nbise Snd
gross variability
of D (L ,T ) and
C§se studies as in
Section 3.1.1
D (L ,T ), D ,T)
Bias, noise and
gross variability
SJlat^al correlation
of tracer network
data
Case studies
described in
Section 3.1.1
Bias of D (L,T)
Bias by meteoro-
logical category
as discussed in
Section 3.1.2 . .
Bias .of D (L,T)
Bias at critical
receptor by
meteorological
category as
discussed in
Section 3.1 .2
Dn(L.T)
REMARKS
1. n Specified by the
regulations
2. n extends over the
upper 5% of the
observations or the
top 25 observations,
whichever is greater
1. n Specified by the
regulation
2. n extends over the
top 5% or top 25
observations, which-
ever is greater
3. Supplement with
isopleths of Co and
Cp for high Co periods
1. n Extends over all
observations above
small cutoff value
1. L =Critical site(s);
n extends over all
observations above
small eutoff
1. Simulations for a few
days with an urban
airshed model
34
-------
TABLE 3.3 Summary of Performance Measures for Second Order Objectives (Example)
OF MODEL/SOURCE
TASK
PERFORMANCE MEASURES .
REMARKS
Single or multiple
source;
Stable pollutant;
short term
Compliance
with NAAQS;
site of
lesser
importance
Site
critical eg
PSD Class 1
Single or multiple
source;
Stable pollutant;
long term
Compliance
with NAAQS
Site
critical
Multiple source;
photochemical;
short term
Compliance
with NAAQS
Bias, noise and
gross variability
of D fL Tjove'r
all iitis"
Bias, noise and
gross variability
by meteorological
category as in
Sectfion 3.1.2
Comparison of
cumulative frequency
distributions of
Co and Cp
Bias, noise and
gross variability
of WV at
crmcaT sUe.s
Bias, noise and
gross variability
at critical sites
by meteorological
category as in
Section 3.1.2
Comparison of cum-
mulative frequency
distribution at
critical sites
Noise and gross
variability of
N8is§'aRd gross
variability of
Dn(Ln,T ) as in
Sect1? onn3.1.2
Noise and gross
variability at
critical sites
Noise and gross
variability at
critical sites
by meteorological
category
Bias, noise and
gross variability
°f "-
1. n extends over
all observations
above a small
cutoff. Also
supply distribu-
tions of parameters
3. Tests for goodness
of fit
1. See Remark 1
above
3. Tests for goodness
of fit
1. See Remark 1
above
1. See Remark 1
above
1. See Remark 1
above
35
-------
TABLE 3.4 Summary of Performance.Measures for Third Order Objectives (Example).
TYPE OF MODEL/SOURCE
Single source and
multiple source;
Stable pollutant;
short term
TASK
Compliance
with NAAQS;
Site of
lesser
importance
PERFORMANCE MEASURES' REMARKS
Site
Critical
Single Source and
multiple source;
stable pollutant;
long term
Multiple source;
photochemical;
short term
All
Compliance
with NAAQS
1. Space-time correlation,
all data
2. Spatial correlation
3. Temporal correlation
1. Space-time correlation 1.
2. Spatial correlation 2.
3. Temporal correlation 3.
Supply scatter-
grams
Isopleths of Co
and Cp for impor-
tant categories
Time series of
Co and Cp at
each site
Remark 1 above
Remark 2 above
Time series of Co
and Cp at critical
sites
1. Space-time correlation 1. Remark 1 above
1. Space-time correlation 1.
2. Spatial correlation _ 2.
3. Temporal correlation ' * 3.
Remark 1 above
Remark 2 above
Time series of Co
and Cp at each
monitor site
36
-------
3.2.2 Comparison of the Performance Measure Statistics for the
Proposed and Reference Models
Once the relative importance of the modeling objectives and
the performance measures that support each objective have been established, it
is necessary to define the rationale to be used in determining the degree
to which each pair of performance measure statistics (or analysis) supports
the advantage of one model over the other. Stated differently, it is
*-
necessary to have a measure of the degree to which better performance of
one model over the other can be established for each performance measure.
While confidence levels are useful for displaying and com-
paring model performance, they provide no direct statistical measure of the
significance associated with comparative performance of the two models. By
selecting a predetermined statistical level of significance, a reasonably
objective scheme can be established for displaying and weighting the relative
performance of each model. For example, it may be desirable to select the
rejection probability at 5% for comparisons of the model bias or for com-
parison of model noise. This figure (5%) can be interpreted as the probabil-
ity that the statistical test will suggest better performance, by one model
when, in fact, neither model is performing better. Procedures for establish-
ing confidence .limits on each model's performance and for testing the advantage
of one model are described in Appendix B.
i
The concept is easily applied to the performance measures
of precision, which measure the scatter of residuals. The most appropriate
statistic, e.g., the ratio of model noise is selected using Appendix B and
used to determine the statistical significance of the comparison. Higher
37
-------
significance levels, say 5%, would be associated with a 'high level of
confidence that the model with the lower average precision is better. In the
protocol an attempt would be made to incorporate the range of possible levels
into an objective scheme. For example, if the maximum possible score
associated with precision is 10 (positive indicates that the proposed model
is better), a score of 10 would be assigned to the proposed model if comparative
statistics were of the 5% significance level or less. A zero score would
be associated with the 50% level and supportable intermediate scores are given
to significance levels between 5% and 50% in some supportable fashion. Simi-
larly, if the statistics suggested that the reference model had better pre-
cision, then analogous significance levels could be determined and used to
assign negative scores to the comparison.
The concept can be easily extended to determine the relative
performance of each model with respect to accuracy. • The question is
whether one model has less (more) bias than the other. In this case, it's
unimportant whether one model tends to overestimate or underestimate,-only
whether one model tends to be more biased than the other.
The significance level of the difference in bias between the
two models is the indicator used in assigning the relative performance
score for accuracy. For example, if the reference model has a bias (either
too high or too low) whi,ch is significantly less than the proposed model
at the 10% significance level, then a score of -10 out of a possible -15
might be assigned to the comparison.
For single valued residuals, D (L ,T ), objective tests for
determining the significance associated with observed differences between
residuals are not well developed. For this reason, a simple scheme seems
38
-------
to be a reasonable alternative to significance testing such as one which
assigns the maximal.permissible score to the smallest absolute residual.
For a specific case study it may not be possible to form a
totally objective basis for comparing the two models. However, it is still
important to clearly define in the protocol the methodology to be used so as
not to compromise the decision on these performance measures once the results
are known.
For the performance measures that involve correlation coefficients,
the rationale is analogous to that for the unsigned residuals. The model
with the higher correlation coefficient is better. The degree of advantage
is based, in an objective and supportable manner, on the significance level
associated with the appropriate statistic (see Appendix B) that compares the
relative magnitude of the two correlation coefficients.-- . — -
i
Caution should be applied in interpreting the statistical
significance associated with comparisons of model performance for each of
the various performance measures. This is especially crucial since each of
the various statistical tests are based to a varying degree on the assump-
tion that model residuals are independent of one another; an assumption that
is clearly not true. For example, model residuals form adjacent time periods
(e.g., hour to hour) are known to be positively correlated. Also the pro-
posed and reference model residuals for a given time period are related
since each residual is calculated by subtracting the same observed concen-
tration from the estimates of the two models. For these reasons, classical -
tests described in Appendix B should be viewed as practical interim guides
until more rigorous statistical tests for comparing model performance can be
evaluated.
39
-------
3.2.3 Format for the Model Comparison Protocol
The specification of an objective technique for considering
the relative importance of the various attributes of good model performance
(Section 3.2.1) and the rationale for deciding how well each attribute is
supported by one model or the other (Section 3.2.2) constitutes the overall
scheme for judging model superiority in the performance evaluation. A
suggested format for the model comparison protocol, based on the scoring
scheme discussed above, is provided as Table 3.5.
In the first column of the table the modeling objectives
relevant to the regulatory problem are listed. The second column lists the
performance measures that support that objective. The third column lists
the maximum scores (-) that could be attained for each objective and for
each of its supporting performance measures. A maximum positive score could
be obtained if the proposed model is unequivocally supported; a" maximum
negative score if the reference model is unequivocally better. In the
fourth column the reasons supporting the distribution of maximum weights
among the various objectives and performance, measures should be listed. The
last column should describe in objective terms the rationale to be used for
scoring each performance measure. (In Section 3.2.2 a rationale tied to the
confidence levels was suggested, for most measures.)
In the middle of the table space is left for any proposed adjustments
to the total score that are not adequately represented by the performance
statistics. It might be agreed initially that for a particular attribute
either the proposed model or reference model is not adequately characterized
by the performance measure statistics and should be accounted for in the ob-
jective sense as described under "basis."
40
-------
TABLE 3.5 Suggested Format for the Model Comparison Protocol
Model ing Objective
1.
2.
3.
.
.
.
Supporting Performance Measure '
a.
b.
c.
<
,
a.
b.
.
m
<
Adjustments to Score
1.
2.
Maximum Score
1.
a.
b.
c.
.
.
2.
a.
b.
.
,
.
Total = 100
Basis for Maximum Score
1.
a.
b.
c.
.
.
2.
a.
b.
•
.
.
Basis
1.
2.
Decision Rationale
Better: Score >
Same: > Score >
Worse: Score <
Absolute Criteria
1.
2.
Rationale for Scoring (Significance Criteria)
a.
b.
c.
.
.
<
a.
b.
.
.
.
Rationale
1.
2.
Basis
1
2
.
-------
Below this, the total scores to be used on judging the overall
model performance are defined. The positive score, above which the proposed
model would be judged to perform significantly better than the reference model
is listed on the bottom line. Marginal scores would form an interval (presumably
symmetric) about zero and would be associated with the conclusion that one can-
not really discriminate between the performance of the two models.
A number of factors should be considered in the rationale that
«-
supports the width of the marginal interval. Some of these factors are related
to the representativeness and the amount of data. For example, if off-site
data were used, it might be decided to reflect the uncertainty in the
representativeness of the data by having a rather broad band of marginal
model performance.
Finally at the bottom of the Table, space is left for any
; * • •
"absolute" requirements on model performance. These criteria would allow the
setting of any a. p^io^U. standards of performance. For example, the initial
decision may be that if the proposed model is found to be grossly inaccurate
or grossly biased (gross must be defined), it would not be acceptable for the
application even though it performs better, overall, than the reference
model.
3.3 Protocol When No Reference Model Is Available
When a reference model is not available, it is necessary to write a
different type of protocol based on case-specific criteria for the model perform-
ance. However, at the present time, there is a lack of scientific understand-
ing and consensus of experts necessary to provide a supportable basis for
42
-------
establishing such criteria for models. Thus the guidance provided in this
subsection is quite general in nature. It is based primarily on the pre-
sumption that the applicant and the regulatory agency can agree to certain
performance attributes which, if met, would indicate within an acceptable level
of uncertainty that the model predictions could be used in decision-making.
A-set of procedures should be established based on
objective criteria that, when executed, will result in a decision on the
acceptability of the model from a performance standpoint. As was the case
for the model comparison protocol, it is suggested that the relative im-
portance of the various performance measures be established. Tables 3.2,
3.3, and 3.4 serve as a guide. However, the performance score for each
measure should be based on statistics of d, or the deviation of the model
estimates form the true concentration, as indicated by the measured con-
centrations. For each performance measure criteria should be written terms
of a statistical test. For example, it might be stated that the average
model bias should not be greater than - X at the Y% significance leveJ. Some
considerations in writing such criteria are:
1. Conservatism—This involves the introduction of a pur-
poseful bias that is protective of the ambient standards or increments, i.e.,
overprediction may be more desirable than underprediction.
2. jRisk—It might be useful to establish maximum or average
deviation from the measured concentrations that could be allowed.
3. Case Studies—As mentioned in Section 2.5 there may be
certain model assumptions or model features that are critical to the intended
application. Minimum acceptable performance of the model in certain case
studies designed to focus on these critical situations could be established.
43
-------
4. Experience in the Performance of Models—Several
8 9 10 11
references in the literature ' ' ' . describe the performance of various
models. These references can serve as a guide in determining the per-
formance that can be expected from the proposed model, given that an anal-
ogy with the proposed model and application can be drawn.
As was the case for the model comparison protocol, a
decision format or table analogous to Tablets.5 should be established.
Execution of the procedures in the table should lead to a conclusion that
the performance is acceptable, unacceptable or marginal.
44
-------
4.0 DATA BASES FOR THE PERFORMANCE EVALUATION
This section describes interim procedures for choosing, collecting
and analyzing field data to be used in the performance evaluation. In general
there must be sufficient accurate field test data available to adequately
judge the performance of the model in estimating all the concentrations
of interest for the given application.
Three types of data can be used to evaluate the performance of a pro-
posed model. The preferred approach is to utilize meteorological and air
quality data from a specially designed network of monitors and instruments
in the vicinity of the sources(s) to be modeled (on-site data). In some
cases especially for new sources, it is advantageous to use on-site tracer
data from a specifically designed experiment to augment or be used in lieu
of long-term continuous data. In infrequent cases where an appropriate
analogy to the modeling problem can be identified, it may be possible to
utilize off-site data to evaluate the performance of the model. ,
- As a general reference for this section the criteria and requirements
contained in the Ambient Monitoring Guidelines for Prevention of Signif-
1 o
icant Deterioration (PSD), should be used. Much of the information con-
tained in the PSD monitoring guideline deals with acquiring information
or ambient conditions in the vicinity of a proposed source but such data
'i
may not entirely fulfill the input needs for model evaluation.
All data used as input to the air quality model and its evaluation
should meet standard requirements or.commonly accepted criteria for
quality assurance. New site-specific data should be subjected to a quality
45
-------
assurance program. Quality assurance requirements for criteria pollutant
measurements are given in Section 4 of the PSD monitoring guideline. Sec-
tion 7 of the PSD monitoring guideline describes quality assurance require-
ments for meteorological data.
The procedures to be used in the performance evaluation described
below in Section 4 involve a comparison of the performance of the proposed
model with that of the reference model. Thus it is necessary to provide
model estimates for both models for each receptor where measured data are
available. Usually concentration estimates and measurements are for a one-hour
period but may be for a shorter or longer period depending on the charac-
teristics of the model or the sampling method used. All valid data and the
corresponding concentration estimates from both models are needed in the per-
formance evaluation. Circumstances bearing on the representativeness of any
of the data or concentration estimates should be fully .e"xplained'~f or consid-
eration in weighting the results of the performance statistics.
i
It is also necessary to sum/average estimates and data such that the
relative performance of the models can be compared for averaging times
corresponding to increments/standards or other decision criteria germane to
the problem. For example, S02 increments and standards are written in terms
of 3-hour, 24-hour and annual averages. Concentration data and model esti-
mates for these averaging times would be used in the performance evaluation
discussed in Section 3.
Finally, it should be noted that, when the model is used to make
estimates for comparison with standards/increments it is necessary to
include a longer period of record of model input data than that collected
for the performance evaluation. This is to ensure that the long-term
46
-------
temporal variations of critical meteorological conditions will be adequately
accounted for. The Guideline on Air Quality Models provides some guidance
on the length of record needed for regulatory modeling.
4.1 On-Site Data
The preferable approach to performance evaluation is to
collect an on-site data base consisting of concurrent measurements of
emissions, meteorological data and air quality data. Given an adequate
t-
sample of these data, an on-site data base designed to evaluate the proposed
model relevant to its intended application, should lead to a definitive
conclusion on its applicability. The most important goal of the data col-
lection network is to ensure adequate spatial and temporal coverage of model
input and air quality data.
In general the spatial and temporal coverage of emissions,
meteorological and air quality data used in the performance evaluation should
be adequate to show with some confidence how well each model is performing at
all points and times for meteorological conditions of interest. Enough data
should be collected to allow the calculation of each applicable performance
measure discussed in Section 3.1. The data collection should emphasize the
area around receptors where high concentrations are expected under critical
meteorological conditions. Concurrent emissions data and meteorological
data should be representative of the critical conditions for the site. The
definition of receptors and meteorological conditions is best obtained from
the screening analysis and model estimates described in Section 2.1.
The number of monitors needed to adequately conduct a per-
formance evaluation is often the subject'of considerable controversy. It can
be argued that one monitor located at the point of maximum concentration for
47
-------
each averaging time corresponding to the standards or increments should be
sufficient. However, the points of maximum concentration are not known
but are estimated us.ing the model or models that are themselves the subject
of the performance evaluation, which of course unacceptably compromises
the evaluation. It is possible that the use of data from one or two
monitors in a performance evaluation may actually be worse than no eval-
uation at all since no meaningful statistics can be generated and attempts
to rationalize this problem may lead to erroneous conclusions on the suit-
ability of the models.
At the other extreme is a large number of monitors, perhaps 40 or
more, that cover the entire modeling domain or area where significant concen-
trations, above a small cutoff can be reasonably expected, and with enough
density such that the entire concentration field (isopleths),can be established.
Such a concentration field will allow the calculation of the needed performance
statistics and, given adequate temporal coverage as discussed below, would
likely result in narrow confidence bands on the model residuals, as dis-
cussed in Section 3.1. With these narrow confidence bands it is easier to
distinguish between the relative capabilities of the proposed model vs.
the reference model to more accurately estimate observed concentrations.
When the data field is more sparse, the confidence bands on the residuals
for the two models will be broader. As a consequence, the probability of
i
statistically distinguishing the difference between the performance of the
two models will be lower.
Thus, the number of monitors needed to conduct a significantly
meaningful performance evaluation should be judged in advance. Some other
factors that should be considered are:
48
-------
1. The more accurate the emissions data are, the less noise
in the model residuals.
2. Similarly,, the more accurately one can pinpoint the location
of the plume(s)_ the less noise that will occur in the model residuals. This
can be done by increasing the spatial density and degree of sophistication in
meteorological input data, for models that are capable of accepting such
data.
3. Models or submodels that are designed to handle special phenomena
would logically only be evaluated over the spatial domain where that phenomena
would result in significant concentrations. Thus, the monitoring network
should be concentrated in that area, perhaps with a few outlying monitors for
a safety factor.
In the temporal sense some of the above rationale is also appropriate.
A short-term study will lead to low or no confidence on "the ability of the
models Cpr°posed and reference) to reproduce reality. A multi-year effort
will yield several samples and model estimates of the second-highest short-
term concentrations thus providing some basis for statistically significant
comparison of models for this frequently critical estimate. Realistically,
multi-year efforts are usually prohibitive and one has to rely on somewhat
circumstantial evidence, the upper end of the frequency distribution, to
establish confidence in the models' capabilities to reproduce the second-
'i
highest concentration.
In general, the data collected should cover a period of record
that is truly representative of the site in question, taking into account
variations in meteorological conditions, variations in emissions and expected
49
-------
frequency of phenomena leading to high concentrations. One year of data
is normally the minimum, although short-term studies are sometimes acceptable
if the results are representative and the appropriate critical concentrations
can be determined from the data base. Thus short-term studies are adequate
if it can be shown that "worst case conditions" are limited to a specific
period of the year and that the study covers that period. Examples might
be ozone problems (summer months), shoreline fumigation (summer months) and
certain episode phenomena.
Other considerations on the length of record for the per-
formance evaluation are analagous to the considerations for spatial coverage:
1. Accurate emissions data over the period of record dimin-
ishes the noise in the temporal statistics. Although data contained in a
standard emissions inventory can sometimes be used, it is generally necessary
to obtain and explicitly model real time' (concurrent with. the. ai.ir quality
data used in performance evaluation) emissions data from significant sources,
"In-stack" monitoring is highly recommended to insure the use of emission
rates comparable in time to the measured and estimated ground-level concentrations,
2. Continuous (minimum of missing data) collection of repre-
sentative meteorological input data is important.
3. Models designed to handle .special phenomena need only have
enough temporal coverage to provide an adequate (produce significant stat-
istical results) sample of those phenomena. For example, a downwash algo-
rithm might be evaluated on the basis of 50 or so observations in the critical
wind speed range.
It is important that the data used in model development be
independent of those data used in the performance evaluation. In most
cases, this is not a problem because the model is either based on general
' 50
-------
scientific principles or is based on air quality data from an analogous
situation. However, in some semi-empirical approaches where site-specific
levels of pollutants are an integral part of the model, an independent
set of data must be used for performance evaluation. The most common ex-
amples of these models are statistical approaches where concentrations for
various averaging times utilize probability curves derived from site-
specific data and for approaches requiring calibration.
*.
When actual air quality data are used in the performance
evaluation, it is necessary to distinguish between the contribution to the
measured concentration from sources that are included in the model and the
contribution attributable to background (or baseline levels). Section 5.4
of the Guideline on Air Quality Models discusses some methods for estimat-
ing background. Considerable care should be taken in estimating background
so as not to bias the performance evaluation. Incorporation of 'background
data consistently in the proposed model and the reference model is necessary
to ensure that no artificial differences in the performance statistics are
generated. For example, a "calibrated" model may implicitly include back-
ground and if it were compared to a model where background is accounted
for differently some biases may be introduced.
4.2 Tracer Studies
The use of on-site tracer material to simulate transport
i
and dispersion in the vicinity of a point or line source has received increas-
ing attention in recent years as a methodology for evaluating the performance
of air quality simulation models. This technique is attractive from a number
of standpoints.
51
-------
1. It allows the impacts from an individual source to
be isolated from those of other nearby sources which may be emitting the same
pollutants.
2. It allows a precise definition of the emission rate.
3. It is generally possible to have a reasonably dense
•network of receptors in areas not easily accessible for placement of a per-
manent monitor.
4. It allows for the emissions from a proposed source
to be simulated.
There are some serious difficulties in using tracers
to demonstrate the validity of a proposed model application. The execution
of the field study is quite resource intensive, especially in terms of manpower.
Samplers need to be manually placed and retrieved after each test and the
samples need to be analyzed in a laboratory. In many cases an aircraft is
required to dispense the tracer material. Careful attention must be placed
on quality control of data and documentation of meteorological conditions. As
a result most tracer studies are conducted as a short term (a few days to
a few weeks) intensive campaign where large amounts of data are collected.
If conducted carefully, such studies provide a considerable amount of useful
data for evaluating the performance of the model. However, the performance
evaluation is limited to those meteorological conditions that occur during
the campaign. Thus, while a tracer study allows for excellent spatial
coverage of pollutant concentrations, it provides a limited sample, biased
in the temporal sense, and leaves an unanswered question as to the validity
of the model for all points on the annual frequency distribution of pollutants
at each receptor.
52
-------
Another problem with tracer studies is that the plume rise
phenomena may not be properly simulated unless the tracer material can be
injected into the gas stream from an existing stack. Thus, for new sources
where the material is released from some kind of platform, the effects of
any plume rise submodel cannot be evaluated.
Given these problems, the following criteria should be
considered in determining the acceptability of tracer tests:
1. The tracer samples should be easily related to the
averaging time of the standards in question;
2. The tracer data should be representative of "worst
case meteorological conditions";
3. The number and location of the samplers should be
sufficient to ensure measurement of maximum concentrations;
4. Tracer releases should represent plume rise under
varying meteorological conditions:
5. Quality assurance procedures should be in accordance
with those specified or referenced in the PSD monitoring guideline as well
•as other commonly accepted procedures for tracer data;.
6. The on-site meteorological data base should be adequate;
7. All sampling and meteorological instruments should be
adequately maintained;
8. Provisions should be made for analyzing tracer samples at
remote locations and for maintaining continuous operations during adverse
weather conditions where necessary.
Of these criteria, items 1 and 2 are the most difficult to
satisfy because the cost of the study precludes collection of data over
53
-------
an annual period. Because of this problem it is generally necessary to
augment the tracer study by collecting data from strategically placed
monitors that are operated over a full year. The data are used to establish
the validity of the model in estimating the second-highest short term and
the annual mean concentration. Although it is preferable to collect these
data "on site," this is usually not possible where a new plant is proposed.
It may be possible to use data collected at a similar site, in a model
evaluation as discussed in the next subsection.
As is the case for a performance evaluation that uses routine
air quality data, sufficient and relevant meteorological data must be col-
lected in conjunction with the tracer study to characterize transport and
dispersion and to characterize the model input requirements. Since tracer
study data are difficult to interpret, it is suggested that the data and
methodologies used to collect the data be reviewed by individuals who have
experience with such studies.
4.3 Off-Site Data
Data collected in another location may be sufficiently rep-
resentative of a new site so that additional meteorological and air quality
data need not be collected. The acceptability of such data rests on a
demonstration of the similarity of the two sites. The existing monitoring
network should meet minimum requirements for a network required at the
new site. The source parameters at the two sites should be similar. The
source variables that should be considered are stack height, stack gas charac-
teristics and the correlation between load and climatological conditions.
A comparison should be made of the terrain surrounding
each source. The following factors should be considered:
54
-------
1. The two sites fall into the same generic category of
terrain:
a. flat terrain
b. shoreline conditions
c. complex terrain:
(1) three-dimensional terrain elements, e.g.,
isolated hill
(2) simple valley
(3) two dimensional terrain elements, e.g., ridge
(4) complex valley
2. In complex terrain the following factors assist in
determining the similarity of the two sites
a. aspect ratio of terrain, i.e., ratio of:
(1) height of valley walls to width of valley
(2) height of ridge to length of ridge
(3) height of isolated hill to width of hill base
b. slope of terrain
c. ratio of terrain height to stack/plume "height
d. distance of source from terrain, i.e., how close to
valley wall, ridge, isolated hill
e. correlation of terrain feature with prevailing winds
It is very difficult to secure data sets with the above
emission configuration/terrain similarities. Nevertheless, such similarities
are of considerable importance in establishing confidence in the represent-
ativeness of the performance statistics. The degree to which the sites and
'i
emission configuration are dissimilar is a measure of the degree to which
the performance evaluation is compromised.
More confidence can be placed in a performance evaluation
which uses data collected off-site if such data are augmented by an on-site
tracer study (See Section 4.2). In this case the considerations for terrain
55
-------
similarities still hold, but more weight is given to the comparability of the
two sets of the observed concentrations. On-site tracer data can be used
to test the ability of the model to spatially define the concentration
pattern if a variety of meteorological conditions were observed during the
tracer tests. Off-site data must be adequate to test the validity of the
model in estimating maximum concentrations.
56
-------
5.0 MODEL ACCEPTANCE
This section describes interim criteria which can be used to judge the
acceptability of the proposed model for the specific regulatory application.
This involves execution of the performance protocol which will lead to a
determination that the model performs better, about the same as, or worse
than the reference model or performs acceptably,- marginally, or unacceptably
in relation to established site-specific criteria. Depending on the results
of the performance evaluation, the overall decision on the acceptability of
the model might also consider the results of the technical evaluation of
Section 2. Finally, because the procedures proposed in this document are
relatively new and untested, it is advisable to reexamine the conclusion
reached to see if it makes good common sense.
5.1 Execution of the Model Performance Protocol
Execution of the model performance protocol ..involves: ..(1) col-
lecting the performance data to be used (Section 4.0); (2) calculation and/'
or analysis of the model performance measures (Section 3.1); and (3) combin-
ing .the results in the objective manner described in the protocol (Section 3.2
or Section 3.3) to arrive at a decision on the relative performance of the
two models.
Table 5.1 shows a format which can be used to accommodate the
results of the model comparison protocol described in Section 3.2.3. If a
different protocol format is prepared, it should have the same goal, i.e., to
arrive at a decision on whether the proposed model is performing better, about
the same, or worse than the reference model.
57
-------
1ADI.E 5.1 Suggested Format for Scoring the Mode) Comparison
en
00
Model ing Objective
i
2.
3.
•
m
Supporting Performance Measures
a.
b.
c.
.
.
a.
b.
c.
.
.
Prel iniinary Score
Adiustiiients to Score
Final Score
Score
1.
a.
b.
c.
.
.
.
2.
a.
b.
c.
.
3.
. ,
.
.
1.
2.,
3.
t
Statistics, Analyses and Calculations that
Support the Score
1.
a.
b.
c. ,.
.
.
.
2.
a.
b.
c.
.
.
.
3.
.
.
.
1.
2.
3.
Decision on Performance Evaluation (Better, Same, Worse)
Absolute Requirements Satisfied?
t
•
-------
The first two columns in the upper half of Table 5.1 are analogous
to those in Table 3.5. The third column contains the actual score for each
modeling objective as well as the sub-scores for each supporting performance
measure. The scores in this column cannot exceed the maximum scores allowed
in the protocol. The last column is for the statistics, graphs, analyses
and calculations that determine the score for each performance measure, al-
though most of this information would probably be in the form of attachments.
The bottom part of the table is for the preliminary score (obtained
from totaling the scores from each objective), adjustments to the score with
supporting data, analysis, etc. and the final score. The final score would
determine whether the proposed model is. performing better, marginally or worse
in comparison to the reference model. This result is used in Section 5.2 to
determine overall acceptability of the model.
; * • -
At the bottom of the table, space is available to include the results
(yes or no) of any absolute requirements that may be specified in the protocol.
Failure to meet these requirements presumably means the model is unacceptable.
If the decision scheme is based on performance criteria alone, a
scoring table based on the procedures contained in Section 3.3 should be
employed. The resulting conclusions of acceptable, marginal or unacceptable
is used in Section 5 to determine the overall acceptability of the model.
\
5.2 Overall Acceptability of a Proposed Model
Until more objective techniques are recommended, it is suggested
that the final decision on the acceptability of the proposed model be based
primarily on the results of the performance evaluation. The rationale is that
59
-------
the overall state of the modeling science has many uncertainties in the
basic theory regardless of what model is used, and that the most weight
should be given to actual proven performance. Thus when a proposed model
is found to perform better than the reference model, it should be accepted
for use in the regulatory application. If the model performance is clearly
worse than that of the reference model, it should not be used. Similarly,
if the performance evaluation is not based on comparison with a reference
model, acceptable performance should imply "that the model be accepted, while
unacceptable .performance would indicate that it is inappropriate.
When the results of the performance evaluation are marginal or
inconclusive, then the results of the technical evaluation discussed in
Section 2 should be used as an aid to deciding on the overall acceptability.
In this case, a favorable (better than the reference model) technical review
would suggest that the model be used, while a marginal or worse determination
would indicate that the model offers no improvement over existing techniques.
If Section 2.5 were used to determine technical acceptability, a marginal or
inconclusive determination on scientific supportability combined with a
marginal performance evaluation would suggest that the model not be applied
to the regulatory problem.
5.3 Common Sense Perspective
One objective of this document is to provide a framework for orga-
nizing the procedures and criteria for model evaluation such that the eval-
uation can be conducted in as consistent and objective a manner as possible.
However, this framework must of necessity be flexible to allow for incorporating
additional knowledge about model evaluation, performance measures and criteria.
60
-------
For example, truly objective criteria for evaluating the technical aspects of
models and scientifically acceptable model performance standards are not yet
available.
\
The user should realize that there are many unresolved items at this
time. Especially lacking is a totally scientific methodology for combining:
(1) various performance statistics, (2) confidence in the scientific basis
for the model, (3) data accuracies, and (4) other positive and negative
«-
attributes of the model into a single overall determination of the applica-
bility and validity of the model. Given these concerns, two caveats are:
1. The procedures proposed are relatively untested. Although
they are based on inputs and comments from a number of scientists in the
field, it remains to be seen what problems may turn up in a real situation.
Thus, when an evaluation is completed, it seems only prudent to look back
over the analyses and the results to see if they really make sense.
2. The assumption has been made that a given regulatory problem
requires a model estimate and that the best way to determine the appropriate
technique is to evaluate the relative applicability of available models. No
determination is made on whether the models are "accurate enough" to be
accepted. This is the realm of performance standards, which are not addressed,
However, the analysis will produce performance statistics which could be com-
pared to standards, if they existed. If the statistics suggest gross in-
accuracies or biases, even in the better of the models, it might be prudent
to advise the decision maker that other modeling or monitoring information
should be used to resolve the regulatory problem.
61
-------
62
-------
6.0 REFERENCES
1. Environmental Protection Agency. "Guideline on Air Quality Models,"
EPA-450/2-78-027, Office of Air Quality Planning and Standards, Research
Triangle Park, N. C., April 1980.
2. American Meteorological Society. "Judging Air Quality Model Performance,"
Draft Report from Workshop on Dispersion Model Performance held at Woods
Hole, Mass., September 1980.
3. Environmental Protection Agency. "Guideline for Use of Fluid Modeling
to Determine Good Engineering Practice Stack Height," Draft for public
comment, EPA 450/4-81-003, Office of Air Quality Planning and Standards,
Research Triangle Park, N. C., June 1981.
4. Environmental Protection Agency. "Guideline for Fluid Modeling of
Atmospheric Diffusion," EPA 600/8-81-008, Environmental Sciences Research
Laboratory, Research Triangle Park, N. C., April 1981.
5. Environmental Protection Agency. "Guideline for Determination of
Good Engineering Practice Stack Height (Technical Support Document for
Stack Height Regulations)," EPA 450/4-80-023, Office of Air Quality Planning
and Standards, Research Triangle Park, N. C., July 1981.
6. U. S. Congress. "Clean Air Act Amendments of 1977," Public Law 95-95,
Government Printing Office, Washington, D. C., August 1977.
7. Environmental Protection Agency. "Workbook for Comparison of Air
Quality Models," EPA 450/2-78-028a, EPA 450/2-78-028b, Office of Air
Quality Planning and Standards, Research Triangle Park, N. C., May 1978.
8. .Bowne, N. E. Preliminary Results from the EPRI Plume Model Validation
Project—Plains Site. EPRI EA-1788-SY, Project 1616 Summary Report, TRC
Environmental Consultants Inc., Wethersfield, Connecticut, April 1981.
9. Lee, R. F., et. al. Validation of a Single Source Dispersion Model,
Proceeding of the Sixth International Technical Meeting on Air Pollution
Modeling and Its Application NATO/CCMS, September 1975.
10. Mills, M. T., et. al. Evaluation of Point Source Dispersion Models,
Draft Report Submitted by Teknekton Research, Inc. to U. S. EPA, January
1981.
11. Londergan, R. J., et. al. Study Performed for the American Petroleum
Institute—An Evaluation of Short-Term Air Quality Models Using Tracer
Study Data, Submitted by TRC Environmental Consultants, Inc. to API.,
October 1980.
12. Environmental Protection Agency. "Ambient Monitoring Guideline for
Prevention of Significant Deterioration (PSD)," EPA 450/4-80-012, Office
of Air Quality Planning and Standards, Research Triangle Park, N. C.,
November 1980.
63
-------
64
-------
APPENDIX A
REVIEWER'S CHECKLIST
Each proposal to apply a nonguideline model to a specific situation needs
to be reviewed by the appropriate control agency which has jurisdiction in the
matter. The reviewing agency must make a judgment on whether the proposed
model is appropriate to use and should justify this judgment with a critique
of the applicant's analysis or with an independent analysis. This critique
or analysis would normally become part of the record in the case. It should
be made available to the public hearing process, used to justify SIP
revisions or used in support of other proceedings.
The following checklist serves as a guide for writing this critique or
analysis. It essentially follows the rationale in this document and is
designed to ensure that all of the required elements in the analysis are ,
addressed. Although it is not necessary that the review follow the format
of the checklist, it is important that each item be addressed and that the
'basis or rationale for the determination on each item is indicated.
A-l
-------
CHECKLIST FOR REVIEW OF MODEL EVALUATIONS
I. Technical Evaluation
A. Is all of the information necessary to understand the intended
application available?
1. Complete listing of sources to be modeled including source
parameters and locations?
2. Maps showing the physiography of the surrounding area?
3. Preliminary meteorological and climatological data?
4. Preliminary estimates of air quality sufficient to (a) determine
the areas of probable maximum concentrations, (b) identify the probable issues
regarding the proposed model's estimates of ambient concentrations and, (c)
form a partial basis for design of the performance evaluation data base?
B. Is the reference model appropriate?
. C. Is enough information available on the proposed model to understand its
structure and assumptions?
D. Are the results of the technical comparison of the proposed and ref-
erence models supportable?
1. Were procedures contained in the Workbook for Comparison of Air
Quality Models followed? Are deviations from these procedures supportable
or desirable?
2. Are the comparisons for each application element complete and
supportable?
3. Do the results of the comparison for each application element support
the overall determination of better, same or worse?
E, For cases where a reference model is not used, is the proposed model
shown to be applicable and scientifically supportable?
A-2
-------
II. Model Performance Protocol
A. Are all the performance measures recommended in the document to be
used? For those performance measures that are not to be used, are valid
reasons provided?
B. Is the relative importance of performance measures stated?
1. Have modeling objectives that best characterize the regulatory
problem been properly chosen and objectively ranked?
2. Are the performance measures that characterize each objective
appropriate? Is the relative weighting among the performance measures
supportable?
C. How are the Performance Measure Statistics for the Proposed and
the Reference Model to be Compared?
1. Are significance criteria used to discriminate between the
performance of the two models established for each performance measure?
2. Is the rationale to be used in scoring the significance criteria
supportable?
3. Is the proposed "scoreband" associated with marginal model
performance supported?
4. Are there appropriate performance limits or absolute criteria
which must be met before the model could be accepted?
D. How is Performance to be Judged When No Reference Model is Used?
1. Has an objective performance protocol been written?
2. Does this protocol establish appropriate site-specific performance
criteria and objective techniques for determining model performance relative
to these criteria?
3. Are the performance criteria in keeping with experience, with the
expectations of the model and with the acceptable levels of uncertainty for
application of the model?
A-3
-------
III. Data Bases
A. Are monitors located in areas of expected maximum concentration
and other critical receptor sites?
B. Is there a long enough period of record in the field data to judge
the performance of the model under transport/dispersion conditions associated
with the maximum or critical concentrations?
C. Are the field data completely independent of the model development
*,
data?
D. Where off-site data are used, is the situation sufficiently analogous
to the application to justify the use of the data in the model performance
evaluation?
E. Will enough data be available to allow calculation of the various
performance measures defined in the protocol? Will sufficient data be avail-
able to reasonably expect that the performance of the model relative to the
reference model or to site-specific criteria can be established?
IV. Is the Model Acceptable
A. Was execution of the performance protocol carried out as planned?
B. Is the model acceptable considering the results of the performance
evaluation and the technical evaluation?
C. Does the result of the model evaluation make good common sense?
A-4
-------
APPENDIX B
CALCULATION OF PERFORMANCE MEASURES AND RELATED PARAMETERS
This appendix presents methods of calculation of performance
measures and related parameters and procedures for applying and inter-
preting statistical tests of model performance. The parameters and
tests recommended follow the results of the AMS Workshop on Dispersion
Model Performance (Fox 1980). A summary of this Workshop appears as
Appendix C.
Two concerns of Workshop participants were that air quality data
are often not normally distributed and that sequential values of meteoro-
logical and air quality parameters are not independent of one another.
This latter concern results from persistence of meteorological events.
These two concerns are not directly addressed in this appendix
since both have been identified as areas needing research. In the
majority of calculations and procedures discussed in this appendix,
methods are given both for situations in which data follow a normal
distribution and for situations for which the normal distribution does
not apply.
In evaluation studies using large data sets, some randomized data
1
selection to form a subset of independent data can answer the second
concern.
B.I Definition of Residuals
This subsection discusses the calculation of residuals described in the
report. The first type of residual discussed in the report was the difference
B-l
-------
between observed and estimated concentrations paired in time and space
and covering a range of observed concentrations from some small cutoff
value to the highest observed. The second type discussed" included
differences between observed and estimated concentrations paired in
various ways in time and space. The data set for this second type of
residual includes at most the upper five percent or upper 25 observed
concentrations, whichever is greater.
B.I.I Residuals Covering a Wide Range of Observed Values
Air quality model performance evaluation is primarily based
on analysis of the differences between observed and estimated concen-
trations. The primary parameter for this analysis is the model residual,
d, defined as
d (l,t) = Co (l,t) - Cp (l,t) (B.I)
where: d (l,t) is the model residual' at location,], .and time.,,, t.
Co = observed concentration
Cp = predicted concentration.
To avoid possible misinterpretation one should note that the
residual, d, measures the amount of under-estimation by the model.
Although subsequent statistical analysis of residuals is
most valid when performed on d (l,t) as defined in Equation (B.I), there
are situations when source strengths may vary significantly over the
period of record for which the model will be evaluated. In these cases
it may be more meaningful to define a model residual prorated to the
source strength. In this case the prorated residual, dq (l,t), is
defined as
B-2
-------
dq (l.t) = d (l,t) * Qo/Q(t) (B.2)
where Qo = the nominal constant source strength used as the base
for prorating the residuals
Q(t) = the actual source strength during the period t.
It is difficult to specify how much variation in source strength may be
significant, but a variation of + 25% about the mean is suggested. The
information derived from model performance evaluation must often be
*-
communicated to persons with nontechnical backgrounds. Such persons may
not know if a 10 ppm average underestimation of carbon monoxide is
better or worse than a 0.10 ppm average error in S02. Therefore, for
ease of communication, we suggest that analysis results include the
analysis of the relative residual, dr (l,t), defined as
dr (i,t) = co((i!t) *10°- ... .(B'..3)
The behavior of the relative residual causes some statistical problems
and, therefore, should not be used as a basis for making decision's, but
for communication of results.
B.I.2 Residuals for Peak Concentration
Important peak concentrations are specified in regulations. The
residuals which measure the accuracy of prediction of these peak concentra-
tion are determined by comparing observed and predicted concentrations paired
I
in various ways in space and time. Table B.I shows the residuals which
measure the accuracy of estimation of the peak. The symbols are interpreted
B-3
-------
Table B.I Residuals to Measure Accuracy of Peak Prediction
Paired in Residual Set
Space & Time Dn (Ln, !„) = CQ (Ln. !„) - Cp (Ln. !„)
Space not Time Dn (!.„. T) =.CQ (l_n> Tp) - Cp (Ln. T.,)
Time not Space Dn (L. !„) = CQ (!.„. Tn) - Cp (LjfTn)
Unpaired Dn (L, T) = CQ (Ln, Tn) - Cp (Lr. Tj
L = monitor site for nth highest observed concentration.
n
T = time of nth highest observed concentration.
n
T. = time of nth highest estimated concentration at site Ln.
J
L- = site of nth highest estimated concentration during time T .
j 3 n
C (L , T ) = nth highest estimated concentration.
C (L , T ) = nth highest observed concentration.
(L , T ) = site and time of highest estimated concentration .
r m (Generally, Lp f Ln and Tm 5* Tp).
B-4
-------
as follows: The subscript, n, on D indicates the rank order of the
observation, i.e., [L is the residual for the second highest prediction
and D,g that for the 19th highest. The parameters in parentheses indicate
the degree of pairing in time and space. L indicates a station location;
the subscript n indicates the station of the nth highest observation.
T indicates an observation period; the subscript n indicates the time
period of the nth highest observation.
Table B.2 presents example values of observed and estimated
concentrations for three stations and four measurement periods. We want
to determine the accuracy of prediction of the second-highest concen-
tration (n = 2). The second-highest observed concentration (Co = 1.25)
occurs at Station 1 (L« = Station 1) time period three (T« = Period 3).
', A - -.
Table B.2. Example Observed and Estimated Concentrations of SOp (ppm)
1
2
3
4
Station 1
Co Cp
1 . 02 1 . 07
1.14 0.85
1.25 1.05
1.36 1.11
Station 2
Co Cp
1.01 0.82
1.03 0.96
1 . 07 1 . 03
1.23 1.10
Station 3
Co Cp
0.96 0.87
1.02 0.94
1.13 1.04
1.22 1.09
Then D2 (L2, T2) = 1.25-1.05 = 0.20 ppm is the residual for
prediction of the second-high concentration paired in space and time.
B-5
-------
The unpaired concentrations produce the residual
D2 (L, f) = 1.25-1.10 = 0.15 ppm.
B.2 Analysis of Bias and Gross Error
Determining the bias and gross error of model predictions involves
analyzing the distribution of the residual, d, and/or simple functions
of d such as the absolute value or the square.
B.2.1 Bias of Model Predictions
The bias of model predictions is measured by the mean value,
*=r (B-4)
and is the statistical first moment of the distribution of the residual.
The number of observations, N, extends over the range of concentrations
or over the meteorological conditions of interest. For first order
modeling objectives, N can extend over the upper 5% or upper 25 obser-
vations.
i
The 95% confidence limits about the mean are calculated
using Student's "t" distribution. The value, .of t,Q 025. ^ \ is found
from any standard table of Student's "t" (e.g., Selby, 1972, p. 617).
The value 0.025 is the probability that a value will be greater than
the upper limit of the 95% confidence bound. The value N-l is the number
of degrees of freedom for the variable "t." The true value of the
mean, y ,, is then given by
? " t(0.025, N-l) Sd < yd - T+ ^(0.025, N-l) Sd'
/N /N
where S, is the sample standard deviation discussed in the next section.
If N is sufficiently large (> 100 or so) the value for t is 1.96.
8-6
-------
This method of calculating the confidence limits assumes
that the distribution of the values cf is normal. This assumption is
more nearly satisfied with a large number of observations, N.
B.2.1.1 Confidence limits for nonnormally distributed variables,
Javitz and Ruff (1979) discuss a procedure for calcu-
lating the confidence limits about the mean value of a nonnormally
••
distributed variable. The method uses the results of the central limit
theorem which states that the sample means of any variable are normally
distributed if the size of the sample is large enough. The method is:
Step 1: Subdivide the data set into five data subsets
so that each subset contains data from every fifth time period. The
first subset would contain data from time periods 1, 6, 11, etc. The
second subset would contain data from time periods 2, 7, 12, etc. If
the full data set has a periodicity of five- time periods (e.g., average
daily values for week days only), then the data should be divided into a
different number of subsets.
Step 2: Compute the average value of the desired
parameter, e.g., d, for each subset. These are labeled d, , d~ or d. ,
where k is the number of subsets. The value d, is then the average
value of d for subset 1. '
Step 3: Compute the sample standard deviation of the
subset means:
1 k
(
S = - V (d. - d) ,
where d is the mean value for the whole data set (equation B-4).
B-7
-------
Step 4: The 95% confidence material for the true
value of the mean value, d, is given by:
d ' '(o.gys.k-i) jp
where TQ n-jc L-I is the upper 97.5 percent critical value of the student's
"t" distribution with k-1 (e.g., four) degrees of freedom.
B.2.2 Model Precision
Model precision, also known as gross error, refers to the
average amount by which estimated and observed concentrations differ as
measured by residuals with no algebraic sign. While large positive and
negative residuals can cancel when measuring model bias, the unsigned
residuals comprising the precision measures do not cancel and thus
provide estimation of the error scatter about some reference point.
This reference point can be the mean error or the desired value of zero.
Two types of precision measure are the noise, which delineates the error
scatter about the mean error, and the gross variability, which deline'ates
the error scatter about zero error.
The performance measure for noise is either the variance of
the residuals or the standard deviation of the residuals. The standard
deviation is the square root of the variance, where
is the variance of the sample of the residuals and N the number of
observations.
B-8
-------
The performance measure for gross variability is the mean
square error, or the root-mean-square-error. The mean square error is
defined by
,2
MSEd = z-^— (B.6)
The bias, noise and gross variability are related by
d = (^1) S
MSE = (1) S + (d)2 (B.7)
An alternate performance measure for the gross variability is the mean
absolute residual defined by
w •
(B-8)
The mean absolute residual is statistically more robust than the root-
mean-square-error; that is, it is less affected by removal of a few
extreme values. The confidence limits on the variance are calculated
using the chi-squared distribution in the following manner.
1. Choose values of x2(0>025jN.1) and X2(Q.975,N-1 ) from
standard x2 tables (e.g., ERC Standard Mathematics Tables, 20th Edition,
pg. 619), where N-l is the number of degrees of freedom.
2. The confidence limits are given by
(N-'l)S^ < . < (N-l)S2
x2 x2
(0.975,N-l) (0.025,N-l)
where o2 is the true variance.
B-9
-------
The confidence limits on the average absolute residual are
calculated as outlined in Section B.2.1.
B.3 Correlation Analysis
Correlation analysis consists of coupled space-time analysis,
spatial analysis and temporal analysis.
B.3.1 Space-time Analysis
*•
Coupled space-time correlation analysis involves computing
the Pearson's correlation coefficient and parameters of the linear least
squares regression equation. For space-time analysis, observed and
estimated concentrations from all stations and time periods are used in
the calculations. The overall Pearson's correlation coefficient, r, is
defined by: _ _
£
-------
B.3.2 Spatial Correlation Analysis
Spatial correlation analysis involves calculating the spatial
correlation coefficient and presenting isopleth analyses of the estimated
and observed concentrations for particular periods of interest. The
spatial coefficient measures the degree of spatial alignment between the
estimated and observed concentrations. The method of calculation
essentially involves calculating the Pearson's correlation coefficient
for each time period and determining an average over. all time periods.
The specifics of the method are:
Calculate the Pearson's correlation coefficient, r., for
each averaging period, t, from Equation B.9. Change the variable to .
for each time period from
i
*
In
rt
_ r
rt
(B.12)
Calculate the mean value, $., by averaging, over the number of time periods,
Estimate the average spatial correlation coefficient
exp(2?.)-l •"_... _
t exp(2c|>t)+l
The 95% confidence limits about the estimate of the spatial
correlation coefficient are calculated from
1.96
Limits = tanh (
-------
Estimates of the spatial correlation coefficient for single source
models are most rel-iable for calculations based on data intensive tracer
networks. Isopleths of the distributions of estimated and observed
concentrations for periods of interest should be presented and discussed.
B.3.3 Temporal Analysis
Temporal correlation analysis involves calculating the
temporal correlation coefficient and presenting time series of observed
and estimated concentrations or of the model residual for each monitor-
ing location. The temporal correlation coefficient measures the degree
of temporal alignment between observed and estimated concentrations.
The method of calculation is similar to that for the spatial correlation
coefficient. Calculate the Pearson's correlation coefficient, r, , for
each monitoring location, 1, from Equation B.9 Change the variable to
i *•
1 for each monitor location using
In
1
1
(B.15)
Average over the number of monitor locations to produce the value
Estimate the average temporal correlation coefficient F from
= - (B.16)
rl exp
B-12
-------
The 95% confidence limits about the mean temporal correlation coefficient
are calculated from
1.96
limits = tanh (L+ —-——i ) (B.17)
'" /Nj(M-3)
where N, = the number of monitoring locations and
M = the number of Co, Cp data pairs^for each monitoring location
(the number of time periods).
Time series of Co and Cp or of model residuals should be presented and
discussed for each monitoring location.
B.4 Statistical Tests
This section discusses the use of the statistical test of hypotheses
mentioned in the body of the report. A general discussion of the concept
of statistical hypothesis testing can be found in any statistical text
(e.g., Panofsky and Brier, 1965, Chapter III).
B.4.1 Comparison of Cumulative Distribution Functions
Comparison of the cumulative distribution functions involves
constructing quantile-quantile (Q-Q) plots and testing for statistically
significant differences between the distributions. Karl (1978) presents
examples of the technique applied to ozone measurements in St. Louis.
The techniques discussed here can be used to analyze differences between
B-13
-------
distributions of Co and Cp at a given station or distributions of the
residual, d, for reference and candidate models. Ogives'of the cumu-
lative distributions of the two parameters to be compared are first
plotted as in the example in Figure B.I. The Q-Q plot for data such as
those shown in the example simply consists of .plotting values of one
parameter at a given cumulative frequency percentages against the values
of the second parameter for the same cumulative frequency percentages as
shown in the example in Figure B.2. If the two distributions are
identical, then points will fall along the straight line with slope
equal to one.
Q-Q plots are useful in detecting differences in distri-.
butions in data sets. The plots do not require any assumptions regard-
ing the form of the distributions of the two data sets ..and the .statisti-
cal significance of any differences can be determined by non-parametric
methods. The Wilcoxen-matched pair, signed-rank test is used to test
the null hypothesis that there is no significant difference between the
'two distributions. (See Panofsky and Brier, 1965, pp. 64-66 or Siege!,
1956 for more complete discussion of the test.)
Step 1. Form the differences
A(q) = X(q) = Y(q)
where A(ql = the difference between values at cumulative frequency
quantile, q
B-14
-------
0.01 Q.02 0.03 0.04
0.07 0.08
Figure B.I. Ogives for 16-h average 0 concentrations on
Sundays and workdays for the inner sites. (Karl, 1978)
Day of the wesk variations of pollutants
•
•
•
T ramitionol
«. '0.0001
ou«NS
«v • O.OOOI
/*
—,
/ -
s
0.0*
OOT
0.0«
a.os
oo<
0 03
O.CB
0.01
a. • O.OOOI
»v «O.OOOI
0.0001
000
0.0*0.0* OOCO.OT 0 CO O.O1 0.« O.O3 0.0* O.CS O.O« 0070.0« °'OC °-°' 8.0» 0.83 0.0*004
pom ' Oj, 9V* 0*. P*"1
O.CX 3.07 0 0«
Figure 3.2. Qxiantile-
-------
X(q) = value of parameter X at q
Y(q) = value of parameter Y at q
Step 2. The absolute values of |&(q)| are ranked from
highest to lowest, the highest value is given Rank = 1, the second
highest Rank = 2, etc. The same average value is given to a number of
identical differences. Zero differences are excluded before the ranking
Step 3. The algebraic sigh of each A(q) is assigned to its
corresponding rank value.
Step .4. The test statistic, R, is calculated by adding the
rank values for the fewest cases of the same sign.
Step 5. Determine the critical value of the test statistic
by entering Table B.3 with the number of non-zero differences.
Step 6. If the absolute value of R is less than the critical
value of the test statistic, reject the null hypothesis and conclude
"that the two distributions are significantly different.
If the absolute value of R is greater than the critical
value of the test statistic, do not reject the null hypothesis and
conclude that there is no significant difference between the distri-
butions. '
B.4.2 Comparison of Means
Tests for comparison of two mean values test the null
hypothesis that the means are equal. The alternate hypothesis is that
B-16
-------
Table B.3. Critical values for 5% significance ,
Wilcoxen matched-pair, signed-rank test.
Number of
Non-Zero
Differences
6
7 •
8
9
10
11
12
13
14
15
16
17
18
19
20
25
> 25
Critical Value
of Test Statistic
5% Level
0
2
4
6
8
11
14
17
21
25
30
35.
40
46
52
89
M(M+1)-1.96 rM(M-H)(2M+l)
1
Kreysig (1970, P. 461)
B-17
-------
one mean is greater than the other. The two tests discussed in this
subsection are the student's "t" test, a parametric -test used when data
approximate a normal distribution, and the Wilcoxen-Mann-Whitney test, a
nonparametric test which does not assume any form for the distribution.
B.4.2.1 The Student's "t" Test
The Student's "t" test should be used to test the equality
of two sample means when the distributions approximate a normal distri-
bution. When the distributions are nonnormal, the test might still be
used, but will be less powerful (Till, 1974, p. 61). If the distri-
butions are known to be much different from normal, a nonparametric test
should be used (see Section B.4.2.2).
The procedure tests the null hypothesis that the two means
are equal. The alternate hypothesis is that one mean is greater than
the other. The alternate hypothesis results from inspection of the
sample values of the two means. For example, if we are testing for
differences between the mean residual of the candidate model, d" , and
can
the mean residual of the reference model, 3" f, and inspection of the
values shows oV^ > d~ . then the alternate hypothesis would be:
iST Can
can '
i
There are two possible cases for this test. Case A where
the variances are equal but unknown and Case B where the variances are
unequal and unknown.
B-18
-------
Case A: Variances unknown but equal. The F-test for
equality of variances is discussed in Section B.5. As stated above the
null hypothesis"is:
Ho: yx = yy,
and the alternate hypothesis is:
Ho: yx > uy,
where y and y are determined by inspection of the two sample means.
x y
Step 1: The critical value of "t" is determined from any
standard Student's "t" tables (e.g., Selby, 1972, P. 617) at the 95%
confidence level and n, + ru - 2 degrees of freedom. The values n and
I s~ f\
n are the number of observations for parameters X and Y respectively.
Step 2: Calculate the test statistic - •-
~ 1/2
1(B.18)
n n (n + n - 2)
x y x y
T = (X - Y)
(n + n ) (n S2 -f n S2)
x y x x y y
where X", T = means of parameters X and Y
S 2, S 2 = variances of parameters X and Y.
x y
Step 3: If the value of the test statistic T (step 2) is
less than the critical value of "t" (step 1), do not reject the null
hypothesis. If the value of T is greater than the critical value of
"t", reject the null hypothesis.
For further discussion see any standard statistics text
(e.g., Panofsky and Brier, 1965, pp. 58-64; Till, 1974, Section 4.3).
B-19
-------
Case B: Variances unknown and unequal.' If the variances
can not be assumed equal, an approximate Student's "t" test is given by
Hoe! (1971, p. 265). The null hypothesis and alternate hypothesis are
the same as in Case A.
Step 1: Calculate the sample variances S 2 and S 2 (Equa-
x y
tion B.5).
Step 2: Calculate the number of degrees of freedom for "t".
(sx/nx) + (S2/n,J2
d.f. = y T -2 . (B.19)
ny -f-
If the number of degrees of calculated in (B.19) is not an integer,
round to the nearest integer.
Step 3: Determine the critical value of "t" as in Case A,
Step 1.
Step 4: Calculate the value of the test statistic
X " Y (B.20)
Step 5: Reject or do not reject the null hypothesis as in
Case A, Step 3.
B-20
-------
B.4.2.2 The Wilcoxen-Mann-Whitney Test.
If the distribution of variables is known to be far from
normal, the Wilcoxen-Mann-Whitney test should be used. This test is
discussed in Section B.4.1.
B.5 Tests for the Equality of Variances
Tests for the comparison of two variances test the null hypothesis
that a2 = a2. The alternate hypothesis is that a2 > a2 . The two
x y x y
tests discussed in this subsection are the F-test, a parametric test
used when data closely follow a normal distribution, and a variation of
Student's "t" test used when data deviate from normality.
B.5.1 The F-test for Normally Distributed Variables
The F-test should be used to test the equality of two sample
variances when the distributions closely approximate normal distributions
The procedure tests the null hypothesis that the two sample variances
are equal. The alternate hypothesis is that one variance is greater
'than the other. The alternate hypothesis results from inspection of the
sample values of the two variances. For example, if we are testing for
differences between the variance of the residual of the candidate model,
S2d can, and of the reference model, S2d ^, then the alternate
hypothesis would be ,
I
C2 > c2
d, can d, ref
For a general discussion of the test see Kreyszig (1970, Sec. 13.6) or
Hoel (1970, pp. 271-273).
B-21
-------
Step 1: From the sample results determine the value of the
larger sample variance, S 2 and the smaller sample variance, S 2.
x y
Step 2: Determine the critical value of the parameter, F,
from any standard table of the F distribution (e.g., Selby, 1972, p.
620). The parameters for the table are: 95% confidence level, n - 1 =
' "
degrees of freedom for the numerator of F (greater mean square) and n •
1 = degrees of freedom for the denominator (lesser mean square).
*
Step 3: Calculate the value of the test statistic
F • S/ / Sy>
Step 4. If the calculated value of the test statistic, F,
(Step 3) is greater than the critical value (Step 2), reject the null
hypothesis. If the calculated value of F is less than the .critical
value, do not reject the null hypothesis.
B.5.2 Tests for Nonnormally Distributed Errors
The F-test can be shown to be-sensitive to deviations from
normality. Kreyszig (1970, pp. 217-218) suggests the following:
Step 1: Compute the means of the following new random
variables
|X| * |Xi - X] and
i - Y]
B-22
-------
It can be shown that |Y| and |Y| are proportional to a and
s\
a respectively.
Step 2: Test for the differences between the means
and JTJ" using the Student's "t" test as in Section B.4.2.1.
B-23
-------
REFERENCES
Fox, D. G., 1980. Judging Air Quality Model Performance. American
Meteorological Society, 60 pp., draft.
Javitz, H. S. and R. E. Ruff, 1979. Evaluation of the Real-Time Air-
Quality Model Using the RAPS Data Base; Vol. II: Statistical Pro-
cedures, EPA-600/4-81-013b, U. S. Environmental Protection Agency,
Research Triangle Park, N. C. 27711.
Karl, T. R., 1978. Day of the Week Variations of Photochemical Pollutants
in the St. Louis Area, Atmos. Environ. 12, 1657-1667.
Kreysig, E., 1970. Introductory Mathematical Statistics, Principles and
Methods, 468 pp, John Wiley & Sons, Inc., New York.
Panofsky, H. A. and G. W. Brier, 1965. Some Applications of Statistics
to Meteorology. 223 pp, Pennsylvania State University, University
Park, Pennsylvania.
Siegel, S., 1956. Nonparametric Statistics for the Behavioral Sciences,
McGraw-Hill, New York.
Selby, S. M. ed., 1972. CRS Standard Mathematical'Tables, 20th edition,
Chemical Rubber Co., Cleveland, Ohio.
Till, R., 1974. Statistical Methods for the Earth Scientist, John Wiley
& Sons, New York.
B-24
-------
APPENDIX C
JUDGING AIR QUALITY MODEL PERFORMANCE
-REVIEW OF THE WOODS HOLE WORKSHOP-
Douglas G. Fox*
1.
INTRODUCTION
Atmospheric dispersion models are
used to support laws and regulations aimed at
protecting the nation's air resources. For this
reason, models have become something more than
approximations of nature designed to provide a
scientifically reasonable connection betveen a
source and receptor of air pollutants. Courts
have interpreted thea to be legally binding
mechanisms for negotiating levels of emission
control froc sources. In view of this expanded
role, it has become particularly critical that
aodels be correct and be correctly applied. As
a result of this need, the U.S. Environmental
Protection Agency (EPA) has entered into a
cooperative agreement with the American Meteorol-
ogical Society to aid in the scientific and pro-
fessional development and application of atmos-
pheric dispersion models. The AMS is not in-
volved with the regulatory process, rather our
actions are motivated by a desire to advance the
use of scientifically valid models.
*7his DRAFT SUMMARY is prepared from a workshop
report- currently under review. It is presented
in order to provide wide distribution of model
evaluation ideas in the hope of focusing dis-
cussion and encouraging work. Participants in
the workshop included: D. G. FOXT , USDA Forest
Service; Robert Bornstein, San Jose State U.;
Norman Bowne, TRC Env. Consultants; R. L. Dennis,
NCAR; Bruce Egar.t,' ERT; Steven Hannat, NOAA;
Clenr. Hilst, EPRI; Stuart Hunter, Princeton U.;
Michael Hills, Teknekron Research Inc.; Larry
Neimeyer, EPA; Hans Panofsky, Pennsylvania
State C.; Darryl Randersont, NOAA; Philip Roth,
Systems Applications, Inc.; Ronald Ruff, SRI
International; Lloyd Schulman, ERT; Jack
Shreffler, EPA; Herschel K. Slater, Consultant;
Joseph Tikvart, EPA; A. Venkatran, Ontario Min.
cf the Environ., Canada; Jeffrey C. Weil,
Martin Marietta Corp.; and Free D. White*, Chair-
man AMS Steering Cocuittee. Dr. Fox, represent-
ing the AMS/EPA Steering Group was Chairman of
the Woods Hole Dispersion Model Workshop and is
Chief Meteorologist, t'SDA Forest Service, Rocky
Mountain "crest and Range Experiment Station,
2-0 u'est Prospect Street, Fort Collins, Colorado
3C5:6.
tNOTE: AMS Steering Con=ittee neabers
The problem of evaluating the per-
formance of models is among the most important
facing che modeling community. To this end,
the American Meteorological Society (AMS) con-
vened a sriall expert uorkine ercmp in September
1980 to discuss current practices in model eva-
luation, recommend model performance evaluation
measures and methods and if possible set stand-
ards for model performance. An additional task
was to discuss the need for further work in this
area.
2.
PERFORMANCE MEASURES
Conaion ideas which evolved froa the
workshop are listed below.
2.1
Using Models for Regulatory Decisions
Models are the most rational and
equitable means available to support the nation's
air quality goals. It is however, the modeler's
responsibility to provide "decision makers" with
an estimate of the significance of model output.
Where possible some statement of confidence in
model results is'recoisaended. For'this to be
done in a scientific and professionally accept-
able manner it is necessary for the air quality
modeling community to implement statistical
performance evaluation.
2.2
Statistical Aspects of Air Quality
Goals
Current law and regulations require
that models simulate the second highest 1-, 3-,
or 24-hour concentration likely to occur in a
year to evaluate short tera (1-hour, 3-hcur, or
24-hour) standards. Since there are S760 hours
in a year for a 1-hour standard this is equiva-
lent to predicting the 0.02 percentile (.3 per-
centile for 24-hour average). It is difficult
to predict such a rare event with any decree of
confidence. Participants reconrnended that aodels
should be compared against a more robust statis-
tic, such as the upper 2 to 5 percentile of
values (something like 10-50 values). The work-
shop recognized that political and technical
problems will need to be resolved before this
recot^nendation can be realized.
2.3
Scientific Evaluation
Statistical performance evaluation
cannot be used exclusively for deterr.inine the
acceptability or unacceptability of a model.
There are many scientific considerations which
can provide critical input to r.cdel evaluation.
-------
N'oc :he least of these is che recognition chat
the atmosphere is a stochastic system and is
such there are limits to its predictability.
Air quality models operating within this systea
are limited to the degree of predictability they
can attain. This in effect, provides a scienti-
fic liait to model accuracy. More effort should
be expended in determination- and communication
of such scientific limitations for particular
problems.
2.4
Data for Model Evaluation
Available data bases are not equal
to the task of codel evaluation. Efforts such
as the E2RI-FXV, and the £?A Complex Terrain
study receive a strong comaendation from the
participants. However, since good data are not
usually available for evaluating a model, pro-
cedures to utilize existing data are needed.
The value of statistical methodologies depends
upon such characteristics of the data as inde-
pendence and normality. Recognizing that
meteorological data in general are not indepen-
dent, time series analysis is an appropriate
tool vhicii should be utilized. Transformation
of data to approach normality should also be
considered.
2.5
Soecific Performance Measures
Performance can be neasured in tvo
general ways, namely by comparing the magnitude
of differences between observations and pre-
dictions and comparing the correlation or asso-
ciation between observations and predictions.
differences can be ex-
pressed in tens oi differences or discrepancies,
defined as
where Cois an observed concentration and Cp is
a predicted concentration. Three measures of
magnitude difference are of importance:
(1) Istimated Bias (average) of che differences,
where X is the number of observations;
(2) ~«r--ay »^ noise (variance) c: the differences
(3) gross variability of the differences, either
as the average absolute gross error,
or as the KXS error,
The -XS . arr =
bias since
r is related to the variance and
1.
Workshop participants recommended that these
measures be applied to
(A) total Fields of Differences,
for all X j-b
(3) Selectively paired aaxiaum values of the
differences, for example, where L(n), T(n) are
the coordinates for highest (n-1), second high-
est (n»2), etc. concentrations, then
) -
paired 'in both space and time;
l-L T
paired in time but Lj is the location of maximum
predicted concentration at T(n);
C
paired in space but Tj is the tiae of maximum
predicted concentrate at location L(n), and
unpaired since Co and C? are simply the maximum
observed and predicted values without regard to
time or space.
(C) Totally unpaired comparisons of frequency
distributions of observations vi:h frequency
distributions of predictions can be conducted
by using statistical methods of comparison for
the bias (t, z.Wilcoxon/Maiin-Vhitney statistics),
for the variance (7 or "X, Statistics) and for
the gross variability (^ or Kolmogorov-Smirnov
statistics)
Correlation can be measured by con-
sidering the data paired for the entire field
as discussed in (A) above and selectively paired
for aaximum values as discussed in (3) above.
Correlation is measured by the correlation co-
efficient Y* » defined as
cc?-cr^
"Vs.
where the overbar is as defined above. Three
specific ways of considering correlation were
suggested:
(1) Temporal f(£7) the cross corre-
lation coefficient, where the C's are paired
at a particular location in space as in (A)
above and separately as in (B) above. All tiae
lags AT" between observation and predictions in-
* can be considered.
(2) Soaeial f"s the spaciai corre-
lation coefficient, using C's paired at parti-
cular times for che entire field as (A) above.
(3) Coupled , \* che correlation co-
efficient using C's for the entire field.
C-2
-------
Various other statistics are sug-
gested to compare, for peak values, the dis-
placeaent in tiae and space of observations fron
predictions.
2.6
Qualifications on Performance
Measures
The measures suggested will likely
prove nost meaningful when applied to large data
sets more typical of urban problems and tracer
studies conducted for limited time periods.
Difficulties in oeasuring the performance of
point source nodels exist because the concentra-
tion pattern resulting fron such a. source has
very sharp gradients. Generally, the peak con-
centration which is routinely calculated (center
line value) is not measured. Special considera-
tions recommended for the point source problea,
therefore, include data preprocessing for wind
direction and possibly stability. Further it
should be realized chat as a community ve have
only very limited experience with aany of the
measures of perforsance. It will, therefore,
require some time to realize the significance of
thea.
3.
PERFORMANCE STANDARDS
Participants agreed that it was
unrealistic to attempt to establish standards
at this time. There is an overriding concern
that criteria for setting standards are not
available. Just how accurate must a model be
for the various regulatory applications?
Secondly, there is a conspicuous lack of ex-
perience vith existing models tested against
performance measures such as ve reconmend.
Finally, data bases of high enough quality to be
capable of discriminating between performance
of various models are not abundant. More good
data nust be collected.
In spite of these concerns the
participants did develop tvo recommendations
related to judging air quality models.
3.1
Statistical Inference Testing
Statistical tests such as the
Student's t for means and the 'Jc for variances
can be utilized to establish confidence intervals
about the calculated values of performance
measures. This allows a quantitative indication
of a model's validity. It was, however, recog-
nized that often such statistical testing is of
very limited value because it is based upon
close adherence of^the data to theoretical dis-
tributions. The point is that such tests oav
suggest that model results are less believable
than in fact they should be.
3.2
Develop Performance Profiles
Referenced Against EPA Guideline
Models
The performance of models recom-
mended in the Guideline could provide a refer-
ence value for comparison of other models. The
reference concept, however, to some participants,
implied that the Guideline Models are "good
enough" while in fact they may not be. For this
reason it was suggested that the reference be
considered like 0 F—namely an arbitrary number
representing nothing physically significant,
but one against which other temperatures can be
quantified. At any rate it seems appropriate to
develop profiles of performance for models by
comparing performance against what are currently
accepted regulatory procedures.
4.
RESEARCH NEEDS
The workshop participants recom-
mended five specific areas in which research is
needed. They are (1) development and refinement
of performance measures; (2) application of per-
formance measures to (especially) point source
models; (3) analysis of the characteristics of
meteorological data; (4) analysis of the char-
acteristics of air quality data; and (5) the
evaluation of diffusion models. In addition to
•these specific tasks it was reitterated that
much better data are needed. Data collection
with special field programs, for example, can be
quite expensive. The cost, however, is small
compared to the amount of money expended on
pollution emission controls and, therefore, on
implications resulting from the applications of
models. It is possible that this data will show
how poorly we are able to predict concentrations.
They may result in a major new round of re-
search into the fundamental physics and chemistry
of the atmosphere.
5.
CONCL0SIONS
How shall we judge the performance
of air quality simulation models? The AMS/EPA
Woods Hole Workshop was convened in part to fo-
cus the attention of the professional air quality
modeling community on this important task. Al-
• though the workshop may raise more questions than
it answers, a. few ideas have emerged which are
described in this short sutzsary.
A set of statistics which can pro-
vide a rational framework for quantitatively
evaluating the nature of differences between ob-
servations and predictions by models are pro-
posed. Statistics are suggested as a tool to
provide confidence in model predictions as well
as to compare new models against those models re-
commended by EPA in their "Guidelines". The
task is not complete. We have only extremely
limited experience with these measures of perfor-
mance. A recoansendation is, therefore, to test
models using the fraaework suggested in this
paper and detailed in our forthcoming report.
Only.through such experiences will it be possible
to learn the most appropriate procedures for eva-
luating models.
It was a strong feeling of the parti-
cipants that statistical measures alone nay not
be sufficient to judge between models. Scientif-
ic evaluations based upon accepted laws of
physics will always provide a good basis for
critiquing models.
Finally, i* was unanimously agreed
that data on which to evaluate models is lacking.
The collection of good data nust remain a high
priority activity for air pollution modelers.
C-3
-------
DATE:
7/30/81
UNITED STATES ENVIRONMENTAL PROTECTION AGENCY
Office of Air Quality Planning and Standards
Research Triangle Park, North Carolina 27711
SUBJECT. Role of Models in Regulatory Decision-Making
Joseph A. Tikvart, ChiefQ,
Source Receptor Analysis^Branch (MD-14)
|ROM:
TO Chief, Air Programs Branch, Regions I - X
As you are aware, OAQPS sponsored a workshop on the role of atmos-
pheric models in regulatory decision-making. The workshop was held in
May 1981 at Airlie House. A summary report has been prepared and dis-
tributed which will serve as the focal point for the modeling conference.
We have previously communicated with you concerning both the report and
the conference.
Section 4 of the summary report (see attachment) provides recom-
mendations on actions that EPA can take to better reflect the uncer-
tainties of air quality model estimates in its regulatory decisions.
Many of these recommendations require further study, technical develop-
ment, coordination, and review of current policies. Some of the recom-
mendations, if implemented, could have a direct effect on Regional
Office and State procedures for SIP revisions and the review of new
sources, as well as resources required for these programs.
We anticipate that the summary report will be well received and
widely endorsed at the modeling conference. It suggests a flexibility
that many in the industrial and regulatory communities believe is neces-
sary to relieve the current regulatory climate w_hich is perceived to be
overly stringent. We want to seriously explore these recommendations'
and their ramifications.
The purpose of this memo is to solicit your views on the summary
report recommendations, in particular those that could directly affect
Regional Office and State programs. To this end, several subsections of
the attachment are marked for your attention. These subsections deal
with (1) planning meetings and criteria for model selection; (2) devel-
opment of protocol documents; (3) use of "arbitration panels," and (4)
more explicit consideration of model uncertainty in decisions. To what
extent are these issues factored into current Regional Office and State
programs? How would you implement the recommendations? What modifi-
cations to current programs would be required? What problems and
benefits would be created? What would be the effect on resources and
the timeliness of reviews?
Other portions of Section 4 provide observations and recommen-
dations concerning (1) screening, long-range transport and complex
terrain models; (2) performance evaluation of models; (3) design con-
centrations; (4) modifications to .modeling guidelines; (5) a modeling
EPA Form 1320-6 (Rev. 3-76)
-------
"center" and (6) a quality assurance program. We also solicit any
comments you might have on these issues.
It would be appreciated if I could have your views on the workshop
recommendations, either verbally or in writing, by the end of August.
Please contact me if you have any questions.
Attachment
cc: R. Campbell-
T. Helms
C. Hopper
R. Rhoads
R. Smith
B. Steigerwald
Modeling Contacts, Regions I - X
-------
Workshop Summary Report
ROLE OF ATMOSPHERIC MODELS IN REGULATORY
DECISION-MAKING
EPA Contract No. 68-01-5845
July 1981
Prepared for
Charlotte Hopper
Source Receptor Analysis Branch
Office of Air Quality Planning and Standards
U.S. Environmental .Protection Agency
Research Triangle Park, North Carolina 277VI
Prepared by
C. Shepherd Burton
Systems Applications, Inc.
101 Lucas Valley Road
Sa'n Rafael, California 94903
-------
SUMMARY OF WORKSHOP FINDINGS AND RECOMMENDATIONS
4.1 OVERVIEW
In the course of independently addressing the four overall
questions, each workgroup was required to develop an approach
to its specific problem and the criteria by uhich issues and
needs could be identified and recommendations made. Their ef-
forts were separately documented through on-site reports writ-
ten and edited by the workgroup participants. These three
documents, which represent three separate reports, are provided
as appendixes to the finalreport. Workgroup I addressed the
four questions usin.g the PSD permitting problem as a vehicle
for examining alternative answers; Workgroup II ' s problem fo-
cused en SIP revisions; and Workgroup III explored the workshop
questions in light of concerns over the transport of pollutants
across political boundaries.
The principal findings and recommendations of the individ-
ual work group reports are integrated and presented in this
closing section of the workshop summary report. Attention is
given to the needs identified by the workgroups and to the pro-
cedural 3r process changes recommended by 'groups for the con-
sideration or implementation of the EPA's Office of Air Quality
-------
Planning and Standards. Particular emphasis is given to needs
and recommendations for
(1) Assuring the wide acceptance by interested parties
and by the public of modeling in air Duality management.
(2) Identifying the factors affecting the needed balance
between standardisation, consistency,'and flexibility in model
selection and. application and in the interpretation and presen-
tation of model results in ASM decisions.
In keeping with the basic workshop structure and the
structure of the reports, the summary findings and recommenda-
tions are organised according to the four general workshop
questions. At the end of this section, some additional propos-
als are advanced and some concepts that were recommended by the
workgroups are extended. Although some of these proposals were
not explicitly mentioned by any workgroup, they appear to be
consistent with the needs and recommendations provided in the
workgroup reports.
In light of the broad cross section of skills and inter-
ests of the participants, it- is worth noting the harmony
within, and among, the groups concerning the • "nee d's' "identified
and the recommendations presented. This harmony is reflected
in both the specifics and the spirit of workgroup findings and
recommendations. Although at the 'outset of the workshop, par-
ticipants were informed that a consensus view was not sought,
it appears that by at least one measure--the level of
harmony — consensus was achieved.
4.2 CRITERIA PERTINENT TO THE APPROPRIATE SELECTION OF AN AIR
OUALITY MODEL
Whether the model is intended for use in a PSD permitting
effort, a revision of a SIP, or in policy setting, all work-
groups endorsed the concept of early, open, and cooperative
participation in mod,el selection by .all affected and interested
parties. A model selected in this manner, using the additional
criteria presented next, is likely to be supported ' and accepted
by not only the regulatory agencies and industry being
regulated, but also by interested labor, civic, and environmen-
\ tal groups.
\
j Other selection criteria recommended by the workgroups in-
. eluded
-------
(1) A suitable match betueen (a) the technical attributes
and capabilities of the selected model, (b) the operational re-
quirements of the selected model, and (c) the air quality
issue(s) of concern.
(2) k suitable means for estimating, evaluating, or exam-
ining the uncertainty associated with model predictions.
(3) A means for addressing and satisfying consistency re-
quirements and con-cerns of equity uith respect to prior use of
the same or similar models and similar air quality issues.
Throughout the model selection process, resource con-
were also acknowledged by the workgroups as warranting
consideration. However, the groups recommended that such con-
siderations be given after one or more modeling alternatives
has met the other (technical) criteria. That is, apparently
the resource constraint criteria should principally serve to
distinguish between technically acceptable alternatives.
Within each of these criteria, the workgroups identified
r.any detailed criteria for consideration in model selection,
including (1) the spatial and temporal scales of the problem,
(2) defensible treatments of recognized important physical and
chemical atmospheric processes, (3) the spatial and' temporal
representativeness of the meteorological .d.ata and recor'd, ( M )
emissions from individual and interacting sources, including
variability, (5) the match between the data needs of the model
and the availability of input data, (6) the compatibility of
the model's output(s) with the requirements of the ambient
standard, increment, or air quality goal, (7) documentation of
the model's algorithms, computing requirements, computer
software, test/example cases, and prior evaluation activities
and regulatory applications, and (8) simplicity, adaptability,
and flexibility in its transferability to different geograph-
ical settings, emission source configurations, and (possibly)
political boundaries. Appendixes to the final report provide
additional . selection criteria and corresponding discussions
abouteach. . —
')
A special mention is in order regarding the selection of
screening models and long-range transport models in PSD new
source reviews. It was noted by Workgroup I that though
screening models may originally appear to ease and simplify the
permitting process, they subsequently could cause complications
involving equity issues, including the following:
(1) Premature determination of increment consumption,
which in turn can elicit from potential industrial developers a
-------
variety of responses that nay complicate subsequent permitting
actions.
«
(2) Predatory actions by industrial developers, including
attempts to bank the increment and tactics to discourage acqui-
sition of adjacent development sites, and so on.
(3) Distortions in the time phasing of industrial develop
ment to ensure- being one of the first ' developers of a region.
(4) Inequities in Best Available Control Technology
(BACT) determinations.
These concerns will be of particular importance .in areas of
concentrated development, such as the oil shale area. In one
respect, the use of screening models in areas of potential con-
centrate'd development can confuse air quality management deci-
sions and planning, since such models do not and cannot
(because they are not intended to) provide a reliable measure
of the consumption of the air resource. Answers to questions
about the ultimate potential development cannot be addressed by
either industrial developers or government policy-makers using
these models. Furthermore, the use of a multiplicity oz
models — screening, guideline, and nonguideline — can cause even
greater complications. The r equir ement - - £"o'r consistency and
standardisation in such situations appears paramount. ' . ,
The requirement for consistency was noted to be, important
in another instance. This circumstance involves the use of
long-range transport models--also in PSD new source revieu but
also, possibly, in SIP revision actions, assuming the transport
of pollutants across state boundaries is of concern. In se-
lecting such models, the principal consideration must be given
to 'the soundness of the scientific principles upon which they
are based and the implementation of those principles in the
model, since empirically based .model evaluations are probably
several years auay.
The balance between flexibility, standardisation, and con-
sistency is particu,larly vex:ing in regions of complex terrain.
The likely wide variations in meteorological, topographical,
and source configurations, when combined with the' absence of
generally accepted modeling approaches, suggests a strong need
for flexibility in selecting a mo-deling approach — especially in
PSD and SIP revision actions. However, in regions of concen-
trated development, standardization and consistency in model
selection are also required to reduce the potential for ine-
quities between sources and to reduce administrative burdens.
Workgroup I suggested that for a particular geographical
region, a requirement for performance evaluation of a proposed
-------
nonguideline model, using an available data base having suit-
able similarities to the impact assessment of interest, uould
likely impose the necessary (regional) consistency and also, al-
low the desired flexibility. Questions involving the charac-
terization and speciiicat.ion of such similarity criteria were
not addressed by Workshop" I (e.g., what these criteria
could/should be and who should identify and specify them).
4.. 3 CRITERIA PERTINENT TO THE APPROPRIATE USE OF AN AIR
OUALITY «ODEL
The previous subsection add-ressed the criteria that
decision-makers and modelers should adopt in selecting from a
set of available models the single model or subset of models to
be used in a particular situation. This subsection focuses o-n
the workgroup-recommended principles that should structure the
process by which the model is agreed upon and set up, the input
data prepared, the runs made, the output formulated, and the
entire process documented.
All workgroups recommended that all models used in air
quality regulation undergo standarised performance evaluation
according to the Woods Hole recommendations. In these in-
stances where a bonafide dispute exists concerning the suit-
ability of a particular, application of .an- EPA-r-e-commended
model, an application-specific performance evaluation was re-
commended by Workgroup II as the preferred means of resolution.
Workgroup II recommended that efforts be made to develop mini-
mum acceptable levels of model performance (i.e., standards).
Fur the rmore ,• Work group II recommended that models be required
to. meet some minimum level of performance prior to acceptance
for regulatory use; however, the workgroup also recognized that
an explicit level of performance cannot currently be specified.
«
All the workgroups recommended and endorsed the concept of
instituting a protocol concept in the use of models. Such a
protocol would be developed through open, cooperative meetings
between the regulator and other interested parties prior to the
use and application of a particular model or set of models.
The workgroups also noted that effort's should be made to iden-
tify in advance, to the greatest extent possible, the specifics
regarding modeling procedure, including as needed (a) model
performance evaluation methods (e.g., measures, standards, and
so on'),/ (b) data sources, uses, and adjustments, (c) model
computation and parameter selection .options, - (d) receptor
selection, (e) model output formats, (f) interpretation and,
if necessary, adjustment of model results, (g) identification
of model limitations and biases, and (h) the examination of ,the
likely effects of model limitations and . unce-rtainties on the
estimated impacts (e.g., through sensitivity analysis or Monte
-------
Carlo simulation). Much of the recommended information would
be available from user manuals and guidelines for the model so
that a protocol document need not be an extensive volume;
rather, it should be a substantive one. .
The workgroups also recommended that during these meetings
every reasonable effort be made by all parties to identify po-
tential uncertainties and conflicts and to identify the means
to resolve them. Among the approaches 'suggested for resolving
conflicts were the establishment of technical review committees
composed of interested parties, or the designation of project
arbitrators. Their judgment of both would be final. These and
other possible approaches to reconciling disputes or conflicts
are recommended for inclusion in the protocol document. The
information that can be used during the resolution of a
dispute, as well as limits governing its use, should also be
identified and specified in the protocol.
It appeared to the workgroups that a natural balance can
be found between flexibility and consistency with the institu-
tion and practice of a protocol concept. It was noted by
Workgroup II that when a high degree of flexibility is being
sought, all procedures should be agreed upon a priori by all
interested parties and that a suitable forura (ut sup.) should
be identified and established to .resolve anticipated or unanti-
cipated issues.* Furthermore, this practice- -would -facilitate
the recommendations of Workgroup III that any party or
de'cision-maker who uses a model and seeks to base a regulatory
decision on the output of that, mo'del is obligated t o( (a) pub-
licly document the input data and actual model used, and (b)
reproduce the methods/techniques employed in preparing input
data and model exercise.
4.4 ' INCORPORATING AIR 2UALITY MODELING UNCERTAINTIES INTO THE
DECISION-MAKING PROCESS
All workgroups endorsed the need, and recommended that ap-
proaches be sought, to identify, quantify, reduce (if
possible), and incorporate the uncertainties associated with
air cuality modeling* into the regulatory and decision-making
process. Furthermore, all the workgroups recommended that
priority be given to (a) the identification and quantification
* A priori means here that procedures should be specified prior
to obtaining, or being able to infer, the final result of the
imoact assessment.
-------
of uncertainties and (b) the incorporation of such uncertain-
ties into the regulatory process. The workgroups recognised
that the rate of reduction of modeling uncertainties,• through
model and data input improvements, generally occurs more slowly
than the rate at which such information is needed in the regu-
latory setting. In addition', such a prioritisation presumes
that technically sound modeling is being practiced (i.e., a
best available modeling approach is selected and properly
used). The workgroups also recommended that model research and
development be continued.
«
In recommending that modeling uncertainties be reflected
in regulatory decisions and in the exploration of alternative
control strategies, Workgroup III noted that it should be the
r.odeler's responsibility to the decision-maker to identify, de-
scribe (when possible), and quantify the sources of uncertainty
in each air quality analysis. In addition, it should be the
modeler's responsibility to express modeling results in a
manner that clearly communicates the uncertainty in an unders-
tandable and utilisable form to. the decision-maker and the
decision-making process. Workgroup III also noted that it
should be the deciglon-msker' s responsibility to become knowl-
edgeable concerning, and conversant with, results that ex.press
and contain uncertainty. Determining how best to utilise the
results presented with their attendant uncertainty was noted by
Workgroup III to be the responsibility of the 'decisio'n'-maker .
Furthermore, for A£n and the regulatory process to ignore
modeling uncertainty and to continue to base, decisions on best
estimate single-value measures, such as the high, second-high
concentrations, places an unduly heavy burden on modelers, who
essentially are being required to make, or are implicitly
making, policy decisions when they select models and choose
model inputs.
The workgroups recognised that modeling uncertainty can be
incorporated into the decision-making process by
(1) Developing procedures for quantifying uncertainty.
'i
(2) Giving attention to the strengths and weaknesses of
modeling in fashioning the measures of achievement•for air pol-
lution control programs.
(3) Explicitly describing the uncertainties (and their
likely implications, if known) that cannot be eliminated.
With respect to item ( 2 ) , Workgroup I noted that uncer-
tainty could be reduced substantially and rather quickly if
-------
measures other than the high, second-high uere employed, or if
the high, second-high measure uere augmented with additional
information available from a model. Another measure Workgroup
I identified was the 95th percentile value of the distribution
of gr ound-3.e vel concentrations. This workgroup noted, however,
that the selection of such a'concentration value, if chosen to
be consistent uith current practice, could raise equity issues
vis-a-vis individual or groups of sources, uith the former
possibly leading to more favorable dutcoroes than the latter.
Workgroup II noted that an alternative .approach to using the
high, second-high concentration value uould beto calculate the
95th percentile concentration and then extrapolate the re-
sulting value to the percentile corresponding to the high,
second-high value.'
Workgroup I also identified additional information that
currently available models could readily provide to Decision-
makers, including the .
(1) Number of times concentration values exceed 80 or 90
percent of the standard/increment.
(2) Average o± the 10 highest concentration values at the
worst receptor.
(3) Episodic character of the highest concentration', val-
ues (i.e., the extent to uhich such values are uniformly dis-
tributed throughout the year or are' grouped together).,
(4) Location and extent of the geographic area uhere
standards/increments are most likely threatened.
(5) Exposure or dosage estimates. Workgroup I, in sug-
gesting the use of such information by decision-makers, recog-
nized that significant changes in both the current values em-
bodied in the clean air legislation and the regulatory process
are necessary.
All uorkgroups \recommended that the strengths and limita-
tions of models be examined in light of the need to incorporate
the resulting understanding into the design of a decision-
making nrocess that
( 1 ) P.educes the sensitivity of decisions to model
uncertainties.
(2) Seeks to manage the risk of incorrect decisions.
-------
One or more of the workgroups provided the following re-
commendations regarding the • explicit incorporation of uncer-
tainties into decision making:
(1) Make uncertainties explicit, through the best avail-
able means, in all modeling-related decisions. As appropriate,
use data from site-specific performance evaluation studies, use
the understanding of departures from underlying model
assumptions, and use the results of sensitivity analyses. For
the immediate future, sensitivity analyses and Monte Carlo si-
mulations probably represent the only available approaches for
providing uncertainty estimates for medium- and long-range
transport models. In effect, the workgroups recommend provid-
ing decision-makers with estimates of error bars on model
estimates. It was noted that sensitivity analyses are likely
to provide lower estimates of uncertainty.
(2) Use confidence bounds (i.e., error limits), or prefer
ably, probability distributions to express uncertainties.
(3) Continue the process already started with the ExEx
and Multi-Point Rollback (MPP.) methods of incorporating proba-
bilistic concepts into the modeling framework and into conven-
ient and understandable formats for use by decision-makers.
Examples noted included using expected exceedance, 'violation
probability, and Type I and Type II error approaches. . '•
(40 Develop structures and models for the decision pro-
cess itself to provide a basis for accommodating and analysing
model uncertainty in the overall' process, with its attendant
uncertainties. In effect, develop a mathematical framework for
decision analysis in air quality management.
Workgroup I also noted that additional flexibility is
probably needed in the decision process, especially with re-
spect to PSD permitting issues, to reflect the various
?urpcses/go.als of the PSD' provisions regarding air-quality-
related values.
4.5 INCORPORATING IMPROVEMENTS IN AIR 2UALITY METHODS INTO
THE REGULATORY PROCESS
The previous subsections have focused on the selection and
use of models and the interpretation of model results and their
attendant uncertainties in a somewhat static regulatory
environ~.ent--one in which no explicit recognition is given to
either the evolutionary (e.g., through the introduction of new
dispersion coefficients) or revolutionary (e.g., through the
-------
introduction of visibility impairment models) nature of deve-
lopments in modeling methods.
The fourth question explored by the workshop recognised
that air quality modeling is a rapidly expanding, evolving, and
advancing field whose raison d'etre is to identify, address,
r.eet, and serve the needs of air quality managers, policy-
makers, and decision-makers. The growth of this field and its
potential contributions can be estimated by considering the in-
creased number of technical conferences, publications, organ-
izations that sponsor and perform research, and organizations
(including state and local agencies) that provide services in
air quality modeling and related ac-tivities . The approaches to
encouraging, controlling, and facilitating the embodiment of
the most suitable methods and data bases in the regulatory pro-
cess are still to be developed. The publication of air quality
modeling guidelines by the EPA represents an initial effort to-
ward attaining this goal. In this subsection, the recommenda-
tions of the workshop vis-a-vis the introduction of modeling
improvements into the regulatory process are provided.
All workgroups noted that the issue related to
consistency, .standardization, and flexibility lay at the core
c£ this overall problem. All workgroups either explicitly or
i-plicitly recommended that consistency should be achieved by
selecting and using a new or modified approach-rather- - than by
insisting that the same guideline or nonguideline model be used
for all circumstances.
Although each workgroup emphasised somewhat different ele-
ments of the process for achieving this goal, all the groups
recommended that
'(1) Improvements be made in the methods used to convey
changes in models, methodology, and processes to interested
participants.
(2) Consideration be g.iven to the establishment of a con-
cept to provide for the centralization of certain modeling ac-
tivities and'to provide some insulation of the technical model-
ing tasks from the political decision-making process.
'.n addition, Workgroups I and III noted that such a center
:ould require extensive peer review and technical oversight, a
:ecc-nendation also implicitly recommended by Workgroup II.
?he remainder of this subsection elaborates on the nature of
:hese recommendations.
Workgroup II recommended that the regular updating of mod-
-------
eling guidelines constitutes the most reasonable means of con-
veying changes. Recommendations were not advanced regarding
the frequency of updates, though the criteria recommended for
their selection were the significance and acceptability of the
change to the technical community. Workgroup II also recom-
mended that procedures be instituted to establish criteria for
change and to communicate methodologies, practices, and so on,
to the community of practitioners and other interested parties.
In addition, Workgroup II recommended the possibility of
adopting a regular schedule for revisions, even if the an-
nouncement at the scheduled time uere only that no significant
revisions were expected during the subsequent interval.
In making its .recommendations for guideline revisions,
Workgroup II acknowledged the importance of proposed model
changes on past regulatory actions, along with the implications
of proposed model changes for future regulatory actions. Thus,
this workgroup recommended the establishment of a function
within the EPA for dealing with the implications of neu methods
or practices. This function would.address, in advance, the
policy, legal, and regulatory issues raised by any proposed
chances and would recommend methods for resolving such issues.
Workgroups II and III recommended that strong consideration be
given to grandfathering affected facilities, provided that past
modeling efforts had been carried out in good .faith. »....
Recommendations varied regarding the scope and function of
the modeling center concept. The responsibilities identified
by one or more workgroups of such a function included
(1) Maintenance and updating of model costs.
(2) fiaintenance of test data bases.
(3) Undertaking model performance evaluation studies and
archiving their results.
(14) Maintenance of a repository for all actions involving
nonguideline models.
I
(5) Maintenance of information concerning model applica-
tion results,.
(6) Provision of certain defined services for selected
modeling studies, including third-party model exercise in some
cases and the exercise of models whose costs or technical
requirements, either in the form of expertise or hardware
demands, are 'extensive.
-------
As noted, the modeling center would of necessity require
extensive peer review and technical oversight and, thus, the
workgroup recommended an advisory or review committee as the
preferred means of reaching consensus and according legitimacy
to proposed changes in modeling practices. Such a body would
be composed of both government and nongovernment representa-
tives having backgrounds in policy and technical areas. This
committee would periodically review proposed revisions to the
guidelines originating from, say, the. modeling center. The
committee would also review and comment on the suitability of
new modeling techniques and advances in modeling practice.
M . 6 SOME LOGICAL EXTENSIONS OF WORKGROUP REPORTS, AND FURTHER
RECOMMENDATIONS
i
The previous sections have attempted "to report with ac-
ceptable fidelity the recommendations of the workgroups. The
similarities and the lack of dissimilarities between the recom-
mendations and their possible implications for additional re-
commendations can be clearly noted. This subsection provides
comments'and recommendations resulting from the efforts to
integrate the conclusions of all the workgroups.
First, dissimilarities among1 workgroup's recommendations,
either in specifics or in spirit, despite^the disparity among
workshop participants, were not noticeable.-This does not mean
that areas of disagreement do not exist or that disagreemen'ts
did not occur. It does appear to mean however, that in areas
involving the practice of air quality modeling there is much
room for agreement. Further, it may also mean that the majority
of participants see the practice of air quality modeling and
the ASM approach as the preferred way to accomplish clean air
'goals . .
Second, the nexus for efficiently achieving clean air ob-
jectives through the ASM approach and impact assessment lies in
establishing and preserving a balance among flexibility,
standardization, and consistency. More effort needs to be de-
voted to defining the dimensions of this issue and the parame-
ters that will secure and assure the continuance of that bal-
ance in the regulatory setting. It appears that many of the
essential elements for dealing with this issue were identified
by the works-hop participants:
(1) Utilisation of cooperative processes, whenever
possible, that provide for early and substantive involvement of
interested parties and that encourage, the anticipation,
definition, and resolution of potential areas of conflict.
Such processes can be expected to accord broad acceptance and
-------
legitimacy to the results.
(2) Utilization of air quality (or increment
consumption) assessment plans or protocols to identify and de-
fine methods, tasks, analysis steps, data bases, potential dis-
putes (and the means for' resolving them), and the schedule
(including the times for periodic meetings) for accomplishing
the impact assessment.
(3) Utilisation of advisory groups to provide oversight,
guidance, and peer revieu.
It is recommended that, whenever appropriate, the EPA pro-
vide guidance to the regions and states regarding the purpose
and role of .the foregoing elements and that in the case of item
(2), the agency provide guidance, by way of examples, regarding
the use of such plans or protocols.
An element not identified by the workgroups, but which ap-
pears to be implicit in, and consistent with, their recommenda-
tions involves the establishment of quality assurance (£A) in
the Aon process. Offered here as an additional recommendation,
this function would attempt to reduce doubts and risks con-
cerning the modeling methods employed in, and conclusions
derived from, air quality impact assessmen-ts . At* least two
basic activities are recommended for a - OA activity.: (1)
certification, and (2) evaluation.
Certification would be primarily directed at verifying,
among other possibilities, the correctness of the impact analy-
sis, against established accepted practice, model design, and
user manual specifications. Certification of an organisation's
capability to offer impact assessment services could also be
provided. The certification would follow a rigorous test plan
established in advance. The result of this activity would be
either acceptance (certification) or rejection (no
certification); the basis for the rejection and the deficien-
cies would.be noted. Further attention is needed to designate
the entity responsible for setting standards, defining the test
plan, and other related activities. A quality assurance board
composed of professional organization, government, and nongo-
vernment membership that encompasses s. broad range of skills
and interests could be a part of this function.
The. evaluation activity would be mainly directed at exam-
ining items of concern that have been identified at some point
curing the impact assessment. This activity follows an inves-
tigative approach; problems or issues that are 'raised are exam-
ined 'for their importance to, and effect on, a particular
outcome. The correctness of the method or methods in question
-------
uould be evaluated, and the effect of using the method(s) on a
particular outcome would be assessed. The result of the eva-
luation uould, at least, be brought to the attention of the
decision-maker or other interested participants in the impact
assessment.
It is recommended that further consideration be given to
the scope and use of a 2A activity, especially in relation to
the
C1) FunctionCs).
«•
(2) Elements of impact assessments to be included.
(3) Establishment of standards, and their relationship to
acceptance tests and independent verification and validation.
(M) Need for oversight.
(5) Roles and types of audits.
(6) Documentation requirements.
(7) Requirement for reducing administrative and resource
burdens on all parties and for preserving cost-effectiveness.
------- |