Interim Procedures For Evaluating Air Quality Models Experience With Implementation


             United States      Office of Air Quality        EPA-450/4-85-006
             Environmental Protection  Planning and Standards      July 1985
             Agency         Research Triangle Park NC 27711
             _
.&EPA      Interim Procedures
             For Evaluating Air
             Quality Models:
             Experience with
             Implementation

-------
             United States      Office of Air Quality        EPA-450/4-85-006
             Environmental Protection  Planning and Standards      July 1985
             Agency         Research Triangle Park NC 27711
             _
,4>EPA      Interim Procedures
             For Evaluating Air
             Quality Models:
•            Experience with
             Implementation

-------
                                          EPA-450/4-85-006
        Interim Procedures for Evaluating Air
Quality Models:  Experience with  Implementation
                         U.S. Environments! Protection Agency
                         Region V, Library
                         230 South Dearborn Street
                         Chicago, Illinois 60604
                    U.S ENVIRONMENTAL PROTECTION AGENCY
                      Monitoring and Data Analysis Division
                    Office of Air Quality Planning and Standards
                    Research Triangle Park, North Carolina 27711

                             July 1985

-------
                                               Disclaimer

This report has been reviewed by The Office of Air Quality Planning and Standards, U S Environmental Protection
Agency, and has been approved for publication. Mention of trade names or commercial products is not intended to
constitute endorsement or recommendation for use.

-------
                                 Preface




     In August 1981, EPA developed and distributed to its Regional Offices an




in-house document "Interim Procedures for Evaluating Air Quality Models."




The Regional Offices were encouraged to use the guidance contained in the




document as an aid to determining whether a proposed model, not recommended in




the Guideline on Air Quality Models^, could be applied to a specific regulatory




situation.  Subsequently, as a result of experience gained in several applica-




tions of these procedures, EPA revised and published the "Interim Procedures for




Evaluating Air Quality Models (Revised)"^, in September, 1984.




     The material contained in this report summarizes the experience gained




from the first several applications of the original guidance.  Potential




users of the revised Interim Procedures are encouraged to read this report




so that they might benefit from the experience of others and thus be able to




better design their own application.  The user should pay particular attention




to the Findings and Recommendations (Section 4) so as to know and better




understand particular aspects in the revised procedures on which EPA will




place emphasis in the future applications.
                                   iii

-------
                              Acknowledgements




     This report was prepared by Dean Wilson with contributions from




Joseph Tikvart, James Dicke and William Cox, all of the Source Receptor




Analysis Branch, Monitoring and Data Analysis Division.




     Appreciation is extended to Michael Koerber, Region V,  Alan Cimorelli,




Region III and Francis Gombar, Region II for their helpful comments during




the review process.   The patience of Linda Johnson as she typed this report




is appreciated.

-------
                            Table of Contents

                                                                     Page

Preface	  iii

Acknowledgements	   iv

Table of Contents 	    v

List of Tables 	  vii

List of Figures	   ix

List of Symbols 	   xi

Summary	xiii

1.0  INTRODUCTION 	    1

     1.1  Scope and Contents 	    2

     1.2  Basic Principles Employed in the Interim Procedures 	    2

     1.3  Summary of the Interim Procedures 	    3

2.0  APPLICATIONS OF THE INTERIM PROCEDURES TO REGULATORY PROBLEMS .    7

     2.1  Baldwin Power Plant 	    8

          2.1.1  Ba ckg round 	    8
          2.1.2  Preliminary Analysis 	   10
          2.1.3  Protocol for the Performance Evaluation 	   11
          2.1.4  Data Bases for the Performance Evaluation 	   12
          2.1.5  Results of the Performance Evaluation and Model
                 Acceptance 	   13

     2.2  Westvaco Luke Mill 	   13

          2.2.1  Background	   13
          2.2.2  Preliminary Analysis 	   14
          2.2.3  Protocol for the Performance Evaluation 	   15
          2.2.4  Data Bases for the Performance Evaluation 	   17
          2.2.5  Results of the Performance Evaluation and Model
                 Acceptance	   18

     2.3  Warren Power Plant 	   19

          2.3.1  Background 	   19
          2.3.2  Preliminary Analysis 	   21
          2.3.3  Protocol for the Performance Evaluation 	   22
          2.3.4  Data Bases for the Performance Evaluation 	   25

     2.4  Lovett Power Plant 	   25

-------
          2.4.1  Background	   26
          2.4.2  Preliminary Analysis	   26
          2.4.3  Protocol for the Performance Evaluation 	   28
          2.4.4  Data Bases for the Performance Evaluation 	   30

     2.5  Guayanilla Basin 	   30

          2.5.1  Background 	   32
          2.5.2  Preliminary Analysis 	   33
          2.5.3  Protocol for the Performance Evaluation 	   34
          2.5.4  Data Bases for the Performance Evaluation 	   37

     2.6  Other Protocols 	   38

          2.6.1  Example Problem 	   38
          2.6.2  Gibson Power Plant 	   39
          2.6.3  Home r Ci ty Area 	   40

3.0  1NTERCOMPARISON OF APPLICATIONS 	   43

     3.1  Preliminary Analysis 	   43

          3.1.1  Regulatory Aspects	   44
          3.1.2  Source Characteristics and Source Environment 	   44
          3.1.3  Proposed and Reference Models	   46
          3.1.4  Preliminary Concentration Estimates 	   48

     3.2  Protocol for the Performance Evaluation 	   48

          3.2.1  Performance Evaluation Objectives 	   49
          3.2.2  Data Sets, Averaging Times and Pairing 	   50
          3.2.3  Performance Measures 	   53
          3.2.4  Model Performance Scoring 	   55

     3.3  Data Bases for the Performance Evaluation	   57

     3.4  Negotiation of the Procedures to be Followed 	   60

4.0  FINDINGS AND CONCLUSIONS	   63

5.0  REFERENCES	   69

Appendix A.  Protocol and Performance Evaluation Results for
             Baldwin Power Plant 	  A-l

Appendix B.  Protocol and Performance Evaluation Results for
             Westvaco Luke Mill	  B-l

Appendix C.  Protocol for Warren Power Plant	  C-l

Appendix D.  Protocol for Lovett Power Plant 	  D-l

Appendix E.  Protocol for Guayanilla Basin 	  E-l


                                  vi

-------
                              List of Tables



Number                            Title                               Page

 3-1               Source and Source Environment	   45

 3-2               Proposed and Reference Models 	   47

 3-3               Weighting of Maximum Possible Points by
                   Data Set, Averaging Time and Degree of
                   Pairing 	   51

 3-4               Performance Measures Used in the Protocols 	   54

 3-5               Data Bases for Performance Evaluations	   58

 3-6               Issues Involved in Negotiations 	   62
                                   vii

-------
Vlll

-------
                             List of Figures
Number                            Title

 1-1           Decision flow diagram for evaluating a
               proposed air quality model	
 2-1           Map of air quality monitoring stations and the
               meteorological tower in the vicinity of the Baldwin
               power plant,  April 1982-March 1983 	
 2-2           Topographic map of the area surrounding the
               Westvaco Luke Mill 	    14

 2-3           Map of seven air quality monitoring stations  and
               the meteorological station in the Warren area 	    20

 2-4           Map of air quality monitoring stations  and the
               primary meteorological tower in the vicinity  of
               the Lovett power plant	    26

 2-5           Map of existing air quality monitoring  network
               and expanded air quality monitoring network in the
               Guayanilla area 	    31
                                    IX

-------
                               List of Symbols
Co    = Observed Concentration


Cp    = Predicted Concentration


d     = Residual = Co - Cp


Mc    = Number of Observed/Predicted Meteorological Events in Common


R     = Pearson's Correlation Coefficient


RMSE^ = Root-mean-square-error of Residual


Sjj    = Standard Deviation of Residual

 o
S*:    = Variance of Observed Concentration

 o
&„    = Variance of Predicted Concentration
                                     XI

-------
                              Summa ry




     This report summarizes and intercompares the details of five major




regulatory cases for which the guidance provided in the "Interim Procedures for




Evaluating Air Quality Models"* was implemented in evaluating candidate models.




In two of the cases the evaluations have been completed and the appropriate




model has been determined.  In three cases the data base collection and/or




the final analysis has not yet been completed.




     Due to the unique source-receptor relationships in each case, however,




the procedures, data bases and number of monitors here are not necessarily




applicable to other situations.  These cases are presented only as examples of




how the 1981 Interim Procedures document has been applied to some real world




situations.




     Each of the five cases involves major point sources of SC>2.  In all




cases the major regulatory concern is to determine the emission limit that




would result in attainment of the National Ambient Air Quality Standards




(NAAQS) within a few kilometers of the plants.  Most of the cases involve




power plants and/or industrial facilites located in complex terrain where




short-term impact on nearby terrain is the critical source-receptor




relationship.




     Although the scope of model problems is limited, it seems clear that the




basic principles or framework underlying this guidance is sound and workable




in application.  For example, the concept of using the results from a pre-




negotiated protocol for the performance evaluation has been shown to be an




appropriate and workable primary basis for objectively deciding on the best




model.  Similarly, "up-front" negotiation on what constitutes an acceptable




data base network, while often difficult to accomplish because of conflicting
*1981 EPA internal document
                                   Xlll

-------
viewpoints, has been established as an acceptable way of promoting objectivity




in the evaluation.




     In earlier evaluations there was some laxity on the part of the reviewing




agencies in requiring a detailed preliminary evaluation/documentation of the




critical source-receptor relationships.  In more recent evaluations fulfilling




the requirement for preliminary estimates has led to better understanding of




the source-receptor relationships and provided a better linkage between these




relationships and the contents of the performance evaluation protocol.   These




preliminary estimates also seem to better define the requisite data base net-




work.  As a consequence of this experience, it is recommended that in future




protocols more emphasis be placed on the preliminary analysis; the results




of this analysis should be linked to the protocol and the requisite data




base through the development of detailed performance evaluation objectives.




     Experience has also pointed up the need to build in some "safeguards" in




the application of the chosen model, should that model be shown to underpredict




concentrations.  This is particularly a problem if an emission limit derived




from the model application might result in violations of the NAAQS.  The methods




used in more recent regulatory cases generally involve the use of "adjustment




factors" to correct for possible underprediction.  This technique is not




particularly appealing and the development of more innovative and scientifi-




cally defensible schemes is recommended.




     Finally, based on this experience, it should be emphasized that the




credibility of the performance evaluation is greatly enhanced by the availability




of continuous on-site measurements of the requisite model input data.  This




includes the measurement of meteorological parameters, as well as pre-specified




backup data sources for missing data periods.  Also included is the need for




continuous in-stack measurement of emissions and accurate stack parameter data.
                                   xiv

-------
1.0  INTRODUCTION




     In 1981 a document "Interim Procedures for Evaluating Air Quality




Models" was prepared in-house by EPA and distributed to the ten




Regional Offices.  This document identified the documentation, model




evaluation and data analyses desirable for establishing the appropriateness




of a proposed model.  The Regional Offices were encouraged to use the




procedures when  judging whether a model not specifically recommended for




use in the "Guideline on Air Quality Models,"^ was acceptable for a given




regulatory action.  These procedures, which involved the quantitative




evaluation and comparison of models for application to specific air




pollution problems, addressed a relatively new problem area for the




modeling community.  It was recognized that experience with their use would




provide better insight to the model evaluation problem and its limitations.




During the 1981-1984 time period, several projects which entailed the use




of the procedures were undertaken.  Based on this experience, the procedures




were revised and published as "Interim Procedures for Evaluating Air




Quality Models (Revised)"2.




     It was clear from the experience gained in application of these 1981




procedures that the basic principles contained therein were sound




and appropriate to apply to regulatory model evaluation problems.  However




the state of the science did not suggest a single prescription detailing




their application.  In fact, each application of the procedures differed




considerably in detail.  However, while the individual merits of each




application could be scientifically debated, each case reflected an




acceptable interpretation of the interim guidance.

-------
     1.1  Scope and Contents




          The purpose of this document is to provide potential users of the




revised Interim Procedures with a description and analysis of several




applications that have taken place.   With this information in mind the user




should be able to:  (1) more effectively implement the procedures since




some of the pitfalls experienced by  the initial pioneers can now be




avoided; and (2) design innovative technical criteria and statistical




techniques that will advance the state of the science of model evaluation.




          Remaining sections of this report are as follows.   Section 1.2




reviews the basic principles underlying the Interim Procedures.   Section




1.3 is a summary of the Interim Procedures, to be used as a  point of




reference in reading this report.  Section 2 contains summaries of each




of five major regulatory cases where the Interim Procedures  were applied,




as well as brief summaries of three  other incomplete cases.   Section 3




intercompares the technical details  of each of the five cases.  Section 4




lists the findings and recommendations resulting from the analyses in




Sections 2 and 3.  Appendices A-E contain details of the protocols for




each of the five cases.  Appendices  A and B also contain the final scores




for two of the performance evaluations.






     1.2  Basic Principles Employed  in the Interim Procedures




          The Interim Procedures for Evaluating Air Quality  Models is built




around a framework of basic principles whereby the details of the decision




process to be used in the model evaluation should be established and documented




up-front.  The performance evaluation protocol should be established before




data are available that would allow either the applicant or  the control




agency(s) to determine, in advance,  the outcome of the evaluation.  These





principles are:

-------
          0  Up-front negotiations/agreements between the user and the




regulatory agencies are vital;




          0  All relevant technical data/analyses and regulatory constraints




are documented;




          0  A protocol for performance evaluation is written before any




data bases are in hand;




          °  A data base network is established that will meet the needs of




both the technical/regulatory requirements and the performance evaluation




protocol;




          0  The performance evaluation is carried out and the decision on




the appropriate model must be made as prescribed in the protocol.




          The macerial in Sections 2 and 3 is an analysis, among other things,




of how well these principles were adhered to for five cases.   The findings




in Section 4 include specific statements to this effect.







     1.3  Summary of the Interim Procedures




          The document Interim Procedures for Evaluating Air Quality




Models (Revised) describes procedures for use in accepting, for a specific




application, a model that is not recommended in the Guideline on Air Quality




Models.  One requirement is for an evaluation of model performance.  The




primary basis for the model evaluation assumes the existence of a reference




model which has some pre-existing status and to which the proposed nonguideline




model can be compared from a number of perspectives.  However for some appli-




cations it may not be possible to identify an appropriate reference model,




in which case specific requirements for model acceptance must be identified.




Figure  1-1 provides an outline of the procedures described in the document.




          After analysis of the intended application, or the problem to be




modeled, a decision must be made on the reference model to which the proposed




                                   3

-------
Wri te
Descrip
\
Tech.
tion of
i
Technical
Compari son
of rbs?ls
\
t
knte Perf.
Evaluati on
Protocol
                  Wn te Tech.
                  Description of
                  Proposed
                  Acceptable
                  or terminal
                 i Write ?erf.
                  Evaluation
                    Protocol
                        I rert.
                   Eva! uati on
                      Data
Collect ?erf
 Evaluti on
    Dats
                    Concuct
                     Pert.
                   Evaluati on
   Conauct
    Perf.
  Evaluati on
                                                                of Perf
                                                                 Eval.
                                                                 ther
                                                               Protocol
                                                               Criteria
Figure  1-1.   Decision flow  diagram  for evaluating  a proposed air quality model

-------
model can be compared.  If an appropriate reference model can be identified,




then the relative acceptability of the two models is determined as follows.




The model is first compared on a technical basis to the reference model




to determine if it can be expected to more accurately estimate the true




concentrations.  This technical comparison should include preliminary con-




centration estimates with both models for the intended application.  Next




a protocol for model performance comparison is written and agreed to by




the applicant and the appropriate regulatory agency.  This protocol




describes how an appropriate set of field data will be used to judge the




relative performance of the proposed and the reference model.  Performance




measures recommended by the American Meteorological Society-^ are used in




describing the comparative performance of the two models in an objective




scheme.  That scheme should consider the relative importance to the problem




of various modeling objectives and the degree to which the individual per-




formance measures support those objectives.  Once the plan for performance




evaluation is written and the data to be used are collected/assembled,




the performance measure statistics are calculated and the weighting scheme




described in the protocol is executed.  Execution of the decision scheme




will lead to a determination that the proposed model performs better,




worse or about the same as the reference model for the given application.




The final determination of the acceptability of the proposed model should




be based primarily on the outcome of the comparative performance evaluation.




However, if so specified in the protocol, the decision may also be based




on results of the technical evaluation,  the ability of the proposed model




to meet minimum standards of performance, and/or other specified criteria.

-------
          If no appropriate reference model is identified, the proposed model




is evaluated as follows.  First the proposed model is evaluated from a




technical standpoint to determine if it is well founded in theory, and is




applicable to the situation.  Preliminary concentration estimates for the




proposed application should be included.  This involves a careful analysis




of the model features and use in comparison with the source configuration,




terrain and other aspects of the intended application.   Secondly, if the




model is considered applicable to the problem, it is examined to see if the




basic formulations and assumptions are sound and appropriate to the problem.




(If the model is clearly not applicable or cannot be technically supported,




it is recommended that no further evaluation of the model be conducted and




that the exercise be terminated.)  Next, a performance  evaluation protocol




is prepared that specifies what data collection arid performance criteria




will be used in determining whether the model is acceptable or unacceptable.




Finally, results from the performance evaluation should be considered




together with the results of the technical evaluation to determine accepta-




bility.

-------
2.0  APPLICATION OF THE INTERIM PROCEDURES TO REGULATORY PROBLEMS




     This section describes five major regulatory cases, covering the




period 1982-1984, where the techniques described in the Interim Procedures




are being applied to establish the appropriate model for setting emission




limits.  Although protocols for the comparative performance evaluation of




competing models have been prepared for all five cases, in only two cases




has the execution of the protocol been completed; these results are pre-




sented.




     Sections 2.1 through 2.5 are arranged roughly chronologically, i.e.




in the order in time when a final performance evaluation protocol was




established.  Section 2.6 contains brief summaries for other applications




of the Interim Procedures of which EPA is aware; however, for a variety of




reasons, the chosen models have not been used in regulatory decision-making.




     The history of negotiation over appropriate models, data bases, emission




limits, etc, for the sources included in these specific applications dates




back several years.  The development and execution of an agreed upon procedure




for the comparative performance evaluation of competing models is, or is




designed to be, the basis for resolution of these issues.  No attempt is




made in the following subsections to describe the complete history of issues/




negotiations.  Instead, only a brief definition of the issues to be resolved




by the performance evaluation is provided.




     Each of the Sections 2.1 through 2.5 contain separate subsections dealing




with the background (history), the preliminary analysis, the protocol for




the performance evaluation and the data bases to be used in the performance




evaluation.  In addition, Sections 2.1 and 2.2 include a subsection which




summarizes the results of the performance evaluation.

-------
     2.1  Baldwin Power Plant




          The Baldwin power plant,  located in Randolph County,  Illinois,




about 60 km southeast of St. Louis, Missouri, is composed of three steam/




electric generating units with a combined design generating capacity of




1,826 megawatts.  Each of the boilers is vented through an individual 605-




foot (184m) stack.  A map of the area is provided in Figure 2-1.






          2.1.1  Background




                 In late 1981 the State-approved SC>2 emission rate was




101,588 Ib/hour.  Illinois Power Company (IP) requested that this rate be




established as the EPA-approved SIP limit adequate to protect both primary




and secondary National Ambient Air Quality Standards (NAAQS).  The basis




for this proposal was estimates by the MPSDM model indicating compliance




with the standards.  The company claimed that the use of MPSDM was supported




by data from an 11-station monitoring network in the vicinity of  the plant.




Estimates using the EPA CRSTER model indicated compliance with the primary




NAAQS but violations of the 3-hour secondary NAAQS.




                 Potential problems were:




                 1.  locations of the monitors were not adequate  to conduct




a performance evaluation for MPSDM and CRSTER.




                 2.  adequacy of the IP model performance evaluation was in




question since the available monitoring data were used to select  a "best




fit" option of MPSDM, i.e., an independent performance evaluation was not




conducted; and




                 3.  available monitoring data (summarized as block data)




indicated exceedances (no violations) of both the 3-hour and 24-hour standards




at a previously operated monitor not included in the 11-station network.

-------
^
' j]fi • ^~|S)oWriwuicl.Directi
-------
Based on this information EPA decided thac the proposed

emission limit was adequate to attain the primary SC>2 NAAQS; however, the

secondary NAAQS demonstration should be re-evaluated by the State of Illinois.

Guidance contained in the Interim Procedures for Evaluating Air Quality

Models should be used in the re-evaluation.

In response to this suggestion, IP, in February 1982, prepared

the "Proposed Procedures for Model Evaluation and Emission Limit Determina-

tion for the Baldwin Power Plant." Negotiations then took place between the

Illinois Environmental Protection Agency (IEPA) and IP on the contents of the

document. The end result of these negotiations was a final protocol^ issued

by IEPA in June 1982. The four major differences between the IEPA document

and the IP protocol were: (1) IEPA eliminated one performance measure that

involved case studies of the 10 episodes with highest measured concentrations,

(2) more weight was given to the comparison of the second-high, single-valued

residuals in the IEPA protocol (and less weight for some of the other mea-

sures); (3) IEPA eliminated the use of 1-hour statistics; and (4) IEPA

eliminated performance tests involving comparison of monitored data with

predictions for a 180 receptor grid. (Instead, only predictions at the

monitor sites were to be used.)

2.1.2 Preliminary Analysis

The preliminary analysis of the proposed application,

submitted by IP to IEPA, included a definition of the regulatory aspects

of the problem and a description of the source and its surroundings. The

analysis established that only the 3-hour concentration estimates were at

issue. IP proposed to use MPSDM in lieu of CRSTER to estimate 3-hour

concentrations, pending the outcome of a comparative performance evaluation.

A technical description of MSPDM and a user's manual for the model were

10
-------
provided to IEPA. IP also provided a technical comparison between MPSDM

and CRSTER following the procedures outlined in the "Workbook for Comparison

of Air Quality Models"^. IP's "workbook" comparison concluded that MPSDM

was technically comparable to CRSTER for most application elements but

was technically better for two of the elements; thus MPSDM was judged by

IP to be technically superior to CRSTER for the proposed application.

Preliminary concentration estimates were made with both

CRSTER and MPSDM although the details of these estimates were not documented.

From other information available it was evident that MPSDM would yield lower

3-hour estimates than CRSTER at locations within 2 km under very unstable

meteorological conditions (A-stability). These estimates would be controlling,

i.e. the estimates that would be used to set the emission limit for the power

plant.

2.1.3 Protocol for the Performance Evaluation

The IEPA protocol for the comparative performance evaluation

of MPSDM and CRSTER, which is detailed in Appendix A, strongly emphasized

accurate prediction of the peak (highest-second-highest) estimate. Fifty-five

(55) percent of the weighting in the protocol involved the calculation of

performance statistics that characterize each model's ability to reproduce

the measured second-high concentrations at the various monitors. Thirty-

five (35) percent of the weighting was assigned to performance statistics

that characterize the models' ability to reproduce the measured concentration

in the upper end of the observed frequency distribution, namely the high-25

observed and predicted concentrations. In addition, the protocol included

performance measures designed to determine how well the models perform for

specific meteorological conditions (5%) and performance statistics that compare

the upper end of the frequency distribution of measured/predicted values (5%).

11
-------
The primary performance measures used in trie evaluation were

the residual (observed minus predicted concentration) and the bias (average

residual for the high-25 data set). Performance measures were calculated

from data paired in space and time and completely unpaired with the major

weighting on the unpaired data. Other performance measures in the protocol

included the standard deviation of the residual and the root-mean-square-error

of the residual.

The scoring scheme used for most performance statistics

consisted of a percentage of maximum possible points within specified cutoff

values. If the performance statistic fell outside of the cutoff values,

no points were awarded to the model. Within the acceptable range, the per-

cent of possible points was linearly related to the value of the performance

statistic. The sign (+ or -) of the residual and bias statistics was not

considered in the scoring process, i.e. overprediction and underprediction

were weighted equally. The scoring schemes for the meteorological cases and

for the frequency distrubutions were more complicated; refer to Appendix A

for details.

The decision criteria by which the better model was chosen

was simply which model attained the best score.

2.1.4 Data Bases for the Performance Evaluation

As mentioned earlier, the data base for the performance

evaluation ultimately consisted of a network of monitors and a meteorological

station specifically designed to fit the needs of the application (See Figure

2-1). Data obtained from previously operated networks were used in designing

this data base network. This data base consisted of 10 S02 monitors and a

single meteorological tower instrumented to collect wind speed, wind direction

and turbulence intensity (for use in MPSDM) data. Off-site meteorological

12
-------
data used in the evaluation consisted of mixing height data derived from

National Weather Service (NWS) soundings from Salem, IL and Pasquill-Gifford

stability data derived from surface observations at Scott Air Force Base,

IL (CRSTER only). Hourly emission data and stack gas parameters were

derived from records of plant load level and daily coal samples.

2.1.5 Results of the Performance Evaluation and Model Acceptance

The data base for this evaluation has been collected and the

performance evaluation has been carried out according to terms specified in

the protocol^. The overall result was that MPSDM scored 51.3 points and

CRSTER scored 41.7 points out of a possible 100 points. Thus MPSDM was

selected as the appropriate model to be used to determine the emission limit

necessary to attain the secondary 3-hour NAAQS. Details of the performance

evaluation results are provided in Appendix A.

2.2 Westvaco Luke Mill

The Westvaco Luke mill in the town of Luke in Allegany County,

Maryland, is located 970 feet (296m) above mean sea level (msl) in a deep

valley on the north branch of the Potomac River. The region surrounding

the mill is mountainous and generally forested. Figure 2-2 is a topographic

map of the area surrounding the Westvaco Luke Mill, The © symbol shows the

location of the 623-foot (190m) main stack which serves the facility.

2.2.1 Background

In response to a consent decree, the company operated an

ambient monitoring and meteorological data collection network from December

1979 through November 1981. The • symbols in Figure 2-2 show the location

of continuous SC>2 monitors and the Jk symbols show the locations of the
13
-------
BLOOMINGTON Monitor
Bloomingto
Figure 2-2. Topographic map of t:.j aroci surrounding the ,-;estvaco Luke Mill.
Elevations are in feet above mean sea level and the contour interval is 500 feet.
The • symbols represent S02 monitoring sites. The A symbols represent meteor-
ological monitoring sites. Sites 1 and 2 are also S00 monitoring sites.
14
-------
100-meter Meteorological Tower No. 1, the 30-meter Meteorological Tower No. 2

(the Luke Hill Tower) and the 100-meter Beryl Meteorological Tower. Continuous

S02 monitors were collocated with Tower No. 1 and Tower No. 2 and an acoustic

sounder was collocated with Tower No. 2. As shown by Figure 2-2, there were

eleven SC>2 monitors of which eight were located on a ridge southeast of the

Main Stack. SC>2 emissions during the two-year monitoring period were limited

to 49 tons per day.

The company developed a site-specific dispersion model,

LUMM, which they claimed was applicable to the problem and should be accepted

as the basis for setting a new emission limit of 89 tons per day. The company's

basis for this claim was described in a March 1982 report' in which estimates

from the LUMM model were compared to ambient measurements from the 11-station

network. EPA reviewed the report and found a number of technical problems

with the model, including the use of ambient data to "tune" the model, i.e.

no independent performance evaluation was undertaken.

In order to resolve these problems, EPA developed, under

contract in mid-1982, a protocol for conducting a performance evaluation of

models applicable to the Westvaco site. The company was then asked to compare

their model with the SHORTZ model, using procedures like those suggested in

this protocol. As a result of these negotiations a final protocol" was agreed

upon in late 1982 and subsequently executed by the company, utilizing the

second year of the two-year data base.

2.2.2 Preliminary Analysis

There is little written material on the Westvaco case which

would suggest that an up-front, in-depth preliminary analysis of regulatory

and technical aspects of the problem was undertaken. However, based on the
15
-------
above two references, various Federal Register actions and numerous meetings,

both the source and the control agencies apparently had at least tacit under-

standings of the regulatory and technical issues involved. For example, the

regulatory agencies were concerned about attainment of the short-term ambient

standards at elevated receptors near (within a few kilometers of) the source.

It was also apparent that SHORTZ would yield higher concentration estimates,

and thus a tighter emission limit, than LUMM.

References 7 and 8 contain technical descriptions of the two

competing models but no user's manuals. The SHORTZ model was modified for

use at Westvaco and no user's manual exists for this version. The references

do not describe any preliminary estimates using the two models nor do they

contain an in-depth technical comparison of the two models. No analysis

using the Workbook for Comparison of Air Quality Models was undertaken.

2.2.3 Protocol for the Performance Evaluation

The final agreed upon protocol for the comparative performance

evaluation of LUMM and SHORTZ, which is detailed in Appendix B, emphasized

accurate estimates of the peak concentrations and the upper end of the fre-

quency distributions. Forty-three (43) percent of the weighting in the

protocol involved the calculation of performance statistics that characterize

each model's ability to reproduce the measured maximum and second-high

concentrations at the various monitors. Fifty-seven (57) percent of the

weighting was assigned to performance statistics that characterize the

models' ability to reproduce measured concentrations in the upper end of

the observed frequency distribution, namely the high-25 observed and predicted

concentrations. No "all data" statistics were calculated, i.e. the protocol

assumed that the only relevant data were the top-25 estimated and observed

concentrations.

16
-------
The protocol specified three basic performance measures to be

used in the evaluation, the absolute residual for single-valued comparisons,

the bias for the top-25 concentrations and the ratios of the observed and

predicted variances for the the top-25 concentrations. Various time and

space pairings were specified with most of the weighting (61 percent) on

data paired in space but not time.

The scoring scheme used for each performance statistic was

specified by somewhat complicated formulae and the reader is referred to

Appendix B for details. Basically, the scheme involved computing ratios

of performance measures between the two competing models and bias ratios

or variance ratios for each model. These ratios were then combined in

various ways to produce a percentage of maximum possible points for each

performance statistic. This result was then multiplied by the maximum

possible points for that performance statistic to yield a subscore.

Subscores were then totalled for each model to yield a composite score.

The model with the highest total score was deemed to be most appropriate

to apply to the source.

2.2.4 Data Bases for the Performance Evaluation

The data base used in the performance evaluation was the

second year of the historical two-year data base described above. The

locations of the ten monitors and the two meteorological towers for use in

the evaluation are shown in Figure 2-2. Data from the Beryl tower and the

Bloomington monitor were not to be used, although Bloomington data were

used to help establish background values. Each tower was instrumented at a

number of levels; thus there were often a number of possible values for the

meteorological inputs to each model to choose from. To promote objectivity

in the evaluation, the primary source of data for each meteorological

17
-------
parameter, as well as ranked "default" data sources to be used in the event

of missing data, were specified in a protocol. No off-site meteorological

data were used in the models; however default values lor mixing height and

some turbulence intensities were specified. Hourly emission data and stack

gas parameters were derived from continuous in-stack measurements.

The data base for the model evaluation already existed.

The network was designed in 1978-1979, to determine if there were any NAAQS

violations in the vicinity of the plant and possibly for use in conducting

a performance evaluation. Most of the monitors were densely clustered on

the hillside south of the plant, the area where maximum concentrations were

expected. However, a decision was made that the definition of ambient air

did not apply to this property, i.e." the NAAQS did not apply there. This

fact, together with an opinion of the control agencies that the LUMM model

was partially based on the same data that would be used in the performance

evaluation, raised many questions on the objectivity of the evaluation.

Detailed records on the negotiations between the company and control agencies

to resolve this concern are lacking. In the end it was apparently decided

that the objectivity in the performance evaluation was not sufficiently

compromised to warrant the redesign of the network and collection of an

additional year's data. The performance evaluation protocol contained one

mitigating measure in this regard. In an apparent attempt to compensate

for the lack of sufficient "offsite" monitors, several of the performance

statistics for the single offsite monitor (No. 10 in Figure 2-2) were weighted

by a factor of four over those same statistics for the other eight monitors.

2.2.5 Results of the Performance Evaluation and Model Acceptance

The data base for this evaluation has been collected and

the performance evaluation has been carried out according to terras specified

18
-------
in the protocol^. The overall result was that LUMM scored 363 points and

SHORTZ scored 168 points out of a possible 602 points. Thus, LUMM was

selected as the appropriate model to be used to determine the emission

limit necessary to attain the NAAQS. Details of the performance evaluation

results are provided in Appendix B.

2.3 Warren Power Plant

The 90 megawatt Warren power plant, operated by the Pennsylvania

Electric Company (Penelec), is located in Warren County in northern Pennsyl-

vania, about 80 km southeast of Erie. The plant has a single 200-foot (61m)

stack which emits about 2420 Ib/hour of S02 at maximum capacity. The model-

ing region near Warren is characterized by irregular mountainous terrain,

with peak terrain elevations substantially above the top of the power plant

stack (See Figure 2-3).

2.3.1 Background

As a result of earlier modeling, the area was designated

as nonattainment in the late 1970's. Penelec was directed by the State of

Pennsylvania to establish, through monitoring and modeling, an emission

limit that would ensure attainment of the NAAQS.

Penelec believed that the LAPPES model was appropriate to

use for purposes of setting the emission limits. In March 1984 Penelec

proposed to the State of Pennsylvania Department of Environmental Resources

(DER) an analysis and a performance evaluation protocol, patterned after

the Interim Procedures, to establish whether LAPPES would be more appro-

priate than EPA's Complex I model. A series of negotiations between DER,

EPA and Penelec followed. A number of additions and changes were made to
19
-------
GV-: o
: S13 rjrick
41. Warren Power Plant

' /'--'
CONTOUR INTERVAL 20 FEET
UATUM IS MEAN SEA LEVEL
\
Figure 2-3. Map of seven air quality monitoring stations ( •) and the
meteorological station (^) in the Warren area.
20
-------
the analysis and the protocol and a final agreed upon analysis and protocol

was written in November 1984-'-1-'. Data collection requisite to executing the

protocol is currently underway.

2.3.2 Preliminary Analysis

The protocol document contains a definition of the regulatory

aspects of the problem and a description of the source and surroundings.

The analysis establishes that the 3-hour and the 24-hour concentration esti-

mates are at issue. Penelec proposes to use LAPPES in lieu of Complex I to

estimate concentrations for all averaging times pending the outcome of a com-

parative performance evaluation. Penelec has also submitted a technical

description of LAPPES. Although a user's manual for LAPPES exists, it is

not clear that the manual is "current" with the version of LAPPES used in

this application. Penelec has not provided a rigorous technical comparison

of LAPPES and Complex I following the procedures outlined in the Workbook

for Comparison of Air Quality Models.

Based on one year of meteorological data, preliminary concen-

tration estimates have been made with both LAPPES and Complex I and the details

of these estimates, including isopleth maps, are provided in the protocol

document. These estimates show that maximum concentrations for all averaging

times occur on elevated terrain to the north of the plant. The preliminary

analysis also identifies another significant S02 source located approximately

4 km east of the Warren power plant. This source is close enough such that

short-term impacts could overlap. Since monitoring data would not always

distinguish between these sources, both sources are induced in the model

comparison study. The hourly average background SC>2 concentration is to be

the lowest concentration observed by any station in the monitoring network.
21
-------
2.3.3 Protocol for the Performance Evaluation

The protocol for the comparative performance evaluation of

LAPPES and Complex I, which is detailed in Appendix C, emphasizes accurate

prediction of the peak concentration. Forty-three (43) percent of the weight-

ing in the protocol involves the calculation of performance statistics that

characterize each model's ability to reproduce the measured high and second-

high concentrations at the various monitors. An additional forty-three (43)

percent of the weighting is assigned to performance statistics that character-

ize the models' ability to reproduce the measured concentration in the upper

end of the observed frequency distribution, namely the high-25 concentrations.

These analyses of the high-25 data set include certain statistics that break

out performance by stability category. In addition, the protocol assigns a

weight of fourteen (14) percent to performance statistics on the entire

range (all data) of measured/predicted values.

A variety of performance measures are used in the Warren

protocol; see Appendix C. Although the bias is weighted heavily in all of

the data sets, the specific performance measures used to characterize bias

vary. For the maximum single-valued comparisons, the average residual and

the ratio of the absolute residual to the observed concentration (both

paired in space but not time) are used to characterize the bias. For other

data sets, including the second high single-valued comparisons, extensive

use is made of the ratio of the predicted to observed concentrations as a

measure of bias. Other performance measures used in the protocol include

correlation measures and ratios of predicted to observed variances. Perfor-

mance statistics are to be calculated for 1-hour, 3-hour, 24-hour and annual

averaging times. Each averaging time carries considerable weighting. Sixty

(60) percent of the weighting is assigned to unpaired data comparisons and

forty (40) percent to data paired in space but not time.

22
-------
The scoring scheme used for most performance statistics

consists of a percentage of maximum possible points within specified cutoff

values. If the performance statistics fall outside of the cutoff values,

no points are to be awarded to the model. Within the acceptable range, the

percent of possible points is specified in tabular form (discrete values

for specified ranges of performance). The tabular values for the bias

statistics slightly favor the model that overpredicts, if one model overpre-

dicts to the same extent that the other model underpredicts. The scoring

schemes for other performance measures are more complicated; refer to

Appendix C for details. Subscores for each performance statistic are

totaled to obtain a final score for each model.

Initially, the model with the highest score is deemed to

be most appropriate to apply for regulatory purposes. However, the protocol

contains some additional procedures to be employed if the LAPPES model

attains the highest score but is shown to underpredict the highest concentra-

tions. For the 3-hour and 24-hour averaging periods, the average of the 10

highest concentrations predicted by LAPPES will be compared with the average

of the 10 highest observed values. If the ratio of the observed to predicted

average is greater than 1, then this ratio will be used to adjust LAPPES

model predictions for the regulatory analyses. This "safety factor" is

intended to compensate for any systematic model underpredictions. If -the

ratio is less than 1, no adjustment will be made. Note that a different

ratio will be used for each averaging time. For annual average concentra-

tions, the averages of observed and predicted values at the seven monitoring

stations will be compared. If the average of observed annual values is

larger than predicted, then model predictions will be adjusted by the ratio

of the observed to predicted average.
23
-------
2.3.4 Data Bases for the Performance Evaluation

The data base for the performance evaluation consists of a

network of monitors and meteorological stations specifically designed to cover

the area of maximum predicted concentration and to fit the needs of the pro-

tocol. This data base consists of seven monitors, six of which are in the

area north of the plant, where preliminary estimates indicated that high

concentrations would occur (See Figure 2-3). The seventh monitor, located

south of the plant would most often be used to determine background. Two

meteorological towers are included in the network but data from the Starbrick

tower would be used exclusively unless such data are missing. For missing

data periods, a hierarchy of default data sources are specified in the

protocol, including data from the Preston tower and off-site data. Wind

fluctuation (sigma theta) data are used to determine stability in accordance

with the scheme defined in the "Regional Workshops on Air Quality Modeling:

A Summary Report" H. Morning and afternoon mixing heights are primarily from

Pittsburgh National Weather Service data. Hourly emission data and stack gas

parameters are to be derived from records of plant load level and coal

sample data.

2.4 Lovett Power Plant

The Lovett power plant is located in the Hudson River Valley of

New York State and is owned by Orange and Rockland Utilities, Inc. The

plant generates 495 megawatts of electricity and is currently burning 0.37

percent sulfur oil. Major terrain features in the vicinity of the plant

include the Hudson River Valley, which generally runs from north to south,

and several nearby mountains. Dunderberg Mountain, with a maximum elevation

of approximately 1100 feet (335m) is located 1-2 km to the north. Other signi-

ficant topographic elevations include Buckberg Mountain, about 1.3 km to the

24
-------
west, with a peak of 787 feet (240 m). An area of high terrain extends

from west-northwest through north within 5 km of the plant. A map of the

region is presented in Figure 2-4.

2.4.1 Background

The company requested to convert the plant to low sulfur

(0.6-0.7 percent) coal with a new emission^ limit of 1.0 Ibs S02/nm Btu. An

actual increase in S02 emissions of approximately 12,000 tons per year

would result.

In April 1984, the EPA Administrator agreed, in principle, to

allow the company to construct a new 475-foot (145m) stack and convert the

plant to coal. One provision of the agreement was that the company develop

a protocol for a performance evaluation which was acceptable to EPA and execute

this protocol once the new stack was erected and the conversion to coal was

completed. The company drafted a protocol for the comparative performance

evaluation of three models: the NYSDEC model, a modified version of the NYSDEC

model (the company's model of choice) and EPA's Complex I model. A series of

negotiations then took place between the company, the State of New York and

EPA where the details of the protocol and the proposed monitoring network

were changed several times. A final protocol11>12 was agreed upon by all

parties in September 1984.* The data base collection phase is not yet under-

way. It should be completed by 1988.

2.4.2 Preliminary Analysis

The preliminary analysis of the proposed application, contained

in the protocol documents, provides a complete description of the existing
*Although an appropriate protocol was agreed upon by the source and the
control agencies, the construction of the 475-foot stack and conversion of
the plant to coal has not yet begun, pending the outcome (final Federal
Register approval or disapproval) of the proposed SIP revision.

25
-------
, _
7 - ,.-* ^ v ,.£*• =
f ,^ \ \oo :
1\ s 5 ,,r - »,
"" / ' [i. I •' ""• •*--"- 4rr \ix.U "",1™.-., •: ;t'
. •' ' ' -• . // t> r / JSs^1/™-. '.^--p-~^. "'•, , -S'
v^
^5
J ^-^ H'^Vi^Siv
Figure 2-4. Map of air quality monitoring stations (•) and the primary
meteorological tower (^) in the vicinity of the Lovett power plant («).
meter meteorological towers are also located at Sites 75, 100, 119 and 6.
26
Ten-
-------
and proposed source and the surroundings. The regulatory constraints had

been established earlier, namely attainment of the S02 NAAQS, primarily the

short-term NAAQS, on nearby elevated terrain above stack height. The proto-

col document identifies Complex I as the reference model and the two proposed

models, NYSDEC and Modified NYSDEC model. The technical features of the

two proposed models are described but no user's manuals are provided. The

preliminary analysis does not include a formal technical comparison of the

proposed and reference models following the procedures outlined in the

Workbook for Comparison of Air Quality Models.

Preliminary estimates of 3- and 24-hour S02 concentrations

have been made with Complex I and the Modified NYSDEC model, using one year of

meteorological data from a tower located at the nearby Bowline power plant.

Modeling has been performed for both maximum and average load conditions.

The protocol document contains a fairly comprehensive analysis of the

results including isopleth maps of maximum short-term concentrations and

tables listing the magnitude and locations of the "high-50" estimates. The

analysis shows that maximum concentrations for both models would be expected

on Dunderberg Mountain to the north of the plant. Complex I estimates are

as much as an order of magnitude higher than the Modified NYSDEC model

estimates. Secondary maxima are estimated to occur on other more distant

terrain features in several directions but these estimates are much lower

than those on Dunderberg Mountain.

The protocol document identifies the Bowline power plant, 6 km

to the south, as another significant source of SC>2, the plume from which could

simultaneously (with Lovett) impact Dunderberg Mountain. The contribution from

this plant will be quantified, as a function of meteorological conditions, through
27
-------
utilization of data from the monitoring network obtained prior to the

Lovett plant conversion.

2.4.3 Protocol for the Performance Evaluation

The protocol for the comparative performance evaluation of

the three competing models, which is detailed in Appendix D, emphasizes

accurate prediction of the peak concentrations and the upper end of the

frequency distribution. Twenty (20) percent of the weighting in the pro-

tocol involves the calculation of performance statistics that characterize

each model's ability to reproduce the measured second-high concentrations

at the various monitors. Fifty-eight (58) percent of the weighting is

assigned to performance statistics that characterize the models' ability

to reproduce the measured concentration in the upper end of the observed

frequency distribution, namely the high-25 concentrations. In addition,

the protocol assigns twenty-two (22) percent of the weighting to performance

measures designed to determine how well the models perform for the entire

range (all data) of measured/predicted values, broken out into stable and

unstable conditions.

The primary performance evaluation measures are the ratios

of observed to predicted concentrations, ratios of the observed to predicted

variances and the inverse of these ratios. Seventy-eight (78) percent of

the weighting is associated with statistics based on the values of these

ratios. These statistics are to be calculated for all combinations of data

pairings but most often the unpaired data sets and the data sets paired in

space only are used. The analysis of the "all data" data set includes

statistics that break out performance by stability category. The other

twenty-two (22) percent of the weighting is associated with performance
28
-------
measures designed to characterize correlation, gross variaoility and the

ability of the models to accurately predict observed concentrations during

observed meteorological conditions.

The scoring scheme used for most performance statistics is

specified by somewhat complicated formulae and the reader is referred to

Appendix D for details. The scheme is similar to that used in the Westvaco

protocol. Basically, it involves computing ratios of performance measures

between the three competing models and bias ratios or variance ratios for

each model. These ratios are then combined in various ways to produce a

percentage of maximum possible points for each performance statistic.

This result is then multiplied by the maximum possible points for that

performance statistic to yield a subscore. Subscores are then summed

for each model to yield a total score.

Initially, the model with the highest score is deemed to be

most appropriate to apply for regulatory purposes. However, the protocol

contains some additional procedures to be employed if the chosen model is

shown to underpredict the highest concentrations. The procedure, which is

based on the unpaired in time and space comparisons, is as follows:

(1) If the average of the highest ten predicted 3- or 24-hour

average concentrations is less than the average of the highest ten observed

3- or 24-hour average concentrations, or

(2) If the highest, second-highest predicted 3- or 24-hour

average concentration is less than ninety (90) percent of the highest,

second-highest observed 3-or 24-hour average concentration,

then the model predictions will be linearly adjusted to correct this regula-

tory problem. The adjustment factors will be calculated as the minimum

needed to eliminate the two conditions of underprediction listed above.
29
-------
2.4.4 Data Bases for the Performance Evaluation

The data base for the performance evaluation will consist

of a network of monitors/meteorological stations specially designed to cover

the area of maximum predicted concentration and to fit the needs of the

protocol. This data base will consist of eleven monitors, nine of which

are to be in the area north of the plant where preliminary estimates indicate

that high concentrations would occur (See Figure 2-4). Monitor #38, located

south of the plant, will most often be used to determine background. A

100-meter meteorological tower, instrumented at three levels, will be

located at the plant site. Ten-meter meteorological towers are included in

the network at sites 6, 119, lOu and 75 but data from the 100-meter tower

will be used exclusively. For missing data periods, a hierarchy of default

data sources is specified in the protocol. These primarily consist of data

from other levels on the 100-meter tower. Wind fluctuation (sigma theta)

data from the 10-meter height are used to determine stability inputs to

Complex I and the NYSDEC model in accordance with the scheme defined in the

Regional Workshops on Air Quality Modeling: A Summary Report. Sigma theta

data from the 100-meter level are used as direct input to the Modified NYSDEC

model. Morning and afternoon mixing heights will be primarily derived from

the Albany National Weather Service data. Hourly emission data and stack

gas parameters will be derived from continuous in-stack measurements.

2.5 Guayanilla Basin

The Guayanilla Basin is located on the southern coast of the island

of Puerto Rico. The area is characterized by coastal plains with hills

rising abruptly from the plains (See Figure 2-5). Historically, several

industrial sources of S02 have operated in the area but most have shut down.

The only currently operating sources, and which are relevant to this analysis,

30
-------
_=-'^^i^r/V-A':^
!;\ ;;<4i3.T'"fei i & v
=^^^TLJL''"" ** •-
"H^ic~V. - -TV* :.&—;
ST**^ &;" V " -T^U ••
.T i^.. T". ^ 'V-jL,*;
^.si^SuwTS0
r% -^ A^=_ • "~^^-vlc|^
3: V^s^-^-fJffiiissiKfi _

cu
4-1
cu
e
1
5
5
8 2
n =d '
. il
-z '
2"!-

•\ ^tzr**—^**^ ' ^
//\^r x" i , ~ 1
' 7 / N"'" ! '•-. ; ; *- ' :
/^ ' \ * , \. - & •
u
3
—; o
1 4-1

O
o

4-1

-------
are the Puerto Rico Electric Power Authority (PREPA) power plant and the

Union Carbide (UCCI) facility, both located near Tallaboa Poniente. The oil

fired PREPA plant has stacks ranging in height from 23 feet (7m) to 250

feet (76m) and a combined nominal SC>2 emission rate of 16,545 Ib/hour. The

UCCI plant has five stacks ranging in height from 38 feet (12m) to 160 feet

(49m) with a combined nominal SC>2 emission rate of 1568 Ib/hour. Nominal

plant grade for both facilities is ten feet (3m) above mean sea level.

2.5.1 Background

The major regulatory concern with these plants has been the

attainment of the short-term S02 NAAQS on elevated terrain to the north and

northwest of the sources. Modeling with EPA's Complex I model indicated that

there would be NAAQS violations on the terrain. Industrial interests and

the Puerto Rico Environmental Quality Board (PREQB) maintained for several

years that emission limits should be based on estimates from the Puerto Rico

Air Quality Model (PRAQM), which generally predicts lower concentrations

than Complex I. In 1979 Environmental Research and Technology, Inc. (ERT)

prepared a report for PREQB entitled "Validation of the Puerto Rico Air

Quality Model for the Guayanilla Basin"^ . The report compared, in various

ways, model estimates with historical ambient air quality data from eight

monitors (four on elevated terrain) in the area. EPA expressed concerns

about the technical aspects ot the model and the underestimation of the

observed concentrations at some monitors.

In response to these concerns, it was decided in early 1984

that a comparative performance evaluation between the PRAQM and Complex I

should be undertaken. Hence EPA developed an analysis and draft protocol for

this performance evaluation. The protocol and design of the monitoring
32
-------
network were then negotiated with PREQB and the industrial interests. A

final agreed-upon protocol was issued in December 1984^.

2.5.2 Preliminary Analysis

The protocol document contains a definition of the regulatory

aspects of the problem and a description of the sources and their surroundings.

The document states that only the short-term S02 concentration estimates are

at issue and that the PREQB proposes to use PRAQM in lieu of Complex 1 to

estimate these concentrations, pending the outcome of a comparative performance

evaluation. A technical description of PRAQM is contained in the protocol.

Apparently no user's manual for PRAQM exists. No formal technical comparison

of PRAQM and Complex I, following the procedures outlined in the Workbook

for Comparison of Air Quality Models, was performed.

Some preliminary concentration estimates have been made with

the PRAQM, Complex I and also the SHORTZ model. The details of these

estimates are not provided in the protocol document; however, all parties

are privy to the results. The results indicate the following:

1. Maximum concentration estimates occur on elevated terrain

to the north and northwest of the plants; however Complex I, PRAQM and

SHORTZ all produce different results in terms of magnitude, specific location,

and time of the maximum concentrations.

2. Maximum 3-hour and 24-hour concentrations frequently

occur both at the monitored locations and in areas that are not monitored.

3. In terms of magnitude, SHORTZ seems to yield the highest

concentrations, significantly higher than either Complex I or PRAQM. The

PRAQM yields the lowest concentration estimates.
33
-------
4. The meteorological data indicate a predominance of

neutral/unstable conditions associated with the daytime southeast winds.

Such conditions generally carry the plumes over the terrain to the northwest

of the sources. However, there are occasional hours, during periods of wind

shifts, when stable plumes traveling over terrain could have a significant

short-term air quality impact.

Based on these results, it has been decided that Complex I is

the appropriate reference model and PRAQM the proposed model for the performance

evaluation; SHORTZ has been dropped from further consideration. It has also

been established that while the existing monitoring network is acceptable

for a preliminary performance evaluation, some data from a more detailed

network will be necessary to confirm/refute the results of this evaluation.

The specifics on how to use the existing network data as well as the design

and use of the augmented network and data are discussed below.

2.5.3 Protocol for the Performance Evaluation

The protocol for the comparative performance evaluation of

PRAQM and Complex I, which is detailed in Appendix E, specifies that the

performance evaluation will be divided into two phases. Phase I is an evalua-

tion for the period January 1983 through December 1984 using monitored data

collected at the four existing monitoring sites. Using the selection criteria

contained in the protocol a model of choice will be selected in this phase.

Phase II of the evaluation is designed to confirm the conclusions reached

as a result of the Phase I evaluation. Phase II will be based on six months

of air quality data from all eight sites (beginning around September 1984).

The specifics of the protocol for each phase are identical, except for a

minor stipulation involving the weighting of performance statistics by

monitor.

34
-------
The protocol emphasizes accurate prediction of the peak

concentration. Thirty-two (32) percent of the weighting in the protocol

involves the calculation of performance statistics that characterize each

model's ability to reproduce the measured maximum and second-high concentra-

tions at the various monitors. Sixty-eight (68) percent of the weighting

is assigned to performance statistics that characterize the models' ability

to reproduce measured concentration in the upper end of the observed frequency

distribution, namely the high-25 observed and predicted concentrations.

The primary performance measures are the ratio of the predicted

to the observed concentration (average predicted to average observed for the

high-25 data set) and the ratio of the variance of predicted concentrations

to variance of observed concentrations. Seventy-seven (77) percent of the

weighting is on data paired in space but not in time and twenty-three (23)

percent on unpaired data. Most performance statistics are to be calculated

for 1-, 3-, and 24-hour averaging times. The ratio measures are supplemented

by case study statistics, based on the number of cases in common between

predicted and observed concentrations (stratified by stability class, for

the upper five percent of the 1-hour values).

The Guayanilla protocol specifies that certain performance

measures are weighted according to the magnitude of the observed concentra-

tions. The performance statistics for the monitor with a higher observed

concentration is given proportionally more weight than that of the next

lower ranked monitor. The monitor with the lowest reading receives the

least weight.

The scoring scheme used for most performance statistics

consists of a percentage of maximum possible points within specified cutoff

values. If the performance statistic falls outside of the cutoff values,
35
-------
no points would be awarded to the model. Within the acceptable range, the

percent of possible points is specified in tabular form (discrete values for

specified ranges of performance). The tabular values for the bias statistics

favor the model that overpredicts, if one model overpredicts to the same

extent that the other model underpredicts.

Scores for each model for Phase I and Phase II are determined

by totalling the subscores for each performance statistic. For each Phase,

the PRAQM is deemed to be the better performer if its score exceeds the score

obtained for Complex I by 10 percent. If Phase II leads to a selection of

the same model as Phase I, this will be the model for future regulatory use

in Guayanilla. If Phase II leads to a selection of a different model, air

quality data will be collected for an additional six month period at the

eight monitoring sites.

If for both Phases I and II the PRAQM model has a point score

at least 10 percent higher than Complex I, it will be considered the preferred

model for use in the Guayanilla Basin.

Concentration estimates from the model with the highest score

are to be adjusted upward if the highest observed concentrations are signifi-

cantly underpredicted. The procedure, which is based on the unpaired in

time and space comparisons, is as follows:

(1) If the average of the highest ten predicted 3- and

24-hour average concentrations is less than the average of the highest ten

observed 3- and 24-hour average concentrations, or

(2) If the highest, second-highest predicted 3- or 24-hour

average concentration is less than ninety (90) percent of the highest, second-

highest observed 3-or 24-hour average concentration,
36
-------
then the model predictions will be linearly adjusted to correct for this

regulatory problem. The adjustment factors will be calculated as the

minimum needed to eliminate the two conditions of underprediction listed

above. If Phase II of the evaluation confirms the selection of the model

determined by Phase I, but there is a difference in terms of whether an

adjustment is warranted or different adjustments are indicated, the adjust-

ment that is most conservative (leads to the most stringent emission limit)

will be selected.

2.5.4 Data Bases for the Performance Evaluation

The data base for the Phase I performance evaluation consists

of two years of data from an existing 4-station monitoring network and an

on-site meteorological tower. The data base for Phase II consists of six

months of data from an 8-station network including the original four monitors

plus four additional monitors situated to better cover the area of predicted

maximum concentration and to fit the requirements of the protocol. The

locations of the monitors are indicated in Figure 2-5. Data from the same

meteorological tower are used in Phase II.

Sensors are mounted on a meteorological tower, located

near the PREPA plant, to collect wind speed, wind direction and temperature

data at 10 and 76 meters. Wind data from 76 meters will be scaled to plume

height with the 10-meter data used as backup. Wind fluctuation (sigraa

theta) data collected at 10 meters will be used to determine Pasquill-Gifford

stability class for both models according to the scheme described in the

Regional Workshop on Air Quality Air Quality Modeling: A Summary Report.

Periods of missing data will be eliminated from the performance evaluation,

Climatological average daily maximum and minimum mixing heights will be
37
-------
used. Hourly emission data and stack gas parameters are to be generated from

load levels, fuel consumption rates, fuel sampling and other surrogate

parameters that are technically defensible.

At the present time data collection from Phase II is still

underway and no results from either the Phase I or Phase II performance

evaluation are available.

2.6 Other Protocols

In addition to the five major performance evaluation analyses and

protocols discussed above, EPA is aware of three other analyses/protocols

written to assess the acceptability of proposed models for specific sources.

For one reason or another these efforts never reached fruition, i.e. no

decisions were made or are intended to be made, on emission limits based on

the chosen model. Brief descriptions of these three efforts are provided

below.

2.6.1 Example Problem

One such effort is the example problem which illustrates

the use of the Interim Procedures for Evaluating Air Quality Models (Revised)

and is included as Appendix B to that document. This narrative example was

based on 1976 emissions data from the Clifty Creek power plant in Indiana

and 1976 S02 ambient data from a 7-station network in the vicinity of the

plant.

The narrative example was specifically designed to illustrate

in a very general way the components of the decision making process and the

protocol for performance evaluation. As such, the preliminary technical/

regulatory analysis of the intended model application, while included in

the example, was significantly fore-shortened from that which would normally

be needed for an actual case. Also, since the evaluation was carried out

38
-------
on an existing data base, the example did not illustrate the design of the

field measurement program required to obtain model evaluation data.

The example problem protocol incorporated a broad spectrum

of performance statistics with associated weights. The number of statistics

contained in the example was overly broad for most performance evaluations

and perhaps, even for the problem illustrated. Thus its use was not intended

to be a "model" for actual situations. For an individual performance evalua-

tion it was recommended that a subset of statistics be used, tailored to the

performance evaluation objectives of the problem. Similarly, the method

used to assign scores to each performance statistic (non-overlapping confi-

dence intervals) was not intended to be a rigid "model" but only an illustration

of one of several possible techniques to accomplish the goal.

2.6.2 Gibson Power Plant

In May' 1981, Public Service Company of Indiana (PSI)

submitted to the Indiana Air Pollution Control Division (IAPCD) a report^

which outlined proposed procedures for conducting a comparative performance

evaluation of models applicable to setting the SC>2 emissions limit for the

Gibson power plant. PSI proposed to establish a monitoring network (actually

augment an existing network), the data from which would be used to establish

whether either of two versions of the MPSDM model would be more appropriate

to apply to the plant than EPA's CRSTER model. The report contained an in-

complete performance evaluation protocol that would be used in the evaluation.

Following submittal of this report, a series of negotiations

on the protocol and the monitoring network took place between PSI and IAPCD.

Some of these negotiations involved EPA. In July 1981, IAPCD accepted the

PSI plan, but EPA continued to express major concerns about the technical

aspects of the proposed models, on the monitoring network and on the

39
-------
protocol. These concerns were not resolved and in June and August 1982

EPA sent letters to PSI^IS cautioning them that the Agency could not accept

the results of the performance evaluation, if the company chose to proceed.

Apparently PSI proceeded with the evaluation and collected

the one year of data from the network. The outcome of the evaluation is

unknown at the present time.

2.6.3 Homer City Area

In November 1982, the Pennsylvania Electric Company (Penelec)

submitted to the State of Pennsylvania Department of Environmental Resources

(DER) a report, "Protocol for the Comparative Performance Evaluation of the

LAPPES and Complex I Dispersion Models Using the Penelec Data Set"19. xhe

company's intent was to execute the protocol and demonstrate the acceptabil-

ity of the LAPPES model in the Homer City, Pennsylvania area so that this

model could be used to revise SC>2 regulations for four area power plants.

The plants, which have varying stack heights, are located in moderately

complex terrain with receptors of concern located both above and below the

heights of the stacks.

The protocol was reviewed by DER and by EPA and a number of

comments/suggestions were provided to Penelec. The most significant comment

involved the choice of Complex I as the only reference model. An examination

of the topography in relationship to stack heights in the area revealed

that many of the monitors (and most of the terrain) were below most of the

physical stack heights. In fact, when expected plume rise was considered,

only the Seward plant, because of its relatively short stack, exhibited a

real risk of direct, stable plume impaction on terrain; the Conemaugh plant

was somewhat marginal in this regard. From an overall performance evaluation

standpoint, this resulted in a dilemma. Some of the monitors were considered

40
-------
"flat terrain" receptors for which CRSTER was the appropriate reference

model while some were complex terrain sites where Complex I might be appro-

priate. An added complexity was that, because of varying stack heights,

some monitors might be both flat terrain and complex terrain receptors

depending on which power plant was being modeled. Thus, Complex I was not

the appropriate model for all monitors, as proposed in the protocol and it

will likely underestimate concentrations at receptors that are below stack

height.

Although the protocol has been executed,20 the issue regarding

the choice of an appropriate reference model(s) has apparently never been

resolved.
41
-------
42
-------
3.0 INTERCOMPARISON OF APPLICATIONS

In this section, the details of the five major applications of the

Interim Procedures are intercompared. Each subsection below corresponds

roughly to and in the same order as Sections 2, 3 and 4 (and inherent sub-

sections) of the Interim Procedures for Evaluating Air Quality Models

(Revised). It is also possible to identify the subsections below with

sequential blocks in the flow chart for the Interim Procedures provided

in Figure 1-1 above. In this way it is possible to analyze the five

applications according to subject matter as it appears in the Interim

Procedures.

In the subsections below the common and differing features among the

five major applications are described. Where appropriate, these features

are compared to recommendations/suggestions contained in the corresponding

section/subsection of the Interim Procedures and similarities/differences

are noted.

The material contained in this section is intended to be factual,

i.e. additional interpretation or opinion is generally avoided. Any inter-

pretations and/or opinions that are provided are only intended to reflect the

views, or apparent views, contained in the individual protocols and related

documents.

3.1 Preliminary Analysis

The Interim Procedures recommend that before any performance

evaluation protocol is written or any performance data are identified/

collected, the applicant should conduct a thorough preliminary analysis

of the situation. This analysis serves to describe the source and its

environment, the regulatory constraints, the proposed and reference
43
-------
models, the relative technical superiority

and the ambient consequences of applying i

regulatory problems.

In each of the five application

conducted, although the level of detail a

the recommended procedures varied conside

3.1.1 - 3.1.4 below.

3.1.1 Regulatory Aspects

The Interim Procedures t

identify the regulatory aspects of the pi

averaging times and applicable regulatior

In each of the five app.

analysis was quite thoroughly covered.

SC>2 emitters and the NAAQS were identifi

. i.e. PSD increments or other
compliance with the annual standard. u

power plant where it was established the

to be acceptable, would only be used to

3.1.2 Source Characteristics

The Interim Procedures

be accompanied by a complete descriptio

Table 3-1 compares the

protocols. The Table shows that power
-------
0>

fl
o
>
c
3
O
cn
en
o
u
(U
iJ
o
CO
r*
3
3

CO
PI

D
rH
XI
C8
H

H
CO
jjj
3

Z
i— i
j
33

H Z
Z <
— i
2
H
_J
O.

OS
Cd
^e
O rH
Ol
•1
J
1— 1
2

3*
^
3
3-i — 1

H
Z
<
3-
OS
3
O
a- en

^j
u co
cd ^
= u
o <
CO H
CO
Ltd
O E&,
O
^ .
>. Q
H Z

o en
in o
CM CM
1 - O
en co o
"H r— i in

!-» O
in en o
r^ 00 -1
^f P^ 'H

O
O CM O
o ^r o
CM CM r^

en o
en oo 0
rsj o in
*o -i <
33 iO oS
O CO OS
*-l IH M
jd 2 H

^t
^ _3 aq
^C fH ^^
H O Ed
CO H Z

, 1
^
oS
^3
OS

^J
-------
• r
emissions with the exception of Westvaco. Most evaluations involve,

effectively, a single tall stack. The Guayanilla evaluation is the only

evaluation involving a true multiple stack situation. In all cases except

Baldwin, complex terrain is a major consideration with nearby terrain well

above stack height(s). All of the sources are in a rural environment and

most are isolated from any neighbors, i.e. the contribution from nearby

sources is not considered to be significant. The exceptions are Warren,

where a nearby plant is to be explicitly modeled, and Lovett, where the con-

tribution from the nearby Bowline power plant will be determined from

monitoring data.

3.1.3 Proposed and Reference Models

The Interim Procedures state that for each evaluation it

is highly desirable to choose a proposed and a reference model applicable

to the situation. (For cases where no reference model can be identified,

the Interim Procedures suggest an alternative approach that can be used to

determine acceptability of the proposed model.) It is further recommended

that each model be well documented, by a user's manual if possible. The

technical features of the competing models should be intercompared, preferably

using techniques described in the Workbook for Comparison of Air Quality Models,

Table 3-2 lists the proposed and reference models for each

of the five evaluations and the degree to which these models are documented

and intercompared. The Table shows that each evaluation involves a different

proposed model. In- the case of Lovett there are two proposed models. The

Complex I model is most often used as the reference model. All of the

preliminary analyses contain technical descriptions of the models to be

evaluated as well as technical/descriptive comparisions of the relavant
-------
01
-a
o

O)
o

0)
i-i
01
^
01
oe!

-3
C
CO
en
o
1)
i—I
^3

cO
-J
^4
z

?•*
^
a
O

^
til
§
_J

z
Ed
OS
sg
3
0
CJ

^
C-l
CO
Ed
3
Z
M

Q
, T
^
33

^
o-
< CO CO

CU >•* Z ^

u
Ed
Q
CO

u"z
Q •
CO Q CO CO
>" O Ed O Ed
Z Z >« Z >•

03
Cu
Cu CO —4 CO
^J ^* XS ?^

2
a co —i co
3 Ed O EiJ
J >- Z >>

2
Q
CO CO CO CO
p . fT^ J>^ [j^
2 >* >- >H
H
H-l
3

O Z
M O
H co
CU M
i-t oi
as < -3
CJ CU Ed
J co S Q
Ed Ed O O
Q Q J - CJ Z
O <
2 J = -3 Ed

O CJ ^5 CJ Z
Ed ^H 2£ *-"* Ed
CO z Z ai
O S ad 35 Ed
Cw CJ M CJ i,
O Ed CO Ed a
OS H => H OS
Cu
! — 1
X
J
CLi
2
O
CJ

i— i

X
Cd
Cu
O
-
1— 1
X
J
CL,
2
O

M
H
2i
o

as
Ed
C— (
CO
as
O

-J

Q
O
y

a
CJ
z
Ed
ai
Ed
Ex^
"T")
«

CO
Ed
>H

CO
Ed

CO
s

CO
Ed
, >"

CO
Ed

Z
O
i— i
H
CU
M
n^
U
C/3
b3
Q

_3
•^
CJ
1— 1
z
*

Ed
£-4

CN
0
Z

o
z

CM
O
z

1— i
o
:z

01
U3

i-J
•^
H3
2
2
3£

ceS
UJ
va
^

C
o
1-1
^J
o
^
o.
o.
CO

01
ff
4J

e
•H
0)
cn
3

^4
o
U-l
T3
CU
•H
U-l
•H
•a
0
6
CO
Cfl
3
rH
0)
-3
o
e

0)
r*
4-1

4-1
3

73
4^
CO
.-1
>;
CU
1— 1
0)
-a
0
s

0)
x;
4-1

O
U-l

i— 1
tfl
3
C
^

i-i
ty
W
S

a 01
e 3
O Q
CJ S
47
-------
: r
features of the models. In only one case was the "Workbook" comparison

rigorously applied. Explicit, up to date, user's manuals were most often

not available. In some cases such manuals did exist but were not up-to-date

with the version of the models to be used in the evaluation.

3.1.4 Preliminary Concentration Estimates

The Interim Procedures suggest that preliminary concentration

estimates be obtained from both the proposed and the reference models, as

an aid to writing the protocol and designing the requisite data bases.

In the three most recent protocols (Warren, Lovett and

Guayanilla) such estimates were made, although they are not well documented

for Guayanilla. In the Baldwin and Westvaco evaluations, it is not evident

that any formal estimates were made although both the source and the control

agencies had a good idea of the consequences (location and magnitude of

high estimates) of applying the models.

3.2 Protocol for the Performance Evaluation

The Interim Procedures require that a protocol be prepared for

comparing the performance of the reference model and proposed model. The

protocol must be agreed upon by the applicant and the appropriate regulatory

agencies prior to collection of the requisite data bases.

In each of the five cases such a protocol was written, negotiated

with the control agencies, and a. final protocol to be used in the evaluation

was established. The relative details of the various protocols are compared

in the following subsections, 3.2.1 - 3.2.5. The degree to which the

negotiating parties were in full agreement that the final established

protocol was optimum is discussed in Section 3.4.
-------
3.2.1 Performance Evaluation Objectives

The Interim Procedures suggest that the first step to

developing a model performance protocol is to translate the regulatory pur-

poses associated with the intended model application into performance evalua-

tion objectives which, in turn can be linked to specific data sets and

performance measures. Ranked-order performance objectives are suggested with

the primary objective focussing on what is perceived to be (from the preliminary

analysis) the critical source-receptor relationship, i.e. the averaging time,

the receptor locations, the set(s) of meteorological conditions and the source

configuration that are most likely associated with the design concentration.

Lower-order objectives, e.g. second-order, third-order, etc., would focus on

other source-receptor relationships which must be addressed when the chosen

model is ultimately applied to the situation, but are not perceived to be of

prime importance (not as likely to be associated with a design concentration)

when the chosen model is applied.

In the five protocols, specific sets of ranked-order

objectives were not stated, at least in the sense described above. However,

it is apparent from the choices of data sets and performance measures, the

weighting of data sets/performance measures, the sometimes-used differential

weighting of individual monitor data, and the scoring schemes employed in

the protocols, that the writers had such ranked-order objectives implicitly

in mind. Most of the protocols explicitly stated a single broad objective

which focuses on an accurate prediction of peak short-term concentrations.

These statements were generally not narrowed down to include specific recep-

tor locations, the importance of time pairing or critical meteorological

conditions. However, as mentioned above, it is evident from the protocols'

contents that these single broad performance objective statements really did

implicitly contain sets of ranked-order specific objectives.

49
-------
3.2.2 Data Sets, Averaging Times and Pairing

The Interim Procedures mention a number of possible data

sets which can be considered but makes no specific recommendation as to the

choice of data sets for an individual situation.

Table 3-3 compares the data sets contained in each of the

five major protocols and the weighting (percent of maximum possible points)

of each data set. The protocols are arranged roughly chronologically

across the top of the table, in the order in time when each was finalized,

to see if there are any trends in the choices of data sets or weighting.

No obvious pattern is apparent. It is clear from Table 3-3 that each of

the protocols focuses on the common broad performance evaluation objective

of accurate prediction of peak short-term concentration. However, it is

obvious from the choice and weighting of data sets that the protocol

writers had different ideas on how to best meet that objective. Three of

the five protocols examined the highest observed/predicted concentration

data set as well as the second-highest data set. All of the protocols tested

the competing models against the second-highs and the high-25 set, although

there were considerable differences in the weighting among the protocols.

Two protocols specify that some performance statistics will be calculated

for all data but the weighting of this data set is lower than the peak/high-25

data sets.

The Interim Procedures suggest that performance of models whose

basic averaging time is shorter than the regulatory averaging time should

be evaluated for that shorter period as well as averaging times corresponding

to the regulatory constraints.* Since all five cases involved SC>2 models
*Most models compute sequential concentrations at each receptor over a short
time average, e.g., 1-hour. Average concentrations for longer periods, e.g.,
3-hour, 24-hour, are arrived at by summing the sequential short-term averages.
-------
Table 3-3 Weighting (%) of Maximum Possible Points by Data Set, Averaging
Time and Degree of Pairing

1
IDATA SET
MAXIMUM
SECOND HIGH
HIGH 25
ALL DATA
-
AVERAGING TIME
1-HOUR
3-HOUR
24-HOUR
ANNUAL
PAIRING
UNPAIRED
SPACE ONLY
TIME ONLY
SPACE AND TIME
BALDWIN

0
55
45
0
0
100
0
0
70
5
0
25
WESTVACO

21
22
57
0
19
37
37
7
32
62
3
3
WARREN

15
28
43
14
20
30
36
14
60
40
0
0
LOVETT

0
20
58
22
36
36
26
2
44
34
21
1
GUAYANILLA

12
15
74
0
34
27
39
0
23
77
0
0
51
-------
: r
whose basic averaging time is one hour, this would suggest that 1-hour

statistics be calculated as well as 3-hour, 24-hour, etc. Table 3-3 shows

that all of the protocols except Baldwin specify that performance statistics

should be calculated for 1-, 3- and 24-hour averaging times. For Baldwin

it was established up-front that the proposed model, if selected, would only

be applied for the 3-hour averaging time. This may be the reason why statis-

tics are not to be calculated for other averaging times, including 1-hour.

Computation of the annual concentration is not a significant issue in any

of the cases. This is apparently the reason that low or no weight is given

to statistics for that averaging time.

Weighting may also be distributed according to performance

statistics calculated for data paired in space, time, both space and time

or completely unpaired. The Interim Procedures discuss the various possible

degrees of pairing asociated with each data set but makes no specific recom-

mendation as to which to choose or the weighting distribution. Instead, the

Interim Procedures suggest that through the development of performance evalua-

tion objectives, pairing can be identified.

Table 3-3 also shows the weighting of maximum possible points

according to the degree of pairing specified in each of the five protocols.

Since detailed performance evaluation objectives are generally lacking for

these protocols, it is difficult to establish a rationale for the seemingly

significant variation of weighting among the protocols. In each evaluation

a relatively isolated point source of SC>2 controlled the short-term ambient

S02 levels in its vicinity. Thus it is not very important that the models

predict the concentration in time and space accurately; only the magnitude

is of importance. This suggests that completely unpaired performance statis-

tics would be of prime importance. Table 3-3 shows that unpaired statistics
52
-------
were important in all five protocols but the weighting and aegree of importance

vary significantly. In fact, in the Westvaco and Guayanilla protocols, data

paired in space only seem to be regarded as the most pertinent. Although

specific rationales are generally lacking, it appears that the protocol writers

were concerned with model credibility. Credibility in model performance can

be linked to the ability of the models to reproduce measured concentrations

at the right place, right time and perhaps both. This explains (perhaps)

the varying degree of pairing.

3.2.3 Performance Measures

The Interim Procedures state that the basic tools used

in determining how well a model performs in a given situation are performance

measures. These performance measures are viewed as surrogate quantities

whose values/statistics serve to characterize the discrepancy between

predictions and observations.

Table 3-4 lists, by data set, the various performance

measures used in the protocols for characterizing performance for that data

set. From an overall perspective the Table seems to indicate that, while

there are some similarities, there are also a wide variety and combinations

of performance measures used among the protocols. Each protocol seems to

contain a more or less unique combination of measures used to characterize

performance and this combination often differs from those suggested in the

Interim Procedures. Some of the protocols contain certain performance

measures not mentioned in the Interim Procedures. For example, three of

the protocols contain a performance measure, Mc*, designed to test the models'
*MC is not a unique performance measure but refers to schemes for quantifying
this type of performance which differ among the various protocols. See
Appendices B, D and E for details.
53
-------
; r
Table 3-4 Performance Measures Used in the Protocols
1
IDATA SET
1
1
I
IPEAK VALUES
1
(HIGH &
SECOND HIGH)
HIGH- 2 5

,ALL DATA
1
I
1
[BALDWIN
d

d
c.
ad
RMSEd
Mc

WESTVACO
d

III
2*>
O / O O / S
'

WARREN
d
Idl/C0
R
S/Co

By Stability:
S2/S2
Cp/Co
d
Id|/C0
R
LOVETT

Vco, V^
p o» p p
Mc

Cp/c0, vc,,
By Stability:
r / r r I r
Up/ V^Q , UQ/ ^p
S/S°' S°/SP
GUAYANILLA
Cp/Co

Cp/Co
•J/'i
Mc

d = Residual

Sd » Standard deviation of residual

S2 = Variance of predicted concentration

S = Variance of observed concentration
o

F = Frequency distribution

R = Correlation coefficient

Mc = Meteorological cases in common

Root-mean-square-error of the residual
54
-------
capability to reproduce observed concentrations during observed meteorological

conditions. Note also that certain performance measures such as the correla-

tion coefficient, the variance of the residuals and statistics on the frequency

distribution were not widely used in the protocols. Where they were used,

they were not weighted heavily.

One specific point revealed by Table 3-4 concerns the use

of performance measures that characterize the model bias. The Interim

Procedures suggest that model bias is an important quantity in performance

evaluations and that the model residual is an appropriate measure to charac-

terize the bias. In the earlier protocols, Baldwin and Westvaco, the

model residual was used exclusively in this regard. However, in time, as

indicated by the more recent protocols on the right side of Table 3-4,

the residual is used less frequently or not at all. Instead, various

combinations of the ratio of the predicted to observed concentration are

used to characterize the bias. No clues/rationale are contained in these

recent protocols that suggest a reason for using ratios instead of

residuals. At the time that these protocols were negotiated with the

control agency, no significant objections were apparently raised over the

use of ratios in lieu of residuals.

3.2.4 Model Performance Scoring

One of the more difficult aspects of writing a performance

evaluation protocol is devising a scheme which, for each performance measure

or other surrogate measure, objectively quantifies the degree to which the

model reproduces measured concentrations. The specification of the details

of this concept, which is called "scoring" in the Interim Procedures, lacks

a clear technical basis or a basis in past experience. The Interim Procedures
55
-------
;r
recognize this lack of guidance and invite the use of innovative schemes,

although the use of confidence intervals is mentioned as one such possible

scheme.

The lack of guidance in this area is well reflected by the

wide variety of scoring schemes that are specified in the various protocols.

In fact, each protocol generally contains several different schemes in

itself. No attempt is made here to intercompare the details of the various

scoring schemes employed; the reader is referred to Appendices A-E and the

specific protocols in the Section 5.0 References for these details.

In general, most of the schemes ultimately generate what

might be termed a performance factor. The performance factor is either a

measure or an indicator of how well the model performs in relation to

measured data. The method (usually formulae) used to arrive at the perfor-

mance factor depends on the specific measure of performance (residual, ratio,

correlation coefficient, etc.) and varies widely among the protocols. The

performance factor, once obtained, is either multiplied by the maximum

possible points to obtain a subscore for that performance measure, or a

table is entered which provides point subscores for specific ranges of the

performance factor. The tables most often have "cutoff values" above or

below which a z«ro subscore is specified. For measures of the bias, the

table is sometimes skewed in favor of overprediction, i.e. a given amount

of overprediction is awarded more points than the same degree of underpre-

diction.

Once the various subscores are obtained, they are, in each

protocol, summed to obtain a total score for each model. In some cases,

the performance factor mentioned above involves performance statistics from

both models. Thus in these cases the scores obtained for each model are
-------
not truly independent indicators of how well the model performs relative to

measured data but contain some elements of relative performance between

models. In any event, the scores for each model are then compared to obtain

a preliminary indication of which model is the better performer.

At this point the Interim Procedures suggest that, among

other things, it might be desirable to define a "window" of marginal perfor-

mance. If the apparently better performer falls in the window then the

results of the technical evaluation could be used to arrive at a final

decision. In only one of the protocols, Guayanilla, is the window concept

used and in that case it is merely stated that the proposed model, if it

receives a higher score, will not be chosen unless that score exceeds that

of the reference model by ten percent.

The Interim Procedures also suggest that it might be

undesirable to apply the chosen model should it be shown to underpredict

critical high concentrations. In this case it is suggested that the chosen

model be "corrected" or "adjusted" to the degree which it apparently under-

predicts. In the three most recent protocols (Warren, Lovett, and Guayanilla)

this concept is employed, although the details on the criteria for and the

method of correcting the model estimates vary.

3.3 Data Bases for Performance Evaluations

The Interim Procedures suggest that three types of data bases can

be used for performance evaluation purposes, data from an on-site specially

designed network, data from an on-site tracer experiment and, rarely, data

from an off-site network. The five performance evaluations utilize data

from an on-site network of SCb monitors and other instruments. Table 3-5

shows that three of those networks were specially designed for the performance
57
-------
Table 3-5 Data Bases For Performance Evaluations
1
1 DATA
1
I TYPE OF
I NETWORK
1
1
i
1
INO. OF
[MONITORS
I
1
| LENGTH OF
I DATA RECORD
1
1
1
INO. OF ON-
ISITE MET,
I TOWERS*
1
1
ION-SITE MET.
IDATA
i
1
I OFF-SITE
IMET. DATA
1
1
i
(EMISSIONS
IDATA
1
BALDWIN
SPECIAL
DESIGN
10
1 YEAR
1
WD,WS,WF,T
NWS: MXHT &
STABILITY
LOAD LEVEL/
FUEL SAMPLES
WESTVACO
EXISTING
9
1 YEAR
2
WD,WS,WF,T
NONE
IN-STACK
DATA
WARREN
SPECIAL
DESIGN
-
7
1 YEAR
2
WD,WS,WF,T
NWS: MXHT &
MISSING DATA
PERIODS
LOAD LEVEL/
FUEL SAMPLES
LOVETT
SPECIAL
DESIGN
11
1 YEAR
5
WD,WS,WF,T
NWS: MXHT
IN-STACK
DATA
GUAYANILLA
PHASE I-
EXI STING
PHASE II-
SPECIAL DESIGN
PHASE 1-4
PHASE 1 1-8
PHASE 1-2 YEARS
PHASE II-
6 MONTHS
1
WD,WS,WF,T
CLIMATOLOGICAL
MXHT
LOAD LEVEL/
FUEL SAMPLES
WD = Wind Direction
WS = Wind Speed
WF = Wind Fluctuation or Turbulence Intensity
T - Temperature
NWS = National Weather Service
MXHT = Mixing Height

* Data from only one primary tower are used, except for data substitutions when
primary data source is not operating
-------
evaluation. The Westvaco protocol utilizes data from a network that was

originally designed to monitor compliance of the source with the NAAQS. As

pointed out in Section 2.2, this network was judged to be acceptable for

performance evaluation purposes. In the Guayanilla protocol, the existing

network of four monitors was judged to be only marginally adequate for

performance evaluation purposes. Thus there will be a Phase II performance

evaluation, where six months of data from an augmented, specially designed

network are to be utilized.

The Interim Procedures recommend that the number and spatial coverage

of monitors is a tradeoff between the scientific desire for wide coverage-

with a dense array and the practical constraints of cost and logistics. In

any event the requisite network must have sufficient spatial coverage/density

to address important source-receptor relationships identified in the preliminary

analysis and to meet the needs of the protocol. Table 3-5 shows that the

networks to be utilized for performance evaluations contain about the same

number of ambient monitors, i.e. ranging from 7 to 11. Further investigation

of these networks reveals that in each of them nearly all of the monitors

are fairly densely clustered in the area of expected maximum concentration

with one or two monitors, generally to be used for assessing background,

located well outside of this area.

The Interim Procedures suggest that a 1-year data collection period

is normally the minimum in order to calculate performance statistics that

are related to the NAAQS, i.e. the high second-high concentration. Table

3-5 shows lengths of record ranging from one to 2-1/2 years will be used

for performance evaluation purposes.

In all of the performance evaluations the primary source of meteoro-

logical data is from an on-site network. Although some of the networks con-
59
-------
: r
tain multiple towers (See Table 3-5), none of the models to be considered

in the evaluations is capable of utilizing spatially divergent meteorolo-

gical data inputs. Thus, meteorological data inputs to the models are

pre-specified to be from a single tower, with other stations used as backup

for missing data periods. In most cases, on-site wind fluctuation data

(sigma-theta) are to be used either as direct input to the models or as a

means to categorize stability. Mixing heights are usually derived from

off-site National Weather Service temperature sounding data. On-site

temperature data are sometimes used to interpolate hourly values of the

mixing height.

The Interim Procedures recommend that in-stack instrumentation is

the preferable data source to be used in deriving hourly emissions and

values of stack gas parameters. Table 3-5 shows that such in-stack instru-

mentation is or will be in place at Westvaco and Lovett.. The other three

performance evaluations derive emissions and stack data from surrogate

measures such as fuel analyses and documented load level information.

3.A Negotiation of the Procedures to be Followed

The Interim Procedures strongly recommend that the applicant (source)

maintain close liaison with the reviewing agency at the beginning and through-

out the project. It is especially important that the protocol and design

of the monitoring network be negotiated and agreed upon before any data are

in-hand. In each of the five cases, such negotiations took place. These

negotiations generally took place at two points of time: (1) in advance

of any work on the project itself where the need to do a comparative model

evaluation was identified as an acceptable way to resolve differences of

opinion on model acceptability and (2) after a draft protocol was written

and the proposed network was designed or identified.

60
-------
Table 3-6 lists Che major components of the model evaluation process

as identified in the Interim Procedures and as discussed in Section 3.1-3.3

above. For each of the five cases the Table indicates whether that compo-

nent was a significant or minor issue in the negotiation process. A signi-

ficant issue is defined as one where there was a significant difference of

opinion between the source and the control agency or, in some cases, between

control agencies. A minor issue is one where the source did not strongly

object to the control agency's request for changes or additions to the

analyses, protocol or data base collection plans. (A minor issue may have

resulted in a significant amount of additional analysis). If no entry is

made in the Table it indicates that there was no issue or that the compo-

nent was apparently not discussed in the review process.

The Table shows that regulatory aspects and the design of the data

base network were significant issues common to all of the projects. The

resolution of these issues was, in fact, the decision to go ahead with the

model evaluation. The network design issues generally reflect Agency concerns

that monitors be located in areas of expected maximum concentration. It is

interesting to note that, in spite of the wide variation in the details of

the protocol, discussed in Section 2.2, there was apparently not much debate

on these details.
61
-------
Table 3-6 Issues Involved in Negotiations

1 PRELIMINARY ANALYSIS
I REGULATORY ASPECTS
i SOURCE & SOURCE ENVIRONMENT
CHOICE OF PROPOSED MODEL
DOCUMENTATION OF PROPOSED MODEL
CHOICE OF REFERENCE MODEL
PRELIMINARY ESTIMATES
PROTOCOL
PERFORMANCE EVALUATION OBJECTIVES
CHOICE OF DATA SETS
CHOICE OF AVERAGING TIME
DEGREE OF PAIRING
CHOICE OF PERFORMANCE MEASURES
WEIGHTING (DISTRIBUTION)
WEIGHTING OF MONITORS
SCORING
ADDITIONAL CRITERIA1
DATA BASES
NETWORK DESIGN
NUMBER OF MONITORS
CHOICE OF METEOROLOGICAL INPUTS
BALDWIN

S
-
-
-
-

S
S

S
S
-
-

S
-
"~
WESTVACO

-
-
- S

-
-
-
-
-
-
S
-
—

S
-
S
WARREN

S
-
-
M

-
-
-
-
-
M
-
-
M

M
LOVETT

S
-
M
-
-
M

-
-
-
-
-
-
-
-
M

M
1
GUAYANILLAl

S
M

-
-
-
-
-
-
-
-
M

S
S
S
M = Minor difference of opinion
S = Significant difference of opinion
- = No difference of opinion stated

Footnote
1. Includes criteria to guard against underprediction of critical concentrations,
-------
4.0 FINDINGS AND RECOMMENDATIONS

The summaries and analyses of several major cases, which utilize

guidance contained in the Interim Procedures for Evaluating Air Quality Models,

lead to the following general findings. These findings parallel the basic

principles of the Interim Procedures listed in Section 1.2.

Finding 1. Up-front negotiations between the applicant and the

regulatory agencies on the nature of the protocol and design/utilization of

the data base network took place in each case. Up-front discussions on the

preliminary analysis did not always take place. This lack of early communi-

cation sometimes led to backtracking to fill in needed analyses.

Recommendation

In the interest of expediency, the applicant should initiate

frequent discussions with all of the control agencies that will be ultimately

involved in reviewing/approving the evaluation. .Based on experience it is

especially important that discussion take place before the preliminary

analysis is conducted such that the applicant can provide all the relevant

information required for the case.

Finding 2. For each case a detailed protocol for performance evaluation

was written.

Recommendation

Establishing an up-front protocol has worked very well as the central

mechanism for decision-making on the appropriate model. This should be

continued.

Finding 3. For each case an on-site data base network was established or

identified as meeting the needs of the protocol and the technical/regulatory

requirements. In three of the evaluations a network was specially designed

to meet these needs. In one evaluation the existing network was augmented
63
-------
: I
to meet these needs. In one evaluation the existing network was judged

to be adequate without any modification.

Recommendation

It is clear from experience that it is highly important to establish

the design of the data base network before any data are collected or at

least before any data are available to the user. This practice should be

continued.

Finding 4. Details of the protocol and network design were well

documented in each case. However, details of the preliminary analysis

and the negotiation process were not always well documented.

Recommendation

It has become increasingly obvious that the preliminary analysis,

especially the preliminary concentration estimates, plays an important

role in the design of the protocol and the data base network. In the

interest of avoiding misunderstanding, complete documentation of this

preliminary analysis is strongly recommended.

Finding 5. For the two cases where the evaluations have been completed,

the decision on the appropriate model was made as prescribed in the

protocol.

Recommendation

The execution of an established protocol leading to a rationalized

decision on the more appropriate regulatory model is a basic premise in

the Interim Procedures. This practice should be continued.

Other more specific findings and recommendations are:

Finding 6. Each of the five protocols involved large point sources

of S02 where attainment of the short-term NAAQS was at issue. Four of

the sources were located in complex terrain.
-------
Recommendation

The use of the Interim Procedures over a broader range of model

evaluation problems is encouraged.

Finding 7. In each of the five cases a technical description of the

proposed model was provided. However, a rigorous technical comparison of

the proposed and reference models, according to procedures outlined in

the Workbook for Comparison of Air Quality Models, was not generally

performed. Also, user's manuals on both the proposed and reference models

were generally not available.

Recommendation

The results of rigorous application of the "workbook" procedures

have not been used as decision criteria for any of the cases covered in

this report. However, it is important that the technical features of the

competing models be compared and the workbook provides a good framework ,

for making such comparisions. Thus its continued use, at least in the

latter regard, is encouraged.

Either a self-documenting code or a user's manual should be provided

for each model under consideration. All data bases used in the evaluation

should be provided.

Finding 8. Preliminary estimates of expected concentration levels

were made in some cases; these results were not always well documented.

Recommendation

Preliminary estimates should be submitted in all future applications

of the Interim Procedures and the results of these estimates should be

documented in the form of isopleth maps and tables as well as descriptive

material that interprets the results.
65
-------
Finding 9. Detailed performance evaluation objectives were generally

not established before writing the protocols.

Recommendation

It is believed that the development and submission of detailed

performance evaluation objectives should lead to logical and perhaps more

uniform choices of performance measures, averaging times, pairing and weight-

ing in the protocol. Then the rationale for the choices will be explicit

to the reviewer, and this should facilitate any negotiation. Thus the

use of detailed performance evaluation objectives is encouraged.

Finding 10. A wide variation in the choice of data sets, averaging
•

times, pairing, performance measures, and weighting is evident among the

protocols.

Recommendation

While EPA is not necessarily concerned about these wide variations

at this time, it is important that the rationale for the choices be

documented; the recommendation regarding performance evaluation objectives,

above, is one way to establish this rationale.

Finding 11. Similarly, a wide variety in the schemes used for

objectively determining the degree to which the models reproduce the

measured concentration (scoring) is evident.

Re c ommendat ion

Same as Item (10) above.

Finding 12. More recent protocols contain stipulations for adjusting

estimates from the chosen model, should that model be shown to underestimate

critical concentrations.
66
-------
Recommendation

The use of model "adjustment factors" to take care of model

underestimates was a result of EPA's concerns. While the "adjustment

factor" approach is acceptable for the time being, the development of

more innovative and more scientifically defensible schemes to address

underestimates is encouraged.

Finding 13. The data bases to be used in the performance evaluations

consist of networks of 7 to 11 monitors, primarily clustered in the area

of expected maximum concentration. Meteorological data from on-site towers

are generally to be used in the evaluations.

Recommendation

These limited monitoring networks were acceptable because the areal

and temporal extent of the critical source-receptor relationships in the

five protocols was very limited. In many cases it may not be possible to

establish a priori these critical source-receptor relationships. In such

cases more monitors might be required.

The need for representative meteorological data is critical to the

performance of the models. To ensure that this need is met, the practice

of collecting on-site meteorological data, commensurate with the model's

input requirements, is encouraged.

Finding 14. Emissions data are either derived from in-stack instrumentation

(two cases) or from surrogate measures such as fuel samples, load levels, etc.

(three cases).

Rec ommenda t i on

The use of surrogate data such as fuel sampling, load levels, etc.

leads to considerable uncertainty in emissions especially when coal fired

boilers or industrial process emissions are involved. The use of continuous

in-stack instrumentation is encouraged.

67
-------
5.0 REFERENCES

1. Environmental Protection Agency. "Guideline on Air Quality Models,"
EPA-450/2-78-027, Office of Air Quality Planning and Standards, Research
Triangle Park, NC 27711, April 1978.

2. Environmental Protection Agency. "Interim Procedures for Evaluating
Air Quality Models (Revised)," EPA-450/4-84-023, Office of Air Quality
Planning and Standards, Research Triangle Park, NC 27711, September 1984.

3. Fox. D. G. "Judging Air Quality Model Performance," Bull. Am. Meteor.
Soc. 62, 599-609, May 1981.

4. Illinois Power and the Illinois Environmental Protection Agency.
"Procedures for Model Evaluation and Emission Limit Determination for the
Baldwin Power Plant," June 1982.

5. Environmental Protection Agency. "Workbook for Comparison of Air
Quality Models," EPA 450/2-78-028a,b, Office of Air Quality Planning and
Standards, Research Triangle Park, NC 27711, May 1978.

6. Environmental Research & Technology, Inc. "Evaluation of MPSDM and CRSTER
using the Illinois EPA-approved Protocol and the Subsequent Emission Lim-
itation Study for the Baldwin Power Plant," Documents P-B881-100, P-B881-200,
Prepared for Illinois Power Company, July 1983, July 1984

7, Hanna, S., C. Vaudo, A. Curreri, J. Beebe, B. Egan, and J. Mahoney.
"Diffusion Model Development and Evaluation and Emission Limitations at
the Westvaco Luke Mill," Document PA439, Prepared for the Westvaco
Corporation by Environmental Research and Technology Inc., 696 Virginia Road,
Concord, MA 01742, March 1982.

8. Bowers, J. F., H. E. Cramer, W. R. Hargraves and A. J. Anderson. "Westvaco
Luke, Maryland Monitoring Program: Data Analysis and Dispersion Model Val-
idation," Final Report prepared for U.S. Environmental Protection Agency,
Region III by H.E. Cramer Company Inc., University of Utah Research Park,
Post Office Box 8049, Salt Lake City, UT 84108, June 1983.

9. Hanna, Steven B., Bruce A, Egan, Cosmos J. Vaudo and Anthony J.
Curreri. "An Evaluation of the LUMM and SHORTZ Dispersion Models Using
the Westvaco Data Set," Document PA-439, Prepared for the Westvaco
Corporation by Environmental Research & Technology, Inc., 696 Virginia
Road, Concord, MA 01742, November 1982.

10. Londergan, Richard J. "Protocol for the Comparative Performance
Evaluation of the LAPPES and Complex I Dispersion Models Using the Warren
Data Set," TRC Environmental Consultants, Inc., 800 Connecticut Blvd.,
East Hartford, CT 06108, November 1984.

11. Environmental Protection Agency. "Regional Workshop on Air Quality
Modeling: A Summary Report," EPA 450/4-82-015, Office of Air Quality Planning
and Standards, Research Triangle Park, NC 27711, April 1981.
69
-------
12. Environmental Research & Technology, Inc. "Protocol for the Evaluation
and Comparison of Air Quality Models for Lovett Generating Station," Docu-
ment P-B636-100, Prepared for Orange & Rockland Utilites, Inc., July 1984.

13. Environmental Protection Agency. Letter to Mr. Frank E. Fischer, Vice
President, Engineering, Orange & Rockland Utilites, Inc., EPA, Region II,
26 Federal Plaza, New York, NY 10278, August 30, 1984.

14. Environmental Research & Technology, Inc. "Validation of the Puerto
Rico Air Quality Model for the Guayanilla Basin," Document P-9050, Pre-
pared for Environmental Quality Board of Puerto Rico, November 1979.

15. Environmental Protection Agency. "Protocol for the Comparative Performance
Evaluation of the PRAQM and Complex I Dispersion Models in the Guayanilla
Basin," EPA Region II, 26 Federal Plaza, New York, NY 10278, August 1984.

16, Public Service Company of Indiana. "Plan for Field Study Leading to
Model Evaluation and Emission Limit Determination for the Gibson Generating
Station," Document P-A892, Environmental Research & Technology, Inc., 696
Virginia Road, Concord, MA 01742, May 1981.

17. Environmental Protection Agency. Letter to Mr. S. A. Ali, Public Service
Indiana from Environmental Protection Agency, Region V, 230 South Dearborn
Street, Chicago, IL 60604, August 10, 1982.

18. Environmental Protection Agency. Letter to Mr. S. A. Ali, Public Service
Indiana from Environmental Protection Agency, Region V, 230 South Dearborn
Street, Chicago, IL 60604, June 10, 1982.

19. Burkhart, Richard P. "Protocol for the Comparative Performance Eval-
uation of the LAPPES and Complex I Dispersion Models Using the Penelec
Data Set," Pennsylvania Electric Company, 1001 Broad Street, Johnstown, PA
15907, November 15, 1982.

20. Burkhart, Richard P., Richard J. Londergan, Richard A. Rothstein and
Herbert S. Borenstein. "Comparative Performance Evaluation of Two Complex
Terrain Dispersion Models," Preprint Paper No. 83-47.4, 76th Annual Meeting
of the Air Pollution Control Association, June 19-24, 1983.
7f)
-------
APPENDIX A

Protocol and Performance Evaluation Results

for

Baldwin Power Plant
A-l
-------
PERFORMANCE EVALUATION PROTOCOL AND FINAL SCORES FOR BALDWIN POWER PLANT
DATA SET
Second-
Highest
25-
Highest
PAIRING
SPACE
Yes
No
Yes
No
Yes
No
Yes
No
Yes
TIME
Yes
No
Yes
No
Yes
No
Yes
No
No
PERFORMANCE
MEASURES
d
d
I
d
Sd
srf
RMSEn
RMSEd
No. of Cases
In Common
Cumulative
Frequency
Distribution
AVERAGING
TIMES
3-hour
3-hour
3-hour
3-hour
3-hour
3-hour
3-hour
3 -hour
3-hour
3-hour
MAXIMUM IS CORING
POINTS | SCHEME
I(CODE)*
WEIGHTING
SCORES
INDIV-lDATAlMPSDM
IDUAL JSET j
15 | a | 15
40
5
15
2.5
5
2.5
5
5
5

a
b
b
c
c
c
c
d
e
TOTAL
40
5
15
2.5
5
2.5
5
5
5
100
55
45
100
0.0
17.7
0.1
13.5
1.7
4.4
1.0
4.4
4.0
4.5
51.3
CRSTERl
0.0
14.0
0.2
9.0
1.8
4.0
1.1
3.6
4.0
4.0
41.7
*Letters in this column refer to the specific scoring scheme to be used.
See subsequent page(s).
A-3
-------
SCORING SCHEME
a. Second-highest data set: Single-valued residuals (d), paired and
unpaired

A match between observed and predicted concentration is awarded a
maximum skill score, while a residual (observed minus predicted concentration)
that is more than 1/2 the observed highest, second-highest concentration
in magnitude is assigned a score of zero. -Regardless of the sign of the
residual, the points awarded vary linearly between 0 and 100% of the maximum
possible as the model error varies in magnitude between 1/2 the observed
highest, second-highest 3-hour average and zero.

b. 25-highest data set: Bias (d), paired and unpaired

A scoring scheme for the bias that is the same as that used for
the second-high values is used. A zero skill level is assigned to a bias
equal to 1/2 of the average observed value for the highest-25 3-hour S02
concentrations. The total number of points awarded to a model vary between
0 and the maximum value as the magnitude of the average residual varies
between 1/2 the average observed 3-hour concentration and zero.

c. 25-highest data set: Noise and gross variability (S^, RMSE^),
paired and unpaired

The scoring scheme for the noise and gross variability tests
involves the ratio of the model precision measure to the average value
about which it is being computed. For the noise test, this ratio is the
standard deviation divided by the average modeled value. For the gross
variability test, the ratio is the root-mean-square-error divided by the
coefficient of variation value (standard deviation divided by the mean)
often used in statistical testing. A score of 0 points is suggested for
a ratio of 1.0, linearly increasing to the maximum score as the ratio
goes to zero. That is, the score will be:

SCORE = (1.0 - computed ratio) x the maximum possible points

d. 25-highest data set: Meteorological cases in commmon, unpaired

For the meteorological conditions comparison, 4 general weather
categories is used:

1. Unstable (Classes A-C), with the 10-meter wind speed less
than 5 m/sec;

2. Neutral (Class D), with the 10-meter wind speed less than
5 m/sec;
A-4
-------
3. Stable (Classes E-G), with the 10-meter wind speed less than
5 ra/sec;

4. Any case with the 10-meter wind speed greater than 5 m/sec.

The number of cases for each weather category is totaled for the
top 25 observed and modeled 3-hour cases. The number of unpaired cases
"in common" between observed and predicted 3-hour events is totaled to
determine the score for this test for each model:

r ~i r ~i
Score = | No. of Cases in Common | (Maximum Points I
l_ 25_| |_ J

e. 25-highest data set: Cumulative frequency distribution, paired in
space

For each individual monitor, the Kolomogorov-Smirnoff (K-S) test
is be used to determine whether the cumulative frequency distributions
between the top 25 observed and predicted 3-hour values are significantly
different (at the 5% significance level). Points are awarded for each
monitor for which there is not a significant difference in a cumulative
frequency distribution:

I No. of monitors where frequency distributions | |
Score = | are not significantly different I I Maximum Points
I 25 I I
-------
APPENDIX B

Protocol and Performance Evaluation Results

for

West vac o Luke Mill
B-l
-------
-------
PERFORMANCE EVALUATION PROTOCOL AND FINAL SCORES FOR WESTVACO LUKE MILL
1DATA SET| PAIRING

Maximum
Second-
Highest
25-
iHighest
S P ACE 1 TIME
No No
Yes No
No Yes
Yes Yes
No No
Yes
No
Yes
No
No
No
PERFORMANCE
MEASURES
Idl
Idl for 8
monitors
|d| for
monitor #10
Id)
Idl

Idl for 8
monitors
|d| for
monitor #10
Idl
Sp/So>So/Sp
Idl for 8
monitors
c2/c2 e2/o2
V o' o/bp
for 8
monitors
Idl for
monitor #10
c2/c2 c2/c2
V bo'Vbp
for monitor
#10

AVERAGING
TIMES
3- hour
24-hour
3-hour
24-hour
3-hour
24-hour
Annual
Annual
3-hour
24-hour
3-hour
24-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3 -hour
24-hour
1-hour
3-hour
24-hour
1-hour
3 -hour
24-hour
1-hour
3-hour
24-hour
TOTAL
MAXIMUM
POINTS
20
20
16d)
16(D
8
8 -
20
20
30
30
24U)
24(2)
12
12
25
25
25
5
5
40^>
40(3)
40(3)
16(4)
16W
16(4)
20
20
20
8
8
8
SCORING
SCHEME
(CODE)*
a
a
a
a
a
a
a
a
a
a
a
a
a
a
b
b
b
f\
c
c
b
b
b
c
c
c
b
b
b
c
c
c
602
WEIGHTING
SCORES
INDIV-lDATA ILUMM
IDUAL ISET |
3.3
3.3
2.7
2.7
1.3
1.3
3.3
3.3
5.0
5.0
4.0
4.0
2.0
2.0
4.2
4.2
4.2
0.8
0.8
0.8
6.6
6.6
6.6
2.7
2.7
2.7
3.3
3.3
3.3
1.3
1.3
1.3
(5)
99.3
21.2
22.0
56.7
99.9
12
19
10
13
0
1
13
Q
27
26
17
4
16
3
17
23
23
0
1
4
30
30
26
6
5
7
1
9
4
2
2
3
SHORTZ
0
0
4
3
8
8
1
8
0
0
5
10
6
11
0
0
0
0
0
0
5
8
9
2
4
4
19
16
19
2
8
8
1
363 168
*Letters in this column refer to the specific scoring scheme used.

Footnotes:

(1) 2 points per monitor
(2) 3 points per monitor
(3) 5 points per monitor
(4) 2 points per monitor
(5) Do not add to 100% because of rounding

B-3
See subsequent page(s),
-------
• f
.. » an, .«.-M.h..t

pairings
SCORING SCHEME

residual (|d|), various
Score - [IdUn/ldlil [*in (C^/C^/Cp.i) ] t-x points]

Where i - 1,2 - Model 1 or Model 2

b. 25-highest data set: Bias ( |d| ), unpaired, paired in space

Score- [IdUn/ldlil [«ln(Cp,i/Co,Co/Cp,i)]

where i - 1,2 - Model 1 or Model 2

- re; 2/c 2 s 2/s 2j unpaired, paired in space
c. 25-highest data set: -Variance (Sp /So ,bo /^p ;,

r ^ cc 2 /c 2 c 2/s 2 )] [max points]
Score = [minCSp^.i/SQ ,bo /ap •i-'J

where i - 1,2 - Model 1 or Model 2
-------
: r
APPENDIX C

Protocol

for

Warren Power Plant
01
-------
; r
-------
PERFORMANCE EVALUATION PROTOCOL FOR WARREN POWER PLANT
DATA SET

PAIRING
SPACE) TIME

Maximum Ye s

Second- No
Highest

25-
Highest

All
Data

Yes

PERFORMANCE
MEASURES

Id I/ C0

CP/CO

CP/CO (D

~r ( ? ^
|-o {*•)
c2/c2 fo\
V o u;
Cp/Co

d
Id|/C0
R
AVERAGING
TIMES

1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
3-hour
24-hour
3-hour
24-hour
1-hour
3 -hour
24-hour
1-hour
3-hour
24-hour
1-hour
1-hour
Annual

Annual
Annual
Annual
1
1 TOTAL
MAXIMUM
POINTS

2.0
2.7
3.6
2.0
2.7
3.6
1.0
1.6
1.8
7.0
9.0
12.0
12.0
8.0
10.0
13.0
4.0
6.0
7.0
8.0
4.0
8.0

4.0
4.0
4.0
SCORING
SCHEME
(CODE)*
a
a
a
b
b
b
c
c
c
d
d
e
e
f
f
f
g
g
g
h
i
j

k
1
m
141
WEIGHTING
INDIV-
IDUAL
1.4
1.9
2.6
1.4
1.9
2.6
0.7
1.1
1.3
5.0
6.4
8.5
8.5
5.7
7.1
9.2
2.8
4.3
5.0
5.7
2.8
5.7

2.8
2.8
2.8
100.0
DATA
SET
14.9

28.4

42.6

14.1

100.0
*Letters refer to the specific scoring scheme to be used. See subsequent page(s).

Footnotes:
(1) For stations with the 3 highest observed and 3 highest estimated values — see
scoring scheme below.
(2) Stratified by stability — see scoring scheme below.
-------
; r
SCORING SCHEME
a. Maximum data set: Average difference (d), paired in space

Confidence intervals for 50 percent, 80 percent, 95 percent confidence
levels from t-test.

Point Score

1
|50 percent confidence interval (C.I.) contains
| zero (observed=predicted)
1
1
|80 percent C.I. contains zero (but 50 percent
| does not)
1
1
|95 percent C.I. contains zero (but 80 percent
I does not)
1
1
|95 percent C.I. does not contain zero
I
(1-Hour)
2.0
1.33
0.67
0
(3-Hour)
2.7
1.8
0.9
0
(24-Hour)
3.6
2.4
1.2
0
b. Maximum data set: Average absolute difference (AAD), paired in space

Compute ratio of AAD to average observed value.

0.2 <
0.4 <
0.8 <

ratio < 0.
ratio < 0.
ratio < 0.
ratio

2
4
8

I
(1-Hour)
2
1.33
0.67
0
'oint Score
(3-Hour) |
1
2.7 |
1.8 I
0.9 I
0 I
1

(24-Hour)
3.6
2.4
1.2
0
c. Maximum data set: Pearson's correlation coefficient (R), paired in space

1
1
1 0.8
1 0.6
i

R > 0.8
> R > 0.6
_> R
Point Score
(1-Hour) |( 3-Hour) I
1 1
1 1 1.6 1
0.5 I 0.8 I
0 I 0 I
1 1

(24-Hour)

1.8
0.9
0
-------
d. Second-highest data set: Highest second-highest value, unpaired
the ratio of the predicted to the observed highest
second-high value.
I Point Score I
13-hour I 24-hour I
0.5 > Cp/Co
0.67 > Cp/Co > 0.5
0.83 > Cp/Co > °-67
1.2 > Cp/Co > °-83
1.5 > Cp/Co > 1-2
2.0 > Cp/Co > 1-5
Cp/Co > 2.0
1 -
1
1 0
I 2.6
! 4.4
1 7
1 4.4
1 2.6
1 0
0
3.4
5.6
9
5.6
3.4
0
Second-highest data set: Second-highest observed and predicted value (by
stations with the highest, second-highest, and third-highest values (12
points possible), paired in space
Cp/C
rati° °f predicted to observed second-highest value at the
same station

0.5 >
0.67 >
0.83 >
1.2 >
1.5 >
2.0 >

Cp/Co
CpVc0 > 0.5
Cp/Co > 0.67
Cp/Co > °'83
Cp/Co > 1*2
Cp/Co > 1«5
Cp/Co >. 2.0

Station w/highest
value
0
1
2
3
2
1
0
Point Score
Second-highest
station
0
1
1
2
1
1
0

Third-highest
station
0
0
1
1
1
0
0
-------
f. 25-highest data set: bias (Cp/Co), unpaired

0 = ratio of predicted to observed average value

1
I 0.67 > Cp
1 0.83 > a,
1 0.91 > Cp
1 1.1 > Cp
1 1.2 > Cp
1 1.5 > Cp
j c,

/c0
/"CQ > 0-67
/C0 > 0.83
/C0 > 0.91
/c0 > 1.1
/c0 > 1.2
/c0 > 1.5
1
Point Score
I (1-hour )| (3-hour)
1 1
1 0 1
1 2.5^ |
1 5 I
1 8 I
1 5 1
1 2.5 I
1 0 I
1 1
0
3
6
10
6
3
0
(24-hour)
0
4
8
13
8
4
0
g. 25-highest data set: variance ratio
), unpaired
S /S = ratio of predicted to observed variance

0.25 > S^/S^
0.50 > S^/S^ > 0.25
P o —
0.75 > SpVsJ >_ 0.50
1.33 > Sp/S^ >_ °-75
2.0 > sj/sj >_ 1.33
4.0 > SpVs2 _> 2.0
S2/S^ > 4.0

(1-hour)
0
1.33
2.67
4
2.67
1.33
0
Point Sc
(3-hour)
0
2
4
6
4
2
0
:ore
(24-hour)
0
2.4
4.8
7
4.8
2.4
0
-------
h. 25-highest data set: Bias (Cp/C0), by stability category, unpaired

For the stability category with the highest observed concentrations,
compare the 25-highest observed and 25-highest predicted values
(unpaired in time or location). Repeat for the stability category
with the highest predicted concentrations. (1-hour average only).
Ratio of predicted to observed average value
Point Score
0
0
0
1
1
1

.67
.83
.91
.1
.2
.5

>
>
>
>
>
>

Cn/Cn
V£o
Cp/io
CP/CO
Cp/Co
Cp/C0
Vuo

>
>
>
>
>
>

0
0
0
1
1
1
"
.67
.83
.91
.1
.2
.5

1
2

2
1

0
•
•
4
•
•
0

25
5

5
25

i. 25-highest data set: Variance ratio (S^/S^), by stability category,
unpaired

For the stability category with the highest observed concentrations,
compare the 25 highest observed and 25 highest predicted values (un-
paired in time or location). Repeat for the stability category with the
highest predicted concentrations. (1-hour average only).

S'r/S^ = Ratio of predicted to observed variance

0.25 )

0.50 )

0.75 )

1.33 )

2.0 >

4.0 )

' Sn/Sn
P °
' So/So >
p o —
' SD/S0 >
P ° ~
> SD/S0 >
p o —
' So/So >
p o —
> S2/S2 >
SD/S0 >
p o —

0.25

0.50

0.75

1.33

2.0
4

Point Score
0

0.67

1.33

0.67
0

-------
j. All data set: Bias (Cp/Co), unpaired

Ratio of predicted to observed highest value

0.75 > Cp/Co
0.83 > Cp/Co
0.91 > Cp/Co
0.95 > Cp/Co
1.05 > Cp/Co
1.1 > Cp/Co
1.2 > Cp/Co
1.33 > Cp/Co
Cp/Co

> 0.75
> 0.83
> 0.91
> 0.95
> 1.05
> 1.1
> 1.2
> 1.33
Point Score
0
2
4
6
8 -
6
4
2
0
k.
All data set: Average residual (d), paired in space

Use confidence intervals as in a.

50
80
.95
95
percent
percent
percent
percent
confidence interval
C.
C.
C.
I.
I.
I.
contains
contains
does not
zero
zero
contains
(but
(but
50%
80%
zero
does
does

not)
not)
contain zero
Point
4
2
1
0
Score

1
1
1
1
1. All data set: Ratio of average absolute difference to average observed
value, paired in space

1
lo.i <
10.2 <
10.3 <
1
1

: III /c0
. Id|/C0
. Id|/C0
Id|/C0

< 0.1
< 0.2
< 0.3
Point Score
4
2
1
0
tn. All data set: Pearson's correlation coefficient (R), paired in space.

0.9 > R
0.8 > R > 0.9
0.7 > R > 0.8
0.6 > R > 0.7
R > 0.6
Point Score
4
3
2
1
0
-------
* r
APPENDIX D

Protocol

for

Lovett Power Plant
D-l
-------
D-2
-------
PERFORMANCE EVALUATION PROTOCOL FOR LOVETT POWER PLANT
DATA SET
Second
Highest
25-Highest
All
Data
PAIRING I PERFORMANCE
SPACE
No
Yes
No
Yes
No
"
Yes
No
No
Yes
No
TIME MEASURES
No Cp/C0, C0/Cp
No Cp/C0, C0/Cp
I
No I _ _ _ _
W)/ *"o * o* p
No
Cp/C0, C0/Cp
No jsZ/S*,
|S2/S2
Vsp
No Is'/S2,.
SO/SP
No No. of cases
1 in common
Yes Cp/C0, Cp/Cp
1
Yes 1 Cp/C0, C0/Cp
Yes 1 R(i>
VV c0/VU
#4
S2/S2 (1)
so/sp
did)
R(2>
Cp/C0. C0/V2>
S2/S2 S2/S2(2)
bp/ao> so/sp
d2(2)

AVERAGING
TIMES
3-hour
24-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
, 3-hour
24-hour
1-hour
Annual
Annual
1-hour
3-hour
1-hour
3-hour
1-hour
3-hour
1-hour
3-hour
1— hour
3-hour
1-hour
3-hour
1-Hour
3- hour
1-hour
3-hour
MAXIMUM
POINTS
5.0
5.0
5.0
5.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
10.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
2.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
2.0
TOTAL 1100.0
SCORING
SCHEME
(CODE)*
a
&
b
b
c
c
c
d
d
d
e
e
e
f
f
g
h
i
j
j
k
k.
1
m
m
n
n
o
o
P
1
1
WEIGHTING
INDIV-lDATA
IDUAL (SET
5.U
5.0
5.0
5.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
10.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
2.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
2.0
20.0
58.0
22.01
1100. 0 1100.0
*Letters refer to specific scoring scheme to be used. See subsequent page(s).

Footnotes: (1) Stable conditions
(2) Nonstable conditions
D-3
-------
' f
SCORING SCHEME

a. Second-highest data set: Ratios of concentrations (Cp/Co, Co/Cp), unpaired

Score = [min(Cp/Co, Co/Cp)] tmax points]

b. Second-highest data set: Ratios of concentrations (Cp/Co, Co/Cp), paired in
space

Score - [min (Cp/Co, Co/Cp)] [max points]

c. 25-highest data set: Bias (Cp/Co, Co/Cp), unpaired

Score = [min (Cp/Co, Co/Cp)] [max points]

d. 25-highest data set: Bias (Cp/C^o, ^0/Cp), paired in space

Score = [min (Cp/ Co/Cp), paired in space and time

Score - [min(Cp/Co, Co/Cp)] [max points]

j. All data set: Pearson's correlation coefficient (R), stable conditions
only, paired in time

Score = [R^] [max points]

k. All data set: Bias (Cp/Co, Co/Cp), stable conditions only, paired in time

Score = [min(Cp/Co, Co/Cp)] lmax points]
D-4
-------
1. All data set: Variance ratios (S2/S2, S2/S2), stable conditions only,
paired in time

Score = [min(S2/S2, S2/S2)] [max points]

All data set: Gross variability (d2) , stable conditions only, paired
in time

Score = [(Id2)min/ (£d2)] [max points] ..
m.
where
/
f°r best performing model
n. All data set: Pearson's correlation coefficient (R) , nonstable conditions,
paired in time

Score = [R.2] [max points]

o. All data set: Bias (Cp/Co,C0/Cp) , nonstable conditions, paired in time

Score = [min^Cp/C0,"c0/Cp)] [max points]

p. All data set: Variance ratios (S2/S2, S^/Si1), nonstable conditions,
paired in time

Score = [min(S2/S2, S2/S2)] [max points]

q. All data set: Gross variability (d^) , nonstable conditions, paired in time

Score = [(Id2)min/(£d2)] [max points]
where ( Id )
,
for best performing model
D-5
-------
• f
D-6
-------
APPENDIX E

Protocol

for

Guayanilla Basin
E-l
-------
E-2
-------
PERFORMANCE EVALUATION PROTOCOL FOR GUAYANILLA BASIN
IDATA SET

Maximum

Second-
Highest
25-Highest

•

Upper
5% of
Observed &
Predicted
PAIRING
SPACE | TIME

No
Yes

No
Yes
No

Yes

No
No

No
No
No

PERFORMANCE
MEASURES

Cp/Co
Cp/Co

Cp/Co

S2/S2

S2/S*

No. of Cases
in Common

AVERAGING
TIMES

3-hour
24-hour
3-hour
24-hour
3-hour
24-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour
3-hour
24-hour
1-hour

TOTAL
MAXIMUM
POINTS

4
5
9
12
5
6
12
15
6
8
12
3
4
6
12
18
30
6
9
15
60

257
SCORING
SCHEME
(CODE)*
a
a
b
b
c
c
d
d
e
e
e
f
f
f
g
g
g
h
h
h
i

WEIGHTING
INDIV-
IDUAL
1.6
2.0
3.5
4.7
2.0
2.3
4.7
5.9
2.3
3.1
4.7
1.2
1.6
2.3
4.7
7.0
11.7
2.3
3.5
5.8
23.4

(1)
100.3
DATA
SET
11.8

14.9
50.2

23.4

(1)
100.3
*Letters refer to specific scoring scheme to be used. See subsequent page(s).
Footnote:
(1) Do not add to 100% because of rounding.
-------
SCORING SCHEME
a. Maximum data set: Concentration ratio (C_/CO), unpaired

0.67 )
0.80 )
0.91 )
1.20 )
1.50 5
2.50 )

> Cp/C0
> CP/CO >
> Cp/C0 >.
' Cp/Co >
> CP/CQ >
» VCo I
S/Co >

0.67
0.80
0.91
1.20
1.50
2.50
|3-hr
1
lo.o
10.5
12.0
|4.0
12.5
|1.5
lo.o
1
24-hr

0.0
1.0
2.5
5.0
3.5
2.0
0.0
b. Maximum data set: Concentration ratio (Cp/Co), paired in space

A weighting factor is to be applied to the scores for the tests at
each monitor. The weighting factor will be based on the relative rank
of the observed data for each averaging period to be examined. The
following weights will be assigned and should be applied to the table
below.

Rank
1
2
3
4
Phase 1
Weight
1.0
0.8
0.7
0.5

Rank
1,2
3,4
5,6
7,8
Phase II
Weight
0.50
0.40
0.35
0.25

0.67 )
0.80 )
0.91 )
1.20 )
1.50 )
2.50 :

> c /c0
> C^/C° >
> cp/c0 >
> S/Co >
> Cp/C0 1
> cp/c0 >
S/Co >

0.67
0.80
0.91
1.20
1.50
2.50
|3-hr
1
lo.o
lo.o
ll.O
|3.0
|2.0
ll.O
lo.o
i
1 24-hr
1
1 0.0
1 0.5
I 2.0
I 4.0
1 2.5
1 1.5
1 0.0
i
-------
c. Second-highest data set: Concentration ratio (Cp/Co), unpaired
13-hr I 24-hr
0.67 :
0.80 J
0.91 )
1.20 )
1.50 5
2.50 )

> Cp/Co
> Cp/Co >
> Cp/Co >
» Cp/Co >.
» Cp/Co >
> Cp/C0 >
VC°>

0.67
0.80
0.91
1.20
1.50
2.50
1
lo.o
ll.O
|2.5
|5.0
|3.5
|2.0
lo.o
1
0.0
1.5
3.0
6.0
4.0
2.5
0.0
d. Second-highest data set: Concentration ratio (Cp/Co), paired in space

A weighting factor is to be applied to the scores for the tests at each
monitor. The weighting factor will be based on the relative rank of the
observed data for each averaging period to be examined. The following
weights will be assigned and should be applied to the table below.
Phase I Phase II
Rank Weight Rank
1 1.0 1,2
2 0.8 3,4
3 0.7 5,6
4 0.5 7,8

1
I 0.67 > Cp/Co
t 0.80 > Cp/Co > 0.67
1 0.91 > Cp/Co > °-80
I 1.20 > Cp/Co > 0.91
i 1.50 > Cp/Co > i-20
1 2.50 > Cp/Co > i-50
1 Cp/Co > 2-50
1
Weight
0.50
0.40
0.35
0.25

13-hr
1
lo.o
|0.5
|2.0
|4.0
12.5
11.5
lo.o
1
24-hr
0.0
1.0
2.5
5.0
3.5
2.0
0.0
V-S
-------
e. 25 highest data set: Bias (Cp/Co), unpaired

1
I 0.67 >
1 0,80 >
1 0.91 >
1 1.20 >
1 1.50 >
1 2.50 >
1
1

Cp/Co
Cp/Co >
CTI/C >
Cp/Co ^
Cp/Co >
C /C >
CpVco >

0.67
0.80
0.91
1.20
1.50
2.50
1 1-hr
1
1 0.0
I 1.5
1 3.0
1 6.0
\ 4.0
1 2.5
1 0.0
1
3-hr
0,0
2.0
4,0
8.0
5.5
3.0
0.0
24-hr
0.0
3.0
6.0
12.0
8.0
4.0
0.0
f. 25 highest data set: Variance ratio
, unpaired

0.25 <
i 0.50 <
|
0.75 <
1 1.33 <
1
I 2.00 <
|
I 4.00 <
1

S2/S2 <
SpVs0<
Sp/So i
sji/s2 <
SpVs2 <
SpVs2, <
sj/sj

0.25
0.50
0.75
1.33
2.00
4.00

1-hr
0.0
1.0
2.0
3.0
2.0
1.0
0.0
3-hr
0.0
1.5
3,0
4.0
3.0
1.5
0.0
24-hr
0.0
2.0
4.0
6.0
4.0
2.0
0.0
-------
g. 25 highest data set: Bias (Cp/Co), paired in space

A weighting factor is to be applied to the scores for the tests at each
monitor. The weighting factor will be based on the relative rank of the
observed data for each averaging period to be examined. The following
weights will be assigned and should be applied to the table below.

Rank
1
2
3
4
Phase I
Weight
1.0
0.8
0.7
0.5

Rank
1,2
3,4
5,6
7,8
Phase II
Weight
0.50
0.40
0.35
0.25

0.67 > Cp/Co
0.80 > Cp/Co >
0.91 > Cp/Co >
1.20 > Cp/Co >
1.50 > Cp/Co >
2.50 > Cp/Co >
Cp/Co >

0.67
0.80
0.91
1.20
1.50
2.50
1-hr
0.0
0.5
2.0
4,0
2.5
1.5
0.0
3-hr
0.0
1.5
3.0
6.0
4.0
2.5
0.0
24-hr
0.0
2.5
5.0
10.0
6.5
3.5
0.0
h. 25 highest data set:
9 7
Variance ratio (S^/Sl1), paired in space
A weighting factor is to be applied to the scores for the tests at each
monitor. The weighting factor will be based on the relative rank of the
observed data for each averaging period to be examined. The following
weights will be assigned and should be applied to the table below.
Rank

1
2
3
4
Phase I
Weight

1.0
0.8
0.7
0.5
Phase II
Rank Weight
1,2
3,4
5,6
7,8
0.50
0.40
0.35
0.25
-------
: r

sp/so 1 °'25
0.25 < Sp/So _< 0.50
0.50 < S2/S2 <. 0.75
0.75 < Sp/Sg £ 1.33
1.33 < S^S2 < 2.00
2.00 < S2/S2 <. 4.00
4.00 < SJT/S2
1 P
1-hr
0.0
0.50
1.0
2.0
- 1.0
0.5
0.0
3-hr
0.0
1.0
2.0
3.0
2.0
1.0
0.0
24-hr
0.0
2.0
3.5
5.0
3.5
2.0
o.o !
I
i. Upper 5% of frequency distribution data set: Number of cases in common

At each monitor, unpaired in time, stratify the upper 5% of the 1-hour
predicted and observed concentrations according to the following categories:

I. Unstable (Classes A, B, C)
II. Neutral (Class D)
111. Stable (Classes E, F)

The number of unpaired cases "in common" between observed and predicted 1-hour
events will be used to determine a skill factor for each category, defined
as:
Rsf
2 x (Number in Common)
(Number Predicted + Number Observed)
The total number of points for each Phase I monitor is 15 points and total
points for each Phase II monitor is 7.5 points, appropriated as follows:

from the highest 25 observed concentrations,
Most predominant category:
Next predominant category:
Least predominant category:

The total score is given by

Score = ^ (Rsf)(max points)
categories
Phase I

8 pts.
4 pts.
3 pts.
Phase II

4 pts.
2 pts.
1.5 pts.
-------
; r
TECHNICAL REPORT DATA
(Please read Instructions an the reverse before completing/
1. REPORT NO
EPA 450/4-85-006
3 RECIPIENT'S ACCESSION NO.
4. TITLE AND SUBTITLE
Interim Procedures for Evaluating Air Quality Models
Experience with Implementation
5 REPORT DATE

.li.lv
6. PERFORMING ORGANIZATION CODE
7. AUTHOR(S)
8. PERFORMING ORGANIZATION REPORT NO.
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Monitoring and Data Analysis Division
Office of Air Quality Planning and Standards
U. S. Environmental Protection Agency
Research Triangle Park, N.C. 27711
10. PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
12. SPONSORING AGENCY NAME AND ADDRESS
13. TYPE OF REPORT AND PERIOD COVERED
14. SPONSORING AGENCY CODE
15. SUPPLEMENTARY NOTES
16 ABSTRACT
report summarizes and intercompares the details ot five major regulatory
cases for which guidance provided in the "Interim Procedures for Evaluating Air Quality
Models" was implemented in evaluating candidate models. In two of the cases the evalua-
tions have been completed and the appropriate model has been determined. In three cases
the data base collection and/or the final analysis has not yet been completed. The pur-
pose of the report is to provide potential users of the Interim Procedures' with a des-
cription and analysis of several applications that have taken place. With this informa-
tion in mind the user should be able to: (1) more effectively implement the procedures
since some of the pitfalls experienced by the initial pioneers can now be avoided; and
(2) design innovative technical criteria and statistical techniques that will advance
the state of the science of model evaluation.
The analyses show that the basic principles or framework underlying the Interim
Procedures is sound and workable in application. The concept of using the results
from a prenegotiated protocol for the performance evaluation has been shown to be an
appropriate and workable primary basis for objectively deciding on the best model . Sim-
ilarly, "up-front" negotiation on what constitutes an acceptable data base network has
been established as an acceptable way of promoting objectivity in the evaluation. Pre-
liminary concentration estimates and the need for accurate continuous on-site measure-
ments of the requisite model input data are also important.
17
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
b.IDENTIFIERS/OPEN ENDED TERMS c. COSATI I leld/Croup
Air Pollution
Meteorology
Mathematical Models
Performance Evaluation
Statistics
Performance Measures
Technical Evaluation
4B
12A
1S DISTRIBUTION STATEMENT
Unlimited
| 19 SECURITY QLASS iTtus Repor;
Unclassified
21 NO OF PAGES
20 SECURITY CLASS
Unclassified
22. PRICE
-------
-------