EPA/600/A-97/098 1.1 ADVANCED TECHNIQUES FOR EVALUATING EULERIAN AIR QUALITY MODELS: BACKGROUND AND METHODOLOGY J.R. Arnold*, Robin L. Dennis1, and Gail S. Tonnesen Atmospheric Sciences Modeling Division National Oceanic and Atmospheric Administration Research Triangle Park, NC Model Evaluation Goals and Objectives The goal of model evaluation within the framework of applications is to determine an air quality model (AQM)'s degree of acceptability and usefulness for a specified application or task. That goal is met by satisfying three objectives; • indicate the validity of the model's scientific formulations • evaluate the realism of the model's simulations • characterize the credibility of the model's realism relative to its intended application. Satisfying these objectives, however, presents two fundamental problems for providing a meaningful interpretation of model behavior to guide decision-making: • how to tell whether the model is producing seemingly appropriate results from incorrect formulations or input data; /,e., how model performance can appear right for the wrong reasons • how to establish the model's acceptability when the evaluation tests appear inadequate for making a determination of acceptability. Most applications of the model concern predicting pollutant concentrations for future atmospheric conditions. The degree of a model's acceptability, therefore, will depend on the validity of the model's science and the appropriateness of its implementation in the model's component processes. Testing only a model's state variable resultants against limited sets of historical data provides only weak, inductive justification of the model's acceptability since such tests are not directed at a model's scientific formulation. Thus, the component processes that comprise a model's scientific formulation should be the focus of an evaluation. Current Efforts Are Inadequate Evaluation efforts to date have been largely inadequate for addressing the fundamental problems and meeting the objectives of AQM evaluation. The inadequacy comes from the practice of basing the evaluation of three-dimensional Eulerian process models on tests that use simple statistical measures on residuals *Corresponding author address: J. R. Arnold, PO Box 23193 Seattle, WA 98102-0493; email: jra@unc.edu 'On assignment to NERL, U.S. Env. Protection Agency, (Concentration^^ - Concentration^,^) for Oa and, less-often, for NOx. However, since these state variable resultants are estimates derived from a highly-nonlinear buffered system, simple comparison statistics cannot be a sound basis for model evaluation because they provide no information for understanding the interaction of model processes responsible for producing the state variable predictions. The current recommended statistical measures for AQM evaluation (USEPA 1991) date from 1981 (Fox 1981) and were adapted from long-established regression statistics for empirically-fitted models where model parameters can be tested against observational data. But that regression paradigm is inappropriate as the sole test for complex Eulerian process models where the underlying science in the model must be evaluated to ensure credibility and confidence in the model's simulations of atmospheric photochemistry and fluid flow. And since there are no strong ambient signals of change, there can be no direct tests of any outcome variable which would be meaningfully measured with resultant comparisons. Moreover, because the atmosphere is a system of strongly nonlinear processes, it would not be possible to interpret such a signal unambiguously if one were present. For these reasons, successful model evaluation for judging a model's potential behavior in future applications requires tests that are diagnostic of a model's processes and the science represented in them. The need for diagnostic testing of AQMs has previously been recognized: Fox's report from an early model evaluation workshop (Fox 1984) noted explicitly that improvement in model performance is tied to understanding the scientific basis for model behavior in well-defined physical situations, a point repeated in Seinfeld (1988). More recently, Tesche, Reynolds, and Roth have described the need for "stressful" diagnostic testing In a better, more comprehensive model evaluation methodology (Tesche et al. 1992; Reynolds et a/. 1994). However, incorporating process-oriented diagnostic tests in actual model evaluations has been slow in coming. Previous evaluations have focused on failure analysis of module components, or sensitivity testing of the full model by examining resultant concentrations, rather than on true model process diagnostics. Implications of Inadequate Evaluation Inadequate evaluations allow the possibility of finding a model acceptable for an application when in fact it is t ------- not. The lack of diagnostic tests for a model and its component modules in most current evaluations has allowed models to be judged acceptable for several applications when large and significant errors remained. For example, using the USEPA recommended performance statistics of *5 to 15% bias, *30 to 35% gross error, and *15 to 20% unpaired peak prediction accuracy in [Og], urban applications of UAM-IV have performed acceptably even when errors in the meteorological model produced wind speeds of zero in all grid layers above the surface (Tesche and McNally 1995). In other cases (described in Tesche et al. 1992), data in VOC inventories used to set*up a model for applications evaluations were later shown to be underestimated by a factor of 2 or more; yet at the time, model applications were found to be acceptable using the recommended performance statistics. Furthermore, because model evaluation for air quality applications is carried out to assist in control strategy selection, an inadequate model evaluation for these applications leaves the potential for undocumented bias in estimating the effect of control strategies. Given an undocumented bias in a model, policy-makers could select an incorrect level of future reductions, or even the wrong type of control (NOx or VOC). This point has been demonstrated in two recent studies using two models to simulate the New York domain for July 1988. A series of sensitivity/uncertainty tests performed with UAM-IV (Sistla et al. 1996) and with RADM (U et al. 1998) have shown that emissions and meteorology uncertainties in the model setup affect final predicted 03 concentration ([03]) to the extent that preferences for control strategies can shift. The high risk of selecting the wrong control strategy has costly economic and social disbenefits, which will be increasingly important for the proposed new multi-pollutant standards and combined control strategies. Analyses of [03] time- series plots and residual statistics cannot reveal such potential biases in the use of a model; thus, undiscovered biases present a substantial negative implication for any evaluation procedure which uses residual statistics and other outcome measures to the exclusion of diagnostic tests. Revised Model Evaluation Methodology These implications of inadequate model evaluations motivate development of a revised evaluation methodology. Importantly, the examples described above are failures of the model evaluation more than of the model. Better evaluation procedures including enhanced diagnostic testing might have detected the flaws in the model and the compensating errors in its setup for these applications.The methodology proposed here is described with a matrix of evaluation tests grouped in several testing categories, and incorporates new and redefined techniques of existing model evaluations. A brief outline of the methodology's key elements is given here; fuller descriptions and a diagram of candidate tests and interpretations for an example model evaluation appear in the oral presentation. This methodology proposes two top-level evaluation types: Integrated/Diagnostic, and Applications. Integrated/Diagnostic evaluation is an assessment of the model's fitness for use in its intended application or task based on a comparison of processes and output from the model judged against relevant observations, observational models, and other numerical model results. An Applications evaluation then typically follows an Integrated/Diagnostic evaluation and characterizes use of the model in a specific application study for the purpose of ensuring against degradation of model performance through unintended or unjustified changes in the model or its setup. flie focus of this paper is Integrated/Diagnostic evaluation since it precedes and supports an Applications evaluation; and because an Applications evaluation makes use of elements of the Integrated/ Diagnostic evaluation, but for the purpose of testing the model setup for the particular application of the study. While most of the tests described below are specific to the chemistry of AQMs, some suggestions for tests of other model components are also included. An Integrated/Diagnostic evaluation consists of a series of test results and interpretations designed using elements of these four categories: (1) Component and Composition assessment (2) Resultants comparisons (3) Diagnostic testing (4) Model Systems analysis. Category (1) Component and Composition assessment examines the structure and scientific basis of model processes, and the instantiation of those processes in the model's equations and numerical routines. Assessment of the model structure includes describing assumptions behind formulations of the model's components, and testing whether different formulations produce different results; i.e., providing a sensitivity analysis on model structure. Specific modules such as the chemical mechanism and the deposition algorithms are assessed independently using direct comparisons in cross-model testing like that performed for the RADM and ADOM chemistry modules (Dennis et al. 1990), and directly with Mechanistic tests of the modules using specially-collected data - e.g., smog chamber runs on specific compounds, or special laboratory and field data - to test specific parameterizations In the model for well-defined conditions. There is substantial experience using Mechanistic tests on chemical mechanisms of AQMs: a hierarchy of tests was developed and has generally been followed (Whitten 1983; Atkinson et al. 1987). In addition, several species ratios have been proposed to help evaluators judge whether the chemistry is correctly predicting the processing and product formation (Dennis et al. 1990). While ratios of species are not always less-sensitive to nonchemistry effects such as transport or dispersion, they can indicate areas of concern. Thus, where the chemical mechanism correctly predicts the NO:NOz ratio, for example, but the ------- full model does riot, this may indicate a problem with transport. Mechanistic testing of other modules is less advanced than it is for chemistry, but an extensive series of Mechanistic tests for AQM meteorological modules has recently been proposed (Tesche and McNally 1996). Category (2) Resultants comparisons describe how well the model predicts for key state variables using Outcome Variable matching. Outcome Variable matching is the direct comparison of model-predicted state variable concentrations against ambient observations of the same state variables. These comparisons are made with statistical measures, qualitative and quantitative pattern analysis, and correlation statistics for geographical sites or time series data like the ones recommended by Tesche et at. (1990, 1992) to supplement the standard USEPA measures of normalized bias and gross error, and average unpaired peak accuracy (USEPA 1991). While 03 is a key variable for AQM evaluation because it is of central regulatory concern, it is not the only variable of interest. Consequently, Resultants comparisons should be made for precursor compounds such as NOx and VOCs, for conserved species such as CO and NOy, and for other products such as HNOa and total particulate N03', H202 and total peroxides, and PAN (peroxyacetic nitric anhydride), MPAN (peroxy-methacrylic nitric anhydride), and PPN (peroxypropionic nitric anhydride) where data are available. Category (3) Diagnostic testing assesses a model's predictive performance and reveals why the model responds in the way that it does by providing candidate explanations in terms of processes and pathways for the performance summarized in Resultants comparisons. Diagnostic testing is in situ testing of a model's processes and can be performed either internally with one model, or across several models, or in comparisons using specially- collected aerometric data which emphasize atmospheric processes (see for example Parrish et al. 1993; Trainer et at. 1993). Two types of Diagnostic tests are proposed, Process Diagnostics and Response Surface Diagnostics. Process Diagnostics are specific tests of key reaction pathways and interactions between components in an AQM. They assess a model's ability to represent actual atmospheric interactions by examining pathways and processes in the model, often using special measurements designed to indicate the activity of such processes. For the chemistry module, for example, Process Diagnostics would test; * radical initiation pathways using comparisons of 03, HCHO, HONO, peroxides, and spectral irradiance * radical termination pathways using comparisons of HN03, PAN, organic nitrates and peroxides • the balance between radical initiation and termination • competition between radical termination pathways by comparing production of HNO, and other nitrates to H2Os and other peroxides • calculations of radical propagation efficiency and OH chain length using NO, N02, VOC, and ROz • estimates of OH chain length using ratios of [03] to radical termination products • estimates of Os production efficiency using ratios of [Oj] to NOx termination products; i.e., NOz • speciation of NOz to compare competition between NOx termination pathways • airmass aging pathways using relative fractions of NOx vs. NOy, and PAN and other nitrates vs. NOy. Some Process Diagnostic tests will involve measures that reflect both local photochemistry and photochemistry over the history of an air parcel. For example, 03 production rates and OH chain length might be estimated for instantaneous, local photochemistry using measured R02, NO, N02, and VOC concentrations, while the cumulative 03 production and average OH chain length over an air parcel's history can be approximated using [03] and the ratio of [03| to radical termination products. However, local photochemistry and the history of an airmass should be carefully distinguished, and it will be necessary to develop diagnostic tests to evaluate model representations of each. Furthermore, ratios of species can be calculated in the model for comparisons at the surface and aloft to provide additional diagnostic information on a model's treatment of intermediate and product species and radicals. For example, chemical dynamics measures, calculated using partitioned N0X and N0V species for aloft and surface concentrations, could provide useful information about airmass aging pathways since the species aloft would be more susceptible to transport and dispersion effects, and the chemistry can proceed further to completion there. Also, R02-R02 reactions should be tested in the surface layer of urban areas where the reaction has only little significance, and again aloft at low [NOJ, where the reaction is significant for [03] and [H202]. Additional Process Diagnostic tests would use CO data with other ratios of species having varying lifetimes to derive additional model-estimated chemical budgets and processing rates. These comparisons would serve as an aid to interpreting results from the Mechanistic testing of the chemical mechanism, helping to separate the influence of chemistry from other model processes. The second type of Diagnostic tests, Response Surface Diagnostics, test a model's ability to track systemic modulation and generally will involve the use of indicator species and ratios of species thought to correlate consistently with VOC-sensitive and NOx-sensitive 03 production in the model. Several indicators of 03 production sensitivity and airmass aging are currently under development (Sillman 1995; Sillman et al. 1997; Tonnesen and Dennis 1998a, 1998b), and appear promising as diagnostic tests for model evaluations. For example, the indicators [03]/[HN03] and [03]/[NOJ show strong correlations to [OJ sensitivity, and correctly predict conditions that are either strongly NOx- or strongly VOC- limited (Tonnesen and Dennis 1998b). ------- Category (4) Model Systems analysis assesses the full model as a system, and is intended to provide insight about the model's behavior over a wide range of simulations. These tests will Involve both model-to-model comparisons and internal comparisons of one model's structure and response characteristics. Three types of Model Systems analyses are proposed: sensitivity and uncertainty analysis, including principal components analysis and supported by additional response surface testing and process analysis for these model-only tests; and, where necessary, a bounding analysis. These types are briefly described below. Sensitivity analysis, testing effects of particular parameterizations in a model or analyzing the response of a model to changes in input variables, is a common evaluation element for model development purposes, and should be more effectively included in Integrated/ Diagnostic evaluations as well. The chemical mechanism, for example, has frequently been tested with sensitivity analyses using the change in predicted [03] as the endpoint for variations in VOC speciation (Harley et al. 1993), and for photolysis rates and reaction yields in the mechanism (Gao et al. 1995,1996). Uncertainty analysis, too, has frequently been used in research-level evaluations of a model. One example (described above) is the work of U et al. (1998) using RADM for analyzing the effect of meteorology and emissions uncertainties on 03 production and control strategy selection. Hanna et al. (1998) have recently extended the use of uncertainty analysis by addressing multivariable input uncertainties using Monte Carlo techniques with the full model. These are early results, though, and have not been worked-up with quantitative likelihoods or fully evaluated against existing single- variable uncertainty studies, but they are promising as potential indicators of the uncertainty range of Model Systems functioning. A bounding analysis is required when evaluation results are inconclusive as to whether the model is acceptable for the evaluated task, yet estimates of the model's likely performance are still required (see Dennis et al. 1990). Bounding analysis can provide targeted interpretations of the effects of bias and error in the science of the model for future predictions. Hence, it is a means for examining process representations and potential compensating errors in the model when Response Surface Diagnostic results are not avaiiable. Summary Tests and interpretations developed using the revised methodology for Integrated/Diagnostic evaluations proposed here will help provide the 'more explicit, less intuitive approaches to model acceptability" called for by the Euierian Model Evaluation Team (Dennis et at. 1990) by formalizing explicit approaches for; • specifying which tests are required • establishing standard interpretations of test results • using new measures of model behavior to supplement the aggregate scores of bias, gross error, and average accuracy currently used. The proposed evaluation methodology shows several advantages over current practice, in that the revised methodology: • can make fuller use of all available data from observations and predictions • will provide for more directed sensitivity analysis of important model processes • can direct future model development research to the most important errors in model structure and implementation to help ensure continued model improvement. Tests for the integrated/Diagnostic evaluations of AQMs will require special observational data that are not now routinely available, it is hoped that the description in this paper of how those data might be used in a more advanced evaluation of a model indicates the importance of such data, and demonstrates the need for continued research and development on accurate and reliable measurement techniques. Active cooperation between model evaluators and measurement developers is crucial to the success of diagnostic evaluation, and for the continued improvement of model performance. References Atkinson, R„ H. E. Jeffries, G. 2, Whitten, and F. L. Lurmann, 1987: Proceedings of the Workshop on Evaluation/Documentation of Chemical Mechanisms. USEPA, Research Triangle Park, NC. Dennis, R. L., W. R. Barchet, T. L. Clark, S. K. Seilkop, and P. M. Roth, 1990: Evaluation of Regional Acidic Deposition Models (Part I), And Selected Applications of RADM (Part II). National Acid Precipitation Assessment Program, Washington, DC. Fox, D. G., 1981: Judging air quality model performance. Bulletin of the American Meteorological Society, 62, 599-609. , 1984: Uncertainty in air quality modeling: A summary of the AMS workshop (September 1982, Woods Hole, MA) on quantifying and communicating model uncertainty. Bulletin of the American Meteorological Society, 65, 27-36. Gao, D., W. R. Stockwell, and J. B. Milford, 1995: First- order sensitivity and uncertainty analysis for a regional-scale gas-phase chemical mechanism. Journal of Geophysical Research ,100, 23153-66. , 1996: Global uncertainty analysis of a regional-scale gas-phase chemical mechanism. Journal of Geophysical Research, 101, 9107-19. Hanna, S. R, J. C. Chang, and M. E. Fernau, 1998: Monte Carlo estimates of uncertainties in predictions by a photochemical grid model (UAM-IV) due to uncertainties in input variables. Atmospheric Environment [submitted February 1997]. Harley, R. A., A. G. Russell, and G. R. Cass, 1993: Mathematical modeling of the concentrations of volatile organic compounds: Model performance using ------- a lumped chemical mechanism. Environmental Science and Technology, 27,1638-49. Li, V., R. L. Dennis, G. S. Tonnesen, and J. E. Pleim, 1998: Regional ozone concentrations and production efficiency as affected by meteorological parameters in the Regional Acid Deposition Modeling system. Preprints from the 10th Joint Conference on the Applications of Air Pollution Meteorology with the AWMA (Phoenix, AZ, January 1998), AMS. Parrish, D. D., M. P. Buhr, M. Trainer, R. B. Norton, J. P. Shimshock, F. C. Fehsenfeld, A. G. Aniauf, J. W. Bottenheim, Y. Z. Tang, H. A. Wiebe, J. M. Roberts, R. L. Tanner, L. Newman, V. C. Bowersox, K. J. Oiszyna, E. M. Bailey, M. O. Rodgers, T. Wang, H. Berresheim, U. K. Roychowdhury, K. Demerjian, 1993: The total reactive oxidized nitrogen levels and the partitioning between the individual species at six rural sites in Eastern North America. Journal of Geophysical Research, 98, 2927-39. Reynolds, S. D., P. M. Roth, and T. W. Tesche, 1994: A Process for the Stressful Evaluation of Photochemical Model Performance. Western States Petroleum Association, Giendale, CA. Seinfeld, J. H., 1988: Ozone air quality models: A critical review. Journal of the Air Pollution Control Association ,38, 616-45. Sillman, S., 1995: The use of NOY, Hs02, and HN03 as indicators for Os-NOx-ROG sensitivity in urban locations. Journal of Geophysical Research, 100, 14175-88. Sillman, S., D. Y. He, M. R. Pippin, P. H. Daum, J. H. Lee, L. I. Kleinman, and J. Weinstein-Lloyd, 1997: Model correlations for ozone, reactive nitrogen, and peroxides for Nashville In comparisons with measurements: Implications for 03-NOx-hydrocarbon chemistry. Journal of Geophysical Research [submitted July 1997]. Sistla, G., N. Zhou, W. Hao, J. Ku, S. T. Rao, R. Bomstein, F. Freedman, and P. Thunis, 1996: Effects of uncertainties in meteorological inputs on urban airshed model predictions and ozone control strategies. Atmospheric Environment, 30,2011-55. Tesche, T. W., F. L Lurmann, P. M. Roth, P. Georgopoulos, J. H. Seinfeld, and G. R. Cass, 1990: Improvement of Procedures for Evaluating Photochemical Models. California Air Resources Board, Sacramento, CA. Tesche, T. W., and D. E. McNally, 1995: Assessment of UAM-IV Model Performance for Three St. Louis Ozone Episodes. Alpine Geophysics, LLC., Covington, KY. Tesche, T. W., and D. E. McNally, 1996: Evaluation of the MM5 Model for Three 1995 Regional Ozone Episodes over the Northeast United States. Alpine Geophysics, LLC., Covington, KY. Tesche, T. W„ P. M. Roth, S. D. Reynolds, and F. W. Lurmann, 1992: Scientific Assessment of the Urban Airshed Model (UAM-IV). Alpine Geophysics, Crested Butte, CO. Tonnesen, G. S., and R. L. Dennis, 1998a: Analysis of radical propagation efficiency to assess ozone sensitivity to hydrocarbons and NOx. Part 1: Local indicators of instantaneous odd oxygen production sensitivity. Journal of Geophysical Research [submitted July 1997], , 1998b: Analysis of radical propagation efficiency to assess ozone sensitivity to hydrocarbons and NOx. Part 2: Long lived species as indicators of ozone concentration sensitivity. Journal of Geophysical Research [submitted August 1997]. Trainer, M., D. D. Parrish, M. P. Buhr, R. B. Norton, F. C. Fehsenfeld, K. G. Aniauf, J. W. Bottenheim, Y. Z. Tang, H. A. Wiebe, J. M. Roberts, R. L. Tanner, L. Newman, V. C. Bowersox, J. F. Meagher, K. J. Oiszyna, M. O. Rodgers, T. Wang, H. Berresheim, K. L. Demerjian, and U. K. Roychowdhury, 1993: Correlation of ozone with NOv in photochemically aged air. Journal of Geophysical Research, 98, 2917- 25. USEPA, 1991: Guideline for Regulatory Application of the Urban Airshed Model. USEPA OAQPS, Research Triangle Park, NC. Whitten, G. Z., 1983: The chemistry of smog formation: A review of current knowledge. Environment International, 9, 447-63. Acknowledgements: J, R. Arnold's support is provided by the NOAA / EPA Postdoctoral Program administered by the University Corporation for Atmospheric Research. Gail S. Tonnesen is a National Research Council Postdoctoral Fellow. This paper has been reviewed in accordance with the U.S. Environmental Protection Agency's peer review policies and approved for presentation and publication. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. ------- TECHNICAL REPORT DATA 1. REPORT NO. EPA/600/A-97/098 2. 3 . F >10 . 4. TITLE AND SUBTITLE Advanced techniques for evaluating Eulerian air quality models: Background and methodology 5.REPORT DATE 6.PERFORMING ORGANIZATION CODE 7. AUTHOR(S) 6. Author(s), Affiliation, and Address (Identify EPA authors with Lab/Office) J R. Arnold P.O Box 23193 Seattle, WA 98102-0493 R.L. Dennis Atmospheric Modeling Division National Exposure Research Laboratory Research Triangle Park, NC 27711 G S Tonnesen National Exposure Research Laboratory Research Triangle Park, NC 27711 8.PERFORMING ORGANIZATION REPORT NO. 9. PERFORMING ORGANIZATION NAME AND ADDRESS Same as block 12. 10.PROGRAM ELEMENT NO. 11. CONTRACT/GRANT NO. 12. SPONSORING AGENCY NAME AND ADDRESS National Exposure Research Laboratory Office of Research and Development U. S. Environmental Protection Agency Research Triangle Park, NC 27711 13.TYPE OF REPORT AND PERIOD COVERED Preprints, 10th Joint Conference on the Applications of Air Pollution Meteorology with the A&WMA, January 11-16, 1998, Phoenix, Arizona 14. SPONSORING AGENCY CODE 15. SUPPLEMENTARY NOTES 16. ABSTRACT ------- • Modei* Evaluation Goals and Objectives The goal of model evaluation within the framework of applications is to determine an air quality model (AQM's) degree of acceptability and usefulness for a specified application or task. That goal is met by satisfying three objectives: • indicate the validity of the model's scientific formulations • evaluate the realism of the model's simulations • characterize the credibility of the model's realism relative to its intended application. Satisfying these objectives, however, presents two fundamental problems for providing a meaningful interpretation of model behavior to guide decision-making: how to tell whether the model is producing seemingly appropriate results from incorrect formulations or input data; i.e., how model performance can appear right for the wrong reasons how to establish the model's acceptability when the evaluation tests appear inadequate for making a determination of acceptability. 17. KEY WORDS AND DOCUMENT ANALYSIS a. DESCRIPTORS b.IDENTIFIERS/ OPEN ENDED TERMS e. COS ATI 18. DISTRIBUTION STATEMENT 19. SECURITY CLASS (This Report) UNCLASSIFIED 21 .NO. OF PAGES 20. SECURITY CLASS (This Page) UNCLASSIFIED 22. PRICE ------- |