White Paper on the Nature and Scope of Issues on Adoption of Model Use Acceptability Guidance Prepared by The Science Policy Council Model Acceptance Criteria and Peer Review White Paper Working Group May 4, 1999 ------- TABLE OF CONTENTS EXECUTIVE SUMMARY 4 1. INTRODUCTION 7 1.1 Context 7 1.2 Role of the Science Policy Council 8 1.3 Purpose of the White Paper 10 2. BACKGROUND 11 2.1 Current Practices in Model Evaluation 11 2.1.1 Office of Air and Radiation (OAR) 11 2.1.2 Office of Solid Waste and Emergency Response (OSWER) 16 2.1.3 Office of Water (OW) 17 2.1.4 Office of Prevention, Pesticides, and Toxic Substances (OPPTS) 17 2.1.5 Office of Research and Development (ORD) 18 2.2 Summary 19 3. OPTIONS FOR SPC INTERACTION WITH CREM 20 3.1 Options Considered by the Task Group 20 3.2 Task Group Recommendations 21 4. PROPOSED SCOPE, APPROACH, AND SUPPORTING ANALYSIS FOR THE GENERAL GUIDANCE 22 4.1 Scope - Handling Qualitative Uncertainty Issues in Peer Review 22 4.2 Approach 23 4.2.1 Strategy for Model Evaluation 23 4.2.2 Strategy for Defining Uncertainty in Model Elements 24 4.3 Supporting Analysis 24 4.3.1 Part 1 - Defining the Objectives for the Model 26 4.3.2 Part II - Analysis of Model Uncertainty 26 4.3.3 Part in - The Overall Assessment 28 5. POSSIBLE FOLLOW-UP ACTIONS 29 5.1 Additional Support Work Needed 29 5.2 Suggested Follow-up Actions 30 6. QUESTIONS POSED BY THE COORDINATING COMMITTEE 32 2 ------- 7. COST AND BENEFITS DISCUSSION 36 APPENDIX A- SCIENCE POLICY COUNCIL MODEL ACCEPTANCE CRITERIA WHITE PAPER GROUP MEMBERS 37 APPENDIX B- MODELS EVALUATION CASE HISTORIES 38 APPENDIX C- MODEL VALIDATION PROTOCOL 48 APPENDIX D - REQUIREMENTS FOR A "STATE OF THE ART" EVALUATION PROCESS 56 APPENDIX E- TYPES OF MODELS USED BY EPA 57 APPENDIX F- NONPOINT SOURCE MODEL REVIEW EXAMPLE 58 APPENDIX G - COST ESTIMATES 65 APPENDIX H - REFERENCES 66 3 ------- EXECUTIVE SUMMARY 1. What is the purpose of this white paper? The initial Coordinating Committee to the Science Policy Council (SPC) proposed an examination of options for implementing Agency Task Force on Regulatory Environmental Modeling (ATFERM) recommendations on model acceptance criteria (MAC) and peer review. The Science Policy Council Steering Committee (SPC-SC) accepted this proposal as a means of providing background for decisions on SPC direct involvement in the Models 2000 efforts and the proposed Committee on Regulatory Environmental Modeling (CREM). An Agency-wide task group was formed to consider the options for implementing the ATFERM proposal, and to publish the findings in this white paper. 2. How are the options for interpreting and executing ATFERM recommendations affected by current developments in modeling? In reviewing current Agency practices in model evaluation and peer review, including evaluating several case histories, the task group observed that models are more diverse and complex than in 1993 when the ATFERM final report was written. Therefore, the task group determined that the adequacy of the ATFERM criteria should be reexamined. In the ATFERM report, environmental models were defined in terms of fate and transport, estimation of contaminant concentrations in soil, groundwater, surface water, and ambient air in exposure assessment. Current models used by the Agency range from site-specific to regional in scale; from single pathway and contaminant to multi- pathway and multi-contaminant in operational scope; and from estimating simple exposure results to providing input to complex risk assessments or comparison of management options in function (e.g., "model systems" with component modules and even algorithms uniquely assembled only at the time of application). In 1993, model evaluation and selection were largely Agency functions, whereas inter- agency efforts to develop and use shared models are now more common. 3. What implementation option is recommended to the SPC? The SPC should engage in direct interaction with the CREM to provide updated general guidelines on MAC to maintain consistency across the Agency (see Section 3.1 for a discussion of other potential options). Guidelines were recommended as a substitute for criteria since guidelines would not seem overly bureaucratic to Agency staff or expose the Agency to unnecessary legal challenges regarding model use, but would promote consistency in model evaluation and selection across the Agency. Choices of models to use for environmental decision-making would be left to the program managers, recognizing that model acceptability is related to the specific use of the model and the acceptability of the risk in decision-making due to uncertainty and variability in model inputs. Program managers would be responsible for providing accessible documentation evaluating any model they use. It is anticipated that eventually the CREM would set up a process for periodic review of selected models to provide feedback to Agency senior management on overall consistency of response to the general guidelines. 4 ------- 4. How should the general guidance be developed? Guidelines should be developed for the various types of peer review that will be used by the Agency in its three-part assessment of models: (1) definition of the objectives, (2) analysis of model uncertainty, and (3) overall assessment. The MAC need to reflect the "state of the art" and be incorporated into an Agency-wide model evaluation strategy that can accommodate different model types and their uses. Heretofore, evaluation criteria did not set explicit specifications that a model must achieve to be suitable for an application. In an integrated assessment of model uncertainty, it is important that explicit specifications be set by program managers for each element of uncertainty (both qualitative and quantitative specifications). The development of element specifications may be influenced by the need to allow for variations in the overall approach, complexity, and purpose of models used by EPA. Using standard model evaluation elements to determine how a model should be assessed could provide a comprehensive integration of the specific model evaluation components into a framework forjudging what constitutes a valid model. 5. What follow-up activities should be pursued? Additional support work is needed in the following areas: 1) analysis of uncertainty, 2) model inventory, 3) multi-media and multi-contaminant model evaluation, 4) comparability of evaluation standards between models. Suggested follow-up actions include: 1) determining the form, resources needed, and housing of CREM, 2) directing CREM's work toward issuing guidance on "how" to evaluate and characterize models to support the strategy for model evaluation, as opposed to only listing "what" to do, 3) developing and utilizing a model clearinghouse with information on model evaluation results, availability and applications experience, 4) integrating peer review guidance and supporting aspects of QA/QC, 5) preparing case studies to serve as examples of how models used in regulatory decision-making can be evaluated and the value added by such evaluations, and 6) producing a glossary for "state of the art" general guidance to clarify model terminology. 6. What are the roles of QA and peer review, and will a "clearinghouse" be developed? QA and peer review requirements are imposed to avoid modeling errors that could result in costly mistakes. According to the SPC Peer Review Handbook (EPA 100-B-98-001), models generally should be peer reviewed. Peer review provides an expert and independent third party evaluation that cannot be provided by stakeholder or public comment. SPC interpretation of ATFERM recommendations would help to clarify what model evaluation records are needed (e.g., code verification, testing results, model selection and the application process). The model evaluation strategy proposed in this document could provide a process tailored to the nature of the predictive task and the magnitude of the risk of making a wrong decision. The proposed strategy could also clarify the complementary roles of QA and peer review tasks in model evaluation and the basis for guidance on QA Project Plans for model development and application. 5 ------- It is also recommended that the creation of an Agency-wide clearinghouse for models be investigated, since it would provide a means to access model evaluation information while leveraging resources of single organizations and avoiding duplication of effort. Responding to a need perceived by the participants of the 1997 Models 2000 Conference, an action team was formed to consider the development of a Modeling Clearinghouse. To meet the Paperwork Reduction Act of 1980 and OMB Circular A-130 requirements, the Office of Information Resources Management (OIRM) is proposing to develop an Application Systems Inventory (ASI) as a repository of information about Agency software. Another effort, by ORD's National Center for Environmental Assessment, is defining metadata for models that can be stored in its relational database (the Environmental Information Management System) with input through the internet and retrieval through a search engine using specific queries. In addition, a strategy for communication needs to be developed for the public and others, like model users, to provide feedback to the EPA, possibly through the Internet at sites providing information on models and their evaluation for EPA use. 7. What are the benefits and related costs involved in model evaluation? Evaluation of models during development and peer review would incur costs, but would result in better products. In broader terms, evaluation would also promote systematic management of model development and use within EPA by providing a basis for consistent model evaluation and peer review. The proposed model evaluation strategy would encourage sensitivity and uncertainty analyses of environmental models and their predictions as well as clarify peer review requirements. Access to evaluation results would offer opportunities for improved model selection, elimination of redundant model development and evaluation, and enhanced Agency credibility with external stakeholders. Evaluation and peer review of model application would offer feedback to model developers, hopefully resulting in improved model performance. 6 ------- 1. INTRODUCTION 1.1 Context For many years, the Science Advisory Board (SAB) has been actively advising the Agency on the use of computer models in environmental protection. In 1989, after reviewing several models, SAB offered general advice in its first commentary or resolution (EPA-SAB-EEC-89-012), recommending that "EPA establish a general model validation protocol and provide sufficient resources to test and confirm models with appropriate field and laboratory data" and that "an Agency-wide task group to assess and guide model use by EPA should be formed." In response, the Assistant Administrator for Research and Development (ORD) and the Assistant Administrator for Solid Waste and Emergency Response (OSWER) jointly requested the Deputy Administrator, as the chair of the former Risk Assessment Council (RAC), to establish a task force to examine the issues. The voluntary Agency Task Force on Environmental Regulatory Modeling (ATFERM) was created in March 1992 and completed a report in October 1993. In its report (EPA 500-R-94-001), the ATFERM noted that the Agency has no formal mechanism to evaluate model acceptability, which causes redundant inconsistent evaluations as well as uncertainty about acceptability of models being applied and the results that the Agency obtains with the models. The ATFERM report recommended establishment of acceptability criteria because "a comprehensive set of criteria for model selection could reduce inconsistency in model selection" and "ease the burden on the Regions and States applying the models in their programs". For Section 11 of their report, they drafted a set of "acceptability criteria." In Section 111, they provided the "Agency Guidance for Conducting External Peer Review of Environmental Regulatory Modeling," which was later issued in July 1994 by EPA's Deputy Administrator on behalf of the Science Policy Council (SPC) as EPA 100-B-94-001. ATFERM also proposed a charter for a Committee on Regulatory Environmental Modeling (CREM) to be created by the Deputy Administrator or the new SPC to carry on work begun by the ATFERM and to provide technical support for model users. This proposal was based on SAB recommendations (SAB-EC-88-040, SAB-EC-88-040A, and SAB-EEC-89-012). In its peer review of the "Agency Guidance for Conducting External Peer Review of Environmental Regulatory Modeling," SAB heartily endorsed the Agency's general approach to conducting peer review of environmental regulatory modeling (EPA-SAB-EEC-LTR-93-008). The Environmental Engineering Committee (EEC) noted the "most important element to the review process is the verification of the model against available data in the range of conditions of interest" with a discussion of compensating errors and suggested "some guidance needs to be provided as to what constitutes adequate model performance." The report also included specific recommendations on organizational and peer review processes. Later SAB asked to work with the Agency on clarifying its own role, along with other peer review mechanisms, to cover substantially new models, significant adaptations of existing models, controversial applications of existing models, and applications with significant impacts on regulatory decisions (EPA-SAB-EEC-COM-95-005). When implementation of these plans had faltered after several years, SAB urged the Agency to move forward in consolidating its gains in modeling. Their recommendation was echoed in a 1997 external review of "Plans to Address 7 ------- Limitations of EPA's Motor Vehicle Emissions Model" (GAO/RCED-97-210 p. 10). In part, as a result of SAB's urging, the Agency conducted the Models 2000 Conference in Athens, GA, in December 1997. Dr. Ishwar Murarka represented the SAB and made a presentation noting the increasing complexity of models. He also stressed the importance of verification and validation issues, sensitivity and uncertainty analyses, intra- and inter-Agency coordination, and the need for a peer review mechanism. Dr. Murarka's bottomline was that new approaches are needed to insure that models are developed, used, and implemented appropriately. The Models 2000 Steering/Implementation Team (SIT) is engaged in an on-going SAB consultation with the Environmental Models Subcommittee on the Agency's modeling efforts that began in May 1998. Recent discussions have revealed that 1. the Agency wouldbenefitfrom an integrated strategy mechanismfor dealingwith computer models development and use within the Agency or across agencies. 2. the Agency is developing multi-media, multi-pathway models indifferent program offices for different purposes and SAB has initiated an Advisory on a module of one of the models (i.e., TRIM.ball1]). 3. a SAB consultation was requested on the follow-up activities of the Models 2000 workshop, on establishment of the ("REM, and on the Agency's goals and objectives in establishing the model acceptability criteria and peer review guidelines. 1.2 Role of the Science Policy Council The Science Policy Council (SPC), including its associated Steering Committee (SPC-SC), was established by the Administrator as a mechanism1 for addressing EPA's many significant policy issues that go beyond regional and program boundaries. They noted that the development and application of environmental regulatory models (ERMs) must be viewed within the larger framework of the risk assessment-risk management (environmental decision-making) paradigm currently utilized by the Agency. Ultimately models can be seen as tools with which risk assessments are performed to support risk management decisions. Therefore, it is critical that the purposes, limitations, and uncertainties inherent in an environmental model be understood by the risk assessor applying the model to a risk concern and the risk manager who depends upon the outputs from a model in decision- making. They also need assurance that the model is being utilized consistently across the Agency. Further, it is vital that the process by which a model is developed, the criteria for evaluating its credibility (mathematical validity, approximation to field results, application to different scenarios, etc.) be accessible to the outside world for objective analysis (e.g., external peer review), and to assessment As such, its goal is the integration of policies that guide Agency decision-makers in their use of scientific and technical information. The SPC works to implement and ensure the success of initiatives recommended by external advisory bodies such as the National Research Council (NRC) and SAB, as well as others such as the Congress, industry and environmental groups, and Agency staff 8 ------- by the public2 at large. Also as modeling becomes more sophisticated and bridges multi-media, multi- pathways, multi-endpoints, it demands a higher degree of technical expertise and training from the EPA. The recently issued SPC Peer Review Handbook (USEPA, 1998) incorporates the earlier ATFERM guidance on the peer review of ERMs (on the EPA website http://www.epa. gov/ORD/spc). The ATFERM guidance states that "...environmental models...that may form part of the scientific basis for regulatory decision-making at EPA are subject to the peer review policy...and...this guidance is provided as an aid in evaluating the need and, where appropriate, conducting external peer review related to the development and/or application of environmental regulatory modeling." The guidance further defines what is encompassed by peer review of model development ' and applications.4 The guidance describes the steps in the external peer review process, the mechanisms, general criteria, and documentation for conducting external peer review and the specific elements of peer review. The elements include: 1) model purpose, 2) major defining and limiting considerations, 3) theoretical basis for the model, 4) parameter estimation, 5) data quality/quantity, 6) key assumptions, 7) model performance measures, 8) model documentation including users' guide, and 9) a retrospective analysis. The guidance does not specifically address externally funded model peer review, but Agency policy5 being developed for implementation of the Cancer Guidelines should provide a useful precedent. The Risk Assessment Forum (RAF) was established to promote scientific consensus on risk assessment issues and to ensure that this consensus is incorporated into appropriate risk assessment guidance. RAF recently convened a workshop on Monte Carlo analysis (EPA/630/R-96/010), and acting upon the recommendations from the workshop, developed a set of guiding principles to guide agency risk assessors in the use of probabilistic analysis tools. The tools were also provided to support adequate characterization of variability and uncertainty in risk assessment results (e.g., sensitivity analyses). Policy on acceptance of risk assessments was also developed. It requires that the methods used be documented sufficiently (including all models used and all data upon which the assessment is based and all assumptions impacting the results) to allow the results to be independently reproduced. Stakeholder involvement (i.e., involvement by those interested or affected entities) in the development of ERMs within the environmental decision-making framework is both desirable and necessary. It is desirable because often the regulated industries or other affected groups have special insight or expertise into parameters, e.g., the industrial process, exposure issues, place or media- based concern, which must be integrated into the EPA model. Their participation provides a value-added dimension to the development process and enhances the chances of model acceptance and/or public credibility. It is necessary for the obvious reason of foregoing possible lawsuits and because the public is insisting upon a greater involvement in the decision-making process (Presidential Commission, 1997; NRC, 1996). Models developed to support regulatory decision-making or research models expanded to develop scientific information for Agency decision-making would be subject to the peer review guidance. 4 Normally, the first application of a model should undergo peer review. For subsequent applications, a program manager should consider the scientific/technical complexity and/or novelty of the particular circumstances as compared to prior applications. Peer review of all similar applications should be avoided because this would likely waste precious time and monetary resources while failing to provide the decision-maker with any new relevant scientific information. Nevertheless, a program manager may consider conducting peer review of applications upon which costly decisions are based or applications which are likely to end up in litigation. ^ The specific details are not yet available, but external stakeholder groups funding peer reviews of ERMs for Agency use will be expected to generally adhere to the same procedures that the EPA is following. 9 ------- 1.3 Purpose of the White Paper In follow-up to the Models 2000 Conference, the initial Coordinating Committee to the SPC proposed that an examination of options for implementing the ATFERM recommendations on model acceptance criteria and peer review, and to publish the findings in a white paper. The SPC-SC accepted the proposal as a means of providing background for decisions on SPC direct involvement in the Models 2000 effort and the proposed CREM. An Agency-wide task group representing the National Program offices, Regions, ORD and EPA's Quality Assurance Division was assembled (see Appendix A) to consider whether the ATFERM proposal should be carried out and, if so, with what criteria. For further clarification, the following questions raised by the ad hoc Coordination Committee are answered (Section 6): 1. How do the issues of Peer Review (external/internal) and QA/QC evaluation relate to acceptability determination? 2. What is a consensus definition of model use acceptability criteria? 3. Does acceptability correspond to a particular model, or specific applications of a model? 4. Does acceptability cover only models developed by EPA or can it cover externally developed models? 5. Does acceptability mean the agency will develop a "clearinghouse" of models that meet EPA's definition of acceptable? 6. Would each program/region develop their own system for evaluating acceptability? 7. Should EPA apply a generic set of criteria across the board to all categories of environmental regulatory models (ERMs) or should acceptability criteria differ depending on the complexity and use (e.g., screening vs. detailed assessment) of a model? 10 ------- 2. BACKGROUND 2.1 Current Practices in Model Evaluation The task group first wanted to establish a background in order to address whether or not model acceptance criteria (MAC) should be adopted, and if so, how to define them. They wanted to know if the context for the MAC had changed. Available information focusing on the last five years since the ATFERM report was written, along with case histories in model evaluations (Appendix B), are summarized in the following sections: 2.1.1 Office of Air and Radiation (OAR) 1) The Office of Air Quality Planning and Standards (OAQPS) OAQPS supports implementing the Clean Air Act (CAA) air quality modeling requirements, which includes several mechanisms to assist the Regional Offices and state and local air pollution control agencies in approving and/or developing models and modeling techniques for air quality dispersion applications. It has weathered the test of time and continues to meet the Regions' needs, therefore recommendations to change its current emphasis or mode of operation are not needed. The implementation process includes the following: a) Appendix W to Part 51: Guideline on Air Quality Models (Guideline) of 40 Code of Federal Regulations The Guideline promotes consistency in the use of air quality dispersion modeling within the air management programs. It also recommends air quality modeling techniques that should be applied to State Implementation Plan (SIP) revisions and new source reviews, including prevention of significant deterioration (PSD) permits. A compendium of models and modeling techniques acceptable to EPA is provided. The recommended models are subjected to peer scientific review and/or a public comment and review process. The Guideline specifically addresses the use of alternative models or techniques, if an EPA preferred model or procedure is not appropriate or available. Several revisions to the guideline have occurred over the years. New modeling paradigms or models proposed for the Guideline are required to be technically and scientifically sound; undergo beta testing, model performance evaluation against applicable EPA preferred 11 ------- models, and use of field studies or fluid modeling evaluation; documented in a user's guide within the public domain; and undergo some level of peer review. The conferences and modeling revisions are announced and proposed through the Federal Register. Modeling revisions then become a part of the regulatory process after publication of a final notice that addresses public comments and EPA's responses. Many new modeling techniques have successfully made the transition to some level of Guideline acceptance. Model Clearinghouse Section 301 of the CAA requires a mechanism for identifying and standardizing inconsistent or varying criteria, procedures, and policies being employed in implementing and enforcing the CAA. The Regions are responsible for ensuring that fairness and uniformity are practiced. The Model Clearinghouse was created many years ago to support these requirements. It is a Regional resource for discussing and resolving modeling issues and for obtaining modeling guidance and modeling tools. A primary purpose of the Model Clearinghouse is to review the Regional Offices' positions on non-guideline models or alternative techniques. The Clearinghouse reviews each referral for national consistency before final approval by the Regional Administrator. This requires an historical cognizance of the Clearinghouse on usage of non-guideline techniques by the Regional Offices and the circumstances involved in each application. In FY-1981, the Clearinghouse began to maintain paper files of referrals from the Regional Offices. These files document the usage of non-guideline models and alternative techniques. The information in the files is summarized and communicated to the Regional Offices periodically to increase awareness of any precedents when reviewing state or industry proposals to apply non-guideline models or alternative techniques. After a few years, the Model Clearinghouse Information Storage and Retrieval System (MCHISRS) was designed. This is a database system to manage information about Regional inquiries involving the interpretation of modeling guidance for specific regulatory applications. The computer database was recently placed on the SCRAM BBS for wider dissemination and access. The MCHISRS includes key information involving States, pollutants, models, terrain type, and so on, plus a narrative summary. The summary includes a statement of the modeling or modeling related issue involved and the Clearinghouse position on the issue. Any users can now examine the 12 ------- historical records to determine impact on their particular application. The Clearinghouse is always accessible to the Regions. A mechanism for timely review of unique modeling applications or nonguideline modeling techniques with a view toward nationwide precedent and consistency is professionally achieved with the Clearinghouse. Support Center for Regulatory Air Modeling Internet Web site, (SCRAM Internet Web site) An important by-product of the Model Clearinghouse is the development and maintenance of the SCRAM Internet website. This bulletin board has been in existence and accessible to the public for at least a decade. The SCRAM Internet website represents an electronic clearinghouse for the Agency and provides a critical linkage to the Regions and the public. It maintains a historical record of guidance on generic issues and other policy memoranda and is updated as needed. All of the Agency's models and user's guides that are recommended for regulatory use are maintained through OAQPS and available at all times through the SCRAM Internet website. In addition to the Agency preferred models, a variety of alternative models and support computer programs are available through SCRAM Internet website. This Internet website also provides complete and timely documentation of not only the revisions to these models but documentation on why they were needed and their effects on model performance. Although the basic format of the Internet website has not changed significantly, changes are made to better meet the needs of the customers and the ever broadening scope of air dispersion modeling. The recent move of the bulletin board to the Internet is just one example of how OAQPS works to improve accessibility of this system. The SCRAM Internet website is one of the most user-friendly bulletin boards on the INTERNET. It appears that the majority of the Regions' needs that are related to the successful implementation of the CAA air quality dispersion modeling regulations and requirements are met by the Clearinghouse and the SCRAM Internet website. Conferences on Air Quality Modeling Section 320 of the CAA requires EPA to conduct a conference 13 ------- on air quality modeling at least every three years. The Seventh Modeling Conference on Air Quality Modeling should be held this year. The conference provides a forum for public review and comment on proposed revisions to the Guideline. e) Periodic Modeling Workshops Finally, an annual workshop is held with the EPA Regional Meteorologists and state and local agencies to ensure consistency and to promote the use of more accurate air quality models and databases for PSD and SIP-related applications. Office of Radiation and Indoor Air (ORIA) Intra- and Inter-Agency cooperative efforts developed technical guidance on model selection and evaluation through a joint Interagency Environmental Pathway Modeling Working Group. The group was established by the EPA Offices of Radiation and Indoor Air (ORIA) and Solid Waste and Emergency Response (OSWER), the Department of Energy (DOE) Office of Environmental Restoration and Waste Management, and the Nuclear Regulatory Commission (NRC) Office of Nuclear Material Safety and Safeguards. Their purpose was to promote more appropriate and consistent use of mathematical environmental models in the remediation and restoration of sites contaminated by radioactive substances. First, the EPA, DOE, and NRC working group sponsored a mail survey in 1990 and 1991 to identify radiologic and non-radiologic environmental transfer or pathway computer models that have been used or are being used to support cleanup of hazardous and radioactive waste sites. The intent of the survey was to gather basic administrative and technical information on the extent and type of modeling efforts being conducted by EPA, DOE, and NRC at hazardous and radioactive waste sites, and to identify a point of contact for further follow-up. A report, Computer Models Used to Support Cleanup Decision-Making at Hazardous and Radioactive Waste Sites, was published (EPA 402-R-93-005, March 1993) to provide a description of the survey and model classification scheme, survey results, conclusions, and an appendix containing descriptions and references for the models reported in the survey. Later, reports resulting from the working group's efforts were published (described below) to be used by technical staff responsible for identifying and implementing flow and transport models to support cleanup decisions at hazardous and radioactive waste sites. One report, Environmental Pathway 14 ------- Models: Ground-Water Modeling in Support of Remedial Decision-Making at Sites Contaminated with Radioactive Material, (EPA 402-R-93-009, March 1993) identified the role of, and need for, modeling in support of remedial decision making at sites contaminated with radioactive materials. It addresses all exposure pathways, but emphasizes ground-water modeling at EPA National Priority List and NRC Site Decommissioning Management Program sites. Its primary objective was to describe when modeling is needed and the various processes that need to be modeled. In addition, the report describes when simple versus more complex models may be needed to support remedial decision making. A Technical Guide to Ground-Water Model Selection at Sites Contaminated with Radioactive Substances (EPA 402-R-94-012, September 1994) was prepared to describe methods for selecting ground-water flow and contaminate transport models. The selection process is described in terms of matching the various site characteristics and processes requiring modeling and the availability, reliability, validity, and costs of the computer codes that meet the modeling needs. Another report, Documenting Ground-Water Modeling at Sites Contaminated with Radioactive Substances, (EPA 540-R-96-003, January 1996) provided a guide to determining whether proper modeling protocols were followed, and, therefore, common modeling pitfalls avoided. The problems were noted in a review of 20 site-specific modeling studies at hazardous-waste remediation sites (Lee et al., 1995). The review cited problems in 1) misunderstanding of the selected model, 2) improper application of boundary conditions and/or initial conditions, 3) misconceptualization, 4) improper or unjustifiable estimation of input data, 5) lack of or improper calibration and verification, 6) omission of or insufficient sensitivity and uncertainty analysis, and 7) misinterpretation of simulation results. Any of these errors could impact remedial and risk decisions. As a guide to modelers, this report demonstrates a thorough approach to documenting model applications in a consistent manner. A proper documentation of modeling results was found to answer the following questions: Do the objectives of the simulation correspond to the decision-making needs? Are there sufficient data to characterize the site? Is the modeler's conceptual approach consistent with the site's physical and chemical processes? Can the model satisfy all the components in the conceptual model, and will it provide the results necessary to satisfy the study's objectives? Are the model's data, initial conditions, and boundary conditions identified and consistent with geology and hydrology? 15 ------- Are the conclusions consistent with the degree of uncertainty or sensitivity ascribed to the model study, and do these conclusions satisfy the modeler's original objectives? The approach recommended for evaluating models consists of three steps: (1) determining one's objectives and data requirements for the project; (2) properly developing a conceptual model for the site, which describes the physical and chemical system that must be simulated; and (3) selecting and applying the model in a manner consistent with the objectives and the site's known physical characteristics and input variables. 3) The Office of Atmospheric Programs See RADM case history in Appendix B. 4) The Office of Mobile Sources (OMS) OMS's model evaluation includes extensive stakeholder review, increased amounts of external peer review, and alpha- and beta- testing of models. Recent efforts have included using the ATFERM guidance and the Agency's peer review policies for conducting more extensive peer review and model evaluation. In addition, efforts are underway to determine the best and most efficient way possible to determine uncertainties in models. 2.1.2 Office of Solid Waste and Emergency Response (OSWER) In 1989, OSWER undertook a study to examine its modeling environment. OSWER staff found more than 310 models in use in the hazardous waste and Superfund programs. Many of the earlier models were written in Fortran. The newer models, many written to run on microcomputers, used a variety of languages and tools. These models varied in their applications and design. Efforts to verify, validate, and select models were inconsistent with little overall guidance and user support. The report concluded with three recommendations: Task Area 1: Initiation, Additional Study, and Preparation of a Management Plan. Task Area 2: Development of Guidance for Modeling. Task Area 3: Establishment of User Support Network for HW/SF Modeling. This study prompted OSWER's leadership in the development of the 16 ------- subsequent Agency Report of the Agency Task Force on Environmental Regulatory Modeling (EPA 500-R-94-001) and the Guidance for Conducting External Peer Review of Environmental Regulatory Models (EPA 100-B-94-001). The situation today has become even more complex with the advent of microcomputers and fourth generation languages that facilitate rapid development of computer programs. However, most of the challenges that faced EPA when the OSWER modeling study was undertaken still exist today. For example, the threat of legal challenge to the use of models for regulatory applications continues. Recently, a Validation Strategy has been developed for the IEUBK model (EPA/540/R.-94-039). The approach emphasizes verification, validation and calibration as previously established through the ATFERM report for environmental exposure models even though the model is for blood lead levels in children rather than exposure assessment. It uses 4 components: 1. Scientific foundations of the model structure. Does the model adequately represent the biological and physical mechanisms of the modeled system? Are these mechanisms understood sufficiently to support modeling? 2. Adequacy of parameter estimates. How extensive and robust are the data used to estimate model parameters? Does the parameter estimation process require additional assumptions and approximations? 3. Verification. Are the mathematical relationships posited by the model correctly translated into computer code? Are model inputs free from numerical errors? 4. Empirical comparisons. What are the opportunities for comparison between model predictions and data, particularly under conditions under which the model will be applied in assessments? Are model predictions in reasonable agreement with relevant experimental and observational data? OSWER believes that at least some of these principles would also work for model applications. 2.1.3. Office of Water (OW) The Office of Science and Technology (OST) in OW is not currently evaluating any models along the lines of the ATFERM acceptance criteria. However, there are two models that will probably be put through the peer review process in the future. Aquatox, is being internally evaluated and later the model will be peer reviewed using 17 ------- criteria like those in the ATFERM report. After it has completed development, CORMK will also be peer reviewed. Another effort is the review of the Basins version 2, an integrated package of a geographic information system, spatial and environmental data, customized analytical tools, watershed and receiving water quality models, and model post processing to be used in analysis of watersheds and preparation of reports and records for Total Maximum Daily Loads. The review objectives do not address evaluation of the models themselves. Long established models like HSPF and QUAL2E have been tested and peer reviewed in the past. However, past evaluation information may not be accessible (e.g., 1980 tests of HSPF that had 10 year record retention schedules). 2.1.4 Office of Prevention, Pesticides and Toxic Substances (OPPTS) OPPTS does not have a standard approach to model evaluation. Models have been largely developed by consultants with variable evaluation practices. Also a score of different models are used in OPPTS; they range from the trivial to the very complex and in each case the model evaluation depends on its complexity. For example, recently a large consulting firm developed a model to be used at OPPTS for the Acute Dietary Exposure Analyses and Risk Assessment. This model produces a realistic calculation of dietary exposure and includes a Monte Carlo analysis to estimate dietary exposure from potential residues of the total chemical residues in food. It also uses a huge data base that conveys the food consumption habits in the USA. The primary evaluation of the model was done following the Quality Assurance and Quality Control procedures of the vendor. A second in-house evaluation of the model has been conducted through peer review. Statisticians, managers, scientists, computer programmers, and outside consultants evaluated the algorithms of the model to reach a consensus that the model is correct and closely represents reality. However, no formal structured form of model validation (i.e., a mathematical validation) has been used on this particular model. A field verification of this model is not possible because of lack of data. The model validation process rests heavily on a balance reached through a consensus among the parties involved and a constant flow of information between the vendors, the reviewers, and the users. Ultimately, the Scientific Advisory Panel, an external peer review group mandated by the Federal Insecticide, Fungicide and Rodenticide Act (FIFRA) is responsible for reviewing major new models used by OPPTS. 2.1.5. Office of Research and Development (ORD) ORD has been involved in model evaluation as well as development in its support for the National Program Offices and Regions. Case histories in model evaluation (Appendix B) demonstrate a wide range of approaches from traditional calibration and validation (SHADE-HSPF and RADM) to benchmarking with other Federal Agencies (MMSOILS). Model systems are also being developed similar to 18 ------- that planned by OAQPS (TRIM.FaTE). Peer review mechanisms used in the case histories include internal review, review by EPA and non-EPA (e.g., DOE) advisory committees, and journal peer review of articles. In 1994, a protocol for model validation (in the Draft white paper "Model Validation for Predictive Exposure Assessments" see Appendix C) was prepared at the request of the Risk Assessment Forum. The protocol was developed from a design perspective to provide a consistent basis for evaluation of a model in performing its designated task reliably. A wide variety of evidence was proposed to inform the decision rather than just the conventional test with matching the model output to historical data (history matching). Therefore, the protocol could cover the case where predictions make extrapolations from past observations into substantially altered future situations. The protocol addressed three aspects: 1. The intrinsic properties of the model; 2. The nature of the predictive task; and 3. The magnitude of the risk of making a wrong decision. It was noted that models would differ in the level of evaluation possible. If the prediction task was routine where quantitative results of model performance could be evaluated, the risk of making a wrong decision would be low. If the prediction task was novel, where little previous experience existed on the model's performance, evidence would be more qualitative (e.g., peer reviewer's opinions on the composition of the model), the risk would be higher. The types of evidence supporting model evaluation were outlined and included: A) Structure - the conceptual basis (easier for established theories but hard to quantify uncertainty) and the way in which the constituent model mechanisms (hypotheses) are expressed mathematically and connected to one another; B) Complexity in number of its state variables and parameters (e.g., ecological models of environmental systems would have more hypotheses and it would be more difficult to detect a failure in any one hypothesis or to predict its impact on the prediction); C) Values of the parameters and the bands of uncertainty around them (related to the data quality) and extent of the observations; D) Sensitivity of the outputs of the model to changes in assigned values of each parameter; and E) History matching with field data which can include quantitative evaluation of model performance if criteria for success are provided. In modeling supporting regulatory uses, the evaluation process results in a choice to use a model as a tool for prediction. This emphasizes the perspective of the quality of the decision and tests the null hypothesis that the model adequately represented the process modeled until shown to be otherwise. The sum of the evidence 19 ------- would be used, however, methods for weighting types of evidence are needed. Unfortunately the term of the most knowledgeable advocate of the Risk Assessment Forum ended before action was taken and the other members did not pursue the protocol. Summary The various National Program Offices and the Office of Research and Development vary in their approach and practices in model evaluation as well as in the types of models they use. In our review of program information and model evaluation case histories (Appendix B), we noted that models are becoming more diverse ranging from site-specific to regional in scale; from single contaminant and pathway to multi- pathway and multi-contaminant in operational scope; and from estimating simple exposure results to providing complex risk assessments or comparisons of management options in function. They are also more complex as "model systems" are being developed with component modules and even algorithms that will be uniquely assembled only at the time of application. Inter-agency efforts are also more involved in evaluation and selection of models in shared projects. This situation varies from 1993 when the ATFERM final report was written which defined environmental models in terms of fate and transport, estimation of contaminant concentrations in soil, groundwater, surface water, and ambient air in exposure assessment (page III-l). 20 ------- 3. OPTIONS FOR SPC INTERACTION WITH CREM 3.1 Options Considered by the Task Group 1. Do nothing. This option implies that the current peer review policy, i. e., the SPC (Section III, ATFERM report) referenced in the SPC Peer Review Handbook (IJ SI-J'A, 1998), would remain the basis for decisions on mode! evaluation. This guidance recommends external peer review, which remains to be defined as to its precise nature, but would have to have some objective standards/criteria; furthermore, the ATFERM MAC would have to be sorted out as to their appropriateness present utility for external peer re vie vr. '/ his leads us back to the need for generally acceptable MAC for the Agency and to the repeated recommendations of the SABfor a mechanism such as ("REM to address Agency models. 2. Establish the CREMand have them responsiblefor reviewing "environmental"models for acceptability in regulatory use (e.g., policy, regulatory assessment, environmental decision-making) and list acceptable models as proposed in the A I I I: RMreport using criteria as listed in the A TFERM report or a revision of the criteria. The CREM would implement the model e valuation process and upgrade the model information system. This could be accomplished through either the models website or a clearinghouse library that provides information on how models satisfy the model acceptability criteria and access to the models. Model use acceptability criteria as listed in the A TFERM report or a revision of the criteria addressing changes in model evaluation practices and the increased complexity and range of models in use across the Agency, would be used as the standards for information reporting. Generic criteria with specific programmatic refinements (e.g., quantitative elements) could be envisioned. 3. Leave decisions on regulatory use to program managers (who provide specifications like quantitative criteria) and their technical staff but require accessible information responding to the model acceptance criteria. Again, this could be accomplished through either the models website or a clearinghouse library that provides information and access to the models. ModeI acceptability criteria as listed in the ATFERM report, or a revision of the criteria, addressing changes in model evaluation practices and the increased complexity and range of models in use across the Agency, would be used as the standards for information reporting. Revision of the A TFERM model acceptance criteria would be addressed through a phased evaluation (e.g., development evaluation with qualitative criteria then application with quantitative criteria related to regulatory application) and analysis of the most appropriate kinds ofpeer review to apply to model development and use. 21 ------- 22 ------- 3.2 Task Group Recommendations The task group recommends a combination of options 2 and 3. Decisions on regulatory use of models would be left to program managers recognizing model acceptability is related to its specific use, however, the Science Policy Council should engage in direct interaction with the CREM to provide updated general guidelines on model acceptance criteria to maintain consistency across the Agency. The program managers should respond to the updated general guidelines on model acceptance by developing specifications related to the model types and use in their programs and assuring model information responding to the criteria is accessible. Model acceptance criteria will help define general acceptability for model developers as well as assist users to select and apply models appropriately. The CREM should provide feedback to Agency senior management on consistency in response to the general guidance after periodic review of selected models. 23 ------- 4. PROPOSED SCOPE, APPROACH, AND SUPPORTING ANALYSIS FOR THE GENERAL GUIDANCE 4.1 Scope - Handling Uncertainties in Model Evaluation Model evaluation must address both qualitative and quantitative uncertainties (Beck et al. 1997). Qualitative uncertainty arises in the comparative analysis of the model's structure to the environmental component addressed in the regulatory task. Structure is the way the constituent components or hypotheses are expressed mathematically and connected. Each hypothesis can be judged untenable if model predictions are found to be incompatible with observations of reality. Finding invalid components or connections is more difficult in complex models. Evaluation of the key model components and their redundancy can help us discriminate between a match and a mis-match in qualitative observed behavior and the model structure. However, it is difficult to quantify the impact on results in predictions from structural errors. Quantitative uncertainty occurs in values assigned to model coefficients (parameterization) and is related to the amount and quality of relevant data (e.g., variation of contaminant concentration in time and space) available. Matching of the model's predictions (performance) to past observations, even when reasonable agreement is found, can mask real uncertainty in the model's approximation of the environment's behavior. Mis-characterization of one parameter can mask, or compensate for mis- characterization of one or more other parameters. So evaluation of the uncertainty in calibration of the parameters should be quantified (e.g., variances and covariances of parameter estimates or bands of uncertainty) to support model selection (e.g., match the regulatory task alternatives to a model only in areas where they are maximally insensitive to the known uncertainties). Thus a strategy can be developed identifying the objective evidence to be considered in model evaluation and how to represent the weight of evidence in the model's success or failure to perform its designated task. The overview of model evaluation above, however, does not address all problems that occur in the site specific application phase. Agency program managers need to need to be aware that other factors may also affect the viability of model predictions. Model outputs or predictions must be evaluated to identify erroneous and uncertain results from improper input data, improper boundary condition specification, unjustified adjustment of model inputs as well as violations of model assumptions and exercising of the model outside its proper domain and application niche (EPA-SAB-EEC-LTR- 93-008). Although peer review is mentioned as a touchstone in the model evaluation process by both the ATFERM report and Beck et al. (1997), difficulties in applying effective peer review should not be underestimated. First, model evaluation must be conducted for the range of processes from model construction through to the regulatory use of its outputs. For each process different peers may be appropriate. Because of the technical nature of model construction, there is a tendency to focus on model constructors as the principal, if not the only peer reviewers. This can emphasize a journal style 24 ------- peer review approach, which may be necessary, but is not sufficient according to the Peer Review Handbook. Second, peer review can rarely make a comprehensive analysis of a model including the technical aspects of its implementation. For example, an essential part of a model code verification includes a line-by-line analysis of computer code, which is not a task that a peer reviewer, in the traditional sense, is able to complete. These particular difficulties, when combined with the general difficulties known to exist with peer review such as apparent subjectivity in qualitative areas (e.g., van Vallen and Pitelka 1974, Peters and Ceci 1982, Cichetti 1991), mean that we can not rest on peer review as though it were a gold standard. Peer review can be useful for some functions but is inadequate for others, such as model code development and the analysis of practical questions about the use of models in a regulatory framework unless the term, "peers", is defined more broadly than ever before. For example, the questions in the peer review charter could address specific model evaluation records such as code verification (if general guidance on the necessary records was provided) and ask reviewers if the evaluation was adequate. The peer review panel would need to be constructed to contain appropriate expertise in this area. For it to be an effective part of model evaluation, guidelines need to be developed for the types of peer review that should be applied to the different components of uncertainty and precisely how these should be used by the Agency in its three-part assessment. Specifications are needed when peer review is used in the analysis of particular elements. The type of peer review used might range from letter reviews by individual researchers, internal or external to the agency, providing individual opinions to panels of scientists, managers and end users of model outputs making a comprehensive analysis of model development and use. Supplying the technical specifications to reviewers for them to use in assessing particular elements of uncertainty would be an innovation for review of models but it is similar to the technical standards used in manuscript review. 4.2 Approach 4.2.1 Strategy for Model Evaluation Considering that the Agency's regulatory actions are often challenged, the model acceptance criteria need to reflect the "state of the art" in model evaluation (see Appendix D). The criteria also need to be incorporated into an Agency-wide strategy for model evaluation that can accommodate differences between model types and their uses. As discussed in Section 3.1.5, a protocol was developed for the RAF to provide a consistent basis for evaluation of a model's ability to perform its designated task reliably (see Appendix C). The protocol offers flexibility by proposing a wide variety of evidence to support the decision and covered even situations lacking field and laboratory data. The protocol also includes the nature of the predictive task and the magnitude of the risk of making a wrong decision as context for evaluating the intrinsic properties of the model. Because the evaluation process results in a choice of whether or not to use a model as a tool for prediction, the perspective of the quality of the decision is emphasized. Also, the protocol accepts the null hypothesis that the model 25 ------- adequately represents the modeled process until shown to be otherwise which is consistent with current thinking. This protocol is a good place to start, however, its terminology needs to be updated. 4.2.2 Strategy for Defining Uncertainty in Model Elements While there is no clear and universally applied definition of the model evaluation process, there is considerable overlap between suggested strategies and their working definitions. For example, there is repeated use of model evaluation elements despite considerable differences in model types and applications of available evaluation techniques. This informal consensus provides the potential for characterizing model evaluation by analyzing and defining the following elements: • Uncertainty in the theory on which the model is based. • Uncertainty in translating theory into mathematical representation. • Uncertainty in transcription into computer code. • Uncertainty in assigning parameter values and model calibration. • Uncertainty in model tests. This approach could be used to identify those elements that must rely on peer review; those that should use quantitative measures (e.g., decision maker specifies the acceptable agreement in accuracies and precisions between distributions of model outputs and data from field sampling); and those that should be assessed by an agency panel (e.g., users, stakeholders, and program staff to address effectiveness and accessibility of computer code). Whereas modelers and scientists familiar with a model tend to focus on improving particular elements of uncertainty, the EPA requires a comprehensive and consistent basis for evaluating if a model performs its designated task reliably. This approach also offers the potential of a comprehensive integration of the different model evaluation components into a framework forjudging what constitutes a valid model in a specific situation rather than the judgement itself (left to the program managers). A wide body of evidence needs to be included in such a process because no evidence is above dispute, even "objective" measures of performance depend on some subjectivity in the chosen level of an acceptable difference between a pair of numbers (quantitative criteria relevant to the program). 4.3 Supporting Analysis Our synthesis (see Table 4.1) focuses on the ASTM D978-92 process description of model evaluation (but not definition), the RAF protocol discussed as a starting point (similar to discussions in 26 ------- Beck et al. 1997 and Rykiel 1996), and the ATFERM model acceptance criteria questions. ASTM describes model evaluation in terms of seven processes: conceptualization, program examination, algorithm examination, data evaluation, sensitivity analysis, validation, and code comparison as well as documentation and availability. In the academic literature some authors have focused on different statistics that might be used in model evaluation such as calculations for particular types of variables and for use at different stages in the model evaluation process. However, no definitive protocols have emerged on when or how these statistics should be used. The RAF protocol considers: structure, complexity in number of its state variables and parameters, values of the parameters and the bands of uncertainty around them, sensitivity of the outputs of the model to changes in assigned values of each parameter, and history matching with field data which can include quantitative evaluation if criteria for success are provided to evaluate model performance. The ATFERM report suggests four ways to evaluate and rank reliability of models: (1) appropriateness, (2) accessibility, (3) reliability and (4) usability and lists questions to be asked under each heading but provides no specifications for answering them. TABLE 4.1 The ASTM E978-92, RAF Protocol, and ATFERM report each identify some of the same issues in model evaluation giving them different names. Nomenclature used in ASTM E978-92, left column, is compared with the other two reports in the next two columns, respectively. ASTM E 978-92 RAF Protocol ATFERM (A3) Model Conceptualization Structure & Composition (5.1.li)1 (3 a)2 Theoretical basis peer reviewed? (A, B) Program Examination Task Specification (5.1.3) Decision Risk (5. liii) (la) Application niche? Questions answered? (B,C) Algorithm Examination Mathematical Expression of hypotheses & Complexity (5.1. Hi) (3b) - algorithm & code peer review; (4bvii) code verification (C, D) Data Evaluation Parameterization, number & uncertainty (5.1.1 iii, iv, v & vi) (4d) adequate data availability? (D) Sensitivity Analysis Parameter Sensitivities (5.1. lvii) (E) Validation Performance - sequence of errors; paired/unpaired tests; calibration; prediction uncertainty (5.1.2 i-v) (lb) Strengths, weaknesses, & applicability relative to application niche? (3defg) testing against field data? user acceptability? Accuracy, precision & bias? Code performance? 27 ------- (E) Code Comparison with similar models (3 c) Verification testing against accepted models? Model Documentation (4b) Full documentation? Model Availability (2a) Accessibility? Cost? C (4a) code structure and internal documentation? C (4c) user support? c (4egh) pre- & post- processors? required resources? J'2 Numbers correspond to those in the Appendix C protocol and 1994 ATFERM Report EPA 500-R-94-001 questions, respectively. The letters correspond to the steps in Part II of the strategy described below. 28 ------- The generic guidelines for model evaluation (e.g., ATFERM and ASTM guidelines above) are constructed as a series of questions (e.g., "How well does the model code perform in terms of accuracy, bias and precision?"). They do not set explicit specifications that a model must reach before it is suitable for application (e.g., two or more simulations with acceptable quantitative results over a defined range of scenarios). In an integrated assessment of model uncertainty, it is important that explicit specifications should be set by program managers for each element of uncertainty (both qualitative and quantitative specifications). This would allow flexibility to cover variation in the overall approach, complexity, and the purpose of models used by EPA that may influence the development of such specifications. The task group's recommendation is to be consistent and use the SPC definition of uncertainty in the sense of "lack of knowledge about specific factors (coefficients), parameters (data), or models (structure)" (EPA/630/R-97/001, 1997 page 8) in the following categories and specify how it is to be assessed. It is recommended that the strategy include three parts: 4.3.1 Part I - Defining the Objectives for the Model This part would describe the required task for the model, its application niche (ATFERM l.a), and the precise type of use that is required, (e.g., exactly how it is envisaged that the model would be used in management and/or regulatory tasks and decision making). This part is essential to set the stage for providing detailed answers in Part II which in turn should lead to Part 111, a comprehensive evaluation of model uncertainty. In some instances complex conceptual models have been built over years of investigative science that have been combined with predictive modeling. In other cases new conceptual models must be developed as very new problems arise and both researchers and program managers may use a simple approach in prediction. Neither complexity nor simplicity can be judged as "correct," both may have their place, and we require standards for evaluating both. Also, different types of models have very different types of output. A particularly important distinction is between models that have one or more of their major outputs continuously (or at least regularly) measured contrasted with models that do not. In the first case, there may be a useful measure against which to evaluate the model, while in the later case, evaluation may have to be indirect and using a range of different measurements. 4.3.2 Part II - Analysis of Model Uncertainty The following elements are suggested for evaluating model uncertainty. A. Uncertainty in the theory on which the model is based An appropriate theory must underlie the model. Alternative theories must have been considered and rejected on scientific grounds, and a procedure must be specified for the conditions when new findings will either influence 29 ------- model structure or cause the development of a new model. Assessing this element of model uncertainty seems most likely to involve peer review and it should be specified how this can be done, e.g., individual reviewers, panels, workshops, and the charters specifying the output from peer review must be explicit. Peer review must also take into account the nature of the task, i.e., that the theory used is relevant to the task. It is quite likely that different programs of the Agency would place different emphasis on this aspect of uncertainty. For some ecological and environmental systems there is little difference between scientists in views about underlying theory (though there may be substantial differences in how to measure them), in others theoretical differences are important. This corresponds to 3a of ATFERM MAC and Protocol 5.1.1 (i) Structure. B. Uncertainty in translating theory into mathematical representation. Alternative mathematical representations of particular functions must be discussed and quantitative evidence given to back particular decisions. Important choices are sometimes influenced by the desire to make a model run more quickly - these must be specified. Error propagation associated with the numerical procedures used to approximate mathematical expressions must be defined. The origin of values for model parameters must be defined and how they were obtained must be described. Assessment guidelines should specify which tasks should be undertaken by peer review and which require a more detailed approach than peer review can provide, and how that will be achieved. This corresponds to Protocol 5.1.1 (ii), (iii),and (iv), and the first part of 3b of ATFERM MAC although not with the same emphasis on peer review. C. Uncertainty in transcription into computer code. This stage is required to verify that there are no programming errors (e.g., that the scientific algorithms are properly embedded), that code meets QA/QC specifications and adequate documentation is provided. Programmers are frequently required to make choices in implementing the algorithms that can lead to uncertainty. These need to be documented. Also the code must pass performance tests (e.g. stress tests - NIST Special Publication 500-234). Guidelines must specify how, and by whom, this would be done and if not done by Agency personnel, provide for acquiring the test and documentation records. It may also be necessary to specify programming languages. This corresponds to the second part of 3b of ATFERM MAC and 4a through 4h. D. Uncertainty in model calibration. There are three aspects: 30 ------- (a) uncertainty in the data used in calibration. This should cover everything from numbers of variables to measurements and sampling intensity. (b) uncertainty in the techniques employed in the calibration procedure. This should cover the use of different optimization techniques and requirements for their implementation. (c) uncertainty in the parameters obtained. Of course these uncertainties are the result of many of the previous uncertainties. But there should be explicit assessment of parameter ranges. Guidelines for assessment of these uncertainties should specify what are satisfactory sources of data and sets of calibration procedures. There are likely to be substantial differences between areas of modeling in the assessment of calibration uncertainty. But this type of uncertainty is likely to be an important focus of any external critique and so should be addressed specifically. ATFERM MAC 3c through 3g refer to some of these points but does not make the distinction between calibration and testing. Sensitivity Analysis should be contained under this type of uncertainty since it shows the range of potential outputs from the model under different types and/or levels of assumptions, e.g., Protocol 5.1.1 (vii). E. Uncertainty in model tests. As with uncertainty in model calibration there are four aspects: (a) quantity of data available to make test(s), (b) uncertainty in the data used in making the test(s) (c) the range of statistics (number and types of tests) to use in any assessment, and (d) uncertainty in how a test will be actually made (e.g., how is a difference between a calibration and a test to be assessed?). These points influence the power of any tests and therefore the effectiveness of the assessment that can be made. Protocol 5.1.2 gives examples of particular types of tests, (e.g., unpaired and paired tests). However, the types of tests that can and should be used are perhaps the most variable thing between different types of models, at least at present and they are likely to remain items for considerable research for example, the value of using multiple assessment criteria rather than just one, and how a multi-criteria assessment can be represented. 4.3.3 Part III - The Overall Assessment 31 ------- This part should refer back to the purpose of the model development. Does the model do the required task? The issue is raised in the Protocol of the magnitude of the risk in making a wrong decision, see Protocol 5.1.3. Applications of the model provide "tests" of its strengths and weaknesses. If adequate documentation is recorded, programmatic evaluations can be conducted to identify areas needing corrective action (e.g., Lee et al, 1995 discussed on p. 14). Some of the National Program Offices have developed guidelines for model evaluation for their regulatory applications and these provide examples of the types of specifications that should be expected for each element of uncertainty. For example, OAQPS has developed a consistent practice for air quality models to assure the best model is used correctly for each regulatory application (Guideline on Air Quality Models, 40 CFR, Ch. 1, pt. 51, App. W, 7-1-97). In Section 3 criteria are presented for EPA's evaluation of new models including quantitative specifications for equivalency of new models (e.g., demonstrated by producing maximum concentration predictions within 2 percent of estimates from preferred models). Currently "best estimates" provided by modelers are used (Section 10) noting that errors in the magnitude of the highest estimated concentration occurring sometime within an area of 10 to 40 percent are typically found but acceptable because they are within the factor-of-two accuracy. 5. POSSIBLE FOLLOW-UP ACTIONS 5.1 Additional Support Work Needed Research is increasing in the development of techniques appropriate for analysis of the different elements of uncertainty. These techniques could address specific types of model development and application. Some research and development work is needed to support these recommendations: a) Analysis of Uncertainty The analysis of uncertainty provides a unifying strategy for model evaluation across the different National Program Offices and Regions of the Agency. However, uncertainty is used in many groups of scientists, e.g., fuzzy set theorists, statisticians, resource managers and risk analysts. A review needs to be made of these uses — if only to document where we may stand relative to others. Methods of quantifying and combining measures of model uncertainties (e.g., to quantify the results in predictions from structural errors) need to be developed along with modes of presentation to assist decision makers in interpreting uncertainty in the context of regulatory use. b) Model Inventory 32 ------- A more comprehensive inventory of the types of models actually used in EPA is needed. Such an inventory might be undertaken by program offices and ORD and incorporated into the Application Systems Inventory proposed by the Office of Information Resources Management's Enterprise Information Management Division to comply with A130. From the last inventory in 1995 and later work by ORD the types of models to be covered are presented in Appendix E. c) Multi-media and Multi-contaminant Model Evaluation Multi-media and multi-contaminant models composed of modules grouped together in different ways to answer specific questions pose a complex problem for model evaluation. Have the modules been evaluated for all possible conditions under which they may be used? Should they be? d) Comparability of Evaluation Standards Between Models The issue of tailoring the specifications for evaluation to the model's phase of development, complexity, and the types of model structure, needs to be analyzed. Frequently an environmental problem can be pressing, yet there is little information (data) about it. Models may be one of the few ways of analyzing and estimating effects. Obtaining an approximate estimate of an environmental process with an initial model may be an important contribution (see EPAMMM case history in Appendix B). However, there may not be a clearly defined assessment procedure for the model, particularly in comparison to more well established models. How can differences in the specifications of model evaluation be justified and how can an effective model development and its evaluation procedure, be charted through such stages? Suggested Follow-up Actions 1. Determining the form, resources needed and appropriate housing of the CREM. The overall recommendation of the SPC Steering Committee was that the proposed CREM provide updated guidance for the Agency. Thus CREM might be viewed as analogous to the EMMC as a cross-Agency effort to coordinate and promote consistency in model evaluation and use. This presumes that the goal of developing a consistent approach for modeling issues is desirable, if not essential, to the EPA's modeling programs. Although beyond the scope of this paper, it is anticipated that the Models 2000 SIT will present to the SPC a proposal for a charter specifying the CREM's general function, projected resource requirements and structural placement within the Agency in conjunction with the white paper recommendations. In the future, a careful economic assessment by a contractor of the needs of each Program office and the Regions would be valuable since only limited information on model assessment is currently available. In addition, it has been suggested that the CREM might be housed under the auspices of the Risk Assessment Forum. 33 ------- 2. Directing the CREM's work toward issuance of peer-reviewed guidances on "how" to evaluate and characterize models to support the strategy for model evaluation (Section 5) rather than only listing "what" should be done. A rough five year time frame for these guidances is estimated. Examples of guidance subjects needed are: • appropriate methods for peer review of models to address uncertainty in model theory; • mathematical approaches to code verification to address uncertainty in transcription into model code; • the appropriate use of sensitivity analyses in modeling to address uncertainty in model calibration; • characterizing applicability of a particular model (needed data, driving parameters, responsiveness to data, etc.) to address uncertainty in model tests and the overall assessment with respect to the stated use of the model results; • how to use information (from evaluations covered by above guidances) as guidance for the technical user to construct a plain English characterization for the non-technical, risk manager (i.e., "model characterization language" similar to the risk characterization paradigm) to address the overall assessment with respect to the stated use of the model results (Part 1). Research efforts could be under the auspices of the Risk Assessment Forum in consultation with the SPC/CREM and could be funded through mechanisms such as the ORD's STAR program. 3. Developing and utilizing a model clearinghouse to inform internal and external stakeholders on model evaluation results, availability and application experience in concert with the Program Offices and Regions. A possible solution for a centralized location for models is the Applications Systems Inventory (AS1) proposed by the Office of Information Resources Management (OIRM). OIRM is required to maintain an Information Systems Inventory of metadata related to EPA's computer application systems, models, modules and databases. This would require agreement by the various program offices and Regions to support such a system. At this stage, the clearinghouse is not envisioned as being resource-intensive in terms of providing technical assistance. The user of the clearinghouse would be referred to individual programs/offices for model support assistance. 4. Integration of developed peer review guidance and the supporting aspects of QA/QC for environmental models. Once the CREM has developed peer review guidance, the supporting aspects of QA/QC for environmental regulatory models (and model systems) will need to be clarified. Some of aspects of evaluation process appear 34 ------- more feasible as peer review than others, i.e., evaluation of the scientific theory underlying a model versus a line by line verification of the computer code incorporating the mathematical equations or assessment of input data quality. Thus, internal Agency discussions, in consultation with the SAB, would be helpful in identifying the most appropriate areas of model development and application for peer review and those evaluations best conducted and documented as part of supporting QA/QC. 5. Preparing case studies (see prototypes in Appendix B) that would serve as examples of how models used in regulatory decision-making can be evaluated and the value added by such evaluations (e.g., Ozone Transport Assessment Group and Chesapeake Bay modeling experience like testing for the submerged aquatic vegetation ecosystem modeling framework). 6. Clarifying model terminology used by EPA in producing a Glossary for the "state of the art" General Guidance. For example, applicable current definitions for "validation," "uncertainty" for various areas of model evaluation, and "modeling error." 35 ------- 6. QUESTIONS POSED BY THE COORDINATING COMMITTEE 1. How do the issues of Peer Review (external/internal) and QA/QC evaluation relate to acceptability determination? Models are important tools supporting EPA's efforts in Risk Assessment and Management. To the extent that models erroneously estimate conditions, EPA could make costly mistakes in related decision-making. Therefore, models are covered by both QA and peer review requirements. According to the Science Policy Council Peer Review Handbook, (EPA 100-B-98- 001) models generally should be peer reviewed, and the ATFERM Guidance for Conducting External Peer Review of Environmental Regulatory Models has been incorporated in the handbook (EPA 100-B-94-001). Peer review provides an expert and independent third party review that cannot be provided by stakeholder or peer involvement and public comment. However, peer and stakeholder involvement provide valuable advice and feedback to model developers to assure a usable product (e.g., advisory panels and workshops). In 1997, EPA's Quality Assurance Division conducted an evaluation of implementation of the EPA's peer review procedures and found that few of the over 300 interviewees used the guidance on model peer review. Most, of the few who were aware of the guidance, were unclear about what was expected in implementing it and apparently had no incentive to do so because it was "only guidance." The American National Standard "Specifications and Guidelines for Quality Systems for Environmental Data Collection and Environmental Technology Programs" (ANS1/ASQ, 1994) cited in contract and assistance agreement regulations and incorporated in the revised EPA Order 5360.1 CHG1 and QA manual 5360.1, specifies QA requirements applicable to models. The scope of the order's applicability includes "the use of environmental data collected for other purposes or from other sources (also termed secondary data), including ... from computerized data bases and information systems, results from computerized or mathematical models ..." Project level planning, implementation and assessment are addressed to assure data, whether collected or existing, "are of sufficient quantity and adequate quality for their intended use." Implementation requirements include data processing to be performed in accordance with approved instructions, methods and procedures. Also required are evaluation of new or revised software including that used for "modeling of environmental processes" and documentation of limitations on use of data (e.g., model output data). Implementation by the SPC of the ATFERM's recommendations for the Agency to provide guidance on model selection using model acceptance criteria and information from a "Model Information System" would help to clarify what model evaluation records are needed (e.g., code verification, testing results, model selection and the application process). The model evaluation strategy proposed above could provide a process tailored to the nature of the 36 ------- predictive task to be performed and the magnitude of the risk of making a wrong decision consistent with existing QA guidance. It could also clarify the complementary roles of QA and peer review tasks in model evaluation and the basis for guidance on QA Project Plans for model development and application. The Agency's Peer Review Handbook includes under materials for peer reviewers: the charge, the work product, "associated background material" and "what is needed to complete their task." Other useful material can include "a bibliography and/or any particular relevant scientific articles from the literature." This leaves unclear what specific records are needed to adequately answer questions on model elements (EPA 100-B-94-001 Section VI) like: "What criteria were used to assess the model performance?," "What databases were used to provide an adequate test?," "What were key assumptions and the model's sensitivity to them?," "How the model performs relative to other models?," "Adequacy of documentation of model code and verification testing?," and "How well does the model report variability and uncertainty in its output?" A number of the requirements in the Peer Review Handbook also need to be clarified: a) What documentation is needed for peer review files of externally developed models to show "the model was independently peer reviewed with the intent of the Agency's Peer review policy for models and that EPA's proposed use" was evaluated? b) What are the criteria needed to identify "models supporting regulatory decision making or policy/guidance of major impacts such as those having applicability to a broad spectrum of regulated entities and other stakeholders', or that will "have a narrower applicability, but with significant consequences on smaller geographic or practical scale" needing peer review? c) What are the criteria by which decision makers judge when "a model application situation departs significantly from the situation covered in a previous peer review" so that it needs another peer review? d) What are the criteria by which decision makers judge when "a modification of an existing adequately peer reviewed model departs significantly from its original approach" so that it needs another peer review? e) What is the relationship of the peer review of models in the development stage often reported in journal articles (where peer review is usually performed for specific reasons for that journal and does not substitute for peer review of the Agency work product or provide accessible peer review records) and peer review addressing model use to support an Agency action? f) What questions need to be asked in peer review of model applications supporting site specific decisions where the underlying model is "adapted to the site specific circumstances? 2. What is a consensus definition of model use acceptability criteria? 37 ------- After reviewing the diversity of models and their uses across the Agency, the task group has proposed a model evaluation strategy rather than a "one size fits all" set of criteria. 38 ------- 3. Does acceptability correspond to a particular model, or specific applications of a model? Specific models and applications could be accommodated in specific criteria developed by the programs. 4. Does acceptability cover only models developed by EPA or can it cover externally developed models? Acceptability covers all models used in Agency regulatory decision making (see Appendix F). 5. Does acceptability mean the agency will develop a "clearinghouse" of models that meet EPA's definition of acceptable? As discussed above it is recommended that program managers would be responsible for acceptance of models for use in their program activities. Some means of providing model evaluation status and information to potential users to be used in model selection is needed Agency-wide. The task group further recommends that a mechanism be developed for providing information responding to the model acceptance criteria to potential users to support model selection and avoid redundancy in model evaluation efforts. A number of Agency efforts might be evaluated to determine how best to leverage their resources to achieve Agency-wide goals. Some model clearinghouses already exist but often lack support. One exception is the Guideline on Air quality Models Appendices (http://www.epa.gov/ttn/scrarn) that provides preferred models as well as information on alternative models (e.g., regulatory use, data input, output format and options, accuracy and evaluation studies) supporting selection. As a result of the December 1997 Models 2000 Conference, an action team was formed for a Modeling Clearinghouse responding to a need perceived by the participants. OIRM is proposing to develop an Application Systems Inventory (ASI) as a repository of information about Agency software and clearinghouse to meet the Paperwork Reduction Act of 1980 and OMB Circular A-130 requirements. The ASI would integrate metadata collection requirements across the Agency and could be modified to meet specific model metadata requirements providing information on model evaluation and use. Another effort, by ORD's National Center for Environmental Assessment, is defining metadata for models that can be stored in its relational database, the Environmental Information Management System, with input through the internet and retrieval through a search engine using specific queries. Information found useful for model selection such as those listed in the Nonpoint Source Model Review example (Appendix F) is being considered for data elements. In addition, a strategy for communication needs to be developed for the public and others, like model users, to provide feedback to the EPA, possibly through the Internet at sites providing information on models and their evaluation for EPA use. 6. Would each program/region develop their own system for evaluating acceptability? 39 ------- Yes, in terms of program specifications (both qualitative and quantitative). 7. Should EPA apply a generic set of criteria across the board to all categories of ERMs or should acceptability criteria differ depending on the complexity and use (e.g., screening vs. detailed assessment) of a model? Both, generic criteria on model evaluation developed by direct interaction of the SPC and CREM with tailoring of specifications (both qualitative and quantitative) done by the programs. 40 ------- 7. COST AND BENEFITS DISCUSSION The task group requested information on costs of model evaluation activities from its members, those involved in the case histories, and the Models 2000 Steering and Implementation Team. The limited interim responses (Appendix G) were distributed for comment. Evaluation activities vary in the resources required depending on their complexity, availability of in-house expertise, and whether or not costs can be leveraged with other organizations. On the low end, the EPAMMM evaluation of a screening model evaluation without site data cost about $60,000 (for an in-house expert's 0.5 full-time equivalent). On the high end, the RADM Evaluation Study cost about $17 million for field studies, $.5 million for NOAA FTEs, and $1 million for contractors to run the model in tests in their 2.5 year effort. The Air Modeling regulatory program has used 20 to 25 staff personnel over the past 20 years with extramural support of $1.5 to 2 million per year. Model performance evaluation and peer review has cost about $150 to 200,000 per model category (2 to 10 models), although the AERMOD evaluation cost somewhat less than $50,000. In the interagency MMSOILS benchmarking evaluation, EPA's portion of the cost involved scenario development of about $50,000, execution of 4 models at about $100,000 and output comparison and reporting at about $ 150,000. Total coding costs estimated for the IEUBK model were about $200,000 and separate test costs were not available from the contractor. The cost would depend on the language used and when the programming documentation was done, costs being higher if documentation was done late in the process (could equal half the total coding cost). At the Models 2000 conference, Bob Carsel estimated that for a ground water model, software evaluation and documentation cost about 30% of the project cost. The costs for the AIR Dispersion Model Clearinghouse and SCRAM were about 2 FTEs/GS-13 for in-house personnel for maintaining the clearinghouse and providing regional support and regional workshops. The database, MCHISRS, cost about $50,000 for contract support over few years with little upkeep for SCRAM. Expenditure by the Agency of the costs summarized above need to be considered in light of the benefits of better documentation and communication of the strengths and weaknesses of models. If carried out the task group's recommendations would promote systematic management of model development and use within EPA by providing a basis for consistent model evaluation and peer review. The proposed model evaluation strategy would encourage sensitivity and uncertainty analyses of environmental models and their predictions as well as clarify peer review requirements. Access to the evaluation results would improve model selection and avoid redundancy in model development and evaluation. Although these benefits would involve costs in model development for evaluation, peer review and access to evaluation results but would result in better products. Likewise, the additional cost incurred by evaluation of model application would provide feedback to developers that would improve model performance. 41 ------- APPENDIX A SCIENCE POLICY COUNCIL MODEL ACCEPTANCE CRITERIA WHITE PAPER TASK GROUP MEMBERS Linda Kirkland, Ph.D., Chair, Quality Assurance Division, NCERQA/ORD Brenda Johnson, Region 4 Dale Hoffmeyer, OAR Hans Allender, Ph.D., OPPT Larry Zaragoza, OSWER Jerry LaVeck, OW Thomas Barnwell, Ph.D., ORD/NERL, Athens, GA John Fowle HI, Ph.D., SAB James Rowe, Ph.D., SPC David Ford, Ph.D., National Center for Research in Statistics and the Environment, University of Washington 42 ------- APPENDIX B MODELS EVALUATION CASE HISTORIES 1. SHADE-HSPF Case Study (Chen et al., 1996,1998a &b) Regulatory Niche & Purpose: A watershed temperature simulation model was needed for targeting critical reach locations for riparian restoration and forestry best management practices development. Evaluation of attainment of stream temperature goals (water quality criteria) was emphasized. Model Selection: Functional selection criteria (e.g., watershed scale and continuous-based representation, stream temperature simulation module) were used to survey and evaluate existing models resulting in the Hydrologic Simulation Program-Fortran (HSPF) model being selected as the only model meeting the requirements. Further evaluation of HSPF identified limitations in two important heat budget terms. Data Source/Development: A stand-alone computer program, SHADE, was developed for generating solar radiation data sets with dynamic riparian shading characteristics for use as input for water temperature simulation by HSPF after it was enhanced for computing the heat conduction between water and the stream bed for complete heat balance analysis. The case study involved generating water balance information using hydrologic simulation and then computing the effective solar radiation for stream heating to simulate stream temperature dynamics with HSPF. Existing data sources were reviewed and appropriate meteorological, stream flow, and hourly stream temperature data for model calibration and validation were located from a fish habitat restoration project in the Upper Grande Ronde (UGR) in Oregon. Most process-oriented parameters were evaluated from known watershed attributes. Other parameters were evaluated through model calibration with recorded stream flow and temperature data based upon an understanding of the study site, HSPF application guidelines, and previous HSPF studies. Topographic, hydrographic and riparian vegetation data sets for SHADE were developed with ARC/INFO GIS for 28 fully mapped segments (the mainstem river and four major tributaries) and 9 partially mapped segments (portions of nine other tributaries). Sensitivity Analysis: Sensitivity analysis was noted as an important technique used extensively in designing and testing hydrologic and water quality models. To evaluate the sensitivities of simulated stream temperatures to HSPF heat balance parameters (HTRCH) and SHADE parameters, one factor (model variable or parameter) was changed at a time while holding the rest of the factors constant. Absolute sensitivity coefficients representing the change in stream temperature as a result of a unit changes in each of the two groups of model parameters were calculated by the conventional factor perturbation method. Temporal (both diurnal and seasonal) and longitudinal 43 ------- variations in sensitivity were noted. Riparian shading parameters in SHADE were evaluated for stream temperature calibration to verify accuracy and reliability of SHADE computations. The solar radiation factors or SRF (deemed the most critical parameter by sensitivity analysis) as well as the diurnal, seasonal, and longitudinal variations were evaluated to verify the accuracy and reliability of SHADE computations. Significant improvement between the maximum values of SRF and the measured stream values suggested a better representation of local shading conditions by the segment based SHADE computations. Calibration/V alidation/Testing: The model sensitivities to each parameter, as well as the diurnal, seasonal, and longitudinal variations noted, provided the basis for the stream temperature calibration. To test and demonstrate the utility and accuracy of SHADE-HSPF modeling system, hydrologic and stream temperature simulations of the watershed were conducted and visually compared to plotted measured data for two summers at 27 sites. The stream temperature calibration for 1991 and validation for 1992 generally confirmed the accuracy and robustness of SHADE- HSPF modeling system. The simulated results matched the observed points reasonably well for the majority of sites (19/27). In addition, three statistical tests were run to provide coefficients of determination and efficiency and the standard error of estimate for evaluation of critical model capability. Evaluation focused on stream temperature goals for the UGR basin (e.g., summer maximum temperature and average 7-day maximum temperature) most of the absolute errors were less than 2.5. Simulated maximum values of stream temperature, on which the riparian restoration forecasts are based, are accurate to 2.6-3.0" C. Hourly simulations have approximately the same accuracy and precision. The phase, diurnal fluctuations, and day-to-day trends in stream tp.-npprat.irpg wpm (jpnprally inHirwtly mnfirmina that riparian charting ran hp PgtimatpH reasonably by SHADE. Compared to the 8-10"C exceedances of the temperature goals under present conditions, the model accuracy of approximately 3.0"C should be adequate to assess riparian restoration scenarios. This case history shows positive elements: description of regulatory use and focused on criteria for model performance evaluated existing models based upon specific criteria to select the model for further development and testing • evaluation and selection of existing data for use in development, calibration, validation and testing • sensitivity analysis and discussion of uncertainties good discussion of data and model limitations results peer reviewed by ?internal ORD review and the Journal of Environmental Engineering process as published 44 ------- Concerns: If used by EPA, the peer review for a journal (while strengthening the scientific and technical credibility of any work product) is not a substitute for Agency work product peer review as it may not cover issues and concerns the Agency would want peer reviewed to support an Agency action. 45 ------- 2. TRIM.FaTE Case Study This case study is based upon the summary (EPA, 1998 EPA-452/D-98-001) of the review of current models, the conceptual approach to the Total Risk Integrated Methodology (TRIM) framework and the first TRIM module, TRIM.FaTE, and the evaluations of TRIM.FaTE prototypes (e.g., mathematical structure and algorithms) presented to the Science Advisory Board (SAB) for an advisory review. Regulatory Niche & Purpose: OAR needed a multi-media, time series simulation modeling system to estimate multi-media impacts of both toxic and criteria air pollutants in support of Clean Air Act requirements (e.g., residual risk program, delisting petitions, urban area source program, special studies, trends, setting NAAQS, and input to regulatory impact analyses). Model Selection: Four multimedia, multipathway models and approaches were evaluated on the criteria that the tools have capabilities of 1) multimedia assessment; 2) ecosystem risk and exposure modeling; 3) multi-pollutant assessment; 4) addressing uncertainty and variability; and 5) accessibility and usability for EPA, states, local agencies and other stakeholders. Hazardous air pollutant (HAP) exposure and risk models also needed to adequately estimate temporal and spatial patterns of exposures while maintaining mass balance. Current criteria air pollutant models have this capability for the inhalation exposure pathway. The importance of capabilities to model pollutant uptakes, biokinetics, and dose-response for HAPs and criteria pollutants was also considered. It was found that risk and exposure assessment models, or a set of models, with the capabilities needed to address the broad range of pollutants and environmental fate and transport processes for OAQPS risk evaluations do not exist. Therefore, development of the modular TRIM framework to have varying time steps and sufficient spatial detail at varying scales, true "mass-conserving" results, transparency to support use in a regulatory context, and a truly coupled multimedia structure was begun. Data Source/Development: An object-oriented architecture using Visual Basic 5 application environment imbedded within Excel 97 to model the hierarchy of components of TREVLFaTE, with a preliminary algorithm library utilizing this coding architecture, was implemented for the TRIM.FaTE prototype. The final TRIM computer framework is being designed. Where possible, existing models, tools, and databases will be adopted, necessitating their evaluation. TRIM is planned to be a dynamic modeling system that tracks the movement of pollutant mass through a comprehensive system of compartments (e.g., physical and biological), providing an inventory of a pollutant throughout the entire system. The TRIM design is modular and, depending on the user's need for a particular assessment, one or more of six planned modules may be employed (e.g., exposure event as well as pollutant movement, uptake, biokinetics, dose response, and risk characterization). Receptors move through the compartments for 46 ------- estimation of exposure. Uptake, biokinetics, and dose response models may be used to determine dose and health impacts. Also, the TRIM.FaTE module allows flexibility to provide simulations needed for a broad range of risk evaluations because it can be formulated at different spatial and temporal scales through user selections from an algorithm library or added algorithms. The unified approach to mass transfer allows the user to change mass transfer relationships among compartments without creating a new program. This scenario differs significantly from routine application of stable single medium model programs. The mathematical linking enables a degree of precision not achieved by other models, while providing full accounting for all of the chemical mass that enters or leaves the environmental system. An outline was provided for the future user's manual for SAB's review. Sensitivity Analysis: Tiered sensitivity and uncertainty analyses will be integrated into the TRIM framework. All inputs to TRIM will be designed such that parameter inputs can be entered in parameter tables as default values or value distributions. Capability to estimate variability and uncertainty will be an integral part of TRIMFaTE. Currently, only simplified sensitivity analyses have been conducted by considering the range of uncertainty in the parameter value and the linear elasticity of predicted organism concentration with respect to each input parameter. Sensitivity scores were calculated for all inputs and the sensitivity to change determined for chemical concentrations in a carnivorous fish, macrophytes, a vole, a chickadee, and a hawk with B(a)P in a steady state. Parameters with both relatively high sensitivity and a large range of uncertainty were identified and efforts focused on decreasing uncertainty that would produce the largest improvement in decreasing output uncertainty. Limitations in reliability were noted to come from relevance and availability of data to address uncertainties (e.g., about soil partition processes). Calibration/Validation/Testing: Four prototypes of increasing complexity were developed and evaluated. Algorithm generalizations, conceptualizations of domains (e.g., soil, groundwater, air, plant, terrestrial food chain transport), and code and data structures were provided for evaluation by the SAB panel. Also, software, routines, the databases consulted, and the data tables sources were documented and the quality of the data (e.g., distributional data for terrestrial wildlife) was discussed in comments. The TRIM.FaTE prototypes were applied to the simulation of B(a)P and phenanthrene releases in a realistic aluminum smelter test case and evaluated by comparison of the distribution of mass in multiple environmental media with results from two widely used multimedia models. TRIM.FaTE yielded results similar to the other models for some media but different results for others based upon different algorithms. Without actual measured concentrations in a controlled system, it cannot be determined which model accurately reflects reality. Limited model verification has been performed to date and more is needed. 47 ------- This is an example of peer review done early and, as planned, often. The elements in the ATFERM "Guidance for Conducting External Peer Review of Environmental Regulatory Models" were addressed in the information provided to the SAB panel even though this was an evaluation done when only about half of the first module was completed. The user's choice of algorithms needs to be an informed choice based upon evaluation of information provided in the algorithm library (e.g., requirements, assumptions). Also, documentation ot tne rationale tor tne cnoices, tne cnoices (e.g., prooiem cietinition, specifications of links between data and chosen algorithms, run set up and performance), and the results need to be self-documenting to provide defensibility of the model output. Acquisition of data and testing of the future system needs to be carefully planned and the results similarly documented for peer review and users (e.g., limitations on use). 3. MM SOILS Case Study (Laniak, et al., 1997) Regulatory Niche & Purpose: EPA develops, implements, and enforces regulations that protect human and ecological health from both chemical and non-chemical stressors. EPA (like DOE) has a need to understand environmental processes that collectively release, transform and transport contaminants resulting in exposure and the probability of deleterious health effects and uses simulation models to assess exposures and risks at facilities in support of its decision making processes. MMSOILS is a multimedia model used by EPA for assessing human exposure and risk resulting from release of hazardous chemicals and radionuclides. It provides screening level analyses of potential exposure and risks and site-specific predictive assessments of exposures and risks. Model Selection: EPA and DOE developed a technical approach, benchmarking, to provide a comprehensive and quantitative comparison of the technical formulation and performance characteristics of three similar analytical multimedia models: EPA's MMSOILS and DOE's RESRAD and MEPAS. Calibration/Validation/Testing: Model design, formulation and function were examined by applying the models to a series of hypothetical problems. In comparing structure and performance of the three models, the individual model components were first isolated (e.g., fate and transport for each environmental medium, surface hydrology, and exposure/risk) and compared for similar problem scenarios including objectives, inputs, contaminants, and model endpoints. Also the integrated multimedia release, fate, transport, exposure, and risk assessment capabilities were compared. 48 ------- For example, the fate and transport evaluation used a series of direct release scenarios including a specific time-series flux of contaminant from the source to the 1) atmosphere, 2) vadose zone, 3) saturated zone, and 4) surface water (river). Model estimates of contaminant concentrations at specific receptor locations were compared. To compare the performance of all components functioning simultaneously, a hypothetical problem involving multimedia release of contaminants from a landfill was simulated. The manner and degree that individual differences in model formulation propagate through the sequence of steps in estimating exposure and risk were evaluated by comparing endpoints of concentration-based model outputs (e.g., contaminant concentrations and fluxes for each medium, time to peak concentration) as well as medium-specific and cumulative dose/risk estimates. The results discussed differences in 1) capabilities (e.g., RESRAD and MEPAS simulate formation, decay, and transport of radionuclide decay products but MMSOILS does not); 2) constructs with respect to simulating direct releases to the various media (e.g., MMSOILS allows for varying source release constructs but does not allow for specific media to be isolated per simulation because all model modules must be executed in a simulation); 3) direct releases to the atmosphere, vadose zone, saturated zone, and surface water (e.g., all models use nearly identical formulations for transport and dispersion, resulting in close agreement with respect to airborne concentration predictions at distances greater than one kilometer from the source); 4) how surface hydrology is handled (e.g., MMSOILS does not distinguish between forms of precipitation like rainfall and snow fall); 5) direct biosphere exposure and risk (e.g., models are in complete agreement for the vast majority of methods to calculate exposure and risk with differences occurring in test scenarios for irrigation, external radiation ad dermal adsorption in contact with waterborne contamination); and 6) the multimedia scenario (e.g., predictions of total methylene chloride mass that volatilizes differ by a factor of 10 between the models). Results showed that the models differ with respect to 1) environmental processes included, and 2) the mathematical formulation and assumptions related to the implementation of solutions. Peer Review of the benchmarking process and results was carried out externally by the DOE Science Advisory Board in a review of the DOE Programmatic Environmental Impact Statement and the journal, "Risk Analysis." This case history shows positive elements: Results provide comparable information on model design, formulation and function that can support informed selection decisions between these models. Concerns: Objectives for the study did not address how well the models predicted exposures and risks relative to actual monitored releases, environmental concentrations, mass exposures or health effects. Also there are no test results of the models in applications supporting specific 49 ------- regulatory and remedial action assessment needs because the study was based upon the premise that the models were applicable to the types of problems for which they were typically used. Additional information would be needed for selection decisions for model application in EPA. Cost: Scenario Development about $50,000 Execution of the Models $100,000 Output comparison and write-up (Journal articles) $150.000 Total $300,000 4. RADM Case Study (NAPAP Report 5, 1990) Regulatory Niche & Purpose: The National Acid Precipitation Assessment Program (NAPAP) needed models to assess changes in sulfate in response to proposed changes in emissions of sulfur dioxide, Model Selection: Regional models (linear chemistry, statistical and Lagrangian) were evaluated with new NAPAP data. An operational baseline for RADM performance was compiled based upon previous operational evaluations. Data Source/Development: RADM was developed to assess changes in sulfate in response to proposed changes in emissions of sulfur dioxide including addressing nonlinearity between changes in emissions and changes in deposition. Earlier simpler models were regarded as unreliable because they did not capture complex photochemistry and nonlinearities inherent in the natural system (e.g., the role of oxidants and aqueous-phase transformations). Nonlinearity (fractional changes in primary pollutant emissions are not matched by proportional changes in wet deposition of its secondary product) was of concern because at the levels of control addressed (10 million metric tons) reduction of emissions by another 20% to overcome nonproportionality and achieve a target could double control costs. Sensitivity Analysis: Sensitivity studies were done to determine which parameterization worked best for the RADM's meteorology and transport module, its chemistry module and the integrated model (e.g., meteorology, chemistry, emissions, and boundary conditions to the relative uncertainty in 6 key species concentrations). Calibration/V alidation/Testing: 50 ------- An Eulerian Model Evaluation Program (EMEP) was carried out to established the acceptability and usefulness of RADM for the 1990 NAPAP Integrated Assessment. Key to evaluation were performance evaluations including diagnostic assessments and sensitivity analyses leading to user's confidence in the model. Guidelines and procedures were incorporated into protocols focused upon the key scientific questions supporting the application (e.g., ability to replicate spatial patterns of seasonal and annual wet deposition). Data were collected to provide robust testing of the models noting that the confidence in the model would be related to the variety of situations in the model's domain tested with observational data to show a "lack of inaccuracy" in performance. Previous model testing was limited by availability of data. Therefore, data were collected for chemical species concentrations and deposition at the surface to improve "representativeness" as well as from aircraft and special chemistry sites to support diagnostic tests for model components in a 2-year field study. Comparisons against field data were viewed as more important in identifying weaknesses than verification (the determination of consistency, completeness, and correctness of computer code) and adequacy of the model design. However, in regional modeling the disparity in scale between site measurements and the volume-averaged prediction is a source of uncertainty. It was noted that inadequate spatial resolution in the data could produce observations that did not represent spatial patterns actually present. Such difficulties in interpretation led to linking of model uncertainty with model evaluation. Model predictions were compared to uncertainty limits of interpolated (by "kriging") observations and the observed statistically significant differences were used in evaluation (e.g., bias estimates). Kriging produced an estimate of the uncertainty (expected squared error) of the interpolated points that could be used as confidence bands for the spatial patterns. Uncertainty in the observation data came from spatial resolution in the observations, small-scale variability in the air concentration fields and wet deposition and measurement error. The model simulated the patterns (from spatially averaged fields within a grid cell) which were expected to lie within the uncertainty bounds of the corresponding pattern obtained from the observations. Two cycles of model development, evaluation, refinement and reevaluation were carried out. The evaluation process was looked upon as iterative as the model progressed through its stages of development: 1) informal testing by the developer, 2) testing by the user with diagnostic evaluation by the developer, and 3) performance evaluation in user application. The first year's data was used in the second phase and it was planned the second's data would be used in the third. Comparative evaluation was performed where model predictions from several versions of RADM and ADOM (developed for Canada) were evaluated against field observations for a 33-day period in the fall of 1988 to see if persistent and systematic biases occurred in predictions estimating the deposition from sources at receptor areas, to estimate the change in deposition resulting from a change in emissions, and to capture nonproportionality in deposition change. Because RADM simulates very complex processes, any errors in the model's representation of physical and chemical processes could bias the model's predictions. Predictions from process modules were compared for the gas-phase chemistry, the cloud scavenging, and transport modules when shortcomings in representations of individual components in the sulfur system were noted. Capabilities of simulating annual and seasonal averages with RADM and several linear models (for wet and dry deposition and on ambient 51 ------- concentrations) were evaluated. At this early stage of regional model evaluation, no viable quantitative performance standards existed (e.g., how "inaccurate" it could be). The inferred performance of the models regarding source attribution, deposition change, and air concentration change was examined based upon the evaluations and bounding analysis results, and risk of that KAUM could give "misguidance tor the 1990 NAPAP Integrated Assessment was assessed. To obtain a reliable estimate of how broadly RADM's predictions could range as a function of possible errors, a "bounding" technique was developed. RADM2.1 was suggested to be used for 1990's NAPAP Integrated Assessment because it did not exhibit any biases extreme enough to preclude use if the bounding technique was used and a cautious approach was taken to use of the predictions. EMEP data were used to address issues of acceptance and standards of performance, but Phase 2 was not covered in the report. Performance over a large number and range of tests were stated as necessary to acquire the weight-of-evidence needed to interpret the results. A multidisciplinary panel provided peer involvement. Cost: It was estimated that EPA provided about 18.5 Million for the 2.5 year evaluation effort (17M for field studies, ,5M for NOAA FTEs, and 1M for contractors to run the model in tests). 5. EPAMMM Case Study (J. Chen and M. B. Beck, 1998. EPA/600/R-98/106.) Regulatory Niche & Purpose: A model was needed to screen a large number of hazardous waste facility sites with potential contamination of groundwater by leachates. The objective was to rank the sites according to their risk of exposure in the absence of in situ field observations. Those predicted to be of highest risk would have priority for remediation. Model Selection: The EPA Multi-Media Model (EPAMMM) was evaluated as a tool for predicting the transport and fate of contaminants released from waste a disposal facility into an environment in several media (e.g., air or subsurface environment). The model contains 7 modules: the landfill unit, the unsaturated flow field, transport of solutes in the unsaturated zone, transport of solutes in the saturated zone, transport of solutes in the surface waters, an air emissions module, and an advective transport and dispersion of the contaminant in the atmosphere (Salhotra et al. 1990 and Sharp-Hansen et al. 1990). The application evaluated was the characterization of a Subtitle D facility using 3 of the modules: flow in the unsaturated zone, transport of solutes in the unsaturated zone, and transport of the solutes in the saturated zone. Analytical and semi- analytical techniques were used to solve the basic partial differential equations of fluid flow and solute transport. 52 ------- Testing: A protocol (Beck et al. 1995 and 1997) developed for evaluation of predictive exposure models was performed in a test case when no historical data were available to be matched to the simulated responses (traditional validation). Quantitative measures of model reliability were provided and summarized in a statistic that could augment more qualitative peer review. Three groups of tests were formulated to determine model reliability. One test assessed the uncertainties surrounding the parameterization of the model that could affect its ability to distinguish between two sites under expected siting conditions. The output uncertainty, as a function of different site characteristics, was investigated to determine if a reasonable range of model parameter uncertainty would render the power of the model to discriminate between performance of containment facilities ineffective. A generic situation was simulated under different subsurface soil, hydrological and contaminant-degradation regimes and the power of the model to distinguish between the site's containment effectiveness was tested. The probability of identical values of the residual contaminant concentration (y) at the respective receptor sites for two sites with different soil and hydrological parameterizations was evaluated to see if it was less than some quantitative threshold, such as 0.01. 0.05, or 0.10. Another test analyzed regionalized sensitivity (Spear et al. 1994) to determine which of the model's parameters were critical to the task of predicting the contaminant's concentration exceeding the action level. The model parameters were evaluated to determine which ones were key to discriminating among the predictions of (y) in various ranges of exposures. This identified the parameters that needed the best information to determine a particular percentile of the contaminant distribution concentration at the receptor site (y). The results also provided information on the redundancy of parameters in achieving the target performance of predicting a percentile concentration. The number of key and redundant parameters can indicate model quality for the screening application. The third test provided a more global sensitivity analysis investigating the dependence of selected statistical properties of the distributions of predicted concentrations on specific parameters. That proportion of uncertainty attached to the output (y) that derives from the uncertainty in the knowledge of a given parameter was quantified. For each individual parameter, the extent of the statistical properties of the predicted distribution (mean, variance, and 95lh percentile) of (y) was determined varying as a function of the point estimates assumed for the parameter. The other parameters were treated as random variables within the framework of a Monte Carlo simulation. The results of the tests show a novel form of statistic (the Quality Index of the model's design) forjudging the reliability of a candidate model for performing predictive exposure assessments. This case history shows positive elements: Quantitative indicators of model performance provided without in situ field data. Concerns: 53 ------- Detailed knowledge of the mathematical model's function, the details of the conditions assumed for the tests and the acceptable risks in answering the questions associated with application niche are required for this type of evaluation. Cost: About '/2 an FTE (estimated about $60,000) 54 ------- APPENDIX C - MODEL VALIDATION PROTOCOL FROM: DRAFT July 4, 1994 MODEL VALIDATION FOR PREDICTIVE EXPOSURE ASSESSMENTS M B Beck * Lee A. Mulkey ** Thomas O. Barnwell ** * Warnell School of Forest Resources University of Georgia Athens, Georgia 30602-2152 and Department of Civil Engineering Imperial College London SW7 2BU, UK **U.S. Environmental Protection Agency Environmental Research Laboratory Athens, Georgia The beginning of the White Paper was published as: Beck, M. B., J. R. Ravetz, LA. Mulkey, and T.O. Barnwell. 1997. On the Problem of Model Validation for Predictive Exposure Assessments. Stochastic Hydrology and Hydraulics 11:229-254. Springer-Verlag. Part 5 CONCLUSIONS It is not reasonable to equate the validity of a model with its ability to correctly predict the future "true" behavior of the system. A judgement about the validity of a model is a judgement on whether the model can perform its designated task reliably, i.e., at minimum risk of an undesirable outcome. It follows that whomsoever requires such a judgement must be in a position to define — in sufficient detail — both the task and the undesirable outcome. 55 ------- However desirable might be the application of "objective" tests of the correspondence between the behavior of the model and the observed behavior of the system, their results establish the reliability of the model only inasmuch as the "past observations" can be equated with the "current task specification." No-one, to the best of our knowledge, has yet developed a quantitative method of adjusting the resulting test statistics to compensate for the degree to which the "current task specification" is believed to diverge from the "past observations." This in no way denies, however, the value of these quantitative, objective tests wherever they are applicable, i.e., in what might be called "data-rich" problem situations. Indeed, there is the prospect that in due course comparable, quantitative measures of performance validity can be developed for the substantially more difficult (and arguably more critical) "data-poor" situations, in which predictions of behavior under quite novel conditions are required by the task specification. In this concluding section, the purpose of the protocol for model validation set out below is to provide a consistent basis on which to conduct the debate, where necessary, on the validity of the model in performing its designated task reliably. It seeks not to define what will constitute a valid model in any given situation, but to establish the framework within which the process of arriving at such a judgement can be conducted. It acknowledges that no evidence in such matters is above dispute, not even the evidence of "objective" measures of performance validity, which themselves must depend on some subjectively chosen level of an acceptable (unacceptable) difference between a pair of numbers. 5.1 The Protocol There are three aspects to forming a judgement on the validity, or otherwise, of a model for predictive exposure assessments: (i) the nature of the predictive task to be performed; (ii) the properties of the model; and (iii) the magnitude of the risk of making a wrong decision. For example, if the task is identical to one already studied with the same model as proposed for the present task and the risk of making a wrong decision is low, the process of coming to a judgement on the validity of the model should be relatively straightforward and brief. Ideally, it would be facilitated by readily available, quantitative evidence of model performance validity. At the other extreme, if the task is an entirely novel one, for which a novel form of model has been proposed, and the risk of making a wrong decision is high, it would be much more difficult to come to a judgement on the validity of the model. Evidence on which to base this judgement would tend to be primarily that of an expert opinion, and therefore largely of a qualitative nature. 56 ------- While the depth of the enquiry and length of the process in coming to a judgement would differ in these two examples, much the same forms of evidence would need to be gathered and presented. It is important, however, to establish responsibilities for the gathering of such evidence, for only a part of it rests with the agency charged with the development of a model. In the following it has been assumed that a second, independent agency would be responsible for specification of the task and evaluation of the risk of making a wrong decision. The focus of the protocol will accordingly be on the forms of evidence required for evaluation of the model. 57 ------- 5.1.1 Examination of the Model's Composition The composition of a model embraces several attributes on which evidence will need to be presented. These are as follows: (1) Structure. The structure of the model is expressed by the assembly of constituent process mechanisms (or hypotheses) incorporated in the model. A constituent mechanism might be defined as "dispersion," for example, or as "predation of one species of organism by another." The need is to know the extent to which each such constituent mechanism has been used before in any previous (other) model or previous version of the given model. There might also be a need to know the relative distribution of physical, chemical and biological mechanisms so incorporated; many scientists would attach the greatest probability of universal applicability to a physical mechanism, and the smallest such probability to a biological mechanism. (li) Mathematical expression of constituent hypotheses. This is a more refined aspect of model structure. The mechanism of "bacterial degradation of a pollutant" can be represented mathematically in a variety of ways: as a first- order chemical kinetic expression, in which the rate of degradation is proportional to the concentration of the pollutant; or as, for instance, a function of the metabolism of bacteria growing according to a Monod kinetic expression. (iii) Number of state variables. In most models of predictive exposure assessments the state variables will be defined as the concentrations of contaminants or biomass of organisms at various locations across the system of interest. The greater the number of state variables included in the model the less will be the degree of aggregation and approximation in simulating both the spatial and microbial (ecological) variability in the system's behavior. In the preceding example of "bacterial degradation of a pollutant," only a single state variable would be needed to characterize the approximation of first-order chemical kinetics; two — one each for the concentrations of both the pollutant and the (assumed) single biomass of bacteria — would be required for the constituent hypothesis of Monod kinetics. Similarly, a lake characterized as a single, homogeneous volume of water will require just one state variable for the description of pollutant concentration within such a system. Were the lake to be characterized as two sub-volumes (a hypolimnion and an epilimnion), however, two state variables would be needed to represent the resulting spatial variability of pollutant concentration. (iv) Number of parameters. The model's parameters are the coefficients that appear in the mathematical expressions representing the constituent mechanisms as a function of the values of the state variables (and/or input variables). They 58 ------- are quantities such as a dispersion coefficient, a first-order decay-rate constant, or a maximum specific growth-rate constant. In an ideal world all the model's parameters could be assumed to be invariant with space and time. Yet they are in truth aggregate approximations of quantities that will vary at some finer scale of resolution than catered for by the given model. For instance, the first-order decay-rate constant of pollutant degradation subsumes the behavior of a population of bacteria; a Monod half-saturation concentration may subsume the more refined mechanism of substrate inhibition of metabolism, and so on. In problems of groundwater contamination the volumes (areas) over which the parameters of the soil properties are assumed to be uniform are intertwined with this same problem of aggregation versus refinement. There is immense difficulty, however (as already noted in discussion of the concept of articulation), in establishing whether a model has the correct degree of complexity for its intended task. Values of parameters. Again, in an ideal world the values to be assigned to the model's parameters would be invariant and universally applicable to whatever the specific sector of the environment for which a predictive exposure assessment is required. In practice there will merely be successively less good approximations to this ideal, roughly in the following descending order: (a) The parameter is associated with an (essentially) immutable law of physics and can accordingly be assigned a single, equally immutable, value; (b) The parameter has been determined from a laboratory experiment designed to assess a single constituent mechanism, such as pollutant biodegradation, under the assumption that no other mechanisms are acting upon the destruction, transformation, or redistribution of the pollutant within the experiment; (c) The parameter has been determined by calibration of the model with a set of observations of the field system; (d) A value has been assigned to the parameter on the basis of values quoted in the literature from the application of models incorporating the same mathematical expression of the same constituent process mechanism. It is misleading to suppose that the result of (b) will be independent of an assumed model of the behavior observed in the laboratory experiment. The coefficient itself is not observed. Instead, for example, the concentration of pollutant remaining undegraded in the laboratory beaker or chemostat is 59 ------- observed. Once a mathematical description of the mechanism assumed to be operative in the experiment is postulated, then the value of the parameter can be inferred from matching the performance of this model with the observations (which in effect is the same procedure as that of (c). (vi) Parameter uncertainty. Evidence should be presented on the range of values assigned to a particular parameter in past studies and/or on the magnitude and (where available) statistical properties of the estimation errors associated with these values. In many cases it might be sufficient to assume that such ranges of values and distributions of errors are statistically independent of each other, but this can be misleading. Supplementary evidence of the absence/presence of correlation among the parameter estimates and errors could be both desirable and material to the judgement on model validity. For example, unless determined strictly independently ~ and it is not easy to see how that might be achieved — the values quoted for a bacterial growth-rate constant and death- rate constant are likely to be correlated. A pair of low values for both parameters can give the same net rate of growth as a pair of high values, and knowledge of such correlation can influence both the computation of, and assessment of, the uncertainty attaching to a prediction of future behavior. (vii) Analysis of parameter sensitivity. The extent to which the predictions of the model will change as a result of alternative assumptions about the values of the constituent parameters can be established from an analysis of parameter sensitivity. On its own such information provides only a weak index of model validity. It may be used, nevertheless, to supplement a judgement on the model's compositional validity based on the foregoing categories of evidence. In the absence of any knowledge of parameter uncertainty an analysis of sensitivity may yield insight into the validity of the model's composition through the identification, in extreme cases, of those "infeasible" values of the parameters that lead to unstable or absurd predictions. It could be used thus to establish in crude terms the domain of applicability of the model, i.e., ranges of values for the model's parameters for which "sensible" behavior of the model is guaranteed. In the presence of information on parameter uncertainty an analysis of sensitivity may enable rather more refined conclusions about the validity of the model. In particular, a highly sensitive, but highly uncertain, parameter is suggestive of an ill-composed model. It is clearly impossible to divorce an assessment of the evidence on the model's compositional validity — its intrinsic properties and attributes — from the current task specification. In particular, the less immutable the hypothesis (law) incorporating a given parameter is believed to be, the more relevant will become a judgement about the degree to which the current task specification deviates from those under which the values previously quoted for this parameter were derived. Such judgement will be especially difficult to make in the case of quantifying the correspondence (or divergence) between the laboratory conditions used to determine a rate constant and 60 ------- the field conditions for which a predictive exposure assessment is required. The judgement, nevertheless, is directed at the internal composition of the model, albeit conditioned upon the degree of similarity between the current and previous task definitions. 5.1.2 Examination of the Model's Performance Evidence must also be assembled from the results of tests of a model's performance against an external reference definition of the prototype (field) system's behavior. This will have various levels of refinement, approximately in the following ascending order. (1) Unpaired tests. In these the coincidence between values for the model's state variables and values observed for corresponding variables of the prototype system at identical points in time and space is of no consequence. It is sufficient merely for certain aggregate measures of the collection of model predictions and the collection of field data to be judged to be coincident. For example, it might be required that the mean of the computed concentrations of a contaminant in a representative (model) pond over an annual cycle is the same as the mean of a set of observed values sampled on a casual, irregular basis from several ponds in a geologically homogeneous region. Within such unpaired tests, there are further, subsidiary levels of refinement. A match of mean values alone is less reassuring than a match of both the means and variances, which is itself a less incisive test than establishing the similarity between the two entire distributions. (ii) Paired tests. For these it is of central concern that the predictions from the model match the observed values at the same points in time and space. Again, as with the unpaired tests, subsidiary levels of refinement are possible, in providing an increasingly comprehensive collection of statistical properties for the errors of mismatch so determined. (iii) Sequence of errors. A paired-sample test, as defined above, makes no reference to the pattern of the errors of mismatch as they occur in sequence from one point in time (or space) to the next. When sufficient observations are available a test of the temporal (or spatial) correlations in the error sequences may yield strong evidence with which to establish the performance validity of the model. In this case a "sufficiency" of data implies observations of the contaminant concentration at frequent, regular intervals over relatively long, unbroken periods. 61 ------- In much the same way as it is not possible to divorce an assessment of the compositional validity of a model from its current and past task specifications, so it is not possible to divorce an assessment of performance validity from the composition of the model. Thus a further two categories of evidence are relevant. (iv) Calibration. The task of model calibration necessarily involves adjustment and adaptation of the model's composition. The extent to which the values of the model's parameters have thereby been altered in order for the model to fit the calibration data set may render inadmissible the use of any associated error statistics for the purposes of judging model validity. It is therefore especially relevant for evidence of this form to be declared. (v) Prediction uncertainty. All models may be subjected to an analysis of the uncertainty attaching to their predictions. Such an analysis will depend on the composition of the model — through the quantification of parameter uncertainty; and it will depend upon the task specification, through a statement of the scenarios for the input disturbances and initial state of the system, i.e., the boundary and initial conditions for the solution of the model equations. The fact that the ambient concentration of the contaminant cannot be predicted with sufficient confidence does not necessarily signify an invalid model, however. For there are three sources of uncertainty in the predictions, two of which (the initial and boundary conditions) are independent of the model. Good practice in the analysis of prediction uncertainty (if a judgement on model validity is the objective) should therefore include some form of ranking of the contributions each source of uncertainty makes to the overall uncertainty of the prediction. Where Monte Carlo simulation is used to compute the distributions of the uncertain predictions, some — perhaps many — runs of the model may fail to be completed because of combinations of the model's parameter values leading to unstable or absurd output responses. As with an analysis of sensitivity, this provides useful information about the robustness of the model and restrictions on its domain of applicability. The less the model is found to be restricted, so the greater is the belief in its validity. In some cases, it may be feasible and desirable to state the output responses expected of the model in order for the task specification to be met, thus enabling a more refined assessment of the domain of applicability of the model (as in discussion of the concept of relevance). The use of combinations of parameter values leading to unacceptable deviations from the behavior of the task specification can be placed under restrictions. 5.1.3 Task specification Judgements on both the compositional and performance validity of the model are inextricably linked with an assessment of the extent to which the current task 62 ------- specification diverges from the task specifications of previous applications of the model. Categories of evidence relating to the fundamental properties of the task specification must therefore be defined, in a manner similar to those assembled in order to conduct an assessment of the model. For example, a model used previously for prediction of a chronic exposure at a single site with homogeneous environmental properties may well not be valid — in terms of performing its task reliably — for the prediction of an acute exposure at several sites with highly heterogeneous properties. It is not that the model is inherently incapable of making such predictions, but that there is an element of extrapolation into novel conditions implied by the different task specification. It is not the purpose of this document, however, to provide anything other than a very preliminary indication of the categories of evidence required to assess the degree of difference between current and past task specifications, as follows. (1) The contaminants. The class(es) of chemicals into which the contaminant would most probably fall, such as chlorinated hydrocarbon, or aromatic compound, for example, must be specified. The number of such chemicals to be released, and their interactions (synergism, antagonism, and so on) vis a vis the state variables of interest in the environment, must also be specified. (li) The environment. Several attributes can be employed to characterize the similarities and differences among the environments into which the contaminant is to be released. These include, inter alia, the geological, hydrological, and ecological properties of the sites of interest, together with statements of the homogeneity, or heterogeneity, of the site according to these attributes. (iii) Target organism, or organ. (iv) Nature of exposure. The obvious distinction to be made in this case is between acute and chronic exposures of the target organism to the contaminant. 63 ------- APPENDIX D REQUIREMENTS FOR A "STATE OF THE ART" EVALUATION PROCESS Changes in the Concept of Model Validation Over the past 10 years previous model validation approaches have been recognized as unsatisfactory. It is no longer accepted that models can be validated as defined by ASTM standard E 978-84 (e.g., comparison of model results with numerical data independently derived from experience or observation of the environment) and then considered to be "true". The discussions supporting the idea that models could not be validated (Konikow and Bredehoeft 1992, Oreskes et al. 1994), focused on hydrological models. Typically these models are first calibrated, i.e., parameter values are estimated using one data set, and then the effectiveness of that parameterization is examined using a second, independent data set. However, one, or even more, successful tests using particular data sets does not mean that a model is valid in the sense of being Hue and able to make reliable predictions for unknown future conditions. The realization of this problem has led ASTM to update its definition of model validation to: "a test of the model with known input and output information that is used to assess that the calibration parameters are accurate without further change" (ASTM E 978 - 92). Practical approaches to validation have varied between environmental and ecological modelers. In ecology both the objectives of modeling and the data available for calibration and testing have frequently differed from those used in environmental modeling, such as hydrology. The objective of ecosystem modeling, as a particular example, has been to synthesize information about an ecosystem from a range of sources and no integrated calibration for the whole model may be possible. In this way an ecosystem model represents a complex ecological theory and there may be no independent data set to provide a test of the complete model. Consequently, the inability to validate models, in the sense of them being considered as absolutely true, has been at least tacitly accepted in ecosystem modeling for some time. Recent developments in environmental models, such as TRIM.FaTE and other multi-media type models, are similar in their methods of construction to ecosystem models. An approach for such models is to replace validation, as though it were an endpoint that a model could achieve, with model evaluation as a process that examines each of the different elements of theory, mathematical construction, software construction, calibration and testing with data. 64 ------- APPENDIX E TYPES OF MODELS USED BY EPA Physical Models: Atmospheric Emissions Models (e.g., GloED) Water Treatment/Distribution Models (e.g., EPANET) Emissions Control Models (e.g., IAPCS) Stream Flow Model (e.g., PC-DFLOW) Atmospheric Models (e.g., Models-3, ISC) Chemical Estimation Models (e.g., AMEM) Subsurface (e.g., SESOIL/AT123D) Surface Water (e.g., SED3D,) Biological Models: Atmospheric Models (e.g., BEIS-2) Chemical Estimation Models (e.g., BIOWIN) Ecological Exposure Models (e.g., ReachScan, FGETS) Human Health Models (e.g., TherdbASE, DEEM) Subsurface (e.g., PRZM2) Multimedia Models: (MULTIMED, MULTIMDP, GEMS, PCGEMS, M M SOIL, SEAS, TRIM, HWIR, MIMS) 65 ------- APPENDIX F NONPOINT SOURCE MODEL REVIEW EXAMPLE This is an example report showing the type of model information used in the 1991 review (Donigian and Huber, 1991). 1. Name of the Method Hydrological Simulation Program—Fortran (HSPF) Stream Transport and Agricultural Runoff of Pesticides for Exposure Assessment (STREAM) 2. Type of Method Surface Water Model: Simple Approach xxx Surface W ater Model: Refined Approach Air Model: Simple Approach Air Model: Refined Approach Soil (Groundwater) Model: Simple Approach xxx Soil (Groundwater) Model: Refined Approach Multi-media Model: Simple Approach Multi-media Model: Refined Approach 3. Purpose/Scope Purpose: Predict concentrations of contaminants in xxx Runoff Waters xxx Surface Waters xxx Ground Waters Source/Release Types xxx Continuous xxx Single Level of Application: 66 xxx Intermittent xxx Multiple xxx Diffuse ------- xxx Screening xxx Intermediate xxx Detailed 67 ------- Type of Chemicals: xxx Conventional xxx Organic Metals Unique Features: xxx Addresses Degradation Products xxx Integral Database/Database Manager Integral Uncertainty Analysis Capabilities xxx Interactive Tnput/Execution Manager 4. Level of Effort System setup: xx mandays xx manweeks manmonths manyear Assessments: mandays xx manweeks xx manmonths manyear (Estimates reflect order-of-magnitude values and depend heavily on the experience and ability of the assessor.) 5. Description of the Method/Techniques Hycfrological Simulation Program h'ORTRAN (HSPF) is a comprehensive package for simulation of watershed hydrology and water quality for both conventional and toxic organic pollutants. HSPF incorporates the watershed scale ARM and NPS models into a basic-scale analysis framework that includes fate and transport in one-dimensional stream channels. It is the only comprehensive model of watershed hydrology and water quality that allows the integrated simulation of land and soil contaminant runoff processes with instream hydraulic and sediment-chemical i nteracti ons. The result of this simulation is a time history of the runoff flow rate, sediment load, and nutrient and pesticide concentrations, along with a time history of water quantity and quality at any point in a watershed. HSPF simulates three sediment types (sand, silt, and clay) in addition to a single organic chemical and transformation products of that chemical. The transfer and reaction processes included are hydrolysis, oxidation, photolysis, biodegradation, a volatilization, and sorption. Sorption is modeled as a first-order kinetic process in which the user must specify a desorption rate and an equilibrium partition coefficient for each of the three solid types. Resuspension and settling of silts and clays (cohesive solids) are defined in terms of shear stress at the sediment-water interface. For sands, the capacity of the system to transport sand at a particular flow is calculated and resuspension or settling is defined by the difference between 68 ------- the sand in suspension and the capacity. Calibration of the model requires data for each of the three solids types. Benthic exchange is modeled as sorption/desorption and desorption/scour with surficial benthic sediments. Underlying sediment and pore water are not modeled. 6. Data Needs/Availability Data needs for HSPF are extensive. HSPF is a continuous simulation program and require s continu ous data to drive the simulati ons. As a minimu m, continu ous rainfall records are require d to drive the runoff model and additio nal records of evapotr anspira tion, temper ature, and solar intensit y are 69 ------- desirab le. A large number of model parame ters can also be specifie d althoug h default values are provide d where reason- able values are availabl e. HSPF is a genera 1- purpos e pro- gram and special attentio n has been paid to cases where input parame ters are omitted ------- Option flags allow bypassi ng of whole section s of the progra m where data are not availabl e. 7. Output of the Assessment HSPF produces a time history of the runoff flow rate, sediment load, and nutrient and pesticide concentrations, along with a time history of water quantity and quality at any point in a watershed. Simulation results can be processed through a frequency and duration analysis routine that produces output compatible with conventional toxicological measures (e.g., 96- hour LC50). 8. Limitations HSPF assumes that the "Stanford Watershed Model" hydrologic model is appropriate for the area being modeled. Further, the instream model assumes the receiving water body model is well-mixed with width and depth and is thus limited to well-mixed rivers and reservoirs. Application of this methodology generally requires a team effort because of its comprehensive nature. 9. Hardware/Software Requirements The program is written in standard FORTRAN 77 and has been installed on systems as small as IBM PC/AT-compatibles. A hard disk is required for operation of the program and a math co-processor is highly recommended. No special peripherals other than a printer are required. The program is maintained for both the IBM PC-compatible and the DEC/VAX with VMS operating system. Executable code prepared with the Ryan-McFarland FORTRAN compiler and PLINK86 linkage editor is available for the MS/DOS environment. Source code only is available for the VAX environment. The program can be obtained in either floppy disk format for MS/DOS operation systems or on a 9-TRK magnetic tape with installation instructions for the DEC VAX VMS environment. This program has been installed on a wide range of computers world-wide with no or minor modifications. 71 ------- 10. Experience HSPF and the earlier models from which it was developed have been extensively applied in a wide variety of hydrologic and water quality studies (Barnwell and Johanson, 1981; Barnwell and Kittle, 1984) including pesticide runoff testing (Lorber and Mulkey, 1981), aquatic fate and transport model testing (Mulkey et al., 1986; Schnoor et al., 1987) analyses of agricultural best management practices (Donigian et al., 1983a; 1983b; Imhoff et al., 1983) and as part of pesti- cide exposure assessments in surface waters (Mulkey and Donigian, 1984). An application of HSPF to five agricultural watersheds in a screening methodology for pesticide review is given in Donigian (1986). The Stream Transport and Agricultural Runoff for Exposure Assessment (STREAM) Methodology applies the HSPF program to various test watersheds for five major crops in four agricultural regions in the U.S., defines a "representative watershed" based on regional conditions and an extrapolation of the calibration for the test watershed, and performs a sensitivity analysis on key pesticide parameters to generate cumulative frequency distributions of pesticide loads and concentrations in each region. The resulting methodology requires the user to evaluate only the crops and regions of interest, the pesticide application rate, and three pesticide parameters—the partition coefficient, the soil/ sediment decay rate, and the solution decay rate. 11. Validation/Review The program has been validated with both field data and model experiments and has been reviewed by independent experts. Numerous citations for model applications are included in the References below. Recently, model refinements for instream algorithms related to pH and sediment-nutrient interactions have been sponsored by the USGS and the EPA Chesapeake Bay Program, respectively. 12. Contact The model is available from the Center for Exposure Assessment Modeling at no charge. Mainframe versions of the programs compatible with the DEC VAX systems are available on standard one-half inch, 9-track magnetic tape. When ordering tapes, please specify the type of computer system that the model will be installed on (VAX, PRIME, HP, Cyber, IBM, etc.), whether the tape should be non-labeled (if non-labeled specify the storage format, EBCDIC or ASCII), or if the tape should be formatted as a VAX files-11, labeled (ASCII) tape for DEC systems. Model distributions tapes contain documentation covering installation instructions on DEC systems, FORTRAN source code files, and test input data sets and output files that may be used to test and confirm the installation of the model on your system. Users are responsible for installing programs. Requests for PC versions of the models should be accompanied by 8 formatted double-sided, double-density (DS/DD), error-free diskettes. Please do not send high-density (DD/HD) diskettes. Model distribution diskettes contain documentation covering installation instructions on PC systems, DOS batch files for compiling, linking, and executing the model, executable 72 ------- task image(s) ready for execution of the model(s), all associated runtime files, and test input data sets and corresponding output files that may be used to test and confirm the installation of the model on your PC or compatible system. To obtain copies of the models, please send 9-track specifications or the appropriate number of formatted diskettes to the attention of David Disney at the following address: Center for Exposure Assessment Modeling U.S. Environmental Protection Agency Environmental Research Laboratory Athens, Georgia 30613 (404) 546-3123 USA Program and/or user documentation, or instructions on how to order documentation, will accompany each response. References Barnwell, T.O. 1980. An Overview of the Hydrologic Simulation Program—FORTRAN, a Simulation Model for Chemical Transport and Aquatic Risk Assessment. Aquatic Toxicology and Hazard Assessment: Proceedings of the Fifth Annual Symposium on Aquatic Toxicology, ASTM Special Tech. Pub. 766, ASTM, 1916 Race Street, Philadelphia, PA 19103. Barnwell, T.O. and R. Johanson. 1981. HSPF: A Comprehensive Package for Simulation of Watershed Hydrology and Water Quality. Nonpoint Pollution Control: Tools and Techniques for the Future. Interstate Commission on the Potomac River Basin, 1055 First Street, Rockville, MD 20850. Barnwell, T.O. and J.L. Kittle. 1984. Hydrologic Simulation Program—FORTRAN: Development, Maintenance and Applications. Proceedings Third International Con- ference on Urban Storm Drainage. Chalmers Institute of Technology, Goteborg, Sweden. Bicknell, B.R., AS. Donigian Jr. and T.O. Barnwell. 1984. Modeling Water Quality and the Effects of Best Management Practices in the Iowa River Basin. J. Wat. Sci. Tech. 17:1141-1153. 73 ------- Chew, Y.C., L.W. Moore, and R.H Smith. 1991. Hydrologic SIMULATION of Tennessee's North Reelfoot Creek Watershed. J. Water Pollution Control Federation 63(1): 10-16. Donigian, A.S., Jr., J.C. Imhoff and B.R. Bicknell. 1983. Modeling Water Quality and the Effects of Best Management Practices in Four Mile Creek, Iowa. EPA Contract No. 68-03-2895, Environmental Research Laboratory, U.S. EPA, Athens, GA 30613. Donigian, A.S., Jr., J.C. Imhoff, B.R. Bicknell and J.L. Kittle, Jr. 1984. Application (Inide for the Hydrological Simulation Program—FORTRAN EPA. 600/3-84-066, Envi- ronmental Research Laboratory, U.S. EPA, Athens, GA. 30613. Donigian, A.S., Jr., D.W. Meier and P.P. Jowise. 1986. Stream Transport and Agricultural Runoff for Exposure Assessment: A Methodology. EPA/600/3-86- 011, Environmental Research Laboratory, U.S. EPA, Athens, GA 30613. Donigian, A.S., Jr., B.R. Bicknell, L.C. Linker, J. Hannawald, C. Chang, and R. Reynolds. 1990. Chesapeake Bay Program Watershed Mode! Application to Calculate Bay Nutrient Loadings: Preliminary Phase I Findings and Recommendations. Prepared by AQUA TERRA Consultants for U.S. EPA Chesapeake Bay Program, Annapolis, MD. Hicks, C.N., W.C. Huber and J.P. Heaney. 1985. Simulation of Possible Effects of Deep Pumping on Surface Hydrology Using HSPF. Proceedings of Stormwater and Water Quality Model User Group Meeting. January 3 1-February 1, 1985. T.O. Barnwell, Jr., ed. EPA-600/9-85/016. Environmental Research Laboratory, Athens, GA. Johanson, R.C., J.C. Imhoff, J.L. Kittle, Jr. and A.S. Donigian. 1984. Hydrological Simulation Program FORTRAN (HSPF): Users Manual for Release 8.0. EPA- 600/3-84-066, Environmental Research Laboratory, U.S. EPA, Athens, GA. 30613. Johanson, R.C. 1989. Application of the HSPF Model to Water Management in Africa. Proceedings of Stormwater and Water Quality Model Users Group Meeting. October 3-4, 1988. Guo, et al., eds. EPA-600/9-89/001. Environmental Research Laboratory, Athens, GA. Lorber, M.N. and L.A. Mulkey. 1982. An Evaluation of Three Pesticide Runoff Loading Models. J. Environ. Qual. 11:519-529. 74 ------- Moore, L.W., H. Matheny, T. Tyree, D. Sabatini and S.J. Klaine. 1988. Agricultural Runoff Modeling in a Small West Tennessee Watershed. J. Water Pollution Control Federation 60(2):242-249. Motta, D.J. and M.S. Cheng. 1987. The Henson Creek Watershed Study. Proceedings of Stormwater and Water Quality Users Group Meeting. October 15-16, 1987. H.C. Torno, ed. Charles Howard and Assoc., Victoria, BC, Canada. Mulkey, L.A., R.B. Ambrose, and T.O. Barnwell. 1986. Aquatic Fate and Transport Modeling Techniques for Predicting Environmental Exposure to Organic Pesticides and Other Toxicants—A Comparative Study. Urban Runoff Pollution, Springer-Verlag, New York, NY. Nichols, J.C. andM.P. Timpe. 1985. Use ofHSPF to Simulate Dynamics of Phosphorus in Floodplain Wetlands over a Wide Range of Hydrologic Regimes. Proceedings of Stormwater and Water Quality Mode! IIsers Group Meeting. January 31-February I, 1985. T.O. Barnwell, Jr., ed. EPA-600/9-85/016, Environmental Research Laboratory, Athens, GA. Schnoor, J.L., C. Sato, D. McKetchnie, and D. Sahoo. 1987. Processes, Coefficients, and Models for Simulating Toxic Organics and Heavy Metals in Surface Waters. EPA/600/3-87/015. U.S. Environmental Protection Agency, Athens, GA 30613. Schueler, T.R. 1983. Seneca Creek Watershed Management Study, Final Report, Volumes I and II. Metropolitan Washington Council of Governments, Washington, DC. Song, J.A., G.F. Rawl, and W.R. Howard. 1983. Lake Manatee Watershed Water Resources Evaluation using Hydrologic Simulation Program—FORTRAN (HSPF). (\>Uoque sur la Modelisation desEaux Pluviales. Septembre 8-9, 1983. P. Beron, et al., T. Barnwell, editeurs. GREMU—83/03 Ecole Polytechnique de Montreal, Quebec, Canada. Sullivan, M.P. and T.R. Schueler. 1982. The Piscataway Creek Watershed Model: A Storm water and Nonpoint Source Management Tool. Proceedings Stormwater and Water Quality Management Modeling and SWMM Users Group Meeting. October 18-19, 1982. Paul E. Wisner, ed. Univ. of Ottawa, Dept. Civil Engr., Ottawa, Ont., Canada. Weatherbe, D.G. and Z. Novak. 1985. Development of Water Management Strategy for the Humber River. Proceedings Conference on Stormwater and Water Quality Management Modeling. September 6-7, 1984. E.M. and W. James, ed. Computational Hydraulics Group, McMaster University, Hamilton, Ont., Canada. 75 ------- Woodruff, D.A., D.R. Gaboury, R.J. Hughto and G.K. Young. 1981. Calibration of Pesticide Behavior on a Georgia Agricultural Watershed Using HSP-F. Proceedings Stormwater and Water Quality Model Users Group Meeting. September 28-29, 1981. W. James, ed. Computational Hydraulics Group, McMaster University, Hamilton, Ont., Canada. Udhiri, S., M-S Cheng and R.L. Powell. 1985. The Impact of Snow Addition on Watershed Analysis Using HSPF. Proceedings of Stormwater and Water Quality Model Users Group Meeting. January 31-February 1, 1985. T.O. Barnwell, Jr., ed. EPA-600/9- 85/016, Environmental Research Laboratory, Athens, GA. 76 ------- APPENDIX G - COST ESTIMATES The following are the cost estimates received in response to our inquiries: MMSOILS Benchmarking Evaluation (EPA's portion, DOE unknown) Cost: Scenario Development about $50,000; Execution of 4 models about $100,000; Output comparison and write-up (Journal articles) $150,000; Total = $300,000. RADM Evaluation Study Cost: It was estimated that EPA provided about 18.5 Million for the 2.5 year evaluation effort (17 M for field studies, ,5M for NOAA FTEs, and 1M for contractors to run the model in tests). [files for contract support have been disposed of] AIR Dispersion Model Clearinghouse and SCRAM: 2 FTEs/GS-13 - clearinghouse, regional support, and support of regional workshops MCHISRS $50,000 contractor over few years with little upkeep for SCRAM Air Modeling Regulatory program - 20 to 25 staff for 20 years with extramural support $1.5 to 2 M per year Model performance evaluation and peer review about $150 to 200 K per model category (2- 10 models) AERMOD over 6 years exceeded $500K the evaluation portion is less than 10% of the total. EPAMMM Evaluation (Screening model evaluation without site data) V2 FTE (about $60,000) Software Evaluation and Documentation Costs: Checked with Margarette Shovlin who said costs are not broken out to the level of model code testing and verification or documentation on a level of effort contract. Larry Zaragoza attempted to get estimates for the IEUBK model and found the same thing getting the information would be tedious- a special request to the ESDS's SAIC management would have to be made by OARM. Larry estimated the IEUBK coding cost about $200K but its hard to separate out test costs and it depends on the language used and how close to the actual programming documentation is done. He estimated that if documentation was done late in the process the cost could equal half 77 ------- the total project cost. At the Models 2000 conference Bob Carsel estimated that for a ground water model, software evaluation and documentation cost about 30 % of the project cost. APPENDIX H - REFERENCES American Society for Testing and Materials. 1992. Standard Practice for Evaluating Environmental Fate Models of Chemicals. Standard 978-92. Philadelphia: American Society for Testing and Materials. Beck, M. B., L.A. Mulkey, and T.O. Barnwell. 1994. Draft Model Validation for Predictive Exposure Assessments. Presented to the Risk Assessment Forum, July 1994. Beck, M. B., J. R. Ravetz, L.A. Mulkey, and T.O. Barnwell. 1997. On the Problem of Model Validation for Predictive Exposure Assessments. Stochastic Hydrology- and Hydraulics 11: 229-254. Springer-Verlag. Beck, M. B., L.A. Mulkey, T.O. Barnwell, and J. R. Ravetz. 1997. Model Validation for Predicative Exposure Assessments. 1995 International Environmental Conference Proceedings, p. 973-980. TAPPI Proceedings. Chen, J. and M. B. Beck. 1998. Quality Assurance of Multi-Media Model for Predictive Screening Tasks. (EPA/600/R-98/106). August 1998. Chen, Y. D., S.C. McCutcheon, R.F. Carsel, D.J. Norton, and J.P. Craig. 1996. Enhancement and Application of HSPF for Stream Temperature Simulation in Upper Grande Ronde Watershed, Oregon. Watershed V6 p. 312-315. Chen, Y. D., R.F. Carsel, S.C. McCutcheon, and W.L. Nutter. 1998. Stream Temperature Simulation of Forested Riparian Areas: I. Watershed-scale Model Development. Journal of Environmental Engineering p. 304-315. April 1998. Chen, Y. D., S.C. McCutcheon, D.J. Norton, and W.L. Nutter. 1998. Stream Temperature Simulation of Forested Riparian Areas: n. Model Application. Journal of Environmental Engineeringp. 316-328. April 1998. 78 ------- Cicchetti, D.V. 1991. The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation. Behavioral and Brain Sciences 14:119-186. Donigian, A.S., Jr. and W.C. Huber. 1991. Modeling of Nonpoint Source Water Quality on Urban and Non-urban Areas. EPA/600/3-91/039 (NTIS PB92-109115) U.S. Environmental Protection Agency, Athens, GA. Doucet, P. and P.B. Sloep. 199?. Mathematical Modeling in the Life Sciences. Ellis Horwood, New York. p. 280-281. Gillies, D. 1993. Philosophy of Science in the Twentieth Century. Blackwell, Oxford, U.K. Konikow, L.F. and J.D. Bredehoeft. 1992. Ground-Water Models Cannot Be Validated. Adv. Water Resources 15:75-83. Oreskes, N., K. Shrader-Frechette, and K. Belitz. 1994. Verification, Validation, and Confirmation of Numerical Models in the Earth Sciences. Science 263:641 -646. Laniak, G.F., J. G. Droppo, E. R. Faillace, E. K. Gnanapragasam, W.B. Mills, D.L. Strenge, G. Whelan, and C. Yu. 1997. An Overview of a Multimedia Benchmarking Analysis for Three Risk Assessment Models: RESRAD, MMSOILS, and MEPAS. Risk Analysis 17(2): 1-23. April 1997. Lee, S.B., V. Ravi, J. R. Williams, and D.S. Burden. 1996. Subsurface Fluid Flow (Groundwater and Vadose Zone) Modeling: Application of Subsurface Modeling Application. ASTMSpecial Technical Publication 12SS p. 3-13. National Acid Precipitation Assessment Program. 1990. Acidic Deposition: State of Science and Technology: Report 5. Evaluation of Regional Acidic Deposition Models (Part I) and Selected Applications of 'RADM (Part II). September 1990. National Institute of Standards and Technology. 1996. NIST Special Publication 500-234: Reference Information for the Software Verification and Validation Process. (http:Whissa.ncsl.nist.gov/HHRFdata/Artifacts/ITLdoc/234/val-proc.html) Peters, D.P. and S.J. Ceci. 1982. Peer-Review Practices of Psychology Journals: The Fate of Published Articles Submitted Again. Behavioral and Brain Sciences 5:187-255. U SEP A. 1993. Computer Models Used to Support Cleanup Decision-Making at Hazardous and Radioactive Waste Sites. EPA 402-R-93-005, March 1993. (NTIS, PB93-183333/XAB). 79 ------- USEPA. 1993. Environmental Pathway Models-Ground-Water Modeling in Support of Remedial Decision-Making at Sites Contaminated with Radioactive Material. EPA 402- R-93-009, March 1993. (NTIS, PB93-196657/XAB). USEPA. 1994. Technical Guide to Ground-Water Model Selection at Sites Contaminated with Radioactive Substances. EPA 402-R-94-012, September 1994. (NTIS, PB94- 205804/XAB). USEPA. 1996. Documenting Ground-Water Modeling at Sites Contaminated with Radioactive Substances. EPA 540-R-96-003, January 1996. (NTIS, PB96-963302/XAB). USEPA Office of Air Quality Planning & Standards. 1998. The iota! Risk Integrated Methodology - Technical Support Document for the TRIM.Fa'lE Module Draft. EPA-452/D-98-001. March 1998. USEPA Office of the Administrator. 1994. Guidance for (\mducting External Peer Review of Environmental Peer Review of Environmental Regulatory Models. EPA 100-B-94-001. July 1994. USEPA Office of Research and Development. 1997. Guiding Principles for Monte Carlo Analysis. EPA/630/R-97/001. March 1997. USEPA Office of Science Policy, Office of Research and Development. 1998. Science Policy Council Handbook: Peer Review. EPA 100-B-98-001. January 1998. USEPA Solid Waste and Emergency Response. 1994. Report of the Agency Task Force on Environmental Regulatory Modeling: Guidance, Support Needs, Draft Criteria and Charter. EPA 500-R-94-001. March 1994. US General Accounting Office. 1997. Report to the Chairman, Subcommittee on Oversight and Investigations, Committee on Commerce, House of Representatives: Air Pollution - Limitations of l<'JyA 's Motor Vehicle Emissions Model and Plans to Address Them. GAO/RCED-97-21 September 1997. van Vallen, L. and F.A. Pitelka. 1974. Commentary - Intellectual Censorship in Ecology. Ecology 55:925-926. 80 ------- |