EPA Document# 740-P1-8001 United States Office of Chemical Safety and Environmental Protection Agency Pollution Prevention APPLICATION OF SYSTEMATIC REVIEW IN TSCA RISK EVALUATIONS MAY 2018 ------- TABLE OF CONTENTS TABLE OF CONTENTS 2 LIST OF TABLES 4 LIST OF FIGURES 7 ACKNOWLEDGEMENTS 8 1 PURPOSE OF THE DOCUMENT 9 2 SCOPING AND PROBLEM FORMULATION: ANALYTICAL FRAMEWORK GUIDING SYSTEMATIC REVIEW IN TSCA RISK EVALUATIONS 12 3 INTEGRATION OF SYSTEMATIC REVIEW PRINCIPLES INTO TSCA RISK EVALUATIONS 13 3.1 Protocol Development 19 3.2 Data Collection 19 3.2.1 Data Search 19 3.2.1.1 Summary of the Literature Search Strategy for the First Ten TSCA Risk Evaluations 21 3.2.2 Data Screening 22 3.2.2.1 Title/Abstract Screening 23 3.2.2.1.1 Summary of the Title/Abstract Screening Conducted for the First Ten TSCA Risk Evaluations 24 3.2.2.2 Full Text Screening 24 3.2.2.2.1 Summary of the Full Text Screening Conducted for the First Ten TSCA Risk Evaluations 25 3.2.2.3 Data Extraction 25 3.3 Data Evaluation 26 3.4 Data Integration and Summary of Findings 26 4 UPDATES TO THE DATA SEARCH AND SCREENING RESULTS FOR THE FIRST TEN RISK EVALUATIONS 27 4.1 Initial Data Search 27 4.2 InitialTitle/AbstractScreening 28 5 REFERENCES 29 APPENDIX A: STRATEGY FOR ASSESSING THE QUALITY OF DATA/INFORMATION SUPPORTING TSCA RISK EVALUATIONS 30 A.l Evaluation Method 33 A.2 Documentation and Instructions for Reviewers 34 A.3 Important Caveats 35 A.4 References 36 APPENDIX B: DATA QUALITY CRITERIA FOR PHYSICAL/CHEMICAL PROPERTY DATA 40 APPENDIX C: DATA QUALITY CRITERIA FOR FATE DATA 42 C.l Types of Fate Data Sources 42 C.2 Data Quality Evaluation Domains 42 C.3 Data Quality Evaluation Metrics 43 C.4 Scoring Method and Determination of Overall Data Quality Level 44 C.4.1 Weighting Factors 45 C.4.2 Calculation of Overall Study Score 46 C.5 Data Quality Criteria 51 C.6 References 64 APPENDIX D: DATA QUALITY CRITERIA FOR OCCUPATIONAL EXPOSURE AND RELEASE DATA 65 D.l Types of Environmental Release and Occupational Exposure DataSources 65 2 ------- D.2 Data Quality Evaluation Domains 66 D. 3 Data Quality Evaluation M etrics 66 D.4 Scoring Method and Determination of Overall Data Quality Level 67 D.4.1 Weighting Factors 67 D.4.2 Calculation of Overall Study Score 68 D.5 Data Sources Frequently Used in Occupational Exposure and Release Assessments 69 D.6 Data Extraction Templates to Assist the Data Quality Evaluation 71 D.7 Data Quality Criteria 75 D. 7.1 Monitoring Data 75 D.7.2 Environmental Release Data 79 D. 7.3 Published Models for Environmental Releases or Occupational Exposures 83 D. 7.4 Data/Information from Completed Exposure or Risk Assessments 86 D. 7.5 Data/Information from Reports Containing Other than Exposure or Release Data 89 D.8 References 92 APPENDIX E: DATA QUALITY CRITERIA FOR STUDIES ON CONSUMER, GENERAL POPULATION AND ENVIRONMENTAL EXPOSURE 93 E.l Types of Consumer, General Population and Environmental Exposure Data Sources 93 E.2 Data Quality Evaluation Domains 94 E.3 Data Quality Evaluation Metrics 95 E.4 Scoring Method and Determination of Overall Data Quality Level 96 E.4.1 Weighting Factors 96 E.4.2 Calculation of Overall Study Score 96 E.5 Data Sources Frequently Used in Consumer, General Population and Environmental Exposure Assessments....97 E.6 Data Quality Criteria 99 E. 6.1 Monitoring Data 99 E.6.2 Modeling Data 108 E.6.3 Survey Data 113 E.6.4 Epidemiology Data to Support Exposure Assessment 119 E.6.5 Experimental Data 130 E.6.6 Database Data 138 E.6.7 Completed Exposure Assessments and Risk Characterizations 143 E.7 References 146 APPENDIX F: DATA QUALITY CRITERIA FOR ECOLOGICAL HAZARD STUDIES 147 F.l Types of Data Sources 147 F.2 Data Quality Evaluation Domains 147 F.3 Data Quality Evaluation Metrics 148 F.4 Scoring Method and Determination of Overall Data Quality Level 150 F.4.1 Weighting Factors 150 F.4.2 Calculation of Overall Study Score 150 F.5 Data Quality Criteria 156 F.6 References 171 APPENDIX G: DATA QUALITY CRITERIA FOR STUDIES ON ANIMAL AND IN VITRO TOXICITY 172 G.l Types of Data Sources 172 G.2 Data Quality Evaluation Domains 173 G.3 Data Quality Evaluation Metrics 174 G.4 Scoring Method and Determination of Overall Data Quality Level 176 G.4.1 Weighting Factors 177 G.4.2 Calculation of Overall Study Score 179 G.5 Data Quality Criteria 186 3 ------- G.5.1 Animal Toxicity Studies 186 G.5.2 In Vitro Toxicity Studies 205 G.6 References 221 APPENDIX H: DATA QUALITY CRITERIA FOR EPIDEMIOLOGICAL STUDIES 223 H.l Types of Data Sources 223 H.2 Data Quality Evaluation Domains 223 H.3 Data Quality Evaluation Metrics 224 H.4 Scoring Method and Determination of Overall Data Quality Level 225 H.4.1 Weighting Factors 225 H.4.2 Calculation of Overall Study Score 226 H.5 Data Quality Criteria 231 H.6 References 247 LIST OF TABLES Table A-l. Definition of Overall Quality Levels and Corresponding Quality Scores 34 Table A-2. Documentation Template for Reviewer and Data/Information Source 34 Table B-l. Evaluation Metrics and Ratings for Physical-Chemical Property Data 40 Table C-l. Types of Fate Data 42 Table C-2. Data Evaluation Domains and Definitions for Fate Data 43 Table C-3. Summary of Metrics for the Fate Data Evaluation Domains 44 Table C-4. Fate Metrics with Greater Importance in the Evaluation and Rationale for Selection 45 Table C-6. Scoring Example for Abiotic Fate Data (i.e., hydrolysis data) with All Applicable Metrics Scored 48 Table C-7. Scoring Example for Abiotic Fate Data (i.e., hydrolysis data) with Some Metrics Not Rated/Not Applicable 49 Table C-8. Scoring Example for QSAR Data 50 Table C-9. Serious Flaws that Would Make Fate Data Unacceptable for Use in the Fate Assessment 51 Table C-10. Data Quality Criteria for Fate Data 52 Table D-l. Types of Occupational Exposure and Environmental Release Data Sources 65 Table D-2. Data Evaluation Domains and Definitions 66 Table D-3. Summary of Quality Metrics for the Five Types of Data Sources 66 Table D-4. Metric Weighting Factors and Range of Weighted Metric Scores for Scoring the Quality of Environmental Release and Occupational Data 68 Table D-5. Scoring Example for Published Models where Sample Size is Not Applicable 69 Table D-6. Examples of Data Sources Frequently Used in Occupational Exposure and Release Data 70 Table D-7. Data Extraction and Evaluation Template for General Life Cycle and Facility Data 72 Table D-8. Data Extraction and Evaluation Template for Occupational Exposure Data 73 Table D-9. Data Extraction and Evaluation Template for Environmental Release Data 74 Table D-10. Serious Flaws that Would Make Monitoring Data Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment 75 4 ------- Table D-ll. Evaluation Criteria for Monitoring Data 76 Table D-12. Serious Flaws that Would Make Environmental Release Data Unacceptable for Use in the Environmental Release Assessment 79 Table D-13. Evaluation Criteria for Environmental Release Data 80 Table D-14. Serious Flaws that Would Make Published Models Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment 83 Table D-15. Evaluation Criteria for Published Models 84 Table D-16. Serious Flaws that Would Make Data/Information from Completed Exposure or Risk Assessments Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment 86 Table D-17. Evaluation Criteria for Data/Information from Completed Exposure or Risk Assessments 87 Table D-18. Serious Flaws that Would Make Data / Information from Reports Containing Other than Exposure or Release Data Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment 89 Table D-19. Evaluation Criteria for Data /Information Reports Containing Other than Exposure or Release Data 90 Table E-l. Types of Exposure Data Sources 93 Table E-2. Data Evaluation Domains and Definitions 94 Table E-3. Summary of Metrics for the Seven Data Types 95 Table E-4.Scoring Example for Monitoring Data 97 Table E-5. Examples of Data Sources Frequently Used for Consumer, General Population and Environmental Exposure Assessments 98 Table E-6. Serious Flaws that Would Make Sources of Monitoring Data Unacceptable for Use in the Exposure Assessment 99 Table E-7. Evaluation Criteria for Sources of Monitoring Data 100 Table E-8. Serious Flaws that Would Make Sources of Modeling Data Unacceptable for Use in the Exposure Assessment 108 Table E-9. Evaluation Criteria for Sources of Modeling Data 109 Table E-10. Serious Flaws that Would Make Sources of Survey Data Unacceptable for Use in the Exposure Assessment 113 Table E-ll. Evaluation Criteria for Source of Survey Data 114 Table E-12. Serious Flaws that Would Make Sources of Epidemiology Data Unacceptable for Use in the Exposure Assessment 119 Table E-13. Evaluation Criteria for Sources of Epidemiology Data to Support the Exposure Assessment 120 Table E-14. Serious Flaws that Would Make Sources of Experimental Data Unacceptable for Use in the Exposure Assessment 130 Table E-15. Evaluation Criteria for Sources of Experimental Data 131 Table E-16. List of Serious Flaws that Would Make Completed Exposure Assessments and Risk Characterizations Unacceptable for Use in the Exposure Assessment 143 Table E-17. Evaluation Criteria for Completed Exposure Assessments and Risk Characterizations 143 Table E-18. Serious Flaws that Would Make Sources of Database Data Unacceptable for Use in the Exposure Assessment 138 5 ------- Table E-19. Evaluation Criteria for Sources of Database Data 139 Table F-l. Study Types that Provide Ecological Hazard Data 147 Table F-2. Data Evaluation Domains and Definitions 148 Table F-3. Data Evaluation Domains and Metrics for Ecological Hazard Studies 149 Table F-4. Ecological Hazard Metrics with Greater Importance in the Evaluation and Rationale for Selection 152 Table F-5. Metric Weighting Factors and Range of Weighted Metric Scores for Ecological Hazard Studies 153 Table F-6. Scoring Example for an Ecological Hazard Study with all Metrics Scored 154 Table F-7. Scoring Example for an Ecological Hazard with Some Metrics Not Rated/Not Applicable 155 Table F-8. Serious Flaws that Would Make Ecological Hazard Studies Unacceptable 156 Table F-9. Data Quality Criteria for Ecological Hazard Studies 159 Table G-l. Types of Animal and In Vitro Toxicity Data 172 Table G-2. Data Evaluation Domains and Definitions 173 Table G-3. Data Evaluation Domains and Metrics for Animal Toxicity Studies 175 Table G-4. Data Evaluation Domains and Metrics for In Vitro Toxicity Studies 176 Table G-5. Animal Toxicity Metrics with Greater Importance in the Evaluation and Rationale for Selection 177 Table G-6. In Vitro Toxicity Metrics with Greater Importance in the Evaluation and Rationale for Selection 178 Table G-7. Metric Weighting Factors and Range of Weighted Metric Scores for Animal Toxicity Studies 180 Table G-8. Metric Weighting Factors and Range of Weighted Metric Scores for In Vitro Toxicity Studies 181 Table G-9. Scoring Example for Animal Toxicity Study with all Metrics Scored 182 Table G-10. Scoring Example for Animal Toxicity Study with Some Metrics Not Rated/Not Applicable 183 Table G-ll. Scoring Example for In Vitro Study with all Metrics Scored 184 Table G-12. Scoring Example for In Vitro Study with Some Metrics Not Rated/Not Applicable 185 Table G-13. Serious Flaws that Would Make Animal Toxicity Studies Unacceptable 186 Table G-14. Data Quality Criteria for Animal Toxicity Studies 190 Table G-15. Serious Flaws that Would Make In Vitro Toxicity Studies Unacceptable 205 Table G-16. Data Quality Criteria for In Vitro Toxicity Studies 208 Table H-l. Types of Epidemiological Studies 223 Table H-2. Data Evaluation Domains and Definitions 223 Table H-3. Summary of Metrics for the Seven Data Types 224 Table H-4. Epidemiology Metrics with Greater Importance in the Evaluation and Rationale for Selection 226 Table H-5. Summary of Domain, Metrics, and Weighting Approach with Biomarkers 228 Table H-6. Summary of Domain, Metrics, and Weighting Approach for Studies without Biomarkers 229 Table H-7. Example of Scoring for Epidemiologic Studies where Sample Size is Not Applicable 230 Table H-8. Serious Flaws that Would Make Epidemiological Studies Unacceptable for Use in the 6 ------- Hazard Assessment 231 Table H-9. Evaluation Criteria for Epidemiological Studies 234 LIST OF FIGURES Figure 1-1. Road Map for Implementing Systematic Review for the First Ten TSCA Risk Evaluations 11 Figure 3-1. TSCA Systematic Review Process 15 7 ------- ACKNOWLEDGEMENTS This document was developed by the United States Environmental Protection Agency (U.S. EPA), Office of Chemical Safety and Pollution Prevention (OCSPP), Office of Pollution Prevention and Toxics (OPPT). The OPPT Assessment Team gratefully acknowledges participation and/or input from Intra- agency reviewers that included multiple offices within EPA, Inter-agency reviewers that included multiple Federal agencies, and assistance from EPA contractors GDIT (Contract No. CIO-SP3, HHSN316201200013W), ERG (Contract No. EP-W-12-006), ICF (Contract No. EP-C-14- 001) and SRC (Contract No. EP-W-12-003) and Versar (Contract No. EP-W-17-006). Docket This document can be found in EPA docket number EPA-HQ-OPPT-2018-0210. A copy of the document is also placed in the following dockets: Chemical Substance Docket Number Asbestos EPA-HQ-OPPT-2016-0736 1-Bromopropane (1-BP) EPA-HQ-OPPT-2016-0741 Carbon Tetrachloride (CCI4) EPA-HQ-OPPT-2016-0733 1,4-Dioxane EPA-HQ-OPPT-2016-0723 Cyclic Aliphatic Bromide Cluster (HBCD) EPA-HQ-OPPT-2016-0735 Methylene Chloride EPA-HQ-OPPT-2016-0742 N-Methylpyrolidone (NMP) EPA-HQ-OPPT-2016-0743 Perchloroethylene (PERC) EPA-HQ-OPPT-2016-0732 Pigment Violet 29 (Anthra[2,l,9-def:6,5,10- d'e'f']diisoquinoline-l,3,8,10(2H,9H)-tetrone; PV29) EPA-HQ-OPPT-2016-0725 Trichloroethylene (TCE) EPA-HQ-OPPT-2016-0737 8 ------- 1 PURPOSE OF THE DOCUMENT The U.S. EPA's Office of Pollution Prevention and Toxics (EPA/OPPT) generally intends to apply systematic review principles1 in the development of risk evaluations under the amended Toxic Substances Control Act (TSCA). This internal guidance sets out general principles to guide EPA's application of systematic review in the risk evaluation process for the first ten chemicals (Table 3-2), which EPA/OPPT initiated on December 19, 2016, as well as future evaluations. Integrating systematic review principles into the TSCA risk evaluation process is critical to develop transparent, reproducible and scientifically credible risk evaluations. EPA/OPPT plans to implement a structured process of identifying, evaluating and integrating evidence for both the hazard and exposure assessments developed during the TSCA risk evaluation process. It is expected that new approaches and/or methods will be developed to address specific assessment needs for the relatively large and diverse chemical space under TSCA. Thus, EPA/OPPT expects to document the progress of implementing systematic review in the draft risk evaluations and through revisions of this document and publication of supplemental documents. EPA invites the public to provide input on this document at www.regulations.gov. docket# EPA-HQ-OPPT-2018-0210. The public can also contact EPA about questions about this document at TSCA-systematicreview@epa.gov. Supplemental documents, released in June 2017, already document the data collection and screening activities for the first ten chemicals (Table 3-2). This document is the next supplemental publication containing details about the general principles that will guide EPA/OPPT in carrying out the systematic review process along with the strategy for assessing data quality that EPA/OPPT generally plans to use for the TSCA risk evaluations. This document only provides the general expectations for evidence synthesis and integration. Additional details on the approach for the evidence synthesis and integration will be included with the publication of the draft TSCA risk evaluations. Figure 1-1 displays a general roadmap for implementing systematic review in the TSCA risk evaluation process for the first ten chemicals. Ultimately, the goal is to establish an efficient systematic review process that generates high- quality, fit-for-purpose risk evaluations that rely on the best available science and the weight of the scientific evidence within the context of TSCA. The information and procedures set forth in this document are intended as a technical resource to those conducting TSCA risk evaluations for existing chemicals. This internal guidance does not constitute rulemaking by the U.S. EPA, and cannot be relied on to create a substantive or procedural right enforceable by any party in litigation with the United States. Non-mandatory language such as "should" provides recommendations and does not impose any legally binding requirements. Similarly, statements about what EPA expects or intends to do reflect general principles to guide EPA's activities and not judgments or determinations as to what EPA will do 1 This document refers to "principle" as a key concept or element guiding the series of steps (or processes) to achieve incorporation of systematic review approaches and/or methods in TSCA risk evaluations. 9 ------- in any particular case. This document is not necessarily applicable to risk assessments developed to support other EPA's statutes or programs. EPA expects to make changes to this living document at any time and therefore this document may be revised periodically. EPA welcomes public input on this document at any time. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government. 10 ------- Figure 1-1. Road Map for Implementing Systematic Review for the First Ten TSCA Risk Evaluations Opening of public comment period for milestone #1 (Dec. 9,2016) Initiation of first ten TSCA RISK EVALUATIONS (Dec. 16,2016) ~ Start systematic data collection and screening Opening of public comment period for milestone #3 (Spring 2018)' Closing of public comment period for milestone #2 (September 19, 2017) ~ Publication of TSCA Problem Formulation documents, ¦*— Systematic review PROCESS DOCUMENT AND evaluation Strategies (Spring 2018) Closing of public comment period for milestone #3 (Summer 2018) ^ * Publication of TSCA Scopes, Literature Search Strategy and Bibliography documents (June 22, 2017) T Analysis phase (data extraction, development of evidence integration strategy and draft TSCA risk evaluations) Closing of public comment period for milestone #1 (March 15, 2017) Opening of public comment period for milestone #2 (June 19,2017) ~\ ~ Publication of draft TSCA risk evaluations -~ AND EVIDENCE INTEGRATION STRATEGY (around December 2018) + PUBLIC COMMENT PERIOD Final Risk Evaluations (late 2019) ~ Peer review of TSCA risk evaluations, INCLUDING SYSTEMATIC REVIEW approaches/methods (early 2019) Notes for Figure 1-1: • Important milestones are numbered and depicted in upper case letters. Although dates would be different, milestones are also applicable for the future TSCA risk evaluations. • Star symbols are next to those activities or technical documents that are related to the implementation of systematic review. • Activities between milestones #3 and #6 show estimated timelines that are subject to change. • There are multiple points in the process for public input. 11 ------- 2 SCOPING AND PROBLEM FORMULATION: ANALYTICAL FRAMEWORK GUIDING SYSTEMATIC REVIEW IN TSCA RISK EVALUATIONS Scoping and problem formulation are important steps in providing the analytical framework for the systematic review efforts supporting the TSCA risk evaluations. Scoping and problem formulation are the first stages of the TSCA risk evaluation process and are intended to convey EPA/OPPT's expectations regarding the overall scope, level of detail, and approach for the risk evaluation. This initial planning effort is critical to developing clear objectives and assessment questions to support quantitative risk analyses, and to defining the steps that EPA/OPPT expects to take to conduct the different components of the risk evaluation. Scoping and problem formulation helps shape the systematic review approaches and/or methods that will be used to identify, evaluate, analyze, and integrate evidence. For example, the outcomes of scoping and problem formulation are used to tailor a data search and screening strategy (including eligibility criteria) to identify relevant data and information while winnowing out those that are irrelevant for the risk evaluation. TSCA requires EPA to publish the scope for any risk evaluation it will conduct. Further, TSCA requires the scope to include the hazards, exposures, conditions of use, and the potentially exposed or susceptible subpopulations2 that EPA expects to consider. To communicate and visually convey the relationships between these components, the final rule Procedures for Chemical Risk Evaluation Under the Amended Toxic Substances Control Act (40 CFR Part 702) requires including a conceptual model and an analysis plan for each risk evaluation. Under EPA's risk assessment guidance, the conceptual model and the analysis plan are the outcomes of conducting problem formulation (U.S. EPA. 2014, 1998, 1992). Through the conceptual model and the analysis plan, problem formulation describes the exposure pathways, receptors and health endpoints that EPA/OPPT expects to consider in the risk evaluations (U.S. EPA. 2014. 1998. 1992). The conceptual model(s) illustrate the exposure pathways, receptor populations and effects that EPA expects to consider in the risk evaluation. An analysis plan presents the proposed approach for the risk evaluation. Hence, problem formulation has essentially the same function as scoping under the amended TSCA, thereby aligning the requirements of the scope for a TSCA risk evaluation with the components of a problem formulation in EPA guidance (U.S. EPA. 2014. 1998. 1992). 2 Potentially exposed or susceptible subpopulation means a group of individuals within the general population identified by the Agency who, due to either greater susceptibility or greater exposure, may be at greater risk than the general population of adverse health effects from exposure to a chemical substance or mixture, such as infants, children, pregnant women, workers, or the elderly (15 U.S.C. 2602 or 40 CFR Part 702.33). 12 ------- With this context in mind, the systematic review activities for the TSCA risk evaluations will be guided by the results of problem formulation, as documented in the TSCA scope documents3. It is expected that the systematic review principles and general processes remain relatively the same across risk evaluations. However, systematic review methods and/or approaches, including criteria, will be customized, as necessary, to meet the assessment needs of each risk evaluation. Details about the fit-for-purpose systematic review methods and/or approaches will be in the draft risk evaluation and its supporting documents. EPA/OPPT is currently implementing systematic review methods and/or approaches in a step- wise fashion in parallel with conducting the phases of the risk evaluation. The phased approach is necessary given the statutory timeframes imposed on EPA. Each of the steps of systematic review is being published in parallel, as supplemental documents, along with steps in the risk evaluation. EPA/OPPT may consolidate the information made available through the various supplemental documents in the future. 3 INTEGRATION OF SYSTEMATIC REVIEW PRINCIPLES INTO TSCA RISK EVALUATIONS The Agency described systematic review in the preamble to the final rule Procedures for Chemical Risk Evaluation Under the Amended Toxic Substances Control Act, 82 FR 33726 (July 20, 2017), and in the preamble to the proposed rule, 82 FR 7562 (Jan. 19, 2017). The following two paragraphs are an excerpt from the final rule. As defined by the Institute of Medicine, systematic review "is a scientific investigation that focuses on a specific question and uses explicit, pre-specified scientific methods to identify, select, assess, and summarize the findings of similar but separate studies" (National Academy of Sciences. 2017). The goal of systematic review methods is to ensure that the review is complete, unbiased, reproducible, and transparent (Bilotta et al.. 2014). The principles of systematic review have been well developed in the context of evidence- based medicine (e.g., evaluating efficacy in clinical trials) (Higgins and Green. 2011) and are being adapted for use across a more diverse array of systematic review questions, through the use of a variety of computational tools. For instance, the National Academies' National Research Council (NRC) has encouraged EPA to move towards systematic review processes to enhance the transparency of scientific literature review that support chemical-specific risk assessments to inform regulatory decision making (Process et al.. 2014). Key elements of systematic review include: • A clearly stated set of objectives (defining the question) • Developing a protocol that describes the specific criteria and approaches that will 3 TSCA problem formulation documents were developed for the first ten chemicals undergoing risk evaluation and refine the scope of the initial TSCA scope documents. They were published as an additional interim step prior to publication of the draft risk evaluations for the first ten chemicals. 13 ------- be used throughout the process • Applying the search strategy in a literature search • Selecting the relevant papers using predefined criteria • Assessing the quality of the studies using predefined criteria • Analyzing and synthesizing the data using the predefined methodology • Interpreting the results and presenting a summary of findings TSCA requires that EPA use data and/or information (hereinafter referred to as data/information) in a manner consistent with the best available science and that EPA base decisions on the weight of the scientific evidence. To meet the TSCA science standards, EPA/OPPT will be guided by the systematic review process described in Figure 3-1. This process complements the risk evaluation process in that the data collection, data evaluation and data integration stages of the systematic review process are used to develop the exposure and hazard assessments. As risk is a function of exposure and hazard, the exposure and hazard assessments are combined to support the integrative risk characterization, which ultimately supports the risk determination. Although not shown in Figure 3-1, iteration is a natural component of the systematic review and risk evaluation processes. There could be different reasons triggering iteration such as the failure of retrieving relevant data and information after the initial search and screening activities, which would require repeating the data collection stage of the systematic review process, or refinements to the initial search, screening and extraction strategies. A short description of each stage of the systematic review process is provided in sections 3.1 through 3.4. Table 3-1 describes EPA's general expectations for the planning, execution and assessment activities related to each stage of the systematic review process. The activities are general enough to be applied to multiple data/information streams supporting the TSCA risk evaluations. 14 ------- Figure 3-1. TSCA Systematic Review Process4 (0 E 01 ti > I/I ao ra 4-» in 5 oc Scoping/Problem formulation Phase of the TSCA Risk Evaluation3 Analysis Phase of the TSCA Risk Evaluation3 Protocol Development —~ Data Collection Data Data Data Search Screening Extraction15 —~ Data Evaluation0 —~ Data Integration - Summary of Findings (Exposure & Hazard Assessments) TSCA Science Standards Best Available Science (BAS): Science that is reliable and unbiased. Use of best available science involves the use of supporting studies conducted in accordance with sound and objective science practices, including>, when available, peer reviewed science and supporting studies and data collected by accepted methods or best available methods (if the reliability of the method and the nature of the decision justifies use of the data). Additionally, EPA will consider as applicable: • The extent to which the scientific information, technical procedures, measures, methods, protocols, methodologies, or models employed to generate the information are reasonable for, and consistent with the intended use of the information [TSCA Section 26(h)(1)] • The extent to which the information is relevant for the Agency's use in making a decision about a chemical substance or mixture [TSCA Section 26(h)(2)]" • The degree of clarity and completeness with which the data, assumptions, methods, quality assurance, and analyses employed to generate the information are documented [TSCA Section 26(h)(3)] • The extent to which the variability and uncertainty in the information or in the procedures, measures, methods, protocols, methodologies, or models, are evaluated and characterized [TSCA Section 26(h)(4)] • The extent of independent verification or peer review of the information or of the procedures, measures, methods, protocols, methodologies, or models. [TSCA Section 26(h)(5)]e Weight of the Scientific Evidence (WOE): A systematic review method, applied in a manner suited to the nature of the evidence or decision, that uses a pre-established protocol to comprehensively, objectively, transparently, and consistently, identify and evaluate each stream of evidence, including strengths, limitations, and relevance of each study and to integrate evidence as necessary and appropriate based upon strengths, limitations, and relevance. BAS and WOE definitions can be found at 40 CFR 70133. Characterization TSCA Risk Evaluation Footnotes: 0 TSCA requires EPA to conduct risk evaluations to determine whether a chemical substance presents an unreasonable risk of injury to health or the environment, without consideration of costs or other non-risk factors, including an unreasonable risk to a potentially exposed or susceptible subpopulation identified as relevant to the risk evaluation, under the conditions of use. b Data extraction may occur before or after data evaluation. c Evaluation may occur during the scoping/problem formulation phase and/or during the analysis phase of the risk evaluation. " Data relevancy issues are considered during the Data Screening, Data Evaluation and Data Integration phases. e Literature screening partially assesses TSCA 26(h)(5) standard by identifying peer-reviewed publications. Most of the independent verification of the study results (i. e., study replicability) will be assessed during the Data Integration step. 4 Diagram depicts systematic review process to guide the first ten TSCA risk evaluations. It is anticipated that the same basic process will be used to guide future risk evaluations with some potential refinements reflecting efficiencies and other adjustments adopted as EPA/OPPT gains experience in implementing systematic review methods and/or approaches to support risk evaluations within statutory deadlines (e.g., aspects of protocol development would be better defined prior to starting scoping/problem formulation). 15 ------- Table 3-1. Planning, Execution and Assessment Activities Supporting the Systematic Review Process of TSCA Risk Evaluations Phase Process Steps Data Search3 Planning phase • Define specific objectives for the searches. • Develop search strategies. This includes describing all information sources to be searched, specification of search strings for each data/information source, search instructions, date range, filters, limits or other details to ensure reproducibility of search by an independent party. Execution phase • Execute search based on the approach described in the Literature Search Strategy documents. • Store search results. • Document date(s) the searches were conducted. • Document refinements to the protocol as part of the iterative process of improving the literature search strategy. • Finalize files using a bibliographic management tool and other documentation related to the literature search protocol. Assessment phase (Quality Assurance (QA)/ Quality Control (QC)) • Describe the mechanisms for QA including management review processes. • Describe the mechanisms for QC including data quality testing procedures. For example, demonstration that the search strategy retrieves a set of known relevant records. Data Screening (Title/Abstract)a Planning phase • Develop/refine inclusion/exclusion criteria for the title/abstract screening. • Develop/refine screening categories ("tags") to categorize information. • Develop pilot plan to test criteria for the title/abstract screening and tagging. • Describe strategy used to identify and resolve screening conflicts. • If natural language processing or other electronic processing is used, describe the methodology and specify the terms to be used for electronic screening and how groups of references will be reviewed. Execution phase • Conduct pilot study to test the criteria for title/abstract screening and tagging and conflict resolution strategy. Unless major changes are made, piloting may only need to be conducted once and not after each update. • Refine the screening and tagging criteria before application. • Conduct title/abstract screening and tagging for the remaining references. • Document date(s) the screening was conducted and who conducted the screening. Assessment phase (QA/QC) • Describe the mechanisms for QA including management review processes. • Describe the mechanisms for QC including the following: - Number of screeners and their technical skill background - Process for pilot testing the clarity of inclusion and exclusion criteria on a set of studies - Process for comparing results and resolving screening conflicts between screeners 16 ------- Table 3-1. Planning, Execution and Assessment Activities Supporting the Systematic Review Process of TSCA Risk Evaluations Phase Process Steps Data Screening (Full Text)a Planning phase • Develop/refine inclusion/exclusion criteria for the full text screening. • Develop/refine screening categories ("tags") to categorize information. • Develop pilot plan to test criteria for the full text data screening and tagging. • Describe strategy used to identify and resolve screening conflicts. • If natural language processing or other electronic processing is used, describe the methodology and specify the terms to be used for electronic screening and how groups of references will be reviewed. Execution phase • Conduct pilot study to test the criteria for full text screening and tagging and conflict resolution strategy. Unless major changes are made, piloting may only need to be conducted once and not after each update. • Refine the screening and tagging criteria before application. • Conduct full text screening and tagging for the remaining references. • Document date(s) the screening was conducted and who conducted the screening. Assessment phase (QA/QC) • Describe the mechanisms for QA including management review processes. • Describe the mechanisms for QC including the following: - Number of screeners and their technical skill background - Process for pilot testing the clarity of inclusion and exclusion criteria on a set of studies - Process for comparing results and resolving screening conflicts between screeners Data Extraction3 Planning Phase • Develop extraction templates preferably from existing examples (e.g., graphical or tabular displays) that capture specific attributes or data elements relevant for disciplines within the risk assessment. Templates should be designed to facilitate evaluation of the data and their synthesis with minimal reference to the original reference. Data/information will need to be tracked with unique identifies. • Use an extraction process that ensures access to the extracted information by EPA and the public. • Develop instructions and decision rules (e.g., what to extract/not extract under certain conditions) to be included in the template form to facilitate data extraction. • Specify number and expertise of reviewers involved in the data extraction process. • Select initial set of citations for training to promote data extraction in a consistent manner across reviewers. • Identify tool(s) for managing extracted data and decisions (e.g., spreadsheet, database). Execution Phase • Conduct pilot study to test the extraction process and conflict resolution strategy. Unless major changes are made, piloting may only need to be conducted once and not after each update. • Extract data/information using pre-defined templates. Assessment phase (QA/QC) • Describe the mechanisms for QA for data extraction process including management review processes. • Describe the mechanisms for QC including the following: Number of data extraction staff and their technical skill background Process for pilot testing the data extraction and conflict resolution 17 ------- Table 3-1. Planning, Execution and Assessment Activities Supporting the Systematic Review Process of TSCA Risk Evaluations Phase Process Steps Data Evaluation Planning Phase • Develop/refine evaluation strategy to assess quality of studies. • For large databases, develop prioritization strategy about how studies will be reviewed. • Develop instructions and decision rules for the evaluation process. • Specify number and expertise of reviewers involved in the data evaluation. • Select initial set of citations for training to promote data evaluation in a consistent manner across reviewers. • Identify tool(s) for managing evaluated data and decisions (e.g., spreadsheet, database). This should be ideally designed in a way that the tools facilitate the synthesis and integration of data in the subsequent phases of systematic review. Execution Phase • Conduct pilot study to test the evaluation criteria conflict resolution strategy. Unless major changes are made, piloting may only need to be conducted once and not after each update. • Evaluate and document the quality of the study based on the pre-defined criteria documented in the protocol. Assessment phase (QA/QC) • Describe the mechanisms for QA including management review processes. • Describe the mechanisms for QC including the following: Number of staff evaluating data/information sources and their technical skill background Process for pilot testing the data evaluation process Process for conflict resolution Data Integration Using the Weight of the Scientific Evidence Planning Phase • Develop and document strategy for analyzing and summarizing data/information across studies within each evidence stream, including strengths, limitations and relevance of the evidence. • Develop and document strategy for weighing and integrating evidence across evidence streams, including strengths, limitations and relevance of the evidence. Execution Phase • Conduct and document the analysis and synthesis of the evidence. • Document the conclusions within each evidence stream. • Weigh and document results across evidence streams to develop weight of evidence conclusions. • Document any professional judgment, including underlying assumptions that are used to support the risk evaluation. Assessment phase (QA/QC) • Specify process for assuring quality of the data being analyzed, synthesized and integrated. Notes: a EPA/OPPT uses the ECOTOX infrastructure for the data searching, screening and extractions of ecological effects data to support the TSCA risk evaluations. The planning, execution and assessment phases for the data search, screening and extraction phases are comparable to those outlined in Table 3-1 for the other data/information streams (i.e., exposure, fate, animal toxicology, in vitro, and epidemiological data). Abbreviations: TSCA=Toxic Substances Control Act ECOTOX=ECOTOXicology knowledgebase EPA/OPPT=Environmental Protection Agency, Office of Pollution Prevention and Toxics QA/QC=Quality Assurance/Quality Control HERO=Health and Environmental Research Online 18 ------- 3.1 Protocol Development Protocol Development is intended to pre-specify the criteria, approaches and/or methods for data collection, data evaluation and data integration. It is important to plan the systematic review approaches and methods in advance to reduce the risk of introducing bias into the risk evaluation process. TSCA requirements and the results of scoping/problem formulation (i.e., conceptual model(s), analysis plan) frame the specific scientific risk assessment questions to be addressed in each TSCA risk evaluation. Likewise, the statutory requirements and scoping/problem formulation inform how the data are searched, evaluated and integrated in the assessment. The TSCA Scope and Problem Formulation documents for the first ten risk evaluations contain the analytical framework guiding the systematic review process and should be consulted to understand the context of this document. The timeframe for development of the TSCA Scope documents has been very compressed. The first ten chemical substances were not subject to prioritization, the process through which EPA expects to collect and screen much of the relevant information about chemical substances that will be subject to the risk evaluation process. As a result, EPA had limited ability to develop a protocol document detailing the systematic review approaches and/or methods prior to the initiation of the risk evaluation process for the first ten chemical substances. For these reasons, the protocol development is staged in phases while conducting the assessment work. Figure 1-1 and Table 3-2 provide information about those components of the systematic review process released to the public and those that are in the pipeline for development (e.g., data integration). Data integration activities for the first ten TSCA risk evaluation are anticipated to occur after the TSCA Problem Formulation documents are released (Figure 1-1). EPA/OPPT will provide further details about the data integration strategy along with the publication of the draft TSCA risk evaluations. 3.2 Data Collection 3.2.1 Data Search Data are collected under a defined literature search strategy that is developed to fit the needs of the different disciplines supporting the risk evaluation (e.g., physical/chemical properties, environmental fate, engineering processes across the full life cycle of the chemical substance, exposure, human health hazard, environmental hazard). This step includes developing strategies for searching and identifying relevant data that are published in public databases (e.g., PubMed) and other sources containing unpublished or published data. The process steps are generally described in Table 3-1, which lists the planning, execution and assessment activities supporting the data search activities for the TSCA risk evaluation process. 19 ------- Table 3-2 provides web links to the Strategy for Conducting Literature Searches and Bibliography documents published in June 2017 along with each of the first ten TSCA Scope documents. EPA/OPPT's initial methods for identifying, compiling, and screening publicly available information are described in the Strategy for Conducting Literature Searches supporting each of the TSCA Scope documents for the first ten chemicals. The literature search and screening strategy already published will be used for future risk evaluations. Table 3-2. Supplemental Documents on Systematic Review Activities Published with the TSCA Scope Documents on June 22, 2017 Chemical Name CASRN Docket Number Web link to TSCA Scope, Literature Search Strategy and Bibliography Documents Asbestos 1332-21-4 EPA-HQ-OPPT-2016-0736 Link 1-Bromopropane (1-BP) 106-94-5 EPA-HQ-OPPT-2016-0741 Link Carbon Tetrachloride (CCI4) 56-23-5 EPA-HQ-OPPT-2016-0733 Link 1,4-Dioxane 123-91-1 EPA-HQ-OPPT-2016-0723 Link Cyclic Aliphatic Bromide Cluster (HBCD) 25637-99-4; 3194- 55-6; and 3194-57-8 EPA-HQ-OPPT-2016-0735 Link Methylene Chloride 75-09-2 EPA-HQ-OPPT-2016-0742 Link N-Methylpyrolidone (NMP) 872-50-4 EPA-HQ-OPPT-2016-0743 Link Perchloroethylene (PERC) 127-18-4 EPA-HQ-OPPT-2016-0732 Link Pigment Violet 29 (Anthra[2,l,9- def:6,5,10- d'e'f']diisoquinoline- 1,3,8,10(21-1,9H)- tetrone; PV29) 81-33-4 EPA-HQ-OPPT-2016-0725 Link Trichloroethylene (TCE) 79-01-6 EPA-HQ-OPPT-2016-0737 Link EPA/OPPT uses the infrastructure of the ECOTOXicology knowledgebase (U.S. EPA. 2018a) to identify single chemical toxicity data for aquatic life and terrestrial life. It uses a comprehensive chemical-specific literature search of the open literature that is conducted according to Standard Operating Procedures (SOPs)5, including specific SOPs to fit the needs of the TSCA risk 5 The ECOTOX SOPs can be found at https://cfpub.epa.gov/ecotox/help.cfm?helptabs=tab4. 20 ------- evaluations6. The search strategy is revised on a regular basis to ensure that high quality ecological effects data are retrieved to support the risk assessment needs of various EPA programs. Due to its well-established methods to gather high quality data, ECOTOX processes and data are widely accepted and used by a variety of domestic and international organizations and researchers. The ECOTOX literature search strategy is documented in the Strategy for Conducting Literature Searches documents for each of the ten TSCA risk evaluations (Table 3-2). EPA/OPPT also plans to search its internal databases for data and information submitted under TSCA (e.g., unpublished industry data). EPA will consider these data in the risk evaluations where relevant and whether or not they are claimed as confidential business information (CBI). If data/information are CBI, EPA/OPPT plans to use it in a manner that protects the confidentiality of the information from public disclosure. The results of the literature search are entered into the EPA's Health Environmental Research Online (HERO) database7 where the literature results are stored in chemical-specific pages. HERO also allows categorizing and sorting references by pre-defined topic areas. EPA/OPPT anticipates that the HERO project pages will be accessible to the public by the publication date of the draft risk evaluations. EPA/OPPT plans to consider relevant data/information that are submitted by the public or peer reviewers. EPA/OPPT may conduct targeted supplemental searches to support the analytical approaches and/or methods in the TSCA risk evaluation (e.g., to locate specific information for exposure modeling) or identify new data/information published after the date limits of the initial search. In addition, retracted studies may be also identified during the process of developing the risk evaluations. EPA/OPPT does not plan to use retracted studies in the TSCA risk evaluations. 3.2.1.1 Summary of the Literature Search Strategy for the First Ten TSCA Risk Evaluations EPA/OPPT conducted chemical-specific searches for data and information on: physical and chemical properties; environmental fate and transport; conditions of use information; environmental and human exposures, including potentially exposed or susceptible subpopulations; ecological and human health hazard, including potentially exposed or susceptible subpopulations. EPA/OPPT designed its initial data search to be broad enough to capture a comprehensive set of sources containing data/information potentially relevant to the risk evaluation process. Generally, the search was conducted on a wide range of data/information sources, including 6 The ECOTOX SOPs for TSCA work can be found at https://cfpub.epa.gov/ecotox/blackbox/help/OPPTRADCodingGuidelinesSOP.pdf and https://cfpub.epa.gov/ecotox/blackbox/help/OPPTRADReportsSOP.pdf. 7 HERO=Health and Environmental Research Online, https://hero.epa.gov/hero/index.cfm/content/home 21 ------- but not limited to peer-reviewed and grey literature8. When available, EPA/OPPT relied on the search strategies from recent assessments (e.g., EPA Integrated Risk Information System (IRIS) assessments) as a starting point to identify relevant references and supplemented these searches to identify relevant information published after the end date of the previous search to capture more recent literature. For human health hazards, the literature search strategy was designed to identify relevant data/information in favor (e.g., positive study) or against (e.g., negative study) a given hypothesis within the context of the assessment question(s) being evaluated in the risk evaluation. Following the initial search of data for the first ten risk evaluations, EPA/OPPT searched for data submitted to EPA under TSCA sections 4, 5, 8(e), and 8(d), as well as for your information (FYI) submissions, to find additional data relevant to human health and environmental hazard, exposure, fate, engineering, physical-chemical properties, and TSCA conditions of use. Searches were conducted of CBI and non-CBI databases followed by a duplicate identification step. Many of the non-CBI data submissions were captured in the initial search published on June 22, 2017, but some were found and added to the pool of new references to undergo data screening. 3.2.2 Data Screening EPA/OPPT develops and applies inclusion and exclusion criteria during title/abstract and full text screening to identify information potentially relevant for the risk evaluation process. This step also classifies the references into useful categories (e.g., on-topic versus off-topic, human versus animal hazard) to facilitate the sorting of information through the systematic review process. Below are examples of data characteristics, generally chemical-specific, that are used as indicators of relevance based on the scope of the assessments. These data characteristics are the basis for the development of inclusion and exclusion criteria for the title/abstract and full text screening. • Data on environmental fate, transport, partitioning and degradation behavior across environmental media of interest. • Data on environmental exposure of ecological receptors (i.e., aquatic and terrestrial organisms) to the chemical substance of interest and/or its degradation products and metabolites. • Data on environmental exposure of human receptors (general population, consumers), including any potentially exposed or susceptible subpopulations, to the substance of interest and/or its degradation products and metabolites. • Data on any setting or scenario resulting in releases of the chemical substance of interest into the natural or built environment (e.g., buildings including homes or workplaces) that 8 Grey literature refers to sources of scientific information that are not formally published and distributed in peer- reviewed journal articles. These references are still valuable and consulted in the TSCA risk evaluation process. Examples of grey literature are theses and dissertations, technical reports, guideline studies, conference proceedings, publicly-available industry reports, unpublished industry data, trade association resources, and government reports. 22 ------- would expose ecological (i.e., aquatic and terrestrial organisms) or human receptors (i.e., general population, and potentially exposed or susceptible subpopulation) • Quantitative estimates of worker exposures and of environmental releases from occupational settings for the chemical of interest • Data on human health and environmental hazards that meet minimum reporting elements (i.e., test chemical, species/organisms, effect(s), dose(s) or concentration(s), and duration). • Data on human health hazards for potentially exposed or susceptible subpopulations. 3.2.2.1 Title/Abstract Screening Titles and abstracts of the retrieved literature are reviewed for relevance according to inclusion and exclusion criteria. Table 3-1 describes the planning, execution and assessment activities supporting the title/abstract screening activities for the TSCA risk evaluation process. These activities are consistent with those conducted and described in the Strategy for Conducting Literature Searches documents (Table 3-2). Systematic reviews typically describe the study eligibility criteria in the form of PECO statements or a modified framework. PECO stands for Population, Exposure, Comparator and Outcome. The approach is used to formulate explicit and detailed criteria about those characteristics in the publication that should be present in order to be eligible for inclusion in the review (e.g., inclusion of studies reporting on the effects of chemical exposure to potentially exposed or susceptible subpopulations). Each article is generally screened by two independent reviewers using specialized web-based software (i.e., DistillerSR)9. Screeners are assigned batches of references after conducing pilot testing. Screening forms are typically used to facilitate the screening process by asking a series of questions based on pre-determined inclusion and exclusion criteria. The screeners resolve conflicts by consensus, or consultation with an independent individual(s). Ecological hazard references undergo a similar screening process following the ECOTOX SOPs. Search results, screening decisions and respective tags are stored electronically in the ECOTOX Knowledgebase. Please also refer to the ECOTOX SOPs10 and the Strategy for Conducting Literature Searches (Table 3-2) documents to understand the screening process and criteria that are applied for the ecological hazard literature. 9 In addition to using DistillerSR, EPA/OPPT is exploring automation and machine learning tools for data screening and prioritization activities (e.g., SWIFT-Review, SWIFT-Active Screener, Dragon, DocTER). SWIFT is an acronym for "Sciome Workbench for Interactive Computer-Facilitated Text-mining". 10 See footnote 3. 23 ------- 3.2.2.1.1 Summary of the Title/Abstract Screening Conducted for the First Ten TSCA Risk Evaluations One screener11 conducted the screening and categorization of titles and abstracts. Relevant studies were identified according to inclusion and exclusion criteria as described in the Strategy for Conducting Literature Searches documents (Table 3-2). The categorization scheme (or tagging structure) varied by scientific discipline (i.e., physical and chemical properties; environmental fate and transport; chemical use/conditions of use information; environmental exposures; human exposures, including potentially exposed or susceptible subpopulations identified by virtue of greater exposure; human health hazard, including potentially exposed or susceptible subpopulations identified by virtue of greater susceptibility; and ecological hazard). Within each data set, there were two broad categories or data tags: (1) on-topic references or (2) off-topic references. On-topic references are those that may contain data/information relevant to the risk evaluation. Off-topic references are those that do not appear to contain data or information relevant to the risk evaluation. Additional sub-categories (or sub-tags) were performed to facilitate further sorting of data/information - for example, identifying references by source type (e.g., published peer- reviewed journal article, government report); data type (e.g., primary data, review article); human health hazard (e.g., liver toxicity, cancer, reproductive toxicity); or chemical-specific and use-specific data or information. The ECOTOX process and methodologies were used to screen the ecological hazard references. The ECOTOX literature screening strategy is discussed in the Strategy for Conducting Literature Searches documents for each of the ten TSCA risk evaluations (Table 3-2). Search results, screening decisions and respective tags were stored electronically in the ECOTOX Knowledgebase. 3.2.2,2 Full Text Screening The references identified during title/abstract screening are checked for relevance at the full- text level against specific eligibility criteria (e.g., PECO statements). Since EPA/OPPT is implementing systematic review methods and/or approaches in phases, the PECO approach was adopted during full text screening for the first ten TSCA risk evaluation. Future assessments will use PECOs from the start of the screening process (i.e., title/abstract screening). The number of screeners, the process of reference assignment and conflict resolution are similar to those used for title/abstract screening. Table 3-1 describes the planning, execution and assessment activities supporting the full text screening activities for TSCA risk evaluations. 11 Systematic review guidelines typically recommend at least two screeners to review each article to minimize bias. EPA had less than 6 months to conduct data collection and screening activities for 10 chemical substances; thus, one screener was used for the title/abstract screening to meet the statutory deadline in June 2017. However, full text screening generally used two independent screeners (see Section 3.2.2.2). 24 ------- Like the title/abstract screening, the ECOTOX SOPs guide the title/abstract and full text screening of ecological hazard references. Please refer to the ECOTOX SOPs12 to understand the screening process and criteria that are applied for the ecological hazard literature. 3.2.2.2.1 Summary of the Full Text Screening Conducted for the First Ten TSCA Risk Evaluations The full text screening was conducted while EPA/OPPT refined the scope of the TSCA risk evaluations during problem formulation for the first ten chemical substances. PECO statements or a modified framework were used to describe the full-text inclusion and exclusion criteria for selecting relevant references. These criteria have been placed in each of the TSCA Problem Formulation documents as some criteria reflect chemical-specific issues that are better discussed in each chemical assessment. Refinements to the criteria may occur as EPA/OPPT delves into the analysis of relevant information. Each article was generally screened by two independent reviewers using specialized web-based software (i.e., DistillerSR)13. Screeners were assigned batches of references after conducing pilot testing. Screening forms facilitated the reference review process by asking a series of questions based on pre-determined eligibility criteria. DistillerSR was used to manage the work flow of the screening process and document the eligibility decisions for each reference. The screeners resolved conflicts by consensus, or consultation with an independent individual(s). As indicated in section 3.2.2.1, ecological hazard references underwent a similar screening process using the ECOTOX SOPs. 3.2.2.3 Data Extraction Data extraction is the process in which quantitative and qualitative data/information are identified from each relevant data/information source and extracted using structured forms or templates. Table 3-1 describes the planning, execution and assessment activities supporting the data extraction activities for TSCA risk evaluations. When possible, the same reviewers used for the full-text screening will be used for data extraction, as these reviewers are already familiar with the references. EPA/OPPT will use various extraction tools to meet the needs of each chemical assessment. These may include specialized web-based software (e.g., DistillerSR, HAWC14). Irrespective of whether data/information are extracted before or after evaluation, the general principle is that the extraction will occur for those sources containing relevant data/information 12 See footnote 3. 13 In addition to using DistillerSR, EPA/OPPT is exploring automation and machine learning tools for data screening and prioritization activities (e.g., SWIFT-Review, SWIFT-Active Screener, Dragon, DocTER). SWIFT is an acronym for "Sciome Workbench for Interactive computer-Facilitated Text-mining" [this is the same as footnote 6 above], 14 EPA/OPPT is exploring HAWC for extracting data supporting TSCA risk evaluations. HAWC stands for Health Assessment Workspace Collaborative. 25 ------- for the risk evaluation. EPA/OPPT is not planning to extract data/information from sources that exhibit serious flaws that would make the data unacceptable for use in the risk evaluation. When applicable and feasible, EPA/OPPT will reach out to the authors of the data/information source to obtain raw data or missing elements that would be important to support the data evaluation and data integration steps. In such cases, the request(s) for additional data/information, number of contact attempts, and responses from the authors will be documented. Data extraction activities for the first ten TSCA risk evaluation are anticipated to occur after the TSCA Problem Formulation documents are released Figure 1-1). 3.3 Data Evaluation Data evaluation is the stage where the study quality of individual studies is assessed. Table 3-1 describes the planning, execution and assessment activities supporting the data evaluation activities for TSCA risk evaluations. EPA/OPPT will use the evaluation strategies, including pre-determined criteria, documented in Appendices A through I. Refinements to the evaluation strategies are likely to occur and, in such case, any adjustments will be documented. Ideally, each data/information source will be screened by two reviewers but one reviewer may be used. The reviewers will resolve conflicts by consensus, or consultation with an independent individual(s). Data evaluation activities for the first ten TSCA risk evaluation are anticipated to occur after the TSCA Problem Formulation documents are released in March 2018 (Figure 1-1). 3.4 Data Integration and Summary of Findings Data integration is the stage where the analysis, synthesis and integration of data/information takes place by considering quality, consistency, relevancy, coherence and biological plausibility. It is in this stage where the weight of the scientific evidence approach is applied to evaluate and synthetize multiple evidence streams in order to support the chemical risk evaluation. EPA/OPPT is required by TSCA to use the weight of the scientific evidence in TSCA risk evaluations. Application of weight of evidence analysis is an integrative and interpretive process that considers both data/information in favor (e.g., positive study) or against (e.g., negative study) a given hypothesis within the context of the assessment question(s) being evaluated in the risk evaluation. Table 3-1 describes the planning, execution and assessment activities supporting the data integration for TSCA risk evaluations. Within the TSCA context, the weight of the scientific evidence is defined as "a systematic review method, applied in a manner suited to the nature of the evidence or decision, that uses a pre- established protocol to comprehensively, objectively, transparently, and consistently identify and evaluate each stream of evidence, including strengths, limitations, and relevance of each 26 ------- study and to integrate evidence as necessary and appropriate based upon strengths, limitations, and relevance". 40 C.F.R. 702.33. In other words, it will involve assembling the relevant data and evaluating the data for quality and relevance, followed by synthesis and integration of the evidence to support conclusions (U.S. EPA. 2016). The significant issues, strengths, and limitations of the data and the uncertainties that require consideration will be presented, and the major points of interpretation will be highlighted. Professional judgment will be used at every step of the process and will be applied transparently, clearly documented, and to the extent possible, follow principles and procedures that are articulated prior to conducting the assessment (U.S. EPA. 2016). The last step of the systematic review process is the summary of findings in which the evidence is summarized, the approaches or methods used to weigh the evidence are discussed, and the basis for the conclusion(s), recommendation(s), and any uncertainties are fully described. This step occurs in each of the components of the risk assessment (i.e., exposure assessment and hazard assessment) and is summarized in the risk characterization section of the TSCA risk evaluation. Data integration activities for the first ten TSCA risk evaluation are anticipated to occur after the TSCA Problem Formulation documents are released (Figure 1-1). EPA/OPPT will provide further details about the data integration strategy along with the publication of the draft TSCA risk evaluations. 4 UPDATES TO THE DATA SEARCH AND SCREENING RESULTS FOR THE FIRST TEN RISK EVALUATIONS 4.1 Initial Data Search EPA/OPPT identified additional environmental fate and exposure references that were not captured in the initial categorization of the on-topic references for the first ten risk evaluations published on June 22, 2017. Specifically, assessors identified references by checking the list of references of data sources frequently used to support EPA/OPPT's risk assessments (e.g., previous assessments cited in Table 1-1 of the TSCA Scope documents). This method, called backward reference searching (or snowballing), was not part of the initial literature search strategy. The inclusion of these additional on-topic references is not expected to change the information presented in the TSCA Scope and Problem Formulation documents. Also, EPA/OPPT anticipates targeted supplemental searches during the analysis phase (e.g., to locate specific information for exposure modeling). Backward reference searching will be included in the literature search strategy for supplemental searches. Since the gathering of the initial literature search results, EPA/OPPT identified a list of on-topic and off-topic references that have been retracted from the scientific literature. Retracted references will not be considered in the development of TSCA risk evaluations. These references are listed in the pertinent TSCA Problem Formulation documents. 27 ------- 4.2 Initial Title/Abstract Screening During the problem formulation phase, EPA/OPPT evaluated the performance of the initial title/abstract screening and tagging for the first ten risk evaluations to identify potentially misclassified on-topic and off-topic references. Misclassification was generally assessed by reviewing a small subset of references in the engineering/occupational exposure, exposure (e.g., general population, consumer exposure), environmental fate and human health hazard peer-reviewed literature. Once a misclassification was identified, EPA/OPPT initiated the process of updating the tags of the reference in HERO. There were many on-topic references identified without readily available full text through the EPA library subscriptions or open sources. EPA/OPPT conducted a second title/abstract screening to confirm relevance of the data source and prioritize the decision of purchasing the full text in the case that the data source remained relevant after making refinements to the TSCA scope as the result from problem formulation. This ensured that EPA/OPPT would purchase the most relevant references for the risk evaluations. Also, assessors questioned the usefulness of some on-topic references after closer inspection of the bibliographic citations. For instance, EPA/OPPT initially included a small subset of references reporting on the therapeutic or ameliorative properties of different drugs in carbon tetrachloride-treated animals. The references were re-classified as off-topic after updating the eligibility criteria and conducting a second title/abstract screening with the assistance of machine learning for literature prioritization (i.e., DocTER). An exploratory exercise was conducted to identify on-topic references that were mischaracterized as off-topic references within the peer-reviewed human health hazard literature. Some on-topic references were identified using SWIFT-Review, but additional work is needed to further optimize the method. The second title/abstract screening for some of the references (see paragraph above) helped identify additional off-topic references that were originally tagged as on-topic. Based on performance checks, it is anticipated that very few on- topic references were misclassified as off-topic. 28 ------- 5 REFERENCES Note: This list contains the references cited in sections 1 through 3. References supporting the various evaluation strategies are listed in their respective appendices. 1. Bilotta, GSM, A. M. Boyd, I.,an. (2014). On the use of systematic reviews to inform environmental policies. Environ Sci Pol. 42: 67-77. http://dx.doi.Org/10.1016/i.envsci.2014.05.010 https://www.sciencedirect.com/science/article/pii/S14629011140011427via%3Dihub. 2. Council, CtRtlPBoESTDoELSNR. (2014). Review of EPA's integrated risk information system (IRIS) process. Washington, D.C.: National Academies Press (US), http://dx.doi.org/10.17226/18764. 3. Higgins, JG, S. (2011). Cochrane handbook for systematic reviews of interventions. Version 5.1.0: The Cochrane Collaboration, 2011. http://handbook.cochrane.org. 4. National Academy of Sciences, National Academy of Engineering,, Institute of Medicine,. (2017). Application of systematic review methods in an overall strategy for evaluating low-dose toxicity from endocrine active chemicals. In Consensus Study Report. Washington, D.C.: The National Academies Press, http://dx.doi.org/10.17226/24758 https://www.nap.edu/catalog/24758/application-of-svstematic-review-methods-in-an-overall- strategy-for-evaluating-low-dose-toxicitv-from-endocrine-active-chemicals. 5. U.S. EPA (U.S. Environmental Protection Agency). (1992). Guidelines for exposure assessment. (EPA/600/Z-92/001). Washington, DC: U.S. Environmental Protection Agency, Risk Assessment Forum. http://cfpub.epa.gov/ncea/cfm/recordisplay.cfm?deid=15263. 6. U.S. EPA. (1998). Guidelines for neurotoxicity risk assessment [EPA Report] (pp. 1-89). (EPA/630/R- 95/001F). Washington, DC: U.S. Environmental Protection Agency, Risk Assessment Forum. http://www.epa.gov/risk/guidelines-neurotoxicity-risk-assessment. 7. U.S. EPA. (2014). Framework for human health risk assessment to inform decision making. Final [EPA Report], (EPA/100/R-14/001). Washington, DC: U.S. Environmental Protection, Risk Assessment Forum, https://www.epa.gov/risk/framework-human-health-risk-assessment-inform-decision- making. 8. U.S. EPA. (2016). Weight of evidence in ecological assessment [EPA Report], (EPA100R16001). Washington, DC: Office of the Science Advisor. https://cfpub.epa.gov/si/si public record report.cfm?dirEntryld=335523. 9. U.S. EPA. (2018). ECOTOX Knowledgebase. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4263024. 29 ------- APPENDIX A: STRATEGY FOR ASSESSING THE QUALITY OF DATA/INFORMATION SUPPORTING TSCA RISK EVALUATIONS The strategies for assessing the quality of data/information sources15 use a structured framework with predefined criteria for each type of data/information source. EPA/OPPT developed a numerical scoring system to inform the characterization of the data/information sources during the data integration phase. The goal is to provide transparency and consistency to the evaluation process along with creating evaluation strategies that meet the TSCA science standards for various data/information streams. Further details about the data integration strategy will be provided with the publication of the draft TSCA risk evaluations, including how the scores will be considered. In this document, the term data/information source is used in a broad way to capture the heterogeneity of data/information sources that are used in the TSCA risk evaluations. The data/information are intended to understand the hazards, exposures, conditions of use, and the potentially exposed or susceptible subpopulations as required by the amended TSCA. Thus, EPA/OPPT has developed evaluation strategies for various data/information streams: • Physical-chemical properties (Appendix B); • Environmental fate (Appendix C); • Occupational exposure and release data (Appendix D) • Exposures to general population and consumers as well as environmental exposures (Appendix E); • Ecological hazard studies (Appendix F); • Animal toxicity and in vitro toxicity (Appendix G); • Epidemiological studies (Appendix H) The process of developing the strategies involved reviewing various evaluation tools/frameworks and documents as well as getting input from scientists based on their expert knowledge about evaluating various data/information sources for risk assessment purposes. Criteria and/or evaluation tools/frameworks that were consulted during the development phase of the evaluation strategies were the following: • Biomonitoring, Environmental Epidemiology, and Short-lived Chemicals (BEES-C) instrument (Lakind et al.. 2014) • Criteria used in EPA's ECOTOXicology knowledgebase (U.S. EPA. 2018a) • Criteria for reporting and evaluating ecotoxicity data(CRED) (Moermond et al.. 2016b) • Systematic review practices in EPA's Integrated Risk Information System (IRIS) (U.S. EPA. 2018b) • EPA's Guidelines for Exposure Assessment (U.S. EPA. 1992) 15 The term data/information source is used in this document in a broad way to capture the heterogeneity of data/information in TSCA risk evaluations (e.g., experimental studies, data sets, published models, completed assessments, release data). 30 ------- • EPA's Summary of General Assessment Factors for Evaluating the Quality of Scientific and technical information (U.S. EPA. 2003b) • EPA's Exposure Factors Handbook (U.S. EPA. 2011b) • Handbook for Conducting a Literature-based Health Assessment Using OHAT Approach for Systematic Review and Evidence Integration (NTP. 2015a) • NAS report on Human Biomonitoring for Environmental Chemicals (NRC. 2006) • Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement (Von Elm et al.. 2008) • ToxRTool (Toxicological data Reliability Assessment Tool) developed by the European Commission (EC. 2018) • Various OECD guidance document on exposure, environmental fate and modeling data (see appendices more information) (EC. 2018; OECD. 2017; Cooper et al.. 2016; ECHA. 2016; Lynch et al.. 2016; Moermond et al.. 2016a; Moermond et al.. 2016b; Samuel et al.. 2016; NTP. 2015a. b; Hooiimans et al.. 2014; Koustas et al.. 2014; Lakind et al.. 2014; NRC. 2014; OECD. 2014; Kushman et al.. 2013; Hartling etal.. 2012; ECHA. 2011a. c; U.S. EPA. 2011a. b; Hooiimans et al.. 2010; U.S. EPA. 2009; Von Elmetal.. 2008; OECD. 2007; Barret al.. 2006; FTC. 2006; NRC. 2006; U.S. EPA. 2006; ATSDR. 2005; OECD. 2004. 2003; U.S. EPA. 2003a. b, c; Bower. 1999; OECD. 1998. 1997. 1995; U.S. EPA. 1992; NRC. 1991) The general structure of the TSCA evaluation strategies is composed of evaluation domains, metrics and criteria. Evaluation domains represent general categories of attributes that are evaluated in each data/information source (e.g., test substance, test conditions, reliability, representativeness). Each domain contains a unique set of metrics, or sub-categories of attributes, intended to assess an aspect of the methodological conduct of the data/information source. Each metric specifies criteria expressing the relevant elements or conditions for assessing confidence that, along with professional judgement, will guide the identification of study strengths and limitations/deficiencies. EPA/OPPT plans to pilot the evaluation strategies for optimization purposes. Reporting quality is an important aspect of a study that needs to be considered in the evaluation process. The challenge, in many cases, is to distinguish a deficit in reporting from a problem in the underlying methodological quality of the data/information source. The TSCA evaluation strategies incorporate reporting criteria within the existing domains rather than adding a separate reporting domain as recommended in some evaluation tools/frameworks. Since reporting contributes to the evaluation of each facet of the data source, EPA/OPPT assesses reporting and methodological quality simultaneously with the idea of untangling reporting from study conduct while the reviewer is assessing a particular metric for each domain. Developing a reporting checklist, guidance document or a separate reporting quality domain may be possible in the near future as EPA/OPPT uses and optimizes the evaluation strategies. Data/information sources should also be evaluated for their relevance or appropriateness to support the risk evaluation. Specifically, data/information sources should support the 31 ------- assessment questions, analytical approaches, methods, models and considerations that are laid out in the analysis plan of the TSCA Scope documents16. EPA/OPPT uses a tiered approach to check for relevance starting at the data search stage and continuing during the title/abstract and full text screening and evaluation and integration stages. By design, the TSCA systematic review process uses a fit-for-purpose literature search and relevance-driven eligibility criteria to end up evaluating the most relevant data/information sources for the TSCA risk evaluation. The reviewers also check for relevance while assessing the quality of the data/information source and are asked to document17 any relevancy issues during the evaluation process. Refer to section 3.2.2 for data attributes that are included in the eligibility criteria to check for relevance. The TSCA evaluation strategies in some cases refer to study guidelines along with professional judgement as a helpful guidance in determining the adequacy or appropriateness of certain study designs or analytical methods. This should not be construed to imply that non-guideline studies have lower confidence than guideline or Good Laboratory Practice (GLP) studies. EPA/OPPT will consider any and all available, relevant data and information that conform to the TSCA science standards when developing the risk evaluations irrespective of whether they were conducted in accordance with standardized methods (e.g., OECD test guidelines or GLP standards). Some data sources may be evaluated under different evaluation strategies. For instance, exposure assessors may evaluate an epidemiological study for estimating exposure via direct measurements or modeling. In addition, a human health hazard assessor may evaluate the same study for hazards and effects in the human population related to the exposure of a particular chemical substance. Although this may be cumbersome, EPA/OPPT's approach is justifiable since the data source is supporting different assessment questions. EPA/OPPT recognizes that this approach may be refined in the future to adopt efficiencies, if lessons learned indicate that it needs to be changed. EPA/OPPT will consider data and information from alternative test methods and strategies (or new approach methodologies or NAMs), as applicable and available, to support TSCA risk evaluations. This is consistent with EPA/OPPT's Strategic Plan to Promote the Development and Implementation of Alternative Test Methods (Draft) to reduce, refine or replace vertebrate animal testing (U.S. EPA. 2018c). Since these NAMs may support the analyses for the exposure and hazard assessments, the data/information quality criteria may need to be optimized or new criteria may need to be developed as part of evaluating and integrating NAMs in the TSCA risk evaluation process. 16 Refer to the TSCA Problem Formulation documents to obtain refined analysis plans for the first ten chemical assessments. 17 Relevancy issues will be documented in the reviewer's comments. 32 ------- A.l Evaluation Method Based on the strengths, limitations, and deficiencies of each data/information source, the reviewer assigns a confidence level score of 1 (high confidence), 2 (medium confidence), 3 (low confidence) or 4 (unacceptable) for each individual metric that is evaluating a particular aspect of the methodological conduct of the data/information source. Although many metrics have criteria for all four bins (i.e., High, Medium, Low, and Unacceptable), there are some metrics with dichotomous or trichotomous criteria to fit better the nature of the criteria. The confidence levels and corresponding scores at the metric level are defined as follows: • High: No notable deficiencies or concerns are identified in the domain metric that are likely to influence results [score of 1]. • Medium: Minor uncertainties or limitations are noted in the domain metric that are unlikely to have a substantial impact on results [score of 2]. • Low: Deficiencies or concerns are noted in the domain metric that are likely to have a substantial impact on results [score of 3]. • Unacceptable: Serious flaws are noted in the domain metric that consequently make the data/information source unusable, [score of 4]. • Not rated/applicable: Rating of this metric is not applicable to the data/information source being evaluated [no score]. Not rated/applicable will also be used in cases in which studies cite a literature source for their test methodology instead of providing detailed descriptions. In these circumstances, EPA will score the metric as Not rated/not applicable and capture it in the reviewer's notes. If the data/information source is not classified as "unacceptable" in the initial review, the cited literature source will be reviewed during a subsequent evaluation step and the metric will be rated at that time. A numerical scoring method is used to convert the confidence level for each metric into the overall quality level for the data/information source. The overall study score is equated to an overall quality level (High, Medium, or Low) using the level definitions and scoring scale shown in Table A-l. The scoring scale was obtained by calculating the difference between the highest possible score of 3 and the lowest possible score of 1 (i.e., 3-1= 2) and dividing into three equal parts (2-^3 = 0.67). This results in a range of approximately 0.7 for each overall data quality level, which was used to estimate the transition points (cut-off values) in the scale between High and Medium scores, and Medium and Low scores. These transition points between the ranges of 1 and 3 were calculated as follows: • Cut-off values between High and Medium: 1 + 0.67= 1.67, rounded up to 1.7 (scores lower than 1.7 will be assigned an overall quality level of High) • Cut-off values between Medium and Low: 1.67 + 0.67= 2.34, rounded up to 2.3 (scores between 1.7 and lower than 2.3 will be assigned an overall quality level of Medium) A study is disqualified from further consideration if the confidence level of one or more metrics is rated as Unacceptable [score of 4]. EPA/OPPT plans to use data with an overall quality level of High, Medium, or Low confidence to quantitatively or qualitatively support the risk evaluations, but does not plan to use data rated as Unacceptable. Data or information from Unacceptable 33 ------- studies might be useful qualitatively and such use of unacceptable studies may be done on a case-by-case basis. Table A-l. Definition of Overall Quality Levels and Corresponding Quality Scores Overall Quality Level Definition Overall Quality Score High No notable deficiencies or concerns are identified and the data therefore could be used in the assessment with a high degree of confidence. > 1 and < 1.7 Medium Possible deficiencies or concerns are noted and the data therefore could be used in the assessment with a medium degree of confidence. >1.7 and <2.3 Low Deficiencies or concerns are noted and the data therefore could be used in the assessment with a low degree of confidence. > 2.3 and < 3 Unacceptable Serious flaw(s) are identified and therefore, the data cannot be used for the assessment. 4 After the overall score is applied to determine an overall quality level, professional judgment may be used to adjust the quality level obtained by the weighted score calculation. The reviewer must have a compelling reason to invoke the adjustment of the overall score and written justification must be provided. This approach has been used in other established tools such as the ToxRTool (Toxicological data Reliability Assessment Tool) developed by the European Commission (https://eurl-ecvam.irc.ec.europa.eu/about-ecvam/archive- publications/toxrtool). Domain definitions, evaluation metrics, and details about the numerical scoring method can be found in the appendices for each data/information stream (Appendices B to H). A.2 Documentation and Instructions for Reviewers Data evaluation is conducted in a tool (e.g., Excel, DistillerSR) that tracks and records the evaluation for each data/information source. The following basic information will be generally recorded for each data/information source that is reviewed. Table A-2. Documentation Template for Reviewer and Data/Information Source Reviewer Information: Name: Affiliation: Qualifications (area of expertise): Date of Review: Data/Information Source: Reference citation: HERO ID: HERO Link: Study or Data Type (if publication reports multiple studies or data types): 34 ------- A confidence level is assigned for each relevant metric within each domain by following the confidence level specifications provided in section A.l, along with professional judgment, to identify study strengths and limitations. The assigned confidence level is indicated by placing a score between 1 and 4 in the column labeled Selected Score. In some cases, reference to study guidelines (in addition to professional judgement) may be helpful in determining the adequacy or appropriateness of certain study designs or analytical methods. This should not be construed to imply that non-guideline studies necessarily have lower confidence than guideline studies. If a publication reports more than one study or endpoint, each study and, as needed, each endpoint will be evaluated separately. Some metrics may not be applicable to all study types. If a metric is not applicable to the study under review, NR (not rated) will be placed in the Selected Score column for this metric. After scoring of the individual metrics within each domain, the overall study score is calculated and assigned to the corresponding bin (High, Medium, Low, or Unacceptable). In the Reviewer's Comments field, the reviewer documents concerns, uncertainties, strengths, limitations, deficiencies and any additional comments observed for each metric, when necessary. For instance, EPA may not always provide a comment for a metric that has been categorized as High. However, a reviewer is strongly encouraged to provide a comment for metrics categorized as Medium or Low to improve transparency. The reviewer also records any relevance issues with the data/information source (e.g., study is not useful to answer assessment questions). A.3 Important Caveats The following is a discussion of important caveats for the data quality evaluation method that EPA/OPPT intends to use in the TSCA risk evaluations: • Although specifications for the data quality evaluation metrics have been developed, professional judgment is required to assess the metrics. • Data evaluation is a qualitative assessment of confidence in a study or data set. A scoring system is being applied to ascertain a qualitative rating in order to provide consistency and transparency to the evaluation process. Scores will be used for the purpose of assigning the confidence level rating of High, Medium, Low, or Unacceptable, and inform the characterization of data/information sources during the data integration phase. The system is not intended to imply precision and/or accuracy of the scoring results. • Every study or data set is unique and therefore the individual metrics and domains may have various degrees of importance (e.g., more or less important). The weighting approach for some of the strategies may need to be adjusted as EPA/OPPT tests the evaluation method with different types of studies. • The metrics developed are intended to be indicators of data quality. They were selected because they are generally considered common and important for a broad range of 35 ------- studies. Other metrics not listed may also be important and added if necessary. Also, there is the possibility of deviating from the calculated overall confidence level score in case the metric criteria are unable to capture professional judgement. A reviewer must provide a justification for the score adjustment to ensure transparency for the decision. A.4 References 1. ATSDR. (2005). Public health assessment guidance manual (Update). Atlanta, GA: U.S. Department of Health and Human Services, Public Health Service. http://www.atsdr.cdc.gov/hac/PHAManual/toc.html. 2. Barr, DBT, K. Curwin, B. Landsittel, D. Raymer, J. Lu, C. Donnelly, K. C. Acquavella, J. (2006). Biomonitoring of exposure in farmworker studies [Review], Environ Health Perspect. 114(6): 936- 942. 3. Bower. NW. (1999). Environmental Chemical Analysis (Kebbekus, B. B.; Mitra, S.). J Chem Educ. 76(11): 1489. 4. Cooper. GL. R. Agerstrand. M. Glenn. B. Kraft. A. Luke. A. Ratcliffe. J. (2016). Study sensitivity: Evaluating the ability to detect effects in systematic reviews of chemical exposures. Environ Int. 92- 93: 605-610. http://dx.doi.Org/10.1016/i.envint.2016.03.017. 5. EC (2018). ToxRTool - Toxicological data Reliability assessment Tool. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262819. 6. ECHA. (2011a). Guidance on information requirements and chemical safety assessment. (ECHA- 2011-G-13-EN). https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262842. 7. ECHA. (2011b). Guidance on information requirements and chemical safety assessment. Chapter R.4: Evaluation of available information. (ECHA-2011-G-13-EN). Helsinki, Finland. https://echa.europa.eu/documents/10162/13643/information requirements r4 en.pdf. 8. ECHA. (2016). Practical guide. How to use and report (Q)SARs. Version 3.1. July 2016. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262860. 9. FTC. (2006). Standards and Guidelines for Statistical Surveys. Washington, DC: Federal Trade Commission, Office of Management and Budget. https://www.ftc.gov/system/files/attachments/data-qualitv- act/standards and guidelines for statistical surveys - omb - sept 2006.pdf. 10. Hartling, LH, M. Milne, A. Vandermeer, B. Santaguida, P. L. Ansari, M. Tsertsvadze, A. Hempel, S. Shekelle. P. Drvden. D. M. (2012). Validity and inter-rater reliability testing of quality assessment instrumentsalidity and inter-rater reliability testing of quality assessment instruments. (AHRQ Publication No. 12-EHC039-EF). Rockville, MD: Agency for Healthcare Research and Quality. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262864. 11. Hooijmans, CDV, R. Leenaars, M. Ritskes-Hoitinga, M. (2010). The Gold Standard Publication Checklist (GSPC) for improved design, reporting and scientific quality of animal studies GSPC versus ARRIVE guidelines. http://dx.doi.org/10.1258/la.2010.01013Q. 12. Hooijmans, CRR, M. M. De Vries, R. B. M. Leenaars, M. Ritskes-Hoitinga, M. Langendam, M. W. (2014). SYRCLE's risk of bias tool for animal studies. BMC Medical Research Methodology. 14(1): 43. http://dx.doi.org/10.1186/1471-2288-14-43. 13. Koustas, EL, J. Sutton, P. Johnson, P. I. Atchley, D. S. Sen, S. Robinson, K. A. Axelrad, D. A. Woodruff, T. J. (2014). The Navigation Guide - Evidence-based medicine meets environmental health: Systematic review of nonhuman evidence for PFOA effects on fetal growth [Review], Environ Health Perspect. 122(10): 1015-1027. http://dx.doi.org/10.1289/ehp.1307177: 36 ------- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4181920/pdf/ehp.1307177.pdf. 14. Kushman, MEK, A. D. Guyton, K. Z. Chiu, W. A. Makris, S. L. Rusyn, I. (2013). A systematic approach for identifying and presenting mechanistic evidence in human health assessments. Regul Toxicol Pharmacol. 67(2): 266-277. http://dx.doi.Org/10.1016/i.vrtph.2013.08.005: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3818152/pdf/nihms516764.pdf. 15. Lakind, JSS, J. Goodman, M. Barr, D. B. Fuerst, P. Albertini, R. J. Arbuckle, T. Schoeters, G. Tan, Y. Teeguarden, J. Tornero-Velez, R. Weisel, C. P. (2014). A proposal for assessing study quality: Biomonitoring, Environmental Epidemiology, and Short-lived Chemicals (BEES-C) instrument. Environ Int. 73: 195-207. http://dx.doi.Org/10.1016/i.envint.2014.07.011: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4310547/pdf/nihms-656623.pdf. 16. Lynch, HNG, J. E. Tabony, J. A. Rhomberg, L. R. (2016). Systematic comparison of study quality criteria. Regul Toxicol Pharmacol. 76: 187-198. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262904. 17. Moermond, CB, A. Breton, R. Junghans, M. Laskowski, R. Solomon, K. Zahner, H. (2016a). Assessing the reliability of ecotoxicological studies: An overview of current needs and approaches. Integr Environ Assess Manag. 13: 1-12. http://dx.doi.org/10.1002/ieam.1870: http://onlinelibrarv.wilev.com/store/10.1002/ieam.l870/asset/ieaml870.pdf?v=l&t=ierdoypz&s=e e96db9e589f470debl0651cdbl460d9ada93486. 18. Moermond, CTK, R. Korkaric, M. Agerstrand, M. (2016b). CRED: Criteria for reporting and evaluating ecotoxicity data. Environ Toxicol Chem. 35(5): 1297-1309. http://dx.doi.org/10.1002/etc.3259. 19. NRC. (1991). Environmental Epidemiology, Volume 1: Public Health and Hazardous Wastes. Washington, DC: The National Academies Press. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262908. 20. NRC. (2006). Human biomonitoring for environmental chemicals. Washington, D.C.: The National Academies Press, http://www.nap.edu/catalog.php7record id=11700. 21. NRC. (2014). Review of EPA's Integrated Risk Information System (IRIS) process. Washington, DC: The National Academies Press, http://www.nap.edu/catalog.php7record id=18764. 22. NTP. (2015a). Handbook for conducting a literature-based health assessment using OHAT approach for systematic review and evidence integration. U.S. Dept. of Health and Human Services, National Toxicology Program, http://ntp.niehs.nih.gov/pubhealth/hat/noms/index-2.html. 23. NTP. (2015b). OHAT risk of bias rating tool for human and animal studies. U.S. Dept. of Health and Human Services, National Toxicology Program. https://ntp.niehs.nih.gov/ntp/ohat/pubs/riskofbiastool 508.pdf. 24. OECD. (1995). Detailed review paper on biodegradability testing . Environment monograph No 98. OECD series on the Test Guidelines Programme. Number 2. (OCDE/GD(95)43). Paris, France: OECD Publishing, https://www.oecd-ilibrary.org/docserver/9789264078529-en.pdf. 25. OECD. (1997). Guidance document on direct phototransformation of chemical in water. OECD Environmental Health and Safety Publications Series on Testing and Assessment. No. 7. (OCDE/GD(97)21). Paris, France: OECD Publishing, https://www.oecd- ilibrarv.org/docserver/978926407800Q-en.pdf. 26. OECD. (1998). Detailed review paper on aquatic testing methods for pesticides and industrial chemicals. Part 1: Report. OECD Series on testing and assessment. No. 11. (ENV/MC/CHEM(98)19/PART1). Paris, France: OECD Publishing, https://www.oecd- ilibrary.org/docserver/9789264078291-en.pdf. 27. OECD. (2003). Guidance document on reporting summary information on environmental, occupational and consumer exposure: OECD Environment, Health and Safety Publications Series on Testing and Assessment no. 42. (ENV/JM/MONO(2003)16). France: Environment Directorate; Joint Meeting of the Chemicals Committee and the Working Party on Chemicals, Pesticides and 37 ------- Biotechnology, http://www.oecd- ilibrarv.org/docserver/download/9750421e. pdf?expires=1511217696&id=id&accname=guest&chec ksum=F6F9CD530DBACFlFA06C5A627E00177C. 28. OECD. (2004). Guidance document on the use of multimedia models for estimating overall environmental persistance and long-range transport. OECD series on testing and assessment No. 45. (ENV/JM/MONO(2004)5). Joint meeting of the chemicals committee and the working party on chemicals, pesticides and biotechnology, https://www.oecd-ilibrary.org/docserver/9789264079137- en.pdf. 29. OECD. (2007). Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship [(Q)SAR] Models. OECD Environment Health and Safety Publications. Series on Testing and Assessment No. 69. (ENV/JM/MONO(2007)2). Paris, France: OECD Publishing. https://www.oecd-ilibrary.org/docserver/9789264085442- en.pdf?expires=1525456995&id=id&accname=guest&checksum=75D4C7E1434FB7B79201CB055DD 772FE. 30. OECD. (2014). Guidance Document for Describing Non-Guideline In Vitro Test Methods. In OECD Series on Testing and Assessment. (No. 211). http://www.oecd.org/officialdocuments/publicdisplavdocumentpdf/?cote=ENV/JM/MONO(2Q14)35 &doclanguage=en. 31. OECD. (2017). Guidance on Grouping of Chemicals, Second Edition: OECD Publishing. http://dx.doi.org/10.1787/9789264274679-en. 32. Samuel. GOH. S. Wright. R. A. Lalu. M. M. Patlewicz. G. Becker. R. A. Degeorge. G. L. Fergusson. D. Hartung. T. Lewis. R. J. Stephens. M. L. (2016). Guidance on assessing the methodological and reporting quality of toxicologically relevant studies: A scoping review. Environ Int. 92-93: 630-646. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262966. 33. U.S. EPA (U.S. Environmental Protection Agency). (1992). Guidelines for exposure assessment. (EPA/600/Z-92/001). Washington, DC: U.S. Environmental Protection Agency, Risk Assessment Forum. http://cfpub.epa.gov/ncea/cfm/recordisplay.cfm?deid=15263. 34. U.S. EPA. (2003a). Occurrence estimation methodology and occurrence findings report of the six- year review of existing national primary drinking water regulations [EPA Report], (EPA-815/R-03- 006). Washington, DC. http://water.epa.gov/lawsregs/rulesregs/regulatingcontaminants/sixyearreview/first review/uploa d/support 6yr occurancemethods final.pdf. 35. U.S. EPA. (2003b). A summary of general assessment factors for evaluating the quality of scientific and technical information [EPA Report], (EPA/100/B-03/001). Washington, DC: U.S. Environmental Protection Agency, Office of Research and Development, http://www2.epa.gov/osa/summary- general-assessment-factors-evaluating-quality-scientific-and-technical-information. 36. U.S. EPA. (2003c). Survey Management Handbook. (EPA 260-B-03-003). Washington, DC: Office of Information Analysis and Access, U.S. EPA. https://nepis.epa.gov/Exe/tiff2png.cgi/P1005GNB.PNG?- r+75+- g+7+D%3A%5CZYFILES%5CINDEX%20DATA%5C00THRU05%5CTIFF%5C00001406%5CP1005GNB.TIF. 37. U.S. EPA. (2006). Approaches for the application of physiologically based pharmacokinetic (PBPK) models and supporting data in risk assessment (Final Report) [EPA Report] (pp. 1-123). (EPA/600/R- 05/043F). Washington, DC: U.S. Environmental Protection Agency, Office of Research and Development, National Center for Environmental Assessment. http://cfpub.epa.gov/ncea/cfm/recordisplay.cfm?deid=157668. 38. U.S. EPA. (2009). Guidance on the Development, Evaluation, and Application of Environmental Models. (EPA/100/K-09/003). Washington, DC: Office of the Science Advisor. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262976. 38 ------- 39. U.S. EPA. (2011a). Exposure Factors Handbook. (EPA/600R-090052F). Washington, DC: U.S. Environmental Protection Agency, National Center for Environmental Assessment, Office of Research and Development. http://cfpub.epa.gov/ncea/risk/recordisplay.cfm?deid=236252. 40. U.S. EPA. (2011b). Exposure factors handbook: 2011 edition (final) [EPA Report], (EPA/600/R- 090/052F). Washington, DC: U.S. Environmental Protection Agency, Office of Research and Development, National Center for Environmental Assessment. http://cfpub.epa.gov/ncea/cfm/recordisplay.cfm?deid=236252. 41. U.S. EPA. (2018a). ECOTOX Knowledgebase. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4263024. 42. U.S. EPA. (2018b). Integrated risk information system (IRIS) [Database], Washington, DC: U.S. Environmental Protection Agency, Integrated Risk Information System. Retrieved from http://www.epa.gov/iris/ 43. U.S. EPA. (2018c). Strategic plan to promote the development and implementation of alternative test methods (Draft). Washington, D.C.: Office of Chemical Safety and Pollution Prevention. https://www.regulations.gov/document?D=EPA-HQ-QPPT-2017-0559-0584. 44. Von Elm, EA, D. G. Egger, M. Pocock, S. J. Ggtzsche, P. C. Vandenbroucke, J. P. (2008). The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. J Clin Epidemiol. 61(4): 344-349. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4263036. 39 ------- APPENDIX B: DATA QUALITY CRITERIA FOR PHYSICAL/CHEMICAL PROPERTY DATA Table B-l describes the general approach that EPA/OPPT uses to assess the quality of physical- chemical property data. Table B-l. Evaluation Metrics and Ratings for Physical-Chemical Property Data Domain/Metric Description/ Definition Ratings and Criteria Representativeness The information or data reflects the data and chemical substance type. High: Data are measured for the subject chemical substance. Medium: Data are measured for a structural analog of the subject chemical substance. Low: Data are estimated (modeled) for the subject chemical substance. Not rated: Rating of this factor is not applicable to this kind of information. Appropriateness The information or data reflects anticipated results based on chemical structural features or behaviors. High: Measured data are consistent with the subject chemical substance structural features (e.g., presence of certain functional groups). Medium: Data measured for a structural analog of the subject chemical substance or estimated (modeled) for the subject chemical substance are consistent with what is expected for the subject chemical substance structural features or behaviors. Low: Data measured for a structural analog of the subject chemical substance or estimated (modeled) for the subject chemical substance are not consistent with the subject chemical substance structural features or behaviors, or the structural features or behaviors of the subject chemical substance are uncertain. Unacceptable: Measured data for a structural analog of the subject chemical substance are not appropriate because the analog is not appropriate (e.g., analog is a neutral molecule and the subject chemical substance is a salt). Estimated (modeled) data for the subject chemical substance are not appropriate because the estimation tool is not appropriate (e.g., estimation tool is not able to estimate class 2 and polymeric substances). Not rated: Rating of this factor is not applicable to this kind of information. 40 ------- Domain/Metric Description/ Definition Ratings and Criteria Evaluation/Review The information or data reported has reliable review. High: The information or data is from a recognized data collection/repository where data are peer-reviewed by experts in the field, are broadly available to the public for review and use, and include references to the original sources. Medium: From a source that is not described as High above but is known. Low: From a source that is uncertain (unknown primary source). Not rated: Rating of this factor is not applicable to this kind of information. Reliability/Unbiased (Method Objectivity) The method for producing the data/information is not biased towards a particular product or outcome. High: Methodology for producing the information is designed to answer a specific question, and the methodology's objective is clear. Medium: Method bias appears unlikely. Low: Method bias appears likely or is highly uncertain. Unacceptable: Method bias is so severe as to be unacceptable. Not rated: Rating of this factor is not applicable to this kind of information. Reliability/Analytic Method The information or data reported is from a reliable method. High: Data are obtained by accepted standard analytic methods. Medium: Analytic method is non-standard but is expected to be appropriate. Low: From a source that is uncertain. Analytic method is not known. Unacceptable: Analytic method is not appropriate. Not rated: Rating of this factor is not applicable to this kind of information. 41 ------- APPENDIX C: DATA QUALITY CRITERIA FOR FATE DATA C.l Types of Fate Data Sources The quality of fate data, which includes mass transport, chemical partitioning, and chemical or biological transformations in soil, surface waters, groundwater, and air (e.g., biodegradation, hydrolysis, photolysis), will be evaluated for four different data sources: experimental data, field studies, modeling data, and monitoring data. Generally experimental fate data is preferred over modeled data; however, fate data from all data sources will be evaluated using the data criteria in this section. Definitions for these data types are shown in Table C-l. Since the availability of information varies considerably for different chemicals, it is anticipated that some study types will not be available while others may be identified beyond those listed in Table Ca- l- lable C-l. Types of Fate Data Type of Data Source Definition Experimental Data Data obtained from experimental studies conducted in a controlled environment with pre-defined testing conditions. Examples include data from laboratory tests such as those conducted for ready biodegradation (e.g., MITI test) or hydrolysis (i.e., following OECD TG 111), among others. Field Studies Data collected from incidental sampling of environmental media, especially to provide information on partitioning, bioconcentration, or long-term environmental fate. Modeling Data Calculated values derived from computational models for estimating environmental fate and property data including degradation, bioconcentration, and partitioning. Monitoring Data Measured chemical concentration(s) obtained from systematic sampling of environmental media (e.g., air, water, soil, and biota) to observe and study the effect of environment conditions on the fate of chemicals. Monitoring data may include studies of chemical(s) after a known exposure/release of test substance as well as measured chemical concentrations over a period of time to provide direct evidence about fate in environment. Notes: MITI = Ministry of International Trade and Industry OECD TG = Organisation for Economic Co-operation and Development (OECD) Testing Guideline (TG) C.2 Data Quality Evaluation Domains The quality of fate data sources will be evaluated against metrics and criteria grouped into eight evaluation domains: Test Substance; Test Design; Test Conditions; Test Organisms (does not apply to abiotic studies); Outcome Assessment; Confounding/Variable Control; Data Presentation and Analysis; and Other. These domains, as defined in Table C-2, address elements of the TSCA Science Standards 26(h)(1) through 26(h)(5). The evaluation strategies are intended to apply to all fate data, although certain domains, metrics, and criteria may not apply to all studies. For example, there are evaluation strategy considerations for organisms in biodegradation, bioconcentration, or bioaccumulation studies that do not apply to abiotic studies. 42 ------- Table C-2. Data Evaluation Domains and Definitions for Fate Data Evaluation Domain Definition Test Substance Metrics in this domain evaluate whether the information provided in the study provides a reliable18 confirmation that the test substance used in a study has the same (or sufficiently similar) identity, purity, and properties as the test substance of interest. Test Design Metrics in this domain evaluate whether the experimental design enables the study to distinguish the behavior of the test substance from other factors. This domain includes metrics related to the use of control groups. Test Conditions Metrics in this domain assess the reliability of methods used to measure or characterize test substance behavior. These metrics evaluate whether presence of the test substance was characterized using method(s) that provide reliable results over the duration of the experiment. Test Organisms Metrics in this domain pertain to some fate studies19. These metrics assess the appropriateness of the population or organism(s) to assess the outcome of interest. Outcome Assessment Metrics in this domain assess the reliability of methods, including sensitivity, that are used to measure or otherwise characterize outcomes. Outcomes may include physical/chemical properties or fate parameters. Confounding/ Variable Control Metrics in this domain assess the potential impact of factors other than presence of test substance that may affect the risk of outcome. The metrics evaluate whether studies identify and account for factors that are related to presence of the test substance and independently related to outcome (confounding factors) and whether appropriate experimental or analytical (statistical) methods are used to control for factors unrelated to the presence of test substance that may affect the risk of outcome (variable control). Data Presentation and Analysis Metrics in this domain assess whether appropriate experimental or analytical methods were used and if all outcomes are presented. Other Metrics in this domain are added as needed to incorporate chemical- or study- specific evaluations (i.e., QSAR models). C.3 Data Quality Evaluation Metrics Table C-3 lists the data evaluation domains and metrics for fate studies. Each domain has between two and four metrics; however, some metrics may not apply to all fate data. A general domain for other considerations is available for metrics that are specific to a given test substance or study type (i.e., QSAR models). As with all evaluation criteria, EPA may modify the metrics used for fate data as more experience is acquired with the evaluation tools, to support fit-for-purpose TSCA risk evaluations. Any modifications will be documented. 18 Reliability is defined as "the inherent property of a study or data, which includes the use of well-founded scientific approaches, the avoidance of bias within the study or data collection design and faithful study or data collection conduct and documentation" (ECHA. 2011b). 19 This domain does not apply to abiotic studies. 43 ------- Table C-3. Summary of Metrics for the Fate Data Evaluation Domains Evaluation Domain Number of Metrics Overall Metrics (Metric Number and Description) Test Substance 2 • Metric 1: Test Substance Identity • Metric 2: Test Substance Purity Test Design 2 • Metric 3: Study Controls • Metric 4: Test Substance Stability Test Conditions 4 • Metric 5: Test Method Suitability • Metric 6: Testing Conditions • Metric 7: Testing Consistency • Metric 8: System Type and Design Test Organisms20 2 • Metric 9: Test Organism - Degradation • Metric 10: Test Organism - Partitioning Outcome Assessment 2 • Metric 11: Outcome Assessment Methodology • Metric 12: Sampling Methods Confounding/ Variable Control 2 • Metric 13: Confounding Variables • Metric 14: Outcomes Unrelated to Exposure Data Presentation and Analysis 2 • Metric 15: Data Presentation • Metric 16: Statistical Methods & Kinetic Calculations Other 2 • Metric 17: Verification or Plausibility of Results • Metric 18: QSAR Models C.4 Scoring Method and Determination of Overall Data Quality Level Appendix A provides information about the evaluation method that will be applied across the various data/information sources being assessed to support TSCA risk evaluations. This section provides details about the scoring system that will be applied to fate data/information, including the weighting factors assigned to each metric score of each domain. Some metrics may be given greater weights than others, if they are regarded as key or critical metrics based on expert judgment (Moermond et al.. 2016a). Thus, EPA will use a weighting approach to reflect that some metrics are more important than others when assessing the overall quality of the data. 20 This domain does not apply to abiotic studies. 44 ------- C.4.1 Weighting Factors Each metric was assigned a weighting factor of 1 or 2, with the higher weighting factor (2) given to metrics deemed critical for the evaluation. The critical metrics were identified based on factors that are most frequently included in other study quality and/or risk of bias tools (reviewed by (Lynch et al.. 2016); (Samuel et al.. 2016)). In selecting critical metrics, EPA recognized that the relevance of an individual fate study to the risk analysis for a given substance is determined by its ability to inform hazard identification and/or exposure. Thus, the critical metrics are those that determine how well a study supports the risk analysis. The rationale for selection of the critical metrics for fate studies is presented in Table C-4. Table C-4. Fate Metrics with Greater Importance in the Evaluation and Rationale for Selection Domain Critical Metrics with Weighting Factor of 2 (Metric Number)a Rationale Test Substance Test Substance Identity (Metric 1) The test substance must be identified and characterized definitively to ensure that the study is relevant to the substance of interest. Test Design Study Controls (Metric 3) Controls, with all conditions equal excluding exposure to the degradation pathway (e.g., sunlight, test organism, reductant, etc.) or partitioning surface, are required to ensure that any observed effects are attributable to the outcome of interest. Test Conditions Testing Conditions (Metric 6) Testing conditions must be defined without ambiguity to enable valid comparisons across studies. Test Organisms21 Test Organism - Degradation (Metric 9) Test Organism - Partitioning (Metric 10) The test organism information must be reported to enable assessment of whether they are suitable for the endpoint of interest and whether there are species, strain, sex, or age/life- stage differences within or between different studies. Data Presentation and Analysis Data Presentation (Metric 15) Detailed reports are necessary to determine if the study authors' conclusions are valid. Note: a A weighting factor of 1 is assigned for the following metrics: test substance purity (metric 2); test substance stability (metric 4); test method suitability (metric 5); testing consistency (metric 7); system type and design (metric 8); outcome assessment methodology (metric 11); sampling methods (metric 12); confounding variables (metric 13); outcomes unrelated to exposure (metric 14); statistical methods and kinetic calculations (metric 16); Verification or Plausibility of Results (metric 17); QSAR models (metric 18) 21 This domain does not apply to abiotic studies. 45 ------- C.4.2 Calculation of Overall Study Score To determine the overall study score, the first step is to multiply the score for each metric (1, 2, or 3 for high, medium, or low confidence, respectively) by the appropriate weighting factor, as shown in Table C-5, to obtain a weighted metric score. The weighted metric scores are then summed and divided by the sum of the weighting factors (for all metrics that are scored) to obtain an overall study score between 1 and 3. The equation for calculating the overall score is shown below: Overall Score (range of 1 to 3) = Z (Metric Score x Weighting Factor)/^ (Weighting Factors) Scoring examples for fate studies are given in Tables C-6 to C-8. Studies with any single metric scored as unacceptable (score = 4) will be automatically assigned an overall quality score of 4 (unacceptable) and further evaluation of the remaining metrics is not necessary. An unacceptable score means that serious flaws are noted in the domain metric that consequently make the data unusable (or invalid). EPA/OPPT plans to use data with an overall quality level of High, Medium, or Low confidence to quantitatively or qualitatively support the risk evaluations, but does not plan to use data rated as Unacceptable. Any metrics that are not rated/not applicable to the study under evaluation will not be considered in the numerator or calculation of the study's overall quality score. These metrics will not be included in the nominator or denominator of the overall score equation. The overall score will be calculated using only those metrics that receive a numerical score. In addition, if a publication reports more than one study or endpoint, each study and, as needed, each endpoint will be evaluated separately. Detailed tables showing quality criteria for the metrics are provided in Tables C-9 through C-10, including a table that summarizes the serious flaws that would make the data unacceptable for use in the environmental fate assessment. 46 ------- Table C-5. Metric Weighting Factors and Range of Weighted Metric Scores for Scoring the Quality ol : Environmental Fate Data Domain Number/ Description Metric Number/Description Range of Metric Scores3 Metric Weighting Factor Range of Weighted Metric Scores'3 1. Test Substance 1. Test Substance Identity 1 to 3 2 2 to 6 2. Test Substance Purity 1 to 3 1 1 to 3 2. Test Design 3. Study Controls 1 to 3 2 2 to 6 4. Test Substance Stability 1 to 3 1 1 to 3 3. Test Conditions 5. Test Method Suitability 1 to 3 1 1 to 3 6. Testing Conditions 1 to 3 2 2 to 6 7. Testing Consistency 1 to 2 1 1 to 3 8. System Type and Design 1 to 2 1 1 to 3 4. Test Organisms22 9. Test Organism - Degradation 1 to 3 2 2 to 6 10. Test Organism - Partitioning 1 to 3 2 2 to 6 5. Outcome Assessment 11. Outcome Assessment Methodology 1 to 3 1 1 to 3 12. Sampling Methods 1 to 3 1 1 to 3 6. Confounding/ Variable Control 13. Confounding Variables 1 to 3 1 1 to 3 14. Outcomes Unrelated to Exposure23 1 to 2 1 1 to 3 7. Data Presentation and Analysis 15. Data Reporting 1 to 3 2 2 to 6 16. Statistical Methods & Kinetic Calculations 1 to 3 1 1 to 3 8. Other 17. Verification or Plausibility of Results 1 to 3 1 1 to 3 18. QSAR Models 1 1 1 to 3 Sum= 24 Sum= 24 to 72 Overall Score = I (M Range of Ove =tric Score x Metric High ?rall Scores after us Weighting Factor) Medium ing equation /"Z (Metric Weighting Facte Low rs) 24/24= 1; 72/24=3 Range of overall score = 1 to 3d >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Notes: a For the purposes of calculating an overall study score, the range of possible metric scores is 1 to 3 for each metric, corresponding to high and low confidence. No calculations will be conducted if a study receives an "unacceptable" rating (score of "A") for any metric. b The range of weighted scores for each metric is calculated by multiplying the range of metric scores (1 to 3) by the weighting factor for that metric. cThe sum of weighting factors and the sum of the weighted scores will differ if some metrics are not scored (not applicable). d The range of possible overall scores is 1 to 3. If a study receives a score of 1 for every metric, then the overall study score will be 1. If a study receives a score of 3 for every metric, then the overall study score will be 3. 22 This domain does not apply to abiotic studies. 23 This metric does not apply to abiotic studies. 47 ------- Table C-6. Scoring Example for Abiotic Fate Data (i.e., hydrolysis data) with All Applicable Metrics Scored Domain Metric Metric Score Metric Weighting Factor Weighted Metric Score 1. Test Substance 1. Test Substance Identity 2. Test Substance Purity 2. Test Design 3. Study Controls 4. Test Substance Stability 3. Test Conditions 5. Test Method Suitability 6. Testing Conditions 7. Testing Consistency 8. System Type and Design 4. Test Organisms 9. Test Organism - Degradation 10. Test Organism - Partitioning N/A N/A 5. Outcome Assessment 11. Outcome Assessment Methodology 12. Sampling Methods 6. Confounding/Variable Control 13. Confounding Variables 14. Outcomes Unrelated to Exposure 1 N/A 7. Data Presentation and Analysis 15. Data Reporting 16. Statistical Methods & Kinetic Calculations 4 1 8. Other 17. Verification or Plausibility of Results 18. QSAR Models 1 N/A N/A = not applicable to abiotic data Sum Overall Study Score 18 1.3333 = High 24 Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor High Medium Low >1 and <1.7 >1.7 and <2.3 >2.3 and <3 48 ------- Table C-7. Scoring Example for Abiotic Fate Data (i.e., hydrolysis data) with Some Metrics Not Rated/Not Applicable Domain Metric Metric Score Metric Weighting Factor Weighted Metric Score 1. Test Substance 1. Test Substance Identity 2. Test Substance Purity 2. Test Design 3. Study Controls 4. Test Substance Stability 3. Test Conditions 5. Test Method Suitability 6. Testing Conditions 7. Testing Consistency 8. System Type and Design 1 1 NR NR 4. Test Organisms 9. Test Organism - Degradation 10. Test Organism - Partitioning N/A N/A 5. Outcome Assessment 11. Outcome Assessment Methodology 12. Sampling Methods 6. Confounding/Variable Control 13. Confounding Variables 14. Outcomes Unrelated to Exposure NR N/A 7. Data Presentation and Analysis 15. Data Reporting 16. Statistical Methods & Kinetic Calculations 8. Other 17. Verification or Plausibility of Results 18. QSAR Models 1 N/A NR = not rated N/A = not applicable to abiotic data Sum Overall Study Score 15 1.4 = High 21 Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor High Medium Low >1 and <1.7 >1.7 and <2.3 >2.3 and <3 49 ------- Table C-8. Scoring Example for QSAR Data Domain Number/ Description Metric Number/Description Metric Score a Metric Weighting Factor Weighted Metric Score b 1. Test Substance 1. Test Substance Identity NR N/A N/A 2. Test Substance Purity NR N/A N/A 2. Test Design 3. Study Controls NR N/A N/A 4. Test Substance Stability NR N/A N/A 3. Test Conditions 5. Test Method Suitability NR N/A N/A 6. Testing Conditions NR N/A N/A 7. Testing Consistency NR N/A N/A 8. System Type and Design NR N/A N/A 4. Test Organisms24 9. Test Organism - Degradation NR N/A N/A 10. Test Organism - Partitioning NR N/A N/A 5. Outcome Assessment 11. Outcome Assessment Methodology NR N/A N/A 12. Sampling Methods NR N/A N/A 6. Confounding/ Variable Control 13. Confounding Variables NR N/A N/A 14. Outcomes Unrelated to Exposure25 NR N/A N/A 7. Data Presentation and Analysis 15. Data Reporting NR N/A N/A 16. Statistical Methods & Kinetic Calculations NR N/A N/A 8. Other 17. Verification or Plausibility of Results 2 1 2 18. QSAR Models 1 1 1 Sum (of all metrics scored)b 2 3 Overall Score = £ (M Range of Ove etric Score x Metric High ?rall Scores after us Weighting Factor) Medium ing equation /£ (Metric Weighti Low ng Factors) 3/2=1.5 1.5 (High) >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Notes: a For the purposes of calculating an overall study score, the range of possible metric scores is 1 to 3 for each metric, corresponding to high and low confidence. No calculations will be conducted if a study receives an unacceptable rating (score of "A") for any metric. bThe sum of weighting factors and the sum of the weighted scores will differ if some metrics are not scored (not rated/ applicable). NR: Not rated N/A: Not applicable 24 This domain does not apply to abiotic studies. 25 This metric does not apply to abiotic studies. 50 ------- C.5 Data Quality Criteria Table C-9. Serious Flaws that Would Make Fate Data Unacceptable for Use in the Fate Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Number/ Description Metric Number Description of Serious Flaw(s) in Data Source 1. Test 1 The test substance identity could not be determined from the information provided. Substance 2 The nature and quantity of reported impurities were such that study results were unduly influenced by one or more of the impurities. 3 The study did not include or report control groups that consequently made the study unusable (e.g., no positive control data for a non-guideline biodegradation study with a novel media and/or inoculum, reporting 0% removal). 2. Test Design The vehicle (e.g., oil or carrier solvent) used in the study was likely to unduly influence the study results. 4 There were problems with test substance stability, homogeneity, preparation, or storage conditions that had an impact on concentration or dose estimates and interfered with interpretation of study results. 5 The test method was not reported or not suitable for the test substance. The testing conditions were not reported and sufficient data were not provided to interpret results. 6 Testing conditions were not appropriate for the method (e.g., a biodegradation study at temperatures that inhibit the microorganisms) resulting in serious flaws that make the study unusable. 3. Test Conditions 7 Critical exposure details across samples or study groups were not reported and these omissions resulted in serious flaws that had a substantial impact on the overall confidence, consequently making the study unusable. Equilibrium was not established or reported preventing meaningful interpretation of study results OR The system type and design (i.e., static, semi-static, and flow-through; sealed, open) were not capable of appropriately maintaining substance concentrations preventing meaningful interpretation of study results. These are serious flaws that make the study unusable. 8 4. Test 9 The test organism, species, or inoculum source was not reported. Organisms 10 The test organism was not reported. 11 The assessment methodology did not address or report the outcome(s) of interest. 5. Outcome Assessment 12 Serious uncertainties or limitations were identified in sampling methods of the outcome(s) of interest and these were likely to have a substantial impact on the results, resulting in serious flaws which make the study unusable. 6. Confounding 13 There were sources of variability and uncertainty in the measurements and statistical techniques or between study groups resulting in serious flaws that make the study unusable. / Variable Control 14 Attrition or health outcomes were not reported and this omission was likely to have a substantial impact on study results. One or more study groups experienced disproportionate organism attrition or health outcomes that influenced the outcome assessment. 51 ------- Domain Number/ Description Metric Number Description of Serious Flaw(s) in Data Source 7. Data Presentation and Analysis 15 The analytical method used was not suitable for detection of the test substance. 16 Statistical methods or kinetic calculations used were likely to provide biased results. 8. Other 17 Reported value was completely inconsistent with reference substance data, related physical chemical properties, or analog data, or was otherwise implausible, suggesting that an unidentified serious study deficiency exists. 18 The QSAR model did not have a defined endpoint, unambiguous endpoint The model performance was not known or r2 < 0.7, a2 < 0.5 or SE > 0.3 (ECHA, 2016). Table C-10. Data Quality Criteria for Fate Data Confidence Level (Score) Description Selected Score Domain 1. Test Substance Metric 1: Test substance identity Was the test substance identified definitively? High (score = 1) The test substance was identified definitively (i.e., established nomenclature, CASRN, or structure reported, including information on the specific form tested [particle characteristics for solid-state materials, salt or base, valence state, isomer, etc.] for materials that may vary in form, or submitting company's code name with supporting confirmatory documentation) and the specific form characterized, where applicable. Medium (score = 2) The test substance was identified by trade name or other internal designation, but characterization details were omitted that could affect interpretation of study results; however, the omission was not likely to have a substantial impact on the study results. Low (score = 3) The test substance was identified; however, it lacked specific characteristics such as stereochemistry or valence state OR there were some uncertainties or conflicting information regarding test substance identification or characterization that were likely to have a substantial impact on the study results. Unacceptable (score = 4) The test substance identity could not be determined from the information provided (e.g., nomenclature was unclear and CASRN or structure was not reported). This is a serious flaw that makes the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 52 ------- Confidence Level (Score) Description Selected Score Metric 2: Test substance purity Was the source of the test substance reported? If the test substance was synthesized or extracted (as part of the synthesis or from a substrate), was the test substance identity verified by analytical methods? Were the purity, grade or hydration state (e.g., analytical, technical) of the test substance reported? If the test substance was tested as part of a finished or formulated product, was the full chemical composition of the formulation reported? High (score = 1) The source or purity of the test substance was reported or the test substance identity and purity were verified by analytical means (chemical analysis, etc.) OR if the test substance was tested as part of a finished or formulated product, the full chemical composition of the formulation was reported AND any observed effects were likely due to the nominal test substance itself (e.g., pure, analytical grade, technical grade test substance, or other substances in the formulation were inert, or the other components were inert under the test conditions). Medium (score = 2) The test substance source was not reported AND/OR the test substance purity was low or not reported (e.g., lack of information on hydration state of a compound introduces uncertainty into concentration calculations); however, the omissions or identified impurities were not likely to have a substantial impact on the study results. Low (score = 3) The source and purity of the test substance were not reported or verified by analytical means OR The test substance was synthesized or extracted and its identity was not verified by analytical means (i.e., chemical analysis, etc.) OR identified impurities were likely to have a substantial impact on study results. Unacceptable (score = 4) The nature and quantity of reported impurities were such that study results were unduly influenced by one or more of the impurities. These are serious flaws that make the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 53 ------- Confidence Level (Score) Description Selected Score Domain 2. Test Design Metric 3: Study controls Was a concurrent negative control or blank group included? Were positive and toxicity controls included? If a vehicle was used, was the control group exposed to the vehicle? Is the selected vehicle unlikely to influence the study results, stability, bioavailability or/toxicity of the test substance? High (score = 1) A concurrent negative control, or blank group, toxicity control, and positive control were included (where applicable) AND results from controls were within the ranges specified for test validity (or validity criteria for equivalent or similar tests, if not a guideline test) AND a concurrent blank with vehicle (e.g., oil or carrier solvent) was included and the vehicle was not likely to influence the study results (where applicable). Medium (score = 2) Some concurrent control group details were not included; however, the lack of data was not likely to have a substantial impact on study results AND the vehicle was not likely to influence the study results (where applicable). Low (score = 3) Reported results from control group(s) were outside the ranges specified for test validity (or validity criteria for equivalent or similar tests, if not a guideline test) OR the vehicle was likely to have a substantial impact on study results. Unacceptable (score = 4) The study did not include or report crucial control groups that consequently made the study unusable (e.g., no positive control for a biodegradation study reporting 0% removal) OR the vehicle used in the study was likely to unduly influence the study results. These are serious flaws that make the study unusable. Not rated/ applicable The study did not require concurrent control groups. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 4: Test substance stability Did the study characterize and accommodate the test substance stability, homogeneity, preparation, and storage conditions? Were the frequency of preparation and storage conditions appropriate to the test substance stability? High (score = 1) The test substance stability, homogeneity, preparation, and storage conditions were reported (e.g., mixing temperature, stock concentration, stirring methods, centrifugation or filtration), and were appropriate for the study (e.g., a test substance known to degrade in light was stored in dark or amber bottles). Medium (score = 2) The test substance stability, homogeneity, preparation or storage conditions were not reported; however, these factors were not likely to influence the test substance or were not likely to have a substantial impact on study results. Low (score = 3) The test substance stability, homogeneity, preparation, and storage conditions were not reported and these factors likely influenced the test substance or are likely to have a substantial impact on the study results. Unacceptable (score = 4) There were problems with test substance stability, homogeneity, preparation, or storage conditions that had an impact on concentration or dose estimates and interfered with interpretation of study results. These are serious flaws that make the study unusable. 54 ------- Confidence Level (Score) Description Selected Score Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Test Conditions Metric 5: Test method suitability Was the test method reported and suitable for the test material? Was the target chemical tested at concentrations below its aqueous solubility? High (score = 1) The test method was suitable for the test substance AND the target chemical was tested at concentrations below its aqueous solubility (when applicable). Medium (score = 2) The test method was suitable for the test substance with minor deviations AND/OR nominal estimates of media concentrations were provided, but, the levels were not measured or suitable to the study type or outcome(s) of interest AND these deviations or omissions were not likely to have a substantial impact on study results. Low (score = 3) Applied target chemical concentrations were greater than the aqueous solubility AND the deviations were likely to have a substantial impact on the results. Unacceptable (score = 4) The test method was not reported or not suitable for the test substance. These deviations or lack of information resulted in serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 6: Testing conditions Were the test conditions monitored, reported, and appropriate for the study method (e.g., the temperature range reported, dissolved organic matter, aeration, total organic matter, pH or water hardness reported and maintained throughout the test)? High (score = 1) Testing conditions were monitored, reported, and appropriate for the method. For example, depending on the study, the following conditions were reported: • aerobic/anaerobic conditions reported • dissolved oxygen (DO) measured • redox/electron activity (pE) parameters listed and/or anaerobic conditions otherwise identified (e.g., sulfate reducing, methanogenic, etc.) • pH buffer for studies on the fate of a substance that may exist in ionized form(s) in the pH range of environmental relevance • For studies in aquatic environments, conditions reported separately for both the water and sediment column • For studies in soil, soil type (location if available), moisture level, soil particle size distribution, background SOM (soil organic matter) or OC (organic carbon) content, CEC (cation exchange capacity) or soil pH, soil name (e.g., USDA series) 55 ------- Confidence Level (Score) Description Selected Score Medium (score = 2) There were reported deviations or omissions in testing conditions (e.g., temperature was not constant or was not in a standard range for the test but, results can be extrapolated to approximate appropriate temperatures); however, sufficient data were reported to determine that the deviations and omissions were not likely to have a substantial impact on study results. Low (score = 3) Inappropriate test conditions for the study method (e.g., temperature fluctuations) and the deviations were likely to have a substantial impact on the results. Unacceptable (score = 4) Testing conditions were not reported and data provided were insufficient to interpret results OR testing conditions were not appropriate for the method (e.g., a biodegradation study at temperatures that inhibit the microorganisms) resulting in serious flaws that make the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 7: Testing consistency Were test conditions established to be consistent across samples or study groups? Were multiple exposures evaluated, where applicable? High (score = 1) Test conditions were consistent across samples or study groups (i.e., same exposure method and timing, comparable particle size characteristics). The conditions of the exposure were documented. Medium (score = 2) There were minor inconsistencies in test conditions across samples or study groups OR some test conditions across samples or study groups were not reported, but these discrepancies were not likely to have a substantial impact on study results. Low (score = 3) There were inconsistencies in test conditions across samples or study groups that are likely to have a substantial impact on results. Unacceptable (score = 4) Critical exposure details across samples or study groups were not reported and these omissions resulted in serious flaws that had a substantial impact on the overall confidence, consequently making the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 8: System type and design* Was equilibrium established? Were the system type and design capable of appropriately maintaining substance concentrations for experimental studies? * For studies of partitioning High (score = 1) Equilibrium was established. The system type and design (i.e., static, semi-static, and flow-through; sealed, open) were capable of appropriately maintaining substance concentrations. Medium (score = 2) Equilibrium was not established or reported but this was not likely to have a substantial impact on study results OR 56 ------- Confidence Level (Score) Description Selected Score the system type and design (i.e., static, semi-static, and flow-through; sealed, open) were not capable of appropriately maintaining substance concentrations or not described but the deviation was not likely to have a substantial impact on study results. Low (score = 3) — Unacceptable (score = 4) Equilibrium was not established or reported preventing meaningful interpretation of study results OR the system type and design (i.e., static, semi-static, and flow-through; sealed, open) were not capable of appropriately maintaining substance concentrations preventing meaningful interpretation of study results. These are serious flaws that make the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Test Organisms (does not apply to all fate studies) Metric 9: Test organism - degradation Was information about the test organism, species or inoculum reported? Were inoculum source, concentration or number of microorganisms, and any pre-conditioning or pre-adaptation procedures reported? Are the test organism, species or inoculum source routinely used for similar study types or outcome(s)* of interest? Were the chosen organisms or inoculum appropriate for the study method or route? * For studies of degradation High (score = 1) The test organism information or inoculum source were reported AND the test organism, species, or inoculum are routinely used for similar study types and appropriate (e.g., aerobic microorganisms used for anaerobic biodegradation study) for the study method or route. Medium (score = 2) The test organism, species, or inoculum source were reported, but are not routinely used for similar study types; however, the deviation was not likely to have a substantial impact on study results. Low (score = 3) The test organism, species, or inoculum source are not routinely used for similar study types or were not appropriate for the evaluation of the specific outcome(s) of interest or route (e.g., genetically modified strains uniquely susceptible or resistant to one or more outcome of interest). In practice, this manifests as using an inappropriate inoculum for the study method (e.g., polyseed capsules instead of activated sludge from a publicly owned treatment works (POTW) for a ready biodegradability test). OR an inoculum that was pre-adapted to the test substance was used for a biodegradation rate study AND no justification for selection of the test organism was provided. The deviation was likely to have a substantial impact on study results. Unacceptable (score = 4) The test organism, species, or inoculum source were not reported. Not rated/ 57 ------- Confidence Level (Score) Description Selected Score applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 10: Test organism - partitioning Was information about the test organism reported? Was the test organism source known? Is the test organism or species routinely used for similar study types or outcome(s)* of interest? * For studies of partitioning High (score = 1) Test organism information was reported, including species or sex, age, and starting body weight (where applicable) OR the test organism was obtained from a reliable or commercial source AND the test organism or species is routinely used for similar study types. Medium (score = 2) The test organism was obtained from a reliable or commercial source OR the test organism or species is routinely used for similar study types; however, one or more additional characteristics of the organisms were not reported (i.e., sex, health status, age, or starting body weight), but these omissions were not likely to have a substantial impact on study results. Low (score = 3) The test organism was not obtained from a reliable or commercial source OR the test organism or species is not routinely used for similar study types or was not appropriate (i.e., species, life-stage) for the evaluation of the specific outcome(s) of interest (e.g., genetically modified organisms, strain was uniquely susceptible or resistant to one or more outcome of interest) AND no justification for selection of the test organism was provided. The deviations were likely to have a substantial impact on study results. Unacceptable (score = 4) The test organism information was not reported. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 5. Outcome Assessment Metric 11: Outcome* assessment methodology Did the outcome* assessment methodology address and report the outcome(s)* of interest? * For all fate studies (i.e., degradation, partitioning, etc.) High (score = 1) The outcome assessment methodology addressed or reported the intended outcome(s) of interest. Medium (score = 2) There were minor differences between the assessment methodology and the intended outcome assessment (i.e. biodegradation rate not reported; however, degradation products and a degradation pathway were determined) OR there was incomplete reporting of outcome assessment methods; however, such differences or absence of details were not likely to be severe or have a substantial impact on the study results. 58 ------- Confidence Level (Score) Description Selected Score Low (score = 3) Deficiencies in the outcome assessment methodology of the assessment or reporting were likely to have a substantial impact on results. Unacceptable (score = 4) The assessment methodology did not address or report the outcome(s) of interest. This is a serious flaw that makes the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 12: Sampling adequacy Were the sampling methods, including timing and frequency, adequate, for the outcome(s)* of interest? * For all fate studies (i.e., degradation, partitioning, etc.) High (score = 1) The study reported the use of sampling methods that address the outcome(s) of interest, and used widely accepted methods/approaches for the chemical and media being analyzed (e.g., sampling equipment, sample storage conditions) AND no notable uncertainties or limitations were expected to influence results. Medium (score = 2) Minor limitations were identified in sampling methods of the outcome(s) of interest were reported (i.e., the sampling intervals were such that a half-life or other rate could be determined and/or pathways could be defined); however, the limitations were not likely to have a substantial impact on results. Low (score = 3) Details regarding sampling methods of the outcome(s) were not fully reported, and the omissions were likely to have a substantial impact on study results AND/OR an accepted method/approach for the chemical and media being analyzed was not used (e.g., inappropriate sampling equipment, improper storage conditions). Unacceptable (score = 4) Serious uncertainties or limitations were identified in sampling methods of the outcome(s) of interest and these were likely to have a substantial impact on the results, resulting in serious flaws which make the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 6. Confounding/Variable Control Metric 13: Confounding variables Were sources of variability or uncertainty noted in the study? Did confounding differences among the study groups influence the outcome* assessment? * For all fate studies (i.e., degradation, partitioning, etc.) High (score = 1) Sources of variability and uncertainty in the measurements, and statistical techniques and between study groups (if applicable) were considered and accounted for in data evaluation AND all reported variability or uncertainty was not likely to influence the outcome assessment. Medium (score = 2) Sources of variability and uncertainty in the measurements and statistical techniques and between study groups (if applicable) were reported in the study AND 59 ------- Confidence Level (Score) Description Selected Score the differences in the measurements and statistical techniques and between study groups were considered or accounted for in data evaluation with minor deviations or omissions AND the minor deviations or omissions were not likely to have a substantial impact on study results. Low (score = 3) Sources of variability and uncertainty in the measurements and statistical techniques and between study groups (if applicable) were not considered or accounted for in data evaluation resulting in some uncertainty AND there is concern that variability or uncertainty was likely to have a substantial impact on the results. Unacceptable (score = 4) There were sources of variability and uncertainty in the measurements and statistical techniques or between study groups resulting in serious flaws that make the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 14: Outcomes unrelated to exposure Were there differences among the study groups in organism attrition or health outcomes unrelated to exposure to the test substance that influenced the outcome* assessment? * For studies of partitioning in organisms High (score = 1) There were multiple study groups, and there were no differences among the study groups in organism attrition or health outcomes (i.e., unexplained mortality) that influenced the outcome assessment. Medium (score = 2) Attrition or health outcomes were not reported; however, this omission was not likely to have a substantial impact on study results. Low (score = 3) — Unacceptable (score = 4) Attrition or health outcomes were not reported and this omission was likely to have a substantial impact on study results OR one or more study groups experienced disproportionate organism attrition or health outcomes that influenced the outcome assessment (e.g., pH drastically decreased for one treatment and resulted in pH effects versus effects from the chemical being tested). This is a serious flaw that makes the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 60 ------- Confidence Level (Score) Description Selected Score Domain 7. Data Presentation and Analysis Metric 15: Data reporting Were the target chemical and transformation product(s) concentrations reported? Was the extraction efficiency, percent recovery, and/or mass balance reported? Was the analytical method used suitable for detection and capable of identifying or quantifying the parent and transformation products? Was sufficient evidence presented to confirm that the disappearance of the parent compound was not due to some other process (e.g., sorption)? High (score = 1) The target chemical and transformation product(s) concentrations (if required), extraction efficiency, percent recovery, or mass balance were reported AND analytical methods used were suitable for detection and quantification of the target chemical and transformation product(s) (if required) AND for degradation studies, sufficient evidence was presented to confirm that parent compound disappearance was not likely due to some other process AND the lipid content or the lipid-normalized bioconcentration factor (BCF) was reported for BCF studies AND detection limits were sensitive enough to follow decline of parent and formation of the metabolites; structures of metabolites were given. Volatile products were trapped and identified. Medium (score = 2) The target chemical and transformation product(s) concentrations, extraction efficiency, percent recovery, or mass balance were not reported; however, these omissions were not likely to have a substantial impact on study results OR the lipid content or lipid normalized BCF was not reported for BCF studies, but these deficiencies or omissions were not likely to have a substantial impact on study results. Low (score = 3) There was insufficient evidence presented to confirm that parent compound disappearance was not likely due to some other process OR concentrations of the target chemical or transformation product(s), extraction efficiency, percent recovery, or mass balance were not measured or reported, preventing meaningful interpretation of study results OR lipid normalized BCF and lipid content were not measured or reported, preventing meaningful interpretation of study results AND these omissions were likely to have a substantial impact on study results. Unacceptable (score = 4) The analytical method used was not suitable for detection of the test substance. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 61 ------- Confidence Level (Score) Description Selected Score Metric 16. Statistical methods & kinetic calculations Were statistical methods or kinetic calculations clearly described and consistent? High (score = 1) Statistical methods or kinetic calculations were clearly described and address the dataset(s). Medium (score = 2) Statistical analysis used an outdated, unusual, or non-robust method; however, the study results were likely to be similar to those obtained using a current/ more robust method OR kinetic calculations were not clearly described AND these differences were not likely to have a substantial impact on study results. OR No statistical analyses were conducted; however, sufficient data were provided to conduct an independent statistical analysis. Low (score = 3) Statistical analysis or kinetic calculations were not conducted or were not described clearly AND the lack of information was likely to have a substantial impact on study results. Unacceptable (score = 4) Statistical methods or kinetic calculations used were likely to provide biased results. These are serious flaws that make the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 8. Other Metric 17. Verification or Plausibility of Results Were the study results reasonable? Was anything not covered in the evaluation questions? High (score = 1) Reported values were within expected range as defined by reference substance(s) OR reported values were consistent with related physical chemical properties (e.g., considering K0w, pKa, vapor pressure, etc.). Medium (score = 2) The study results were reasonable AND the reported value was outside expected range, as defined by reference substance(s) or in relation to related physical chemical properties (e.g., considering K0w, vapor pressure, etc.); however, no serious study deficiencies were identified, and the value was plausible. Low (score = 3) Due to limited information, evaluation of the reasonableness of the study results was not possible (i.e., reference substance(s) not used or physical-chemical properties unknown and unable to be estimated). Unacceptable (score = 4) Reported value was completely inconsistent with reference substance data, related physical chemical properties, analog data, or otherwise implausible, suggesting that an unidentified serious study deficiency exists. These are serious flaws that make the study unusable. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as 62 ------- Confidence Level (Score) Description Selected Score relevance] Metric 18. QSAR Models Did the QSAR model have a defined, unambiguous endpoint and appropriate measures of goodness-of-fit, robustness and predictivity, defined by r2 > 0.7, q2 > 0.5 and SE < 0.3, where r2 is the correlation coefficient, q2 is the cross-validated correlation coefficient and SE is the standard error (ECHA, 2016)? High (score = 1) The QSAR model had a defined, unambiguous endpoint AND the model performance was known and r2 > 0.7, q2 > 0.5, and SE < 0.3 (ECHA, 2016). Medium (score = 2) Model endpoint is broad (i.e., overall persistence) AND/OR non-transparent and difficult to reproduce methods were used to build the (Q)SAR model (e.g. artificial neural networks using many structural descriptors). Low (score = 3) Algorithm is not publicly available to verify or reproduce the predictions AND/OR statistics on the external validation set are unavailable. Unacceptable (score = 4) The model performance was either not known or r2 < 0.7, q2 < 0.5 or SE > 0.3 (ECHA, 2016). These are serious flaws that make the studv unusable. Not rated/ applicable A QSAR model was not reported. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 63 ------- C.6 References 1. ECHA. (2011). Guidance on information requirements and chemical safety assessment. Chapter R.3: Information gathering. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262857. 2. ECHA. (2016). Practical guide. How to use and report (Q)SARs. Version 3.1. July 2016. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262860. 3. Lynch. HNG. J. E. Tabonv. J. A. Rhomberg. L. R. (2016). Systematic comparison of study quality criteria. Regul Toxicol Pharmacol. 76: 187-198. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262904. 4. Moermond. CB. A. Breton. R. Junghans. M. Laskowski. R. Solomon. K. Zahner. H. (2016). Assessing the reliability of ecotoxicological studies: An overview of current needs and approaches. Integr Environ Assess Manag. 13: 1-12. http://dx.doi.org/10.1002/ieam.1870: http://onlinelibrarv.wilev.com/store/10.1002/ieam.l870/asset/ieaml870.pdf?v=l&t=ierdoypz&s=e e96db9e589f470debl0651cdbl460d9ada93486. 5. Samuel, GOH, S. Wright, R. A. Lalu, M. M. Patlewicz, G. Becker, R. A. Degeorge, G. L. Fergusson, D. Hartung, T. Lewis, R. J. Stephens, M. L. (2016). Guidance on assessing the methodological and reporting quality of toxicologically relevant studies: A scoping review. Environ Int. 92-93: 630-646. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262966. 64 ------- APPENDIX D: DATA QUALITY CRITERIA FOR OCCUPATIONAL EXPOSURE AND RELEASE DATA D.l Types of Environmental Release and Occupational Exposure Data Sources Environmental release and occupational exposure data and information may be found in a variety of sources, and most are not found in controlled studies. The evaluation of this data and information requires approaches that differ from evaluation of controlled studies. These differences are inherently covered by the tables for the different sources (e.g., all tables in section D.7). In these tables, some metrics are shown as not applicable and will not be scored. Other metrics may have criteria that reflect differences in the documentation of background information about the data or information, especially if the data or information are not collected from a controlled study that is fully documented. The data quality will be evaluated for five different types of data sources that contain environmental release and occupational exposure data: (1) monitoring data from various sources (e.g., journal articles, government reports, public databases); (2) release data from various sources; (3) published models for exposures or releases; (4) completed exposure or risk assessments; (5) and reports for data or information other than exposure or release data. Definitions for these data types are shown below in Table D-l; note that these data types do not include epidemiology sources that lack occupational exposure data. Table D-l. Types of Occupational Exposure and Environmental Release Data Sources Type of Data Source Definition Monitoring Data Measured occupational exposures, which include, but not limited to, personal inhalation exposure monitoring, area/stationary airborne concentration monitoring, and surface wipe sampling. Environmental Release Data Measured or calculated quantities of chemical or chemical substance released across a facility fence line into an environmental media or waste management/disposal method. Published Models for Exposures or Releases Published models used to calculate occupational exposures or environmental releases. Completed Exposure or Risk Assessments Completed exposure or risk assessments containing a broad range of data types (i.e., exposure concentrations, doses, estimated values, exposure factors). Examples: ATSDR assessments, risk assessments completed by other countries. Reports for Data or Information Other than Exposure or Release Data Data sources used for data or information other than exposure or release data, such as process description information. Example: Kirk-Othmer Encyclopedia of Chemical Technology Note: ATSDR = Agency for Toxic Substances and Disease Registry 65 ------- D.2 Data Quality Evaluation Domains The data sources will be evaluated against the following four data quality evaluation domains: (1) reliability; (2) representativeness; (3) accessibility/clarity; (4) and variability and uncertainty. These domains, as defined in Table D-2, address elements of TSCA Science Standards 26(h)(1) through 26(h)(5). Table D-2. Data Evaluation Domains and Definitions Evaluation Domain Definition Reliability The inherent property of a study or data, which includes the use of well-founded scientific approaches, the avoidance of bias within the study or data collection design and faithful studv or data collection conduct and documentation (ECHA, 2011b). Representativeness The data reported address exposure scenarios (e.g., sources, pathways, routes, receptors) that are relevant to the assessment. Accessibility/Clarity The data and supporting information are accessible and clearly documented. Variability and Uncertainty The data describe variability and uncertainty (quantitative and qualitative) or the procedures, measures, methods, or models are evaluated and characterized. D.3 Data Quality Evaluation Metrics Table D-3 provides a summary of the quality metrics for each data type. EPA may adjust these quality metrics as more experience is acquired with the evaluation tools to support fit-for- purpose TSCA risk evaluations. If this happens, EPA will document the changes to the evaluation tool. Table D-3. Summary of Quality Metrics for the Five Types of Data Sources Type of Data Source Overall Number of Metrics Metric Names Monitoring Data 7 Sampling and analytical methodology; Geographic Scope; Applicability; Temporal representativeness; Sample size; Metadata completeness informing the Accessibility and Clarity domain; Metadata completeness informing the Variability and Uncertainty domain Environmental Release Data 7 Methodology; Geographic Scope; Applicability; Temporal representativeness; Sample size; Metadata completeness informing the Accessibility and Clarity domain; Metadata completeness informing the Variability and Uncertainty domain Published Models for Exposures or Releases Up to 6 Methodology; Geographic Scope; Applicability; Temporal representativeness; Metadata completeness informing the Accessibility and Clarity domain; Metadata completeness informing the Variability and Uncertainty domain Completed Exposure or Risk Assessments Up to 7 Methodology; Geographic Scope; Applicability; Temporal representativeness; Sample Size; Metadata completeness informing the Accessibility and Clarity domain; Metadata completeness informing the Variability and Uncertainty domain Reports for Data or Information Other than Exposure or Release Data Up to 7 Methodology; Geographic Scope; Applicability; Temporal representativeness; Sample size; Metadata completeness informing the Accessibility and Clarity domain; Metadata completeness informing the Variability and Uncertainty domain Notes: • Number of Metrics Overall indicates the number of metrics across evaluation domains. • Metadata are data that provide descriptive information about other data. Examples include the date of the data, the author and author's affiliation of a report or study, and the type of exposure monitoring sample (e.g., personal breathing zone sample). 66 ------- D.4 Scoring Method and Determination of Overall Data Quality Level Appendix A provides information about the evaluation method that will be applied across the various data/information sources being assessed to support TSCA risk evaluations. This section provides details about the scoring system that will be applied to occupational exposure and release data/information, including the weighting factors assigned to each metric score of each domain. Some metrics may be given greater weights than others, if they are regarded as key or critical metrics, based on expert judgment (Moermond et al.. 2016a). Thus, EPA will use a weighting approach to reflect that some metrics are more important that others when assessing the overall quality of the data. D.4.1 Weighting Factors EPA developed the weighting factors by beginning with an even weight for each metric. In other words, there are seven metrics for many data types; thus, each weighting factor began with a value of 1. Then, EPA used expert judgement to determine the importance of a particular metric relative to others. Following the prioritization of criteria, each metric was assigned a weighting factor of 1 or 2, with the higher weighting factor (2) given to metrics deemed critical for the evaluation. EPA judged applicability and temporal representativeness to be the most important towards overall confidence, and these two metrics were determined to be twice as important as other metrics (weighting factors assigned a value of 2). • Applicability is one of the most important metrics for occupational data because occupational settings have a diverse set of determinants of exposure and release. Therefore, when evaluating occupational data, it is important for EPA's purposes that those data capture as many of the determinants of exposure and release that apply to the condition of use of interest as possible. • Representativeness of current workplace practices is the other most important metric for occupational data because industry and business practices are expected to change with time. Therefore, when evaluating occupational data, it is important for EPA's purposes that those data represent current day practices. Table D-4 summarizes the weighting factor for each metric, the range of possible scores for each metric, and the range of resulting weighted scores, which are the products of the weighting factor and the metric score, if all of the metrics are scored for a particular data type. 67 ------- Table D-4. Metric Weighting Factors and Range of Weighted Metric Scores for Scoring the Quality of Environmental Release and Occupational Data Domain Metric Metric Weighting Factor Metric Score (range of possible values) Weighted Metric Score (range of possible values) Reliability Methodology 1 1 to 3 1 to 3 Representativeness Applicability 2 1 to 3 2 to 6 Geographic Scope 1 1 to 3 1 to 3 Temporal representativeness 2 1 to 3 2 to 6 Sample Size 1 1 to 3 1 to 3 Accessibility / Clarity Metadata Completeness 1 1 to 3 1 to 3 Variability and Uncertainty Metadata Completeness 1 1 to 3 1 to 3 Sum (if all metrics scored)a 9 - 9 to 27 Range of Over Overall Score = Factors) all Scores, where ^(Metric Score x N High Metric Weighting F Medium actor)/£(Metric We Low ghting 9/9=1; 27/9=3 Range of overall score = 1 to 3 >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Note: aThe sum of weighting factors and the sum of the weighted scores will differ if some metrics are not scored (not applicable). D.4.2 Calculation of Overall Study Score To determine the overall study score, the first step is to multiply the score for each metric (1, 2, or 3 for high, medium, or low confidence, respectively) by the appropriate weighting factor, as shown in Table C-4, to obtain a weighted metric score. The weighted metric scores are then summed and divided by the sum of the weighting factors (for all metrics that are scored) to obtain an overall study score between 1 and 3. The equation for calculating the overall score is shown below: Overall Score (range of 1 to 3) = Z (Metric Score x Weighting Factor)/^ (Weighting Factors) EPA/OPPT plans to use data with an overall confidence rating of High, Medium, or Low to quantitatively or qualitatively support the risk evaluations, but does not plan to use data rated Unacceptable. If any single metric for a data source has a score of Unacceptable, then the overall confidence of the data is automatically rated with an overall confidence score of 4. An Unacceptable score means that serious flaws are noted in the domain metric that consequently make the data unusable (or invalid). There is no need to calculate weighted scores for metrics that score less than four when serious flaws are identified in one of the metrics, which receives a score of four. Therefore, Table D-4 does not include metric scores of four. 68 ------- If any metric is not applicable to a data set, that metric is not rated. In that case, the metric is not included in the scoring. In the case that the source type contains more than one data set or information element, the reviewer provides an overall confidence score for each data set or information element that is found in the source. Therefore, it is possible that a source may have more than one overall quality/ confidence score. Table D-5 provides an example of scoring when a particular metric is not rated. In this example, the sample size metric under the representativeness domain is not applicable for published models. Detailed tables showing quality criteria for the metrics are provided in Tables D-10 through D- 19 for each data type, including separate tables which summarize the serious flaws which would make the data unacceptable for use in the environmental release and occupational exposure assessment. Table D-5. Scoring Example for Published Models where Sample Size is Not Applicable Domain Metric Metric Score Metric Weighting Factor Weighted Metric Score Reliability Methodology 2 1 2 Representativeness Applicability 1 2 2 Geographic Scope 2 1 2 Temporal representativeness 1 2 2 Sample Size NR N/A N/A Accessibility / Clarity Metadata Completeness 2 1 2 Variability and Uncertainty Metadata Completeness 3 1 3 Sum= 8 Sum= 13 Range of Overall Scores, where Overall Score = ^(Metric Score x Metric Weighting Factor)/£(Metric V Factors) High Medium Low Weighting 13/8=1.6 1.6 (High) >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Notes: N/A: Not applicable NR: Not rated D.5 Data Sources Frequently Used in Occupational Exposure and Release Assessments A key component in many of the metric criteria is if the methodology is sound and widely accepted (i.e., from a source generally using sound methods and/or approaches). Table D-7 provides examples of data sources that EPA frequently uses to support the data needs of occupational exposure and release assessments. EPA notes that some data sources may use or include data or information that are not of high quality but are still acceptable (e.g., medium or low quality) for use in risk evaluation. The methodologies in the individual studies under review will still be assessed in relation to chemical- and scenario- specific considerations. Thus, the data source may still receive quality scores ranging from Unacceptable to High even though the 69 ------- data source used a methodology from a source commonly known to use sound methods and/or approaches. EPA may determine standard quality ratings for some of these sources as more experience is acquired with TSCA risk evaluations. Table D-6. Examples of Data Sources Frequently Used in Occupational Exposure and Release Data Data Source U.S. EPA Chemical Data Reporting (CDR) High Production Volume (HPV) Challenge Submissions Extra HPV Program Submissions EPA Existing Chemicals Engineering Files EPA Generic Scenarios Toxics Release Inventory (TRI) National Emissions Inventory (NEI) Office of Water Office of Air Office of Enforcement and Compliance Assistance Sector Notebooks AP-42 Other EPA Programs (e.g., Design for Environment) Occupational Safety and Health Administration (OSHA) National Institute of Occupational Safety and Health (NIOSH) American Conference of Governmental Industrial Hygienists (ACGIH) Agency for Toxic Substances and Disease Registry (ATSDR) Other federal agencies (e.g., Department of Defense, Department of Energy) Organisation for Economic Co-operation and Development (OECD) Screening Information Dataset (SIDS) Emission Scenario Documents (ESDs) Other Programs Environment Canada Canadian Pollution Prevention Information Clearinghouse Other Programs U.S. Census Bureau North American Industry Classification System (NAICS) Definitions County Business Patterns Annual Survey of Manufacturers Current Industrial Reports Economic Census Bureau of Labor Statistics (BLS) States (e.g., North Carolina Division of Pollution Prevention and Environmental Assistance) Kirk-Othmer Encyclopedia of Chemical Technology Hazardous Substances Data Bank (HSDB) National Library of Medicine's HazMap Note: The list in this table is not intended to be comprehensive but to show examples used by EPA/OPPT in the past. 70 ------- D.6 Data Extraction Templates to Assist the Data Quality Evaluation The reviewer will extract the data or information element from the source into the data extraction table. Tables D-7, D-8, and D-9 are examples of data extraction and evaluation templates. The tables consist of the key data needs elements for occupational exposures and environmental releases, which accompany the inclusion criteria for full text screening as shown in the TSCA problem formulation documents, and also the evaluation elements described above. For each data quality evaluation metric, the reviewer will document relevant metadata in the metadata column and then provide a score, or a notation of not rated or not applicable, in the scoring column based on the quality criteria of the metrics provided in Tables D-ll through D- 20. Metadata are data or information that describe the collected data and include, but are not limited to, the following: • Number of samples collected by authors in a monitoring study; • Number of sites or workers included in a survey; • Full bibliographic information of the data source; • Date of the data source; and • Date of the data within the data source (for example, an article published in 2015 may cite data from 2000). After scorings are complete, the reviewer calculates the overall confidence score and provides the corresponding bin (High, Medium, Low, or Unacceptable). If the source contains more than one data or information element, the reviewer provides an overall confidence rating for each data or information element that is found in the source. Therefore, it is possible that a source may have more than one data or information set or type and associated overall confidence scores. 71 ------- Table D-7. Data Extraction and Evaluation Template for General Life Cycle and Facility Data Data Source (HERO ID) General Life Cycle and Life Cycle Stage Facility Data (note: these apply to both occupational exposures and environmental Life Cycle Description (Subcategory of Use) Process Description Total Annual U.S. Volume (and % of PV) releases) Number of Sites Batch Size Operating Days per Year and Batches per Day Site Daily Throughput Possible Physical Form Chemical Concentration Data Quality Evaluation Domain 1: Reliability Methodology Score Associated Meta Data and Rationale for Score Domain 2: Representativeness Geographic Scope Score Associated Meta Data and Rationale for Score Applicability Score Associated Meta Data and Rationale for Score Temporal representativeness Score Associated Meta Data and Rationale for Score Sample Size Score Associated Meta Data and Rationale for Score Domain 3. Accessibility / Clarity Metadata Completeness Score Associated Meta Data and Rationale for Score Domain 4. Variability and Uncertainty Metadata Completeness Score Associated Meta Data and Rationale for Score Overall Confidence Score 72 ------- Table D-8. Data Extraction and Evaluation Template for Occupational Exposure Data Data Source (HERO ID) Occupational Exposure Life Cycle Stage Data Physical Form Route of Exposure Exposure Concentration (Unit) Number of Samples Number of Sites Type of Measurement (e.g., TWA, STEL) or Method (e.g., modeling) Worker Activity (or source of exposure if stationary sampling) or Job Description Number of Workers Type of Sampling (e.g., personal - pump/ passive, stationary) Sampling Location/ Key Environmental Factors (e.g., temperature, humidity) Exposure Duration Exposure Frequency Bulk and Dust Particle Size Distribution Engineering Control & % Exposure Reduction Personal Protective Equipment (PPE) Analytic Method Data Quality Evaluation Domain 1: Reliability Methodology Score Associated Meta Data and Rationale for Score Domain 2: Representativeness Geographic Scope Score Associated Meta Data and Rationale for Score Applicability Score Associated Meta Data and Rationale for Score Temporal representativeness Score Associated Meta Data and Rationale for Score Sample Size Score Associated Meta Data and Rationale for Score Domain 3. Accessibility / Clarity Metadata Completeness Score Associated Meta Data and Rationale for Score Domain 4. Variability and Uncertainty Metadata Completeness Score Associated Meta Data and Rationale for Score Overall Confidence Score 73 ------- Table D-9. Data Extraction and Evaluation Template for Environmental Release Data Data Source (HERO ID) Environmental Release Life Cycle Stage Data Release Source (at the process- or unit-level with the type of waste) Disposal /Treatment Method Environmental Media Release or Emission Factor Release Estimation Method Daily and Annual Release (kg/day) Quantity (kg/yr) Release Days per Year Number of Sites Waste Treatment Method Pollution Prevention / Control & %Efficiency Data Quality Domain 1: Reliability Evaluation Methodology Score Associated Meta Data and Rationale for Score Domain 2: Representativeness Geographic Scope Score Associated Meta Data and Rationale for Score Applicability Score Associated Meta Data and Rationale for Score Temporal representativeness Score Associated Meta Data and Rationale for Score Sample Size Score Associated Meta Data and Rationale for Score Domain 3. Accessibility / Clarity Metadata Completeness Score Associated Meta Data and Rationale for Score Domain 4. Variability and Uncertainty Metadata Completeness Score Associated Meta Data and Rationale for Score Overall Confidence Score 74 ------- D.7 Data Quality Criteria This section presents tables showing quality criteria for the metrics for each data type, including separate tables which summarize the serious flaws which would make the data unacceptable for use in the environmental release and occupational exposure assessment. The overall data confidence level is automatically rated as Unacceptable if any single metric for a data set has a score of 4, or serious flaws that would make the data unusable (or invalid) for the environmental release and occupational exposure assessment. If the source type contains more than one data set or information element, the review provides an overall confidence score for each data set or information element that is found in the source. Therefore, it is possible that a source may have more than one overall quality/ confidence score. D.7.1 Monitoring Data The general approach for setting the criteria for an unacceptable rating is to only assign an unacceptable rating when EPA can confirm that the data or information is unacceptable. If the data source lacks documentation of needed metadata, EPA will not rate the metric as unacceptable but will rate it as low. The reason for this approach is to avoid omitting potentially valid data or information since occupational exposure and release data are often sparse. EPA will not use data/information that exhibit serious flaws as described in Table D-10. Table D-10. Serious Flaws that Would Make Monitoring Data Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Reliability Sampling and Analytical Methodology Sampling or analytical methodology is specified and EPA has information that indicates the methodology is unacceptable. Representativeness Geographic Scope This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Applicability The data are from an occupational or non-occupational scenario that does not apply to any occupational scenario within the scope of the risk evaluation. Temporal representativeness Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Sample Size This metric does not have an unacceptable criterion. Accessibility / Clarity Metadata Completeness Monitoring data do not include any needed metadata to understand what the data represent and are not usable in the risk evaluation. Variability and Uncertainty Metadata Completeness This metric does not have an unacceptable criterion. 75 ------- Table D-ll. Evaluation Criteria for Monitoring Data Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Sampling and Analytical Methodology High (score = 1) Sampling or analytical methodology is an approved OSHA or NIOSH method or is well described and found to be equivalent to approved OSHA or NIOSH methods. Medium (score = 2) Sampling or analytical methodology is not equivalent to an approved OSHA or NIOSH method and EPA review of information indicates the methodology is acceptable. Differences in methods are not expected to lead to lower quality data. Low (score = 3) Sampling or analytical methodology is not specified. Unacceptable (score = 4) Sampling or analytical methodology is specified and EPA has information that indicates the methodology is unacceptable. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 2. Geographic Scope High (score = 1) The data are from the United States and are representative of the industry being evaluated. Medium (score = 2) The data are from an OECD country, other than the U.S., and locality-specific factors (e.g., potential differences in regulatory occupational exposure limits, industry/ process technologies) may impact exposures relative to the U.S. Low (score = 3) The data are from a non-OECD country, and locality-specific factors (e.g., potentially greater differences in regulatory occupational exposure limits, industry/ process technologies) may impact exposures relative to the U.S., or the country of origin is not specified. Unacceptable (score = 4) This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Applicability High (score = 1) The data are for an occupational scenario within the scope of the risk evaluation. Medium (score = 2) The data are for an occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, in terms of the type of industry, operations, and work activities. Low (score = 3) The data are for a non-occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, such as a consumer DIY scenario that is similar to a worker scenario. Unacceptable (score = 4) The data are from an occupational or non-occupational scenario that does not apply to any occupational scenario within the scope of the risk evaluation. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 76 ------- Confidence Level (Score) Description Selected Score Metric 4. Temporal representativeness High The operations, equipment, and worker activities associated with the data are (score = 1) expected to be representative of current operations, equipment, and activities. The monitoring data were collected after the most recent permissible exposure limit (PEL) establishment or update or are generally, no more than 10 years old, whichever is shorter. If no PEL is established, the data are no more than 10 years old. Metadata on the operations, equipment, and worker activities associated with the data show that the data should be representative of current operations, equipment, and activities. Medium Operations, equipment, and worker activities are expected to be reasonably (score = 2) representative of current conditions. The monitoring data were collected after the most recent PEL establishment or update but are generally more than 10 years old. If no PEL is established, the data are more than 10 years but generally, no more than 20 years old. Low Metadata on the operations, equipment, and worker activities associated with the data (score = 3) show that the data agree representative of outdated operations, equipment, and activities rather than current operations, equipment, and worker activities. The data were collected before the most recent PEL establishment or update or are more than 20 years old if no PEL is established. Unacceptable Known factors (e.g., new and completely different process or equipment) are so (score = 4) different as to make outdated information unacceptable. Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Metric 5. Sample Size High Statistical distribution of samples is fully characterized. (score = 1) Medium Distribution of samples is characterized by a range with uncertain statistics. (score = 2) Low Distribution of samples is qualitative or characterized by no statistics. (score = 3) Unacceptable This metric does not have an unacceptable criterion. (score = 4) Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 6. Metadata Completeness High Monitoring data include all associated metadata, including sample types, exposure (score = 1) types, sample durations, exposure durations worker activities, and exposure frequency. Medium Monitoring data include most critical metadata, such as sample type and exposure (score = 2) type, but lacks additional metadata, such as sample durations, exposure durations, exposure frequency, and/or worker activities. Low Monitoring data include sample type (e.g., personal breathing zone) but no other (score = 3) metadata. Unacceptable Monitoring data do not include any needed metadata to understand what the data (score = 4) represent and are not usable in the risk evaluation. Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] 77 ------- Confidence Level (Score) Description Selected Score Domain 4. Variability and Uncertainty Metric 7. Variability and Uncertainty High (score = 1) The monitoring study addresses variability in the determinants of exposure for the sampled site or sector. The monitoring study addresses uncertainty in the exposure estimates or uncertainty can be determined from the sampling and analytical method. Medium (score = 2) The monitoring study provides only limited discussion of the variability in the determinants of exposure for the sampled site or sector. The monitoring study provides only limited discussion of the uncertainty in the exposure estimates. Low (score = 3) The monitoring study does not address variability or uncertainty. Unacceptable (score = 4) This metric does not have an unacceptable criterion. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Notes: OSHA = Occupational Safety and Health Administration NIOSH = National Institute for Occupational Safety and Health OECD = Organisation for Economic Co-operation and Development PEL = Permissible exposure limit 78 ------- D.7.2 Environmental Release Data The general approach for setting the criteria for an unacceptable rating is to only assign an unacceptable rating when EPA can confirm that the data or information is unacceptable. If the data source lacks documentation of needed metadata, EPA will not rate the metric as unacceptable but will rate it as low. The reason for this approach is to avoid omitting potentially valid data or information since occupational exposure and release data are often sparse. EPA will not use data/information from data sources that exhibit serious flaws as described in Table D-12. Table D-12. Serious Flaws that Would Make Environmental Release Data Unacceptable for Use in the Environmental Release Assessment Optimization of the list of serious flaws may occur after calibrating evaluation tool during pilot exercise. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Methodology The release data methodology is specified and EPA has information that indicates the methodology is unacceptable. Representativeness Geographic Scope This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Applicability The release data are from an occupational or non-occupational scenario that does not apply to any occupational scenario within the scope of the risk evaluation. Temporal representativeness Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Sample Size EPA has information that indicates the samples are not expected to represent the assessed release. Accessibility / Clarity Metadata Completeness Release data do not include any needed metadata to understand what the data represent and are not usable in the risk evaluation. Variability and Uncertainty Metadata Completeness This metric does not have an unacceptable criterion. 79 ------- Table D-13. Evaluation Criteria for Environmental Release Data Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Methodology High (score = 1) The release data methodology is known or expected (see section D.5 and Table D-6) to be accurate and is known to cover all release sources at the site. Medium (score = 2) The release data methodology is known or expected to be accurate (e.g., see section D.5 and Table D-6) but may not cover all release sources at the site. Low (score = 3) The release data methodology is not specified. Unacceptable (score = 4) The release data methodology is specified and EPA has information that indicates the methodology is unacceptable. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 2. Geographic Scope High (score = 1) The data are from the United States and are representative of the industry being evaluated. Medium (score = 2) The data are from an OECD country other than the U.S., and locality-specific factors (e.g., potential differences in regulatory emission limits, industry/ process technologies) may impact releases relative to the U.S. Low (score = 3) The data are from a non-OECD country, and locality-specific factors may impact (e.g., potentially greater differences in regulatory emission limits, industry/ process technologies) releases relative to the U.S., or the country of origin is not specified. Unacceptable (score = 4) This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Applicability High (score = 1) The release data are for an occupational scenario within the scope of the risk evaluation. Medium (score = 2) The release data are for an occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, in terms of the type of industry, operations, and work activities. Low (score = 3) The release data are for a non-occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, such as a consumer DIY scenario that is similar to a worker scenario. Unacceptable (score = 4) The release data are from an occupational or non-occupational scenario that does not apply to any occupational scenario within the scope of the risk evaluation. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 4. Temporal representativeness High (score = 1) The operations, equipment, and worker activities associated with the data indicate that the data should to be representative of current operations, equipment, and activities. The release data were collected after the most recent federal regulatory action (e.g., NESHAP for air release or effluent limit guideline (ELG) for water release) 80 ------- Confidence Level (Score) Description Selected Score or update or are no more than 10 years old, whichever is shorter. If no federal regulation is established, the data are generally no more than 10 years old. Medium (score = 2) The release data were collected after the most recent federal regulatory action or update but are generally, more than 10 years old. If no federal regulation is established, the data are more than 10 years but no more than 20 years old. However, operations, equipment, and worker activities are expected to be reasonably representative of current conditions. Low (score = 3) The data were collected before the most recent federal regulatory action or update or are more than 20 years old if no federal regulation is established. The operations, equipment, and worker activities are not available or indicate that the associated data are expected to be outdated. Unacceptable (score = 4) Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Sample Size High (score = 1) Statistical distribution of samples is fully characterized. Sample size is sufficiently representative. Medium (score = 2) Distribution of samples is characterized by a range with uncertain statistics. It is unclear if analysis is representative. Low (score = 3) Distribution of samples is qualitative or characterized by no statistics. Unacceptable (score = 4) EPA has information that indicates the samples are not expected to represent the assessed release. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 6. Metadata Completeness High (score = 1) Release data include all associated metadata, including release media; process, unit operation, or activity that is the source of the release; and release frequency. Medium (score = 2) Release data include most critical metadata, including release media and release frequency, but lacks additional metadata, such as process, unit operation, and/or activity that is the source of the release. Low (score = 3) Release data include release media but no other metadata. Unacceptable (score = 4) Release data do not include any needed metadata to understand what the data represent and are not usable in the risk evaluation. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 7. Variability and Uncertainty High (score = 1) The release data study addresses variability in the determinants of release. The release data study addresses uncertainty in the release results. Medium (score = 2) The release data study provides only limited discussion of the variability in the determinants of release. The release data study provides only limited discussion of the uncertainty in the release results. Low (score = 3) The release data study does not address variability or uncertainty. 81 ------- Confidence Level(Score) Description Selected Score Unacceptable (score = 4) This metric does not have an unacceptable criterion. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Notes: DIY = Do it yourself ELG = Effluent limit guideline NESHAP = National Emissions Standards for Hazardous Air Pollutants OECD = Organisation for Economic Co-operation and Development 82 ------- D.7.3 Published Models for Environmental Releases or Occupational Exposures The general approach for setting the criteria for an unacceptable rating is to only assign an unacceptable rating when EPA can confirm that the data or information is unacceptable. If the data source lacks documentation of needed metadata, EPA will not rate the metric as unacceptable but will rate it as low. The reason for this approach is to avoid omitting potentially valid data or information since occupational exposure and release data are often sparse. EPA will not use data/information from data sources that exhibit serious flaws as described in Table D-14. Table D-14. Serious Flaws that Would Make Published Models Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Methodology Mathematical equations of the model have significant errors, parameters use erroneous values, or the model is based on flawed logic. Representativeness Geographic Scope This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Applicability The model is not applicable and cannot be adapted to any occupational scenario within the scope of the risk evaluation. Temporal representativeness Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Accessibility / Clarity Metadata Completeness The model is a "black box" and provides no documentation or clarity of its approaches, equations, and parameter values. Variability and Uncertainty Metadata Completeness This metric does not have an unacceptable criterion. 83 ------- Table D-15. Evaluation Criteria for Published Models EPA will consult with the Guidance on the Development, Evaluation, and Application of Environmental Models (U.S. EPA. 2009) when evaluating models and modeling data types. Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Methodology High (score = 1) The model is free of mathematical errors and is based on scientifically sound approaches or methods. Equations and choice of parameter values are appropriate for the model's application (note: peer review may address appropriate application). Medium (score = 2) The model is free of mathematical errors and is based on scientifically sound approaches or methods. However, equations and choice of parameter values are not fully described and some equations and/or parameter values may not be appropriate for the model's application. Low (score = 3) The model is free of mathematical errors. However, the model makes assumptions or uses parameter values that lead to significant uncertainties. Unacceptable (score = 4) Mathematical equations of the model have significant errors, parameters use erroneous values, or the model is based on flawed logic. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 2. Geographic Scope High (score = 1) The data are from the United States and are representative of the industry being evaluated. Medium (score = 2) The data are from an OECD country other than the U.S., and locality-specific factors (e.g., potential differences in regulatory occupational exposure or emission limits, industry/ process technologies) may impact exposures or releases relative to the U.S. Low (score = 3) The data are from a non-OECD country, and locality-specific factors (e.g., potentially greater differences in regulatory occupational exposure or emission limits, industry/ process technologies) may impact exposures or releases relative to the U.S., or the country of origin is not specified. Unacceptable (score = 4) This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Applicability High (score = 1) The model can be appropriately applied to an occupational scenario within the scope of the risk evaluation. Medium (score = 2) Not applicable: this domain is dichotomous: applicable or not applicable. Low (score = 3) Not applicable: this domain is dichotomous: applicable or not applicable. Can a poor fit model be used? Unacceptable (score = 4) The model is not applicable and cannot be adapted to any occupational scenario within the scope of the risk evaluation. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 84 ------- Confidence Level (Score) Description Selected Score Metric 4. Temporal representativeness High (score = 1) The model is based on operations, equipment, and worker activities expected to be representative of current conditions. The model is based on data that are generally no more than 10 years old. Medium (score = 2) The model is based on data that are generally more than 10 years but no more than 20 years old. However, the model is based on operations, equipment, and worker activities are expected to be reasonably representative of current conditions. Low (score = 3) The model is based on data that are more than 20 years old. The model is based on operations, equipment, and worker activities that are expected to be outdated. Unacceptable (score = 4) Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 6. Metadata Completeness High (score = 1) Model approach, equations, and choice of parameter values are transparent and clear and can be evaluated. Rationale for selection of approach, equations, and parameter values is provided. Medium (score = 2) Model approach, equations, and choice of parameter values are transparent. However, rationale for selection of approach, equations, and parameter values is not provided. Low (score = 3) The model documentation describes the approach and parameters, but the equations and/or selection of parameter values are not provided. Rationale for modeling approach and parameter value selection is not provided. Unacceptable (score = 4) The model is a "black box" and provides no documentation or clarity of its approaches, equations, and parameter values. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 7. Variability and Uncertainty High (score = 1) The model characterizes variability and uncertainty in the results. Medium (score = 2) The model has limited characterization of the variability of parameter values. The model has limited characterization of the uncertainty in the results. Low (score = 3) The model does not characterize variability or uncertainty. Unacceptable (score = 4) This metric does not have an unacceptable criterion. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Note: OECD = Organisation for Economic Co-operation and Development 85 ------- D.7.4 Data/Information from Completed Exposure or Risk Assessments The general approach for setting the criteria for an unacceptable rating is to only assign an unacceptable rating when EPA can confirm that the data or information is unacceptable. If the data source lacks documentation of needed metadata, EPA will not rate the metric as unacceptable but will rate it as low. The reason for this approach is to avoid omitting potentially valid data or information since occupational exposure and release data are often sparse. EPA will not use data/information from data sources that exhibit serious flaws as described in Table D-16. Table D-16. Serious Flaws that Would Make Data/Information from Completed Exposure or Risk Assessments Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Methodology The assessment or report uses data or techniques or methods that are not consistent with the best available science. Assumptions, extrapolations, measurements, and models are not appropriate. There appears to be mathematical errors or errors in logic. Representativeness Geographic Scope This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Applicability The assessment is from an occupational or non-occupational scenario that does not apply to any occupational scenario within the scope of the risk evaluation. Temporal representativeness Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Sample Size This metric does not have an unacceptable criterion. Accessibility / Clarity Metadata Completeness Assessment or report does not document its data sources, assessment methods, and assumptions. Variability and Uncertainty Metadata Completeness This metric does not have an unacceptable criterion. 86 ------- Table D-17. Evaluation Criteria for Data/Information from Completed Exposure or Risk Assessments Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Methodology The assessment or report uses high quality data and/or techniques or sound methods High (score = 1) that are from a frequently used source (e.g., European Union or OECD reports, NIOSH HHEs, journal articles, Kirk-Othmer; see section D.5 and Table D-6) and are generally accepted by the scientific community, and associated information does not indicate flaws or quality issues. Medium (score = 2) The assessment or report uses high quality data and/or techniques or sound methods that are not from a frequently used source, and associated information does not indicate flaws or quality issues. Low The data, data sources, and/or techniques or methods used in the assessment or (score = 3) report are not specified. Unacceptable (score = 4) The assessment or report uses data or techniques or methods that are not consistent with the best available science. Assumptions, extrapolations, measurements, and models are not appropriate. There appears to be mathematical errors or errors in logic. Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 2. Geographic Scope High The data are from the United States and are representative of the industry being (score = 1) evaluated. Medium (score = 2) The data are from an OECD country other than the U.S., and locality-specific factors (e.g., potential differences in regulatory occupational exposure or emission limits, industry/ process technologies) may impact exposures or releases relative to the U.S. The data are from a non-OECD country, and locality-specific factors (e.g., potentially Low greater differences in regulatory occupational exposure or emission limits, industry/ (score = 3) process technologies) may impact exposures or releases relative to the U.S. or the country of origin is not specified. Unacceptable This metric does not have an unacceptable criterion since no geographic location is (score = 4) known to have unacceptable data. Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Metric 3. Applicability High The assessment is for an occupational scenario within the scope of the risk evaluation. (score = 1) Medium (score = 2) The assessment is for an occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, in terms of the type of industry, operations, and work activities. Low (score = 3) The assessment is for a non-occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, such as a consumer DIY scenario that is similar to a worker scenario. Unacceptable The assessment is from an occupational or non-occupational scenario that does not (score = 4) apply to any occupational scenario within the scope of the risk evaluation. Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Metric 4. Temporal representativeness High The assessment captures operations, equipment, and worker activities expected to be (score = 1) representative of current conditions. EPA has no reason to believe exposures have changed. The completed exposure or risk assessment is generally no more than 10 years old. Medium The assessment captures operations, equipment, and worker activities that are (score = 2) expected to be reasonably representative of current conditions. The completed exposure or risk assessment is generally, more than 10 years but no more than 20 87 ------- Confidence Level (Score) Description Selected Score years old. Low (score = 3) The completed exposure or risk assessment is more than 20 years old. The assessment captures operations, equipment, and worker activities that are expected to be outdated. Unacceptable (score = 4) Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Sample Size High (score = 1) Statistical distribution of samples is fully characterized. Sample size is sufficiently representative. Medium (score = 2) Distribution of samples is characterized by a range with uncertain statistics. It is unclear if analysis is representative. Low (score = 3) Distribution of samples is qualitative or characterized by no statistics. Unacceptable (score = 4) This metric does not have an unacceptable criterion. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 6. Metadata Completeness High (score = 1) Assessment or report clearly documents its data sources, assessment methods, results, and assumptions. Medium (score = 2) Assessment or report clearly documents results, methods, and assumptions. Data sources are generally described but not fully transparent. Low (score = 3) Assessment or report provides results, but the underlying methods, data sources, and assumptions are not fully transparent. Unacceptable (score = 4) Assessment or report does not document its data sources, assessment methods, and assumptions. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 7. Variability and Uncertainty High (score = 1) The assessment addresses variability and uncertainty in the results. Uncertainty is well characterized. Medium (score = 2) The assessment provides only limited discussion of the variability and uncertainty in the results. Low (score = 3) The assessment does not address variability or uncertainty. Unacceptable (score = 4) This metric does not have an unacceptable criterion. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Notes: HHE = Health Hazard Evaluations NIOSH = National Institute for Occupational Safety and Health OECD = Organisation for Economic Co-operation and Development 88 ------- D.7.5 Data/Information from Reports Containing Other than Exposure or Release Data The general approach for setting the criteria for an unacceptable rating is to only assign an unacceptable rating when EPA can confirm that the data or information is unacceptable. If the data source lacks documentation of needed metadata, EPA will not rate the metric as unacceptable but will rate it as low. The reason for this approach is to avoid omitting potentially valid data or information since occupational exposure and release data are often sparse. EPA will not use data/information from data sources that exhibit serious flaws as described in Table D-18. Table D-18. Serious Flaws that Would Make Data / Information from Reports Containing Other than Exposure or Release Data Unacceptable for Use in the Environmental Release and Occupational Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Methodology The assessment or report uses data or techniques or methods that are not consistent with the best available science. Assumptions, extrapolations, measurements, and models are not appropriate. There appears to be mathematical errors or errors in logic. Representativeness Geographic Scope This metric does not have an unacceptable criterion since no geographic location is known to have unacceptable data. Applicability The report is from an occupational or non-occupational scenario that does not apply to any occupational scenario within the scope of the risk evaluation Temporal representativeness Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Sample Size This metric does not have an unacceptable criterion. Accessibility / Clarity Metadata Completeness Assessment or report does not document its data sources, assessment methods, and assumptions. Variability and Uncertainty Metadata Completeness This metric does not have an unacceptable criterion. 89 ------- Table D-19. Evaluation Criteria for Data /Information Reports Containing Other than Exposure or Release Data Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Methodology The assessment or report uses high quality data and/or techniques or sound methods High (score = 1) that are from frequently used sources (e.g., European Union or OECD reports, NIOSH HHEs, journal articles, Kirk-Othmer; see section D.5 and Table D-6) and are generally accepted by the scientific community, and associated information does not indicate flaws or quality issues. Medium (score = 2) The assessment or report uses high quality data and/or techniques or sound methods that are not from a frequently used source and associated information does not indicate flaws or quality issues. Low The data, data sources, and/or techniques or methods used in the assessment or (score = 3) report are not specified. The assessment or report uses data or techniques or methods that are not high quality Unacceptable or not consistent with the best available science. Assumptions, extrapolations, (score = 4) measurements, and models are not appropriate. There appears to be mathematical errors or errors in logic. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 2. Geographic Scope High The data are from the United States and are representative of the industry being (score = 1) evaluated. Medium (score = 2) The data are from an OECD country other than the U.S., and locality-specific factors (e.g., potential differences in regulatory occupational exposure or emission limits, industry/ process technologies) may impact exposures or releases relative to the U.S. The data are from a non-OECD country, and locality-specific factors (e.g., potentially Low greater differences in regulatory occupational exposure or emission limits, industry/ (score = 3) process technologies) may impact exposures or releases relative to the U.S., or the country of origin is not specified. Unacceptable This metric does not have an unacceptable criterion since no geographic location is (score = 4) known to have unacceptable data. Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Metric 3. Applicability High The report is for an occupational scenario within the scope of the risk evaluation. (score = 1) Medium (score = 2) The report is for an occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, in terms of the type of industry, operations, and work activities. Low (score = 3) The report is for a non-occupational scenario that is similar to an occupational scenario within the scope of the risk evaluation, such as a consumer DIY scenario that is similar to a worker scenario. Unacceptable The report is from an occupational or non-occupational scenario that does not apply to (score = 4) any occupational scenario within the scope of the risk evaluation. Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Metric 4. Temporal representativeness High The report captures operations, equipment, and worker activities expected to be (score = 1) representative of current conditions. The report is generally no more than 10 years old. Medium The report captures operations, equipment, and worker activities that are expected to 90 ------- Confidence Level (Score) Description Selected Score (score = 2) be reasonably representative of current conditions. The report is generally more than 10 years but no more than 20 years old. Low (score = 3) The report is more than 20 years old. The report captures operations, equipment, and worker activities that are expected to be outdated. Unacceptable (score = 4) Known factors (e.g., new and completely different process or equipment) are so different as to make outdated information unacceptable. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Sample Size High (score = 1) Statistical distribution of samples is fully characterized. Sample size is sufficiently representative. Medium (score = 2) Distribution of samples is characterized by a range with uncertain statistics. It is unclear if analysis is representative. Low (score = 3) Distribution of samples is qualitative or characterized by no statistics. Unacceptable (score = 4) This metric does not have an unacceptable criterion. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 6. Metadata Completeness High (score = 1) Assessment or report clearly documents its data sources, assessment methods, results, and assumptions. Medium (score = 2) Assessment or report clearly documents results, methods, and assumptions. Data sources are generally described but not fully transparent. Low (score = 3) Assessment or report provides results, but the underlying methods, data sources, and assumptions are not fully transparent. Unacceptable (score = 4) Assessment or report does not document its data sources, assessment methods, and assumptions. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 7. Variability and Uncertainty High (score = 1) The report addresses variability and uncertainty in the results. Uncertainty is well characterized. Medium (score = 2) The report provides only limited discussion of the variability and uncertainty in the results. Low (score = 3) The report does not address variability or uncertainty. Unacceptable (score = 4) This metric does not have an unacceptable criterion. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Notes: HHE = Health Hazard Evaluation NIOSH = National Institute for Occupational Safety and Health OECD = Organisation for Economic Co-operation and Development 91 ------- D.8 References 1. ECHA. (2011). Guidance on information requirements and chemical safety assessment. Chapter R.3: Information gathering. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262857. 2. Moermond, CB, A. Breton, R. Junghans, M. Laskowski, R. Solomon, K. Zahner, H. (2016). Assessing the reliability of ecotoxicological studies: An overview of current needs and approaches. Integr Environ Assess Manag. 13: 1-12. http://dx.doi.org/10.1002/ieam.1870: http://onlinelibrarv.wilev.com/store/10.1002/ieam.l870/asset/ieaml870.pdf?v=l&t=ierdoypz&s=e e96db9e589f470debl0651cdbl460d9ada93486. 3. U.S. EPA. (2009). Guidance on the Development, Evaluation, and Application of Environmental Models. (EPA/100/K-09/003). Washington, DC: Office of the Science Advisor. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262976. 92 ------- APPENDIX E: DATA QUALITY CRITERIA FOR STUDIES ON CONSUMER, GENERAL POPULATION AND ENVIRONMENTAL EXPOSURE E.l Types of Consumer, General Population and Environmental Exposure Data Sources The data quality of consumer, general population, and environmental exposure data sources will be evaluated for seven different types of data sources: monitoring data, modeling data, survey-based data, epidemiological based data, experimental data, completed exposure assessments and risk characterizations, and database sources not unique to a chemical. Definitions for these data types are shown below in Table E-l. Table E-l. Types of Exposure Data Sources Type of Data Source Definition Monitoring Data Measured chemical concentration(s) obtained from sampling of environmental media (e.g., air, water, soil, and biota) to observe and study conditions of the environment. Monitoring data also include measured concentrations of chemicals or their metabolites in biological matrices (i.e., blood, urine, breastmilk, breath, hair, and organs) that provide direct evidence about exposure of environmental contaminants in humans and wildlife, as well as measured chemical concentrations obtained from personal exposure monitoring (i.e., breathing zone, skin patch samples). Modeling Data Calculated values derived from computational models for estimation of environmental concentrations (i.e., indoor, outdoor, microenvironments) and uptakes (e.g., ADD, LADD, Cmax, or AUC) associated with relevant exposure scenarios and routes (i.e., inhalation, oral, dermal). Survey-based Data Data collected from survey questionnaires about activity and use patterns (e.g., habits, practices, food intake) to evaluate exposure to an individual, a population segment or a population. Epidemiological Data Exposure data obtained from epidemiological studies collected as part of the examination of the association between chemical exposure and the occurrence and causes of health effects in human populations. The data may also come from case study reports which characterize exposures to one person. Experimental Data Data obtained from experimental studies conducted in a controlled environment with pre- defined testing conditions. Examples include data from laboratory/chamber tests such as those conducted for product testing, source characterization, emissions testing, and migration testing. Experimental data may also include chemical concentrations from personal exposure or biomonitoring studies conducted in laboratory/chamber test settings. Completed Exposure Assessments and Risk Characterizations Data reported in completed exposure assessments and risk characterizations containing a broad range of exposure data types (e.g., media concentrations, doses, estimated values, exposure factors). Examples: ATSDR assessments, risk assessments completed by other countries. Database Sources Not Unique to a Chemical Data obtained from large databases which collate information for a wide variety of chemicals using methods that are reasonable and consistent with sound scientific theory and/or accepted approaches, and are from sources generally using sound methods and/or approaches (e.g., state or federal governments, academia). Example databases: NHANES, STORET. Notes: ADD = Average daily dose LADD = Lifetime average daily dose ATSDR = Agency for Toxic Substances and Disease Registry NHANES = National Health and Nutrition Examination AUC = Area under the curve Survey Cmax = maximum concentration in plasma STORET = Storage and Retrieval for Water Quality Data database 93 ------- In general, the studies will inform the following basic data needs for exposures assessment (NRC. 1991): • measures or estimates of the chemical • the source of the chemical exposure • environmental media of exposure • specific populations exposed, including potentially exposed or susceptible subpopulations • intensity and frequency of contact • spatial and temporal concentration patterns Some data sources identified as on-topic26 for consumer, general population, and environmental exposure will also be identified as on-topic for the other disciplines (Engineering, Fate, Human Health Hazard, Environmental Health Hazard) supporting the development of the TSCA risk evaluations. In these cases, each discipline will consider different aspects of the same study. This is the case for epidemiological studies which examine disease patterns among populations during a specific duration of time. While the human health assessors are primarily interested in the hazards and effects that exposure to pollutants have on key biological, chemical, and physical processes affecting human health, exposure assessors are primarily interested in estimating exposure via direct measurements (e.g., media concentrations coupled with uptake rates, biomonitoring concentrations) or modeling. EPA anticipates that many epidemiological studies will need to be assessed by both the exposure and the human health assessors. E.2 Data Quality Evaluation Domains The data sources will be evaluated against the following four data quality evaluation domains: reliability, representativeness, accessibility/clarity, and variability and uncertainty. These domains, as defined in Table E-2, address elements of TSCA Science Standards 26(h)(1) through 26(h)(5). Table E-2. Data Evaluation Domains and Definitions Evaluation Domain Definition Reliability The inherent property of a study, which includes the use of well-founded scientific approaches, the avoidance of bias within the study design and faithful study conduct and documentation (ECHA, 2011a). Representativeness The data reported address exposure scenarios (e.g., sources, pathways, routes, receptors) that are relevant to the assessment. Accessibility/Clarity The data and supporting information are accessible and clearly documented. Variability and Uncertainty The data describe variability and uncertainty (quantitative and qualitative) or the procedures, measures, methods, or models are evaluated and characterized. 26 For the scoping phase, EPA/OPPT developed specific criteria to determine which references should be tagged as "on-topic" (inclusion criteria) and "off-topic" (exclusion criteria). Refer to the literature search strategies and bibliographies developed for each of the 10 existing chemicals under evaluation. https://www.epa.gov/assessing-and-managing-chemicals-under-tsca/risk-evaluations-existing-chemicals- under-tsca 94 ------- E.3 Data Quality Evaluation Metrics The data quality evaluation domains will be evaluated by assessing unique metrics that have been developed for each data type. A summary of the number of metrics and metric name for each data type is provided in Table E-3. EPA may adjust these metrics as more experience is acquired with the evaluation tools to support fit-for-purpose TSCA risk evaluations. If this happens, EPA will document the changes to the evaluation tool. Table E-3. Summary of Metrics for the Seven Data Types Type of Data Source Overall Number of Metrics3 Metric Types Monitoring Data 10 Sampling Methodology; Analytical Methodology; Selection of Biomarker of Exposure; Geographic Area; Temporality; Spatial and Temporal Variability; Exposure Scenario; Reporting of Results; Quality Assurance; Variability and Uncertainty Modeling Data 6 Mathematical Equations; Model Evaluation; Exposure Scenario; Model and Model Documentation Availability; Model Inputs and Defaults; Variability and Uncertainty Survey-based Data 8 Data Collection Methodology; Data Analysis Methodology, Geographic Area; Sampling/Sampling Size; Response Rate; Reporting of Results; Quality Assurance; Variability and Uncertainty Epidemiological Data 18 Measurement or Exposure Characterization; Reporting Bias; Exposure Variability and Misclassification; Sample Contamination; Method Requirements; Matrix Adjustment; Method Sensitivity; Stability; Use of Biomarker of Exposure; Relevance; Population; Participant Selection; Comparison Group; Attrition; Documentation; QA/QC; Variability; Uncertainties Experimental Data 9 Sampling Methodology and Conditions; Analytical Methodology; Selection of Biomarker of Exposure; Testing Scenario, Sample Size and Variability; Temporality; Reporting of Results; Quality Assurance; Variability and Uncertainty Completed Exposure Assessments and Characterizations 4 Methodology; Exposure Scenario; Documentation of References; Variability and Uncertainty Database Sources Not Unique to a Chemical 8 Sampling Methodology; Analytical Methodology; Geographic Area; Temporal; Exposure Scenario; Availability of Database and Supporting Documents; Reporting of Results; Variability and Uncertainty Note: a Number of metrics across evaluation domains. 95 ------- E.4 Scoring Method and Determination of Overall Data Quality Level A scoring system will be used to assign the overall quality of the data source, as discussed in Appendix A. E.4.1 Weighting Factors EPA/OPPT is not applying weighting factors to the general population, consumer, and environmental exposure data types. In practice, it is equivalent to assigning a weighting factor of 1, which statistically assumes that each metric carries an equal amount of weight. This approach was adopted because of the wide range of objectives exhibited by the data sources across and within each data type and variations in their protocols, making it difficult to fairly apply a standard weighting scheme to all studies. Additionally, it is expected that weighting inherently occurs for most data types because more metrics are assigned to the reliability and representativeness domains (when combined) than the accessibility/clarity and variability/uncertainty domains. This is consistent with the logic that the reliability and representativeness domains are considered more important than other domains since these domains are considered fundamental aspects of the study. E.4.2 Calculation of Overall Study Score To determine the overall study score, the first step is to multiply the score for each metric (1, 2, or 3 for high, medium, or low confidence, respectively) by the appropriate weighting factor, as shown in Table E-4, to obtain a weighted metric score. The weighted metric scores are then summed and divided by the sum of the weighting factors (for all metrics that are scored) to obtain an overall study score between 1 and 3. The equation for calculating the overall score is shown below. Although weighting factors are not used, the equation is showing the term for Weighting Factor (equivalent to 1) to be transparent about the calculation and to provide a consistent equation among the disciplines: Overall Score (range of 1 to 3) = Z (Metric Score x Weighting Factor)/^ (Weighting Factors) Table E-4 provides an example scoring for monitoring data. Studies with any single metric scored as 4 will be automatically assigned an overall quality score of Unacceptable and further evaluation of the remaining metrics is not necessary. An Unacceptable score means that serious flaws are noted in the domain metric that consequently make the data unusable (or invalid). EPA/OPPT plans to use data with an overall quality level of High, Medium, or Low to quantitatively or qualitatively support the risk evaluations, but does not plan to use data rated as Unacceptable. 96 ------- Any metrics that are Not rated/not applicable to the study under evaluation will not be considered in the calculation of the study's overall quality score. These metrics will not be included in the nominator or denominator of the overall score equation. The overall score will be calculated using only those metrics that receive a numerical score. In addition, if a publication reports more than one study or endpoint, each study and, as needed, each endpoint will be evaluated separately. Detailed tables showing quality criteria for the metrics are provided in Tables E-6 through E-18, including a table that summarizes the serious flaws that would make the data unacceptable for use in the exposure assessment. Table E-4.Scoring Example for Monitoring Data Metric Selected Metric Score Metric Weighting Factor Weighted Metric Score Metric 1: Sampling Methodology 1 1 1 Metric 2: Analytical Methodology 2 1 2 Metric 3: Selection of Biomarker of Exposure 2 1 2 Metric 4: Geographic Area 1 1 1 Metric 5: Temporality 1 1 1 Metric 6: Spatial and Temporal Variability 1 1 1 Metric 7: Exposure Scenario 3 1 3 Metric 8: Reporting of Results 1 1 1 Metric 9: Quality Assurance 2 1 2 Metric 10: Variability and Uncertainty 2 1 2 ^(Metric Score x Metric Weighting Factor)/^ High Medium Sum = 10 Metric Weighting Factors) Low Sum = 16 =16/10=1.6 >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Overall Score: 1.6 (High) E.5 Data Sources Frequently Used in Consumer, General Population and Environmental Exposure Assessments Many of the metric criteria definitions for the confidence levels (i.e.,high, medium, low, and unacceptable) examine if the methodology used was sound and widely accepted. Table E-5 provides examples of data sources that EPA frequently uses to support the data needs of consumer, general population and environmental exposure assessments. EPA notes that some data sources in Table E-5 may use or include data or information that are not of high quality but are still acceptable (e.g., medium or low quality) for use in risk evaluation. The methodologies in the individual studies under review will still be assessed in relation to chemical- and scenario- 97 ------- specific considerations, thus the study may still receive study quality scores ranging from unacceptable to high even though the study used a methodology from a source commonly known to use sound methods and/or approaches. EPA may determine standard quality ratings for some of these sources as more experience is acquired with TSCA risk evaluations. Table E-5. Examples of Data Sources Frequently Used for Consumer, General Population and Environmental Exposure Assessments Source U.S. EPA Chemical Data Reporting (CDR) High Production Volume (HPV) Challenge Submissions Extra HPV Program Submissions EPA Existing Chemicals Engineering Files EPA Generic Scenarios Toxics Release Inventory (TRI) National Emissions Inventory (NEI) Office of Water Office of Air Office of Enforcement and Compliance Assistance Sector Notebooks AP-42 Other EPA Programs (e.g., Design for Environment) Occupational Safety and Health Administration (OSHA) National Institute of Occupational Safety and Health (NIOSH) American Conference of Governmental Industrial Hygienists (ACGIH) Agency for Toxic Substances and Disease Registry (ATSDR) Organisation for Economic Co-operation and Development (OECD) Screening Information Dataset (SIDS) Emission Scenario Documents (ESDs) Other Programs Environment Canada Canadian Pollution Prevention Information Clearinghouse Other Programs U.S. Census Bureau North American Industry Classification System (NAICS) Definitions County Business Patterns Annual Survey of Manufacturers Current Industrial Reports Economic Census Bureau of Labor Statistics (BLS) North Carolina Division of Pollution Prevention and Environmental Assistance Kirk-Othmer Encyclopedia of Chemical Technology Hazardous Substances Data Bank (HSDB) National Library of Medicine's HazMap 98 ------- E.6 Data Quality Criteria E.6.1 Monitoring Data Table E-6. Serious Flaws that Would Make Sources of Monitoring Data Unacceptable for Use in the Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Sampling Methodology The sampling methodology is not discussed in the data source or companion source. Sampling methodology is not scientifically sound or is not consistent with widely accepted methods/approaches for the chemical and media being analyzed (e.g., inappropriate sampling equipment, improper storage conditions). There are numerous inconsistencies in the reporting of sampling information, resulting in high uncertainty in the sampling methods used. Analytical Methodology Analytical methodology is not described, including analytical instrumentation (i.e., HPLC, GC). Analytical methodology is not scientifically appropriate for the chemical and media being analyzed (e.g., method not sensitive enough, not specific to the chemical, out of date). There are numerous inconsistencies in the reporting of analytical information, resulting in high uncertainty in the analytical methods used. Selection of Biomarker of Exposure This metric does not have an unacceptable criterion. Representative Geographic Area Geographic location is not reported, discussed, or referenced. Currency Timing of sample collection for monitoring data is not reported, discussed, or referenced. Spatial and Temporal Variability Sample size is not reported. Single sample collected per data set. For biomonitoring studies, the timing of sample collected is not appropriate based on chemical properties (e.g., half-life), the pharmacokinetics of the chemical (e.g., rate of uptake and elimination), and when the exposure event occurred. Exposure Scenario If reported, the exposure scenario discussed in the monitored study does not represent the exposure scenario of interest for the chemical. Accessibility / Clarity Reporting of Results There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. Quality Assurance QA/QC issues have been identified which significantly interfere with the overall reliability of the study. Variability and Uncertainty Variability and Uncertainty Estimates are highly uncertain based on characterization of variability and uncertainty. Notes: GC = Gas chromatography HPLC = High pressure liquid chromatography QA/QC = Quality assurance/quality control 99 ------- Table E-7. Evaluation Criteria for Sources of Monitoring Data Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Sampling Methodology High (score = 1) • Samples were collected according to publicly available SOPs that are scientifically sound and widely accepted (i.e., from a source generally using sound methods and/or approaches) for the chemical and media of interest. Example SOPs include USGS's "National Field Manual for the Collection of Water-Quality Data", EPA's "Ambient Air Sampling" (SESDPROC-303-R5), etc. OR • The sampling protocol used was not a publicly available SOP from a from a source generally using sound methods and/or approaches, but the sampling methodology is clear, appropriate (i.e., scientifically sound), and similar to widely accepted protocols for the chemical and media of interest. All pertinent sampling information is provided in the data source or companion source. Examples include: > sampling equipment > sampling procedures/regime > sample storage conditions/duration > performance/calibration of sampler > study site characteristics > matrix characteristics Medium (score = 2) • Sampling methodology is discussed in the data source or companion source and is generally appropriate (i.e., scientifically sound) for the chemical and media of interest, however, one or more pieces of sampling information is not described. The missing information is unlikely to have a substantial impact on results. OR • Standards, methods, protocols, or test guidelines may not be widely accepted, but a successful validation study for the new/unconventional procedure was conducted prior to the sampling event and is consistent with sound scientific theory and/or accepted approaches. Or a review of information indicates the methodology is acceptable and differences in methods are not expected to lead to lower quality data. Low (score = 3) • Sampling methodology is only briefly discussed; therefore, most sampling information is missing and likely to have a substantial impact on results. AND/OR • The sampling methodology does not represent best sampling methods, protocols, or guidelines for the chemical and media of interest (e.g., outdated (but still valid) sampling equipment or procedures, long storage durations). AND/OR • There are some inconsistencies in the reporting of sampling information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used, etc.) which lead to a low confidence in the sampling methodology used. Unacceptable (score = 4) • The sampling methodology is not discussed in the data source or companion source. AND/OR • Sampling methodology is not scientifically sound or is not consistent with widely accepted methods/approaches for the chemical and media being analyzed (e.g., inappropriate sampling equipment, improper storage conditions). 100 ------- Confidence Level (Score) Description Selected Score AND/OR • There are numerous inconsistencies in the reporting of sampling information, resulting in high uncertainty in the sampling methods used. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Analytical Methodology High (score = 1) • Samples were analyzed according to publically available analytical methods that are scientifically sound and widely accepted (i.e., from a source generally using sound methods and/or approaches) and are appropriate for the chemical and media of interest. Examples include EPASW-846 Methods, NIOSH Manual of Analytical Methods 5th Edition, etc. OR • The analytical method used was not a publically available method from a source generally known to use sound methods and/or approaches, but the methodology is clear and appropriate (i.e., scientifically sound) and similar to widely accepted protocols for the chemical and media of interest. All pertinent sampling information is provided in the data source or companion source. Examples include: > extraction method > analytical instrumentation (required) > instrument calibration > LOQ, LOD, detection limits, and/or reporting limits > recovery samples > biomarker used (if applicable) > matrix-adjustment method (i.e., creatinine, lipid, moisture) Medium (score = 2) • Analytical methodology is discussed in detail and is clear and appropriate (i.e., scientifically sound) for the chemical and media of interest; however, one or more pieces of analytical information is not described. The missing information is unlikely to have a substantial impact on results. AND/OR • The analytical method may not be standard/widely accepted, but a method validation study was conducted prior to sample analysis and is expected to be consistent with sound scientific theory and/or accepted approaches. AND/OR • Samples were collected at a site and immediately analyzed using an on-site mobile laboratory, rather than shipped to a stationary laboratory. Low (score = 3) • Analytical methodology is only briefly discussed. Analytical instrumentation is provided and consistent with accepted analytical instrumentation/methods. However, most analytical information is missing and likely to have a substantial impact on results. AND/OR • Analytical method is not standard/widely accepted, and method validation is limited or not available. AND/OR • Samples were analyzed using field screening techniques. AND/OR • LOQ, LOD, detection limits, and/or reporting limits not reported. 101 ------- Confidence Level (Score) Description Selected Score AND/OR • There are some inconsistencies or possible errors in the reporting of analytical information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used, etc.) which leads to a lower confidence in the method used. Unacceptable (score = 4) • Analytical methodology is not described, including analytical instrumentation (i.e., HPLC, GC). AND/OR • Analytical methodology is not scientifically appropriate for the chemical and media being analyzed (e.g., method not sensitive enough, not specific to the chemical, out of date). AND/OR • There are numerous inconsistencies in the reporting of analytical information, resulting in high uncertainty in the analytical methods used. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Selection of Biomarker of Exposure High (score = 1) • Biomarker in a specified matrix is known to have an accurate and precise quantitative relationship with external exposure, internal dose, or target dose (e.g., previous studies (or the current study) have indicated the biomarker of interest reflects external exposures). AND • Biomarker (parent chemical or metabolite) is derived from exposure to the chemical of interest. Medium (score = 2) • Biomarker in a specified matrix has accurate and precise quantitative relationship with external exposure, internal dose, or target dose. AND • Biomarker is derived from multiple parent chemicals, not only the chemical of interest, but there is a stated method to apportion the estimate to only the chemical of interest Low (score = 3) • Biomarker in a specified matrix has accurate and precise quantitative relationship with external exposure, internal dose, or target dose. AND • Biomarker is derived from multiple parent chemicals, not only the chemical of interest, and there is NOT an accurate method to apportion the estimate to only the chemical of interest. OR • Biomarker in a specified matrix is a poor surrogate (low accuracy and precision) for exposure/dose. Unacceptable (score = 4) • Not applicable. A study will not be deemed unacceptable based on the use of biomarker of exposure. Not rated/applicable • Metric is not applicable to the data source. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 102 ------- Confidence Level (Score) Description Selected Score Domain 2. Representative Metric 4. Geographic Area High (score = 1) • Geographic location(s) is reported, discussed, or referenced. Medium (score = 2) • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). Low (score = 3) • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). Unacceptable (score = 4) • Geographic location is not reported, discussed, or referenced. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Temporality High (score = 1) • Timing of sample collection for monitoring data is consistent with current or recent exposures (within 5 years) may be expected. Medium (score = 2) • Timing of sample collection for monitoring data is less consistent with current or recent exposures (>5 to 15 years) may be expected. Low (score = 3) • Timing of sample collection for monitoring data is not consistent with when current exposures (>15 years old) may be expected and likely to have a substantial impact on results. Unacceptable (score = 4) • Timing of sample collection for monitoring data is not reported, discussed, or referenced. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 6. Spatial and Temporal Variability High (score = 1) • Sampling approach accurately captures variability of environmental contamination in population/scenario/media of interest based on the heterogeneity/homogeneity and dynamic/static state of the environmental system. For example: > Large sample size (i.e., > 10 samples for a single scenario). > Use of replicate samples. > Use of systematic or continuous monitoring methods. > Sampling over a sufficient period of time to characterize trends. > For urine, 24-hr samples are collected (vs first morning voids or spot). > For biomonitoring studies, the timing of sample collected is appropriate based on chemical properties (e.g., half-life), the pharmacokinetics of the chemical (e.g., rate of uptake and elimination), and when the exposure event occurred. Medium (score = 2) • Sampling approach likely captures variability of environmental contamination in population/scenario/media of interest based on the heterogeneity/homogeneity and dynamic/static state of the environmental system. Some uncertainty may exist, but it is unlikely to have a substantial impact on results. For example: > Moderate sample size (i.e., 5-10 samples for a single scenario), or > Use of judgmental (non-statistical) sampling approach, or > No replicate samples. 103 ------- Confidence Level (Score) Description Selected Score > For urine, first morning voids or pooled spot samples. Low (score = 3) • Sampling approach poorly captures variability of environmental contamination in population/scenario/media of interest. For example: > Small sample size (i.e., <5 samples), or > Use of haphazard sampling approach, or > No replicate samples, or > Grab or spot samples in single space or time, or > Random sampling that doesn't include all periods of time or locations, or > For urine, un-pooled spot samples. Unacceptable (score = 4) • Sample size is not reported. • Single sample collected per data set. • For biomonitoring studies, the timing of sample collected is not appropriate based on chemical properties (e.g., half-life), the pharmacokinetics of the chemical (e.g., rate of uptake and elimination), and when the exposure event occurred. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 7. Exposure Scenario High (score = 1) • The data closely represent relevant exposure scenario (i.e., the population/scenario/media of interest). Examples include: > amount and type of chemical / product used > source of exposure > method of application or by-stander exposure > use of exposure controls > microenvironment (location, time, climate) Medium (score = 2) • The data likely represent the relevant exposure scenario (i.e., population/scenario/media of interest). One or more key pieces of information may not be described but the deficiencies are unlikely to have a substantial impact on the characterization of the exposure scenario. AND/OR • If surrogate data, activities seem similar to the activities within scope. Low (score = 3) • The data lack multiple key pieces of information and the deficiencies are likely to have a substantial impact on the characterization of the exposure scenario. AND/OR • There are some inconsistencies or possible errors in the reporting of scenario information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used, etc.) which leads to a lower confidence in the scenario assessed. AND/OR • If surrogate data, activities have lesser similarity but are still potentially applicable to the activities within scope. Unacceptable (score = 4) • If reported, the exposure scenario discussed in the monitored study does not represent the exposure scenario of interest for the chemical. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 104 ------- Confidence Level (Score) Description Selected Score Domain 3. Accessibility / Clarity Metric 8. Reporting of Results High (score = 1) • Supplementary or raw data (i.e., individual data points) are reported, allowing summary statistics to be calculated or reproduced. AND • Summary statistics are detailed and complete. Example parameters include: > Description of data set summarized (i.e., location, population, dates, etc.) > Range of concentrations or percentiles > Number of samples in data set > Frequency of detection > Measure of variation (CV, standard deviation) > Measure of central tendency (mean, geometric mean, median) > Test for outliers (if applicable) AND • Both adjusted and unadjusted results are provided (i.e., correction for void completeness in urine biomonitoring, whole-volume or lipid adjusted for blood biomonitoring, wet or dry weight for ecological tissue samples or soil samples) [only if applicable]. Medium (score = 2) • Supplementary or raw data (i.e., individual data points) are not reported, and therefore summary statistics cannot be reproduced. AND/OR • Summary statistics are reported but are missing one or more parameters (see description for high). AND/OR • Only adjusted or unadjusted results are provided, but not both [only if applicable]. Low (score = 3) • Supplementary data are not provided, and summary statistics are missing most parameters (see description for high). AND/OR • There are some inconsistencies or errors in the results reported, resulting in low confidence in the results reported (e.g., differences between text and tables in data source, less appropriate statistical methods). Unacceptable (score = 4) • There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 9. Quality Assurance High (score = 1) • The study applied quality assurance/quality control measures and all pertinent quality assurance information is provided in the data source or companion source. Examples include: > Field, laboratory, and/or storage recoveries. > Field and laboratory control samples. > Baseline (pre-exposure) samples. > Biomarker stability > Completeness of sample (i.e., creatinine, specific gravity, osmolality for urine samples) AND • No quality control issues were identified or any identified issues were minor and adequately addressed (i.e., correction for low recoveries, correction for 105 ------- Confidence Level (Score) Description Selected Score completeness). Medium (score = 2) • The study applied and documented quality assurance/quality control measures; however, one or more pieces of QA/QC information is not described. Missing information is unlikely to have a substantial impact on results. AND • No quality control issues were identified or any identified issues were minor and addressed (i.e., correction for low recoveries, correction for completeness). Low (score = 3) • Quality assurance/quality control techniques and results were not directly discussed, but can be implied through the study's use of standard field and laboratory protocols. AND/OR • Deficiencies were noted in quality assurance/quality control measures that are likely to have a substantial impact on results. AND/OR • There are some inconsistencies in the quality assurance measures reported, resulting in low confidence in the quality assurance/control measures taken and results (e.g., differences between text and tables in data source). Unacceptable (score = 4) • QA/QC issues have been identified which significantly interfere with the overall reliability of the study. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 10. Variability and Uncertainty High (score = 1) • The study characterizes variability in the population/media studied. AND • Key uncertainties, limitations, and data gaps have been identified. AND • The uncertainties are minimal and have been characterized. Medium (score = 2) • The study has limited characterization of variability in the population/media studied. AND/OR • The study has limited discussion of key uncertainties, limitations, and data gaps. AND/OR • Multiple uncertainties have been identified, but are unlikely to have a substantial impact on results. Low (score = 3) • The characterization of variability is absent. AND/OR • Key uncertainties, limitations, and data gaps are not discussed. AND/OR • Uncertainties identified may have a substantial impact on the exposure the exposure assessment Unacceptable (score = 4) • Estimates are highly uncertain based on characterization of variability and uncertainty. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 106 ------- Confidence Level (Score) Description Selected Score Notes: ADME = Absorption, distribution, metabolism, and elimination CV = Coefficient of variation GC = Gas chromatography HPLC = High pressure liquid chromatography LOD = Limit of detection LOQ = Limit of quantitation NIOSH = National Institute for Occupational Safety and Health QA/QC = Quality assurance/quality control SOPs = Standard operating procedures USGS = U.S. Geological Survey 107 ------- E.6.2 Modeling Data27 Table E-8. Serious Flaws that Would Make Sources of Modeling Data Unacceptable for Use in the Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Mathematical Equations For widely accepted models from a source generally known to use sound methods and/or approaches, the module used is not germane to the scenario being assessed. For other (non-public/non-authoritative) models, key mathematical equations and/or theory are not provided in the data source or in a companion reference. Key mathematical equations are not based on scientifically sound approaches. Key mathematical equations are incorrect. Model Evaluation The model used in the data source has not undergone evaluation. It is unknown whether the model has undergone evaluation. Evaluation efforts indicate that the model results do not correctly estimate concentrations or uptakes. Model has no acceptance among the scientific or regulatory community. Representative Exposure Scenario Model inputs do not reflect relevant conditions for the scenario of interest, or insufficient information is provided to make a determination. Accessibility / Clarity Model and Model Documentation Availability This metric does not have an unacceptable criterion. Model Inputs and Defaults There is at most a very limited description of model inputs/defaults and their associated data sources. Variability and Uncertainty Variability and Uncertainty Estimates are highly uncertain based on characterization of uncertainty. 27 Evaluation of models and modeling data types will largely follow guidance from (U.S. EPA. 2009). 108 ------- Table E-9. Evaluation Criteria for Sources of Modeling Data EPA will consult with the Guidance on the Development, Evaluation, and Application of Environmental Models (U.S. EPA. 2009) when evaluating models and modeling data types. Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Mathematical Equations/Theory High (score = 1) • The model is scientifically sound and widely accepted (i.e., from a source generally using sound methods and/or approaches) for the scenario being assessed. OR • For other (non-public/non-authoritative) models, key mathematical equations to calculate concentrations or uptakes are provided in the data source or in a companion reference. Equations are described in detail and correctness can be assessed. Medium (score = 2) • For other (non-public/authoritative) models, key mathematical equations to calculate concentrations or uptakes are not available in the data source, but the scientific and mathematical theory (i.e., conceptual model) is described in detail. Low (score = 3) • For other (non-public/authoritative) models, key mathematical equations or theory to calculate concentrations or uptakes are unclear or not detailed enough to thoroughly assess. Unacceptable (score = 4) • For widely accepted models from a source generally known to use sound methods and/or approaches, the module used is not germane to the scenario being assessed. AND/OR • For other (non-public/non-authoritative) models, key mathematical equations and/or theory are not provided in the data source or in a companion reference. AND/OR • Key mathematical equations are not based on scientifically sound approaches. AND/OR • Key mathematical equations are incorrect. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Model Evaluation High (score = 1) • The model used in the data source has undergone extensive evaluation. The evaluation methodology and results are either discussed in the data source or provided in a companion source. Example evaluation methods include: - formal peer review - quantitative corroboration of model results with monitoring data directly relevant for the scenario of interest - benchmarking against other models - quality assurance checks during model development. Medium (score = 2) • The model used in the data source has undergone only targeted/limited evaluation. For example: - informal peer review - at most limited evaluation with monitoring data - qualitative corroboration of model results through expert elicitation 109 ------- Confidence Level (Score) Description Selected Score - evaluation via other model predictions - quality assurance checks during model development. AND/OR • There is only limited discussion on the evaluation methodology and results in either the data source or other references. AND/OR • Model has wide acceptance among the scientific and regulatory community but has not have been validated for the scenario of interest, peer reviewed or well documented. Low (score = 3) • Model evaluation was conducted according to the author; however, there is no information provided regarding model peer review, corroboration, or quality assurance checks. AND/OR • Model has only limited acceptance among the scientific and regulatory community. Unacceptable (score = 4) • The model used in the data source has not undergone evaluation. AND/OR • It is unknown whether the model has undergone evaluation. AND/OR • Evaluation efforts indicate that the model results do not correctly estimate concentrations or uptakes. AND/OR • Model has no acceptance among the scientific and regulatory community. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 3. Exposure Scenario High (score = 1) • The modeled scenario closely represents current exposures (within 5 years) and/or relevant conditions (e.g., environmental conditions, consumer products, exposure factors, geographical location). Medium (score = 2) • The modeled scenario is less representative of current exposures (>5 to 15 years) and/or relevant conditions for the scenario of interest (e.g., environmental conditions, consumer products, exposure factors, geographical location). Low (score = 3) • The modeled scenario is not consistent with when current exposures are expected (>15 years) and/or with relevant conditions (e.g., environmental conditions, consumer products, exposure factors, geographical location); inconsistencies are likely to have a substantial impact on results. Unacceptable (score = 4) • Model inputs do not reflect relevant conditions for the scenario of interest, or insufficient information is provided to make a determination. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 110 ------- Confidence Level (Score) Description Selected Score Domain 3. Accessibility / Clarity Metric 4. Model and Model Documentation Availability High (score = 1) • The model and documentation (user guide, documentation manual) are publicly available or there is sufficient documentation in the data source or in a companion reference. Medium (score = 2) • Not applicable. This metric is dichotomous (i.e., high versus low). Low (score = 3) • The model and documentation (user guide, documentation manual) are not available, or there is insufficient documentation in the data source or in a companion reference. Unacceptable (score = 4) • Not applicable. This metric is dichotomous (i.e., high versus low). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Model Inputs and Defaults High (score = 1) • Key model inputs (e.g., chemical mass released, release pattern over time, receptor uptake rates and locations over time) and defaults are identified, referenced and clearly described. AND • Model inputs meet data quality acceptance criteria specified by the authors or are standard or commonly accepted inputs (e.g., from Exposure Factors Handbook). Medium (score = 2) • Key model inputs and defaults and associated data sources are generally identified, referenced and clearly described, but the descriptions are not detailed. AND/OR • Data quality acceptance criteria specified by the author are not discussed, but inputs appear appropriate. Low (score = 3) • Numerous key model inputs and defaults and associated data sources are not identified, referenced or clearly described; AND/OR • There are some inconsistencies in the reporting of inputs and defaults and their associated data sources (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used) that lead to a low confidence in the inputs and defaults used. AND/OR • Data quality acceptance criteria specified by the author are not discussed and some inputs appear inappropriate. Unacceptable (score = 4) • There is at most a very limited description of model inputs/defaults and their associated data sources. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Ill ------- Confidence Level (Score) Description Selected Score Domain 4. Variability and Uncertainty Metric 6. Variability and Uncertainty High (score = 1) • The study characterizes variability in the population/media studied. AND • Key uncertainties, limitations, and data gaps have been identified. AND • The uncertainties are minimal and have been characterized. Medium (score = 2) • The study has limited characterization of variability in the population/media studied. AND/OR • The study has limited discussion of key uncertainties, limitations, and data gaps. AND/OR • Multiple uncertainties have been identified, but are unlikely to have a substantial impact on results. Low (score = 3) • The characterization of variability is absent. AND/OR • Key uncertainties, limitations, and data gaps are not discussed. AND/OR • Uncertainties identified may have a substantial impact on the exposure the exposure assessment Unacceptable (score = 4) • Estimates are highly uncertain based on characterization of variability and uncertainty. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 112 ------- E.6.3 Survey Data Table E-10. Serious Flaws that Would Make Sources of Survey Data Unacceptable for Use in the Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Data Collection Methodology Data collection methods are not described. Data collection methods used are not appropriate (i.e., scientifically sound) for the target population, the intended purpose, data requirements of the survey, or the target response rate. There are numerous inconsistencies in the reporting of data collection information resulting in high uncertainty in the data collection methods used. Data Analysis Methodology Data analysis methodology is not described. Data analysis methodology is not appropriate (i.e., scientifically sound) for the intended purpose of the survey and the data/information collected. There are numerous inconsistencies in the reporting of analytical information resulting in high uncertainty in the data analysis methods used. Representative Geographic Area Geographic location is not reported, discussed, or referenced. Sampling/ Sampling Size Sampling procedures (e.g., stratified sampling, cluster sampling, multi- stage sampling, non-probability sampling, etc.) are not documented in the data source or companion source. Sample size is not reported. Response Rate This metric does not have an unacceptable criterion.. Accessibility / Clarity Reporting of Results There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. Quality Assurance QA/QC issues have been identified which significantly interfere with the overall reliability of the survey results. Variability and Uncertainty Variability and Uncertainty Estimates are highly uncertain based on characterization of variability and uncertainty. Note: QA/QC = Quality assurance/quality control 113 ------- Table E-ll. Evaluation Criteria for Source of Survey Data Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Data Collection Methodology High (score = 1) • Survey data were collected using a standard or validated data collection methods (e.g., mail, phone, personal interview, online surveys, etc.) that are appropriate (i.e., scientifically sound) given the characteristics of the target population, the intended purpose, data requirements of the survey, and the target response rate. AND • All pertinent information regarding data collection methodology is provided in the data source or companion source. Examples include: > data collection instrument (e.g., questionnaire, diaries, etc.) > data collection protocols for field personnel > date of data collection > description of target population Medium (score = 2) • Survey data were collected using standard or validated data collection methods appropriate given the characteristics of the target population, the intended purpose and data requirements of the survey, and the target response rate. However, one or more pieces of pertinent information regarding data collection is not described. The missing information is unlikely to have a substantial impact on results. Low (score = 3) • Data collection methods are only briefly discussed, therefore most data collection information is missing and likely to have a substantial impact on results. AND/OR • There are some inconsistencies in the reporting of data collection information (e.g., differences between text and tables in data source) which lead to a low confidence in the data collection methodology used. Unacceptable (score = 4) • Data collection methods are not described. AND/OR • Data collection methods used are not appropriate (i.e., scientifically sound) for the target population, the intended purpose, data requirements of the survey, or the target response rate. AND/OR • There are numerous inconsistencies in the reporting of data collection information resulting in high uncertainty in the data collection methods used. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Data Analysis Methodology High (score = 1) • Data analysis methodology is discussed in detail and is clear and appropriate (i.e., scientifically sound) for the intended purpose of the survey and the data/information collected. Methods employed are standard/widely accepted. AND • All pertinent analytical methodology information is provided in the data source or companion source. Examples include: > information on statistical and weighting methods (if applicable) > discussion regarding treatment of missing data 114 ------- Confidence Level (Score) Description Selected Score > Identification of sources of error, including coverage error, nonresponse error, measurement error, and data processing error (e.g., keying, coding, editing, and imputation error) > Methods for measuring sampling and nonsampling errors Medium (score = 2) • Data analysis methodology is discussed and is clear and appropriate for the intended purpose of the survey and the data/information collected. Methods employed are standard/widely accepted; however, one or more pieces of analytical information is not described. The missing information is unlikely to have a substantial impact on results. Low (score = 3) • Data analysis methodology is only briefly discussed in the data source or companion source, therefore most analytical information is missing and likely to have a substantial impact on results. AND/OR • Methods for data analysis are not standard/widely accepted. AND/OR • There are some inconsistencies in the reporting of analytical information which lead to a low confidence in the data analysis methodology used. Unacceptable (score = 4) • Data analysis methodology is not described in the data source or companion source. OR • Data analysis methodology is not appropriate (i.e., scientifically sound) for the intended purpose of the survey and the data/information collected. OR • There are numerous inconsistencies in the reporting of analytical information resulting in high uncertainty in the data analysis methods used. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 3. Geographic Area High (score = 1) • Geographic location(s) is reported, discussed, or referenced. Medium (score = 2) • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). Low (score = 3) • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). Unacceptable (score = 4) • Geographic location is not reported, discussed, or referenced. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 4. Sampling/Sampling Size High (score = 1) • Sampling procedures are documented (e.g., stratified sampling, cluster sampling, multi-stage sampling, non-probability sampling, etc.). AND 115 ------- Confidence Level (Score) Description Selected Score • Sample size and method of calculation is reported. AND • Sample size is large enough to be reasonably assured that the samples represent the population of interest. For example, sample size has a margin of error of <10% and a confidence level of >90%. Medium (score = 2) • Sampling procedures are documented (e.g., stratified sampling, cluster sampling, multi-stage sampling, non-probability sampling, etc.). AND • Sample size is reported, but the sample size calculation method is not reported. AND/OR • Sample size is small, indicating that the survey results are less likely to represent the target population. For example, sample size has a margin of error of >10% and a confidence level of <90%. Low (score = 3) • Sampling procedures are documented (e.g., stratified sampling, cluster sampling, multi-stage sampling, non-probability sampling, etc.). AND • Sample size is reported, but the sample size calculation method is not reported. AND/OR • Adequacy of sample size is not discussed or cannot be determined from information in the study. Unacceptable (score = 4) • Sampling procedures (e.g., stratified sampling, cluster sampling, multi-stage sampling, non-probability sampling, etc.) are not documented in the data source or companion source. AND/OR • Sample size is not reported. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Response Rate High (score = 1) • The survey response rate is documented and is high enough (i.e., >70%) to reasonably ensure that the survey results are representative of the target population. Medium (score = 2) • The survey response rate is documented and the response rate is >40-70%, indicating that the survey results will likely represent the target population. Low (score = 3) • The survey response rate is documented and the response rate is <40%, indicating that the survey results are less likely to represent the target population. OR • The survey response rate is not documented in the data source or companion source. Unacceptable (score = 4) • This metric does not have an unacceptable criterion. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 116 ------- Confidence Level (Score) Description Selected Score Domain 3. Accessibility / Clarity Metric 6. Reporting of Results High (score = 1) • Supplementary or raw data (i.e., individual data points) are reported, allowing summary statistics to be calculated or reproduced. AND • Summary statistics are detailed and complete. Example parameters include: > Description of data set summarized > Number of samples in data set > Range or percentiles > Measure of variation (coefficient of variation (CV), standard deviation) > Measure of central tendency (mean, geometric mean, median) > Test for outliers (if applicable) Medium (score = 2) • Supplementary or raw data (i.e., individual data points) are not reported, and therefore summary statistics cannot be reproduced. AND/OR • Summary statistics are reported but are missing one or more parameters (see description for high). Low (score = 3) • Supplementary data are not provided, and summary statistics are missing most parameters (see description for high). AND/OR • There are some inconsistencies or errors in the results reported, resulting in low confidence in the results reported (e.g., differences between text and tables in data source, less appropriate statistical methods). Unacceptable (score = 4) • There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 7. Quality Assurance High (score = 1) • Survey quality assurance/control measures were employed during each phase of the survey and are documented. Examples may include: > training staff in protocols > monitoring interviewers > conducting response analysis surveys > contingencies to modify the survey procedures > monitoring of data collection activities AND • No quality control issues were identified or any identified issues were minor and were addressed. Medium (score = 2) • The study applied and documented quality assurance/quality control measures; however, one or more pieces of QA/QC information is not described. Missing information is unlikely to have a substantial impact on results. AND • No quality control issues were identified or any identified issues were minor and addressed. Low (score = 3) • Quality assurance/quality control techniques and results were not directly discussed, but can be implied through the study's use of standard survey 117 ------- Confidence Level (Score) Description Selected Score protocols. AND/OR • Deficiencies were noted in quality assurance/quality control measures that are likely to have a substantial impact on results. AND/OR • There are some inconsistencies in the quality assurance measures reported, resulting in low confidence in the quality assurance/control measures taken and results (e.g., differences between text and tables in data source). Unacceptable • QA/QC issues have been identified which significantly interfere with the overall (score = 4) reliability of the survey results. Not rated/applicable Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 8. Variability and Uncertainty High • The variability in the population and data collected in the survey is characterized (score = 1) (e.g., sampling and non-sampling errors). AND • Key uncertainties, limitations, and data gaps have been identified. AND • The uncertainties are minimal and have been characterized. Medium • The study has limited characterization of variability in the population studied and (score = 2) data collected in the survey. AND/OR • The study has limited discussion of key uncertainties, limitations, and data gaps. AND/OR • Multiple uncertainties have been identified, but are unlikely to have a substantial impact on results. Low • The characterization of variability is absent. (score = 3) AND/OR • Key uncertainties, limitations, and data gaps are not discussed. AND/OR • Uncertainties identified may have a substantial impact on the exposure the exposure assessment Unacceptable • Estimates are highly uncertain based on characterization of variability and (score = 4) uncertainty. Not rated/applicable Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any additional comments comments that may highlight study strengths or important elements such as relevance] Note: QA/QC = Quality assurance/quality control 118 ------- E.6.4 Epidemiology Data to Support Exposure Assessment Table E-12. Serious Flaws that Would Make Sources of Epidemiology Data Unacceptable for Use in the Exposure Assessment EPA will not use data/information from data sources that exhibit serious flaws as described in Table E-12. Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Measurement or Exposure Characterization Exposure misclassification (e.g., differential recall of self-reported (All Study Types) exposure) is present, but no attempt is made to address it. Reporting Bias This metric does not have an unacceptable criterion. Exposure Variability and Misclassification Exposure based on a single sample and error is known to be so large that the results are too uncertain to be useful. Reliability (Applicable to Study Sample Contamination There are known contamination issues and the issues were not addressed. Types with Direct Exposure Measurements Method Requirements The method used is known to produce unreliable or invalid results. Only) Matrix Adjustment This metric does not have an unacceptable criterion. Method Sensitivity This metric does not have an unacceptable criterion. Stability This metric does not have an unacceptable criterion. Reliability (Applicable to Study Types with Biomarker Use of Biomarker of Exposure This metric does not have an unacceptable criterion. Measurements Only) Relevance This metric does not have an unacceptable criterion. Representativeness Geographic Area Geographic location is not reported, discussed, or referenced. Participant Selection This metric does not have an unacceptable criterion. For cohort studies: The loss of subiects (i.e., incomplete exposure data) was both large and unacceptably handled (as described in the Attrition low confidence category). For case-control and cross-sectional studies: The exclusion of subjects from analyses was both large and unacceptably handled (as described in the low confidence category). Comparison Group Subjects in all groups were not similar, recruited within very different time frames, or had very different participation/ response rates. Accessibility/ Clarity Documentation There are numerous inconsistencies or errors in the calculation and/or reporting of information and results, resulting in highly 119 ------- Domain Metric Description of Serious Flaw(s) in Data Source uncertain reported results. QA/QC QA/QC issues have been identified which significantly interfere with the overall reliability of the study, and are not addressed. Variability and Uncertainty Variability This metric does not have an unacceptable criterion. Uncertainties This metric does not have an unacceptable criterion. Table E-13. Evaluation Criteria for Sources of Epidemiology Data to Support the Exposure Assessment Confidence Level (Score) Metric Description Selected Score Domain 1. Reliability Metrics 1-2 = Applicable to All Study Types Metric 1. Measurement or Exposure Characterization High (score = 1) • Exposure was consistently assessed (i.e., under the same method and time-frame across cases, controls or the entire cohort) using well-established methods that directly measure exposure (e.g., measurement of the chemical in air or measurement of the chemical in blood, plasma, urine, etc.). OR Exposure was consistently assessed using less-established methods that directly measure exposure and are validated against well-established methods. • Medium (score = 2) • Exposure was assessed using indirect measures (e.g., questionnaire or occupational exposure assessment by a certified industrial hygienist) that have been validated or empirically shown to be consistent with methods that directly measure exposure (i.e., inter-methods validation: one method vs. another) Low (score = 3) • Exposure was assessed using direct or indirect measures that have not been validated or have poor validity. OR If using indirect methods, they have not empirically shown to be consistent with methods that directly measure exposure (e.g., a job-exposure matrix or self- report without validation). OR There is insufficient information provided about the exposure assessment, including validity and reliability, but no evidence for concern about the method used. • • Unacceptable (score = 4) • Exposure misclassification (e.g., differential recall of self-reported exposure) is present and likely to impact results, but no attempt is made to address it. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Reporting Bias High (score = 1) • All of the study's measured exposures outlined in the protocol, methods, abstract, and/or introduction (that are relevant for the evaluation) are reported. Medium (score = 2) • Not applicable. This metric is dichotomous (i.e., high versus low) Low • All of the study's measured exposures outlined in the protocol, methods, 120 ------- Confidence Level (Score) Metric Description Selected Score (score = 3) abstract, and/or introduction (that are relevant for the evaluation) have not been reported. Unacceptable (score = 4) • Not applicable. This metric is dichotomous (i.e., high versus low). Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metrics 3-8 = Applicable Only to Study Types with Direct Exposure Measurements (i.e., Measurement of Chemical in Specific Media or Biomarker Measurement) Metric 3. Exposure Variability and Misclassification High (score = 1) • There are a sufficient number of samples per individual to estimate exposure over the appropriate duration, or through the use of adequate long-term sampling data. A "sufficient" number is dependent upon the chemical and the research question. AND • Error is considered by calculating measures of accuracy (e.g., sensitivity and specificity) and reliability (e.g., intra-class correlation coefficient (ICC)). Medium (score = 2) • One sample is used per individual, and there is stated evidence that errors from a single measurement are negligible. Low (score = 3) • More than one sample collected per individual, but without evaluation of error. OR • Exposure based on a single sample without consideration or recognition of error Unacceptable (score = 4) • Exposure based on a single sample and error is known to be so large that the results are too uncertain to be useful. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 4. Sample Contamination High (score = 1) • Samples are contamination-free from the time of collection to the time of measurement (e.g., by use of certified analyte free collection supplies and reference materials, and appropriate use of blanks both in the field and lab). AND • Documentation of the steps taken to provide the necessary assurance that the study data are reliable is included. Medium (score = 2) • Samples are stated to be contamination-free from the time of collection to the time of measurement. AND • There is incomplete documentation of the steps taken to provide the necessary assurance that the study data are reliable. Low (score = 3) • Samples are known to have contamination issues, but steps have been taken to address and correct contamination issues. OR • Samples are stated to be contamination-free from the time of collection to the time of measurement, but there is no use or documentation of the steps taken to provide the necessary assurance that the study data are reliable. 121 ------- Confidence Level (Score) Metric Description Selected Score Unacceptable (score = 4) • There are known contamination issues and the issues were not addressed. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Method Requirements High (score = 1) • Study uses instrumentation that provides unambiguous identification and quantitation of the biomarker or chemical in media at the required sensitivity (e.g., gas chromatography-high-resolution mass spectrometry (GC-HRMS), gas chromatography-tandem mass spectrometry (GC-MS/MS), liquid chromatography-tandem mass spectrometry (LC-MS/MS)). Medium (score = 2) • Study uses instrumentation that allows for identification of the biomarker or chemical in media with confidence and the required sensitivity (e.g., gas chromatography-mass spectrometry (GC-MS), gas chromatography-electron capture detector (GC-ECD)). Low (score = 3) • Study uses instrumentation that only allows for possible quantification of the biomarker or chemical in media but the method has known interferants (e.g., gas chromatography-flame ionization detector (GC-FID)). OR • Study uses a semi-quantitative method to assess the biomarker or chemical in media (e.g., fluorescence). Unacceptable (score = 4) • The method used is known to produce unreliable or invalid results. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 6. Matrix Adjustment High (score = 1) • If applicable for the biomarker under consideration, study provides results, either in the main publication or as a supplement, for adjusted and unadjusted matrix concentrations (e.g., creatinine-adjusted or SG-adjusted and non-adjusted urine concentrations) and reasons are given for adjustment approach. Medium (score = 2) • If adjustments are needed, study only provides results using one method (matrix adjusted or not). Low (score = 3) • If applicable for the biomarker under consideration, no established method for matrix adjustment was conducted. Unacceptable (score = 4) • Not applicable. A study will not be deemed unacceptable based on matrix adjustment. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 122 ------- Confidence Level (Score) Metric Description Selected Score Metric 7. Method Sensitivity High • Limits of detection/quantification are reported and low enough to detect (score = 1) chemicals in a sufficient percentage of the samples to address the research questions (e.g., 50-60% detectable values if the research hypothesis requires estimates of both central tendencies and upper tails of the population concentrations). OR • All samples are above the LOD/LOQ. Medium • Not applicable. This metric is dichotomous (i.e., high versus low). (score = 2) Low • Frequency of detection too low to address the research question (score = 3) OR • There are samples below the LOD/LOQ, and LOD/LOQ are not stated. Unacceptable • Not applicable. This metric is dichotomous (i.e., high versus low). (score = 4) Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 8. Stability High • Samples with a known history and documented stability data or those using real- (score = 1) time measurements. Medium • Samples have known losses during storage but the difference between low and (score = 2) high exposures can be qualitatively assessed. Low • Samples with either unknown history and/or no stability data for analytes of (score = 3) interest. Unacceptable • Not applicable. A study will not be deemed unacceptable based on stability. (score = 4) Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 9 = Only Applicable to Studies with Biomarker Measurements Metric 9. Use of Biomarker of Exposure High • Biomarker in a specified matrix is known to have an accurate and precise (score = 1) quantitative relationship with external exposure, internal dose, or target dose (e.g., previous studies (or the current study) have indicated the biomarker of interest reflects external exposures). AND • Biomarker (parent chemical or metabolite) is derived from exposure to the chemical of interest. Medium • Biomarker in a specified matrix has accurate and precise quantitative relationship (score = 2) with external exposure, internal dose, or target dose. AND • Biomarker is derived from multiple parent chemicals, not only the chemical of interest, but there is a stated method to apportion the estimate to only the chemical of interest. 123 ------- Confidence Level (Score) Metric Description Selected Score Low (score = 3) • Biomarker in a specified matrix has accurate and precise quantitative relationship with external exposure, internal dose, or target dose. AND • Biomarker is derived from multiple parent chemicals, not only the chemical of interest, and there is NOT an accurate method to apportion the estimate to only the chemical of interest. OR • Biomarker in a specified matrix is a poor surrogate (low accuracy and precision) for exposure/dose. Unacceptable (score = 4) • Not applicable. A study will not be deemed unacceptable based on the use of biomarker of exposure. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representativeness Metric 10. Relevance High (score = 1) • The study represents current exposures (within 5 years) and relevant conditions (e.g., environmental conditions, consumer products, exposure factors, geographical location). Medium (score = 2) • The study is less representative of current exposures (>5 to 15 years) and/or relevant conditions for the scenario of interest (e.g., environmental conditions, consumer products, exposure factors, geographical location). Low (score = 3) • The study is not consistent with current exposures (>15 years) and/or with relevant conditions (e.g., environmental conditions, consumer products, exposure factors, geographical location); inconsistencies are likely to have a substantial impact on results. OR • Insufficient information is provided to determine whether the study represents current relevant conditions for the scenario of interest. Unacceptable (score = 4) • Not applicable. A study will not be deemed unacceptable based on relevance. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 11. Geographic Area High (score = 1) • Geographic location(s) is reported, discussed, or referenced. Medium (score = 2) • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). Low (score = 3) • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). Unacceptable (score = 4) • Geographic location is not reported, discussed, or referenced. Not rated/applicable 124 ------- Confidence Level (Score) Metric Description Selected Score Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 12. Participant Selection High (score = 1) • The participants selected are representative of the larger population from which they were sampled. OR • Approaches (e.g., survey weights, inverse probability weighting) were applied to ensure representativeness. Medium (score = 2) • Not applicable. This metric is dichotomous (i.e., high versus low). Low (score = 3) • The participants selected do not appear to be representative of the larger population from which they were sampled. OR • There is insufficient information to determine whether participants selected are representative of the population from which they were sampled. Unacceptable (score = 4) • Not applicable. This metric is dichotomous (i.e., high versus low). Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 13. Attrition High (score = 1) • For cohort studies: There was minimal subiect attrition during the studv (or exclusion from the analysis sample) and exposure data were largely complete. OR • Any loss of subjects (i.e., incomplete exposure data) was adequately* addressed (as described above) and reasons were documented when human subjects were removed from a study. OR • Missing data have been imputed using appropriate methods (e.g., random regression imputation), and characteristics of subjects lost to follow up or with unavailable records are described in identical way and are not significantly different from those of the study participants. • For case-control studies and cross-sectional studies: There was minimal subiect withdrawal from the study (or exclusion from the analysis sample) and exposure data were largely complete. OR • Any exclusion of subjects from analyses was adequately* addressed (as described above), and reasons were documented when subjects were removed from the study or excluded from analyses. *NOTE for all studv tvoes: Adeauate handling of subiect attrition includes: verv little missing exposure data; missing exposure data balanced in numbers across study groups, with similar reasons for missing data across groups. 125 ------- Confidence Level (Score) Metric Description Selected Score Medium (score = 2) • For cohort studies: There was moderate subject attrition during the studv (or exclusion from the analysis sample). AND • Any loss or exclusion of subjects was adequately addressed (as described in the acceptable handling of subject attrition in the high confidence category) and reasons were documented when human subjects were removed from a study. • For case-control studies and cross-sectional studies: There was moderate subject withdrawal from the study (or exclusion from the analysis sample), but exposure data were largely complete. AND • Any exclusion of subjects from analyses was adequately addressed (as described above), and reasons were documented when subjects were removed from the study or excluded from analyses. Low (score = 3) • For cohort studies: There was large subiect attrition during the studv (or exclusion from the analysis sample), but it was adequately addressed (i.e., missing exposure data was balanced in numbers across groups and reasons for missing data were similar across groups). OR • Subject attrition was not large but it was inadequately addressed. Inadequate handling of subject attrition: reason for missing exposure data likely to be related to true exposure, with either imbalance in numbers or reasons for missing data across study groups; or potentially inappropriate application of imputation. OR • Numbers of individuals were not reported at each stage of study (e.g., numbers potentially eligible, examined for eligibility, confirmed eligible, included in the study or analysis sample, completing follow-up, and analyzed). Reasons were not provided for non-participation at each stage. • For case-control and cross-sectional studies: There was large subiect withdrawal from the study (or exclusion from the analysis sample), but it was adequately addressed (i.e., missing exposure data was balanced in numbers across groups and reasons for missing data were similar across groups). OR • Subject attrition was not large but it was inadequately addressed. Inadequate handling of subject attrition: reason for missing exposure data likely to be related to true exposure, with either imbalance in numbers or reasons for missing data across study groups; or potentially inappropriate application of imputation. OR Numbers of individuals were not reported at each stage of study (e.g., numbers potentially eligible, examined for eligibility, confirmed eligible, included in the study or analysis sample, and analyzed). Reasons were not provided for non- participation at each stage. Unacceptable (score = 4) • For cohort studies: The loss of subjects (i.e., incomplete exposure data) was both large and unacceptably handled (as described above in the low confidence category). • For case-control and cross-sectional studies: The exclusion of subjects from analyses was both large and unacceptably handled (as described above in the low confidence category). Not rated/applicable 126 ------- Confidence Level (Score) Metric Description Selected Score Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 14 = Only Applicable to Studies that Compare Exposure in Different Groups Metric 14. Comparison Group High (1) • Key elements of the study design are reported (i.e., setting, inclusion and exclusion criteria, and methods of participant selection), and indicate that subjects (in all groups) were similar (e.g., recruited with the same method of ascertainment and within the same time frame using the same inclusion and exclusion criteria, and were of similar age and health status) OR • Baseline characteristics of groups differed but these differences were considered as potential confounding or stratification variables, and were thereby controlled by statistical analysis. Medium (2) • There is indirect evidence (i.e., stated by the authors without providing a description of methods) that subjects (in all groups) were similar (as described above for the high confidence rating). AND • Baseline characteristics for subjects (in all groups) reported in the study were similar. Low (3) • There is indirect evidence (i.e., stated by the authors without providing a description of methods) that subjects (in all groups) were similar (as described above for the high confidence rating). AND • Baseline characteristics for subjects (in all groups) were not reported. Unacceptable (4) • Subjects in all groups were not similar, recruited within very different time frames, or had very different participation/ response rates. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 15. Documentation High (score = 1) • Study clearly states aims, methods, assumptions and limitations. AND • Study clearly states the time frame over which exposures were estimated and what the exposure level represents (e.g., spot measurement, peak, or average over a specified time frame). AND • Discussion of sample collection requirements, relevant participant characteristics, and matrix treatment is provided. AND • Supplementary data is included, allowing summary statistics to be reproduced. Medium (score = 2) • Study clearly states aims, methods, assumptions and limitations. AND • Study clearly states the time frame over which exposures were estimated and what the exposure level represents (e.g., spot measurement, peak, or average over a specified time frame). 127 ------- Confidence Level (Score) Metric Description Selected Score AND • Discussion of sample collection requirements, relevant participant characteristics, and matrix treatment is provided. AND • Supplementary data is not included; summary statistics cannot be reproduced. Low (score = 3) • Aims, methods, assumptions and limitations are not clear or not completely reported. OR • The time frame over which exposures were estimated and/or what the exposure level represents (e.g., peak, average over a specified time frame) are not clear (e.g., spot measurement, peak, average over a specified time frame). OR • Discussion of sample collection requirements, relevant participant characteristics, and matrix treatment is not provided. Unacceptable (score = 4) • There are numerous inconsistencies or errors in the calculation and/or reporting of information and results, resulting in highly uncertain reported results. Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 16. Quality Assurance/Quality Control High (score = 1) • The study applied quality assurance/quality control measures and all pertinent quality assurance information is provided in the data source or companion source. Examples include: > Field, laboratory, and/or storage recoveries > Field and laboratory control samples > Baseline (pre-exposure) samples > Biomarker stability > Completeness of sample (i.e., creatinine, specific gravity, osmolality for urine samples) AND • No quality control issues were identified or, if they were identified, were appropriately addressed (i.e., correction for low recoveries, correction for completeness). Medium (score = 2) • It is stated that quality assurance/quality control measures were used, but no details were provided. AND • No quality control issues were identified or any identified issues were minor and addressed (i.e., correction for low recoveries, correction for completeness). Low (score = 3) • Information on quality assurance/quality control was absent. OR • Quality assurance/quality control measures were applied and documented; however, minor quality control issues have been identified but not addressed, or there may be some reporting inconsistencies. Unacceptable (score = 4) • QA/QC issues have been identified which significantly interfere with the overall reliability of the study, and are not addressed. Not rated/applicable 128 ------- Confidence Level (Score) Metric Description Selected Score Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 17. Variability High • Study summarizes mean and variation in exposure levels for one or more groups. (score = 1) AND • Study presents discussion of sources of variability. Medium • Not applicable. This metric is dichotomous (i.e., high versus low). (score = 2) Low • Study does not summarize mean and variation in exposure levels for any groups. (score = 3) AND/OR • Study does not present discussion of sources of variability. Unacceptable • Not applicable. This metric is dichotomous (i.e., high versus low). (score = 4) Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 18. Uncertainties High • Key uncertainties, limitations, and data gaps are recognized and discussed (e.g., (score = 1) those related to inherent variability in environmental and exposure-related parameters or possible measurement errors). AND • The uncertainties are minimal. Medium • Not applicable. This metric is dichotomous (i.e., high versus low). (score = 2) Low • Key uncertainties, limitations, or data gaps are not recognized or discussed. (score = 3) AND/OR • Estimates are highly uncertain. Unacceptable • Not applicable. This metric is dichotomous (i.e., high versus low). (score = 4) Not rated/applicable Reviewer's Comments: [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 129 ------- E.6.5 Experimental Data Table E-14. Serious Flaws that Would Make Sources of Experimental Data Unacceptable for Use in the Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Sampling Methodology and Conditions The sampling methodology is not discussed in the data source or companion source. Sampling methodology is not scientifically sound or is not consistent with widely accepted methods/approaches for the chemical and media being analyzed (e.g., inappropriate sampling equipment, improper storage conditions). There are numerous inconsistencies in the reporting of sampling information, resulting in high uncertainty in the sampling methods used. Analytical Methodology Analytical methodology is not described, including analytical instrumentation (i.e., HPLC, GC). Analytical methodology is not scientifically appropriate for the chemical and media being analyzed (e.g., method not sensitive enough, not specific to the chemical, out of date). There are numerous inconsistencies in the reporting of analytical information, resulting in high uncertainty in the analytical methods used. Selection of Biomarker of Exposure Biomarker in a specified matrix is a poor surrogate (low accuracy and precision) for exposure/dose. Representative Testing Scenario Testing conditions are not relevant to the exposure scenario of interest for the chemical. Sample Size and Variability Sample size is not reported. Single sample collected per data set. For biomonitoring studies, the timing of sample collected is not appropriate based on chemical properties (e.g., half-life), the pharmacokinetics of the chemical (e.g., rate of uptake and elimination), and when the exposure event occurred. Temporality Temporality of tested items is not reported, discussed, or referenced. Accessibility / Clarity Reporting of Results There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. Quality Assurance QA/QC issues have been identified which significantly interfere with the overall reliability of the study. Variability and Uncertainty Variability and Uncertainty Estimates are highly uncertain based on characterization of variability and uncertainty. Notes: GC = Gas chromatography HPLC = High pressure liquid chromatography QA/QC = Quality assurance/quality control 130 ------- Table E-15. Evaluation Criteria for Sources of Experimental Data Confidence Level (Score) Metric Description Selected Score Domain 1. Reliability Metric 1. Sampling Methodology and Conditions High (score = 1) • Samples were collected according to publicly available SOPs, methods, protocols, or test guidelines that are scientifically sound and widely accepted from a source generally known to use sound methods and/or approaches such as EPA, NIST, ASTM, ISO, and ACGIH. OR • The sampling protocol used was not a publicly available SOP from a source generally known to use sound methods and/or approaches, but the sampling methodology is clear, appropriate (i.e., scientifically sound), and similar to widely accepted protocols for the chemical and media of interest. All pertinent sampling information is provided in the data source or companion source. Examples include: > sampling conditions (e.g., temperature, humidity) > sampling equipment and procedures > sample storage conditions/duration > performance/calibration of sampler Medium (score = 2) • Sampling methodology is discussed in the data source or companion source and is generally appropriate (i.e., scientifically sound) for the chemical and media of interest, however, one or more pieces of sampling information is not described. The missing information is unlikely to have a substantial impact on results. OR • Standards, methods, protocols, or test guidelines may not be widely accepted, but a successful validation study for the new/unconventional procedure was conducted prior to the sampling event and is consistent with sound scientific theory and/or accepted approaches. Low (score = 3) • Sampling methodology is only briefly discussed, therefore, most sampling information is missing and likely to have a substantial impact on results. AND/OR • The sampling methodology does not represent best sampling methods, protocols, or guidelines for the chemical and media of interest (e.g., outdated (but still valid) sampling equipment or procedures, long storage durations). AND/OR • There are some inconsistencies in the reporting of sampling information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used, etc.) which lead to a low confidence in the sampling methodology used. Unacceptable (score = 4) • The sampling methodology is not discussed in the data source or companion source. AND/OR • Sampling methodology is not scientifically sound or is not consistent with widely accepted methods/approaches for the chemical and media being analyzed (e.g., inappropriate sampling equipment, improper storage conditions). AND/OR There are numerous inconsistencies in the reporting of sampling information, resulting in high uncertainty in the sampling methods used. Not rated/applicable 131 ------- Confidence Level (Score) Metric Description Selected Score Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Analytical Methodology High (score = 1) • Samples were analyzed according to publically available analytical methods that are scientifically sound and widely accepted (i.e.,from a source generally using sound methods and/or approaches) and are appropriate for the chemical and media of interest. Examples include EPASW-846 Methods, NIOSH Manual of Analytical Methods 5th Edition, etc. OR • The analytical method used was not a publically available method from a source generally known to use sound methods and/or approaches, but the methodology is clear and appropriate (i.e., scientifically sound) and similar to widely accepted protocols for the chemical and media of interest. All pertinent sampling information is provided in the data source or companion source. Examples include: > extraction method > analytical instrumentation (required) > instrument calibration > LOQ, LOD, detection limits, and/or reporting limits > recovery samples > biomarker used (if applicable) > matrix-adjustment method (i.e., creatinine, lipid, moisture) Medium (score = 2) • Analytical methodology is discussed in detail and is clear and appropriate (i.e., scientifically sound) for the chemical and media of interest; however, one or more pieces of analytical information is not described. The missing information is unlikely to have a substantial impact on results. AND/OR • The analytical method may not be standard/widely accepted, but a method validation study was conducted prior to sample analysis and is expected to be consistent with sound scientific theory and/or accepted approaches. AND/OR • Samples were collected at a site and immediately analyzed using an on-site mobile laboratory, rather than shipped to a stationary laboratory. Low (score = 3) • Analytical methodology is only briefly discussed. Analytical instrumentation is provided and consistent with accepted analytical instrumentation/methods. However, most analytical information is missing and likely to have a substantial impact on results. AND/OR • Analytical method is not standard/widely accepted, and method validation is limited or not available. AND/OR • Samples were analyzed using field screening techniques. AND/OR • LOQ, LOD, detection limits, and/or reporting limits not reported. AND/OR • There are some inconsistencies or possible errors in the reporting of analytical information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have 132 ------- Confidence Level (Score) Metric Description Selected Score been used, etc.) which leads to a lower confidence in the method used. Unacceptable (score = 4) • Analytical methodology is not described, including analytical instrumentation (i.e., HPLC, GC). AND/OR • Analytical methodology is not scientifically appropriate for the chemical and media being analyzed (e.g., method not sensitive enough, not specific to the chemical, out of date). AND/OR • There are numerous inconsistencies in the reporting of analytical information, resulting in high uncertainty in the analytical methods used. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Selection of Biomarker of Exposure High (score = 1) • Biomarker in a specified matrix is known to have an accurate and precise quantitative relationship with external exposure, internal dose, or target dose (e.g., previous studies (or the current study) have indicated the biomarker of interest reflects external exposures). AND • Biomarker (parent chemical or metabolite) is derived from exposure to the chemical of interest. Medium (score = 2) • Biomarker in a specified matrix has accurate and precise quantitative relationship with external exposure, internal dose, or target dose. AND • Biomarker is derived from multiple parent chemicals, not only the chemical of interest, but there is a stated method to apportion the estimate to only the chemical of interest Low (score = 3) • Biomarker in a specified matrix has accurate and precise quantitative relationship with external exposure, internal dose, or target dose. AND • Biomarker is derived from multiple parent chemicals, not only the chemical of interest, and there is NOT a stated method to apportion the estimate to only the chemical of interest. Unacceptable (score = 4) • Biomarker in a specified matrix is a poor surrogate (low accuracy and precision) for exposure/dose. Not rated/applicable • Metric is not applicable to the data source. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 4. Testing Scenario High (score = 1) • Testing conditions closely represent relevant exposure scenarios (i.e., population/scenario/media of interest). Examples include: > amount and type of chemical / product used > source of exposure/test substance 133 ------- Confidence Level (Score) Metric Description Selected Score > method of application or by-stander exposure > use of exposure controls > microenvironment (location, time, climate, temperature, humidity, pressure, airflow) AND • Testing conducted under a broad range of conditions for factors such as temperature, humidity, pressure, airflow, and chemical mass / weight fraction (if appropriate). Medium • The data likely represent the relevant exposure scenario (i.e., (score = 2) population/scenario/media of interest). One or more key pieces of information may not be described but the deficiencies are unlikely to have a substantial impact on the characterization of the exposure scenario. AND/OR • If surrogate data, activities seem similar to the activities within scope. Low • The data lack multiple key pieces of information and the deficiencies are likely to (score = 3) have a substantial impact on the characterization of the exposure scenario. AND/OR • There are some inconsistencies or possible errors in the reporting of scenario information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used, etc.) which leads to a lower confidence in the scenario assessed. AND/OR • If surrogate data, activities have lesser similarity but are still potentially applicable to the activities within scope. AND/OR • Testing conducted under a single set of conditions. Unacceptable • Testing conditions are not relevant to the exposure scenario of interest for the (score = 4) chemical. Not rated/applicable Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any comments additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Sample Size and Variability High • Sample size is reported and large enough (i.e., > 10 samples) to be reasonably (score = 1) assured that the samples represent the scenario of interest. AND • Replicate tests performed and variability across tests is characterized (if appropriate). Medium • Sample size is moderate (i.e., 5 to 10 samples), thus the data are likely to (score = 2) represent the scenario of interest. AND • Replicate tests performed and variability across tests is characterized (if appropriate). Low • Sample size is small (i.e., <5 samples), thus the data are likely to poorly represent (score = 3) the scenario of interest. AND/OR • Replicate tests were not performed. Unacceptable • Sample size is not reported. 134 ------- Confidence Level (Score) Metric Description Selected Score (score = 4) AND/OR • Single sample collected per data set. AND/OR • For biomonitoring studies, the timing of sample collected is not appropriate based on chemical properties (e.g., half-life), the pharmacokinetics of the chemical (e.g., rate of uptake and elimination), and when the exposure event occurred. Not rated/applicable • Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 6. Temporality High (score = 1) • Source(s) of tested items appears to be current (within 5 years). Medium (score = 2) • Source(s) of tested items is less consistent with when current or recent exposures (>5 to 15 years) are expected. Low (score = 3) • Source(s) of tested items is not consistent with when current or recent exposures (>15 years) are expected or is not identified. Unacceptable (score = 4) • Temporality of tested items is not reported, discussed, or referenced. Not rated/applicable • Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 7. Reporting of Results High (score = 1) • Supplementary or raw data (i.e., individual data points) are reported, allowing summary statistics to be calculated or reproduced. AND • Summary statistics are detailed and complete. Example parameters include: > Description of data set summarized (i.e., location, population, dates, etc.) > Range of concentrations or percentiles > Number of samples in data set > Frequency of detection > Measure of variation (CV, standard deviation) > Measure of central tendency (mean, geometric mean, median) > Test for outliers (if applicable) AND • Both adjusted and unadjusted results are provided (i.e., correction for void completeness in urine biomonitoring, whole-volume or lipid adjusted for blood biomonitoring) [only if applicable]. Medium (score = 2) • Supplementary or raw data (i.e., individual data points) are not reported, and therefore summary statistics cannot be reproduced. AND/OR • Summary statistics are reported but are missing one or more parameters (see description for high). 135 ------- Confidence Level (Score) Metric Description Selected Score AND/OR • Only adjusted or unadjusted results are provided, but not both [only if applicable]. Low (score = 3) • Supplementary data are not provided, and summary statistics are missing most parameters (see description for high). AND/OR • There are some inconsistencies or errors in the results reported, resulting in low confidence in the results reported (e.g., differences between text and tables in data source, less appropriate statistical methods). Unacceptable (score = 4) There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 8. Quality Assurance High (score = 1) • The study applied quality assurance/quality control measures and all pertinent quality assurance information is provided in the data source or companion source. Examples include: > Laboratory, and/or storage recoveries. > Laboratory control samples. > Baseline (pre-exposure) samples. > Biomarker stability > Completeness of sample (i.e., creatinine, specific gravity, osmolality for urine samples) AND • No quality control issues were identified or any identified issues were minor and adequately addressed (i.e., correction for low recoveries, correction for completeness). Medium (score = 2) • The study applied and documented quality assurance/quality control measures; however, one or more pieces of QA/QC information is not described. Missing information is unlikely to have a substantial impact on results. AND • No quality control issues were identified or any identified issues were minor and addressed (i.e., correction for low recoveries, correction for completeness). Low (score = 3) • Quality assurance/quality control techniques and results were not directly discussed, but can be implied through the study's use of standard field and laboratory protocols. AND/OR • Deficiencies were noted in quality assurance/quality control measures that are likely to have a substantial impact on results. AND/OR • There are some inconsistencies in the quality assurance measures reported, resulting in low confidence in the quality assurance/control measures taken and results (e.g., differences between text and tables in data source). Unacceptable (score = 4) • QA/QC issues have been identified which significantly interfere with the overall reliability of the study. Not 136 ------- Confidence Level (Score) Metric Description Selected Score rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 9. Variability and Uncertainty High (score = 1) • The study characterizes variability in the population/media studied. AND • Key uncertainties, limitations, and data gaps have been identified. AND • The uncertainties are minimal and have been characterized. Medium (score = 2) • The study has limited characterization of variability in the population/media studied. AND/OR • The study has limited discussion of key uncertainties, limitations, and data gaps. AND/OR • Multiple uncertainties have been identified, but are unlikely to have a substantial impact on results. Low (score = 3) • The characterization of variability is absent. AND/OR • Key uncertainties, limitations, and data gaps are not discussed. AND/OR • Uncertainties identified may have a substantial impact on the exposure the exposure assessment Unacceptable (score = 4) • Estimates are highly uncertain based on characterization of variability and uncertainty. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Notes: ACGIH = American Conference of Governmental Industrial Hygienists ASTM = American Society for Testing and Materials CV = Coefficient of variation GC = Gas chromatography HPLC = High pressure liquid chromatography ISO = International Organization for Standardization LOD = Limit of detection LOQ = Limit of quantitation NIOSH = National Institute for Occupational Safety and Health NIST = National Institute of Standards and Technology QA/QC = Quality assurance/quality control SOPs = Standard operating procedures 137 ------- E.6.6 Database Data Table E-18. Serious Flaws that Would Make Sources of Database Data Unacceptable for Use in the Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Sampling methodology The sampling methodologies used were not appropriate for the chemical/media of interest in the database (e.g., inappropriate sampling equipment, improper storage conditions). Analytical methodology The analytical methodologies used were not appropriate for the chemical/media of interest in the database (e.g., method not sensitive enough, not specific to the chemical, out of date). Representative Geographic Area Geographic location of sampling data within database is not reported, discussed, or referenced. Temporal Timing of sample data is not reported, discussed, or referenced. Exposure Scenario Data provided in the database are not representative of the media or population of interest. Accessibility / Clarity Availability of Database and Supporting Documents No information is provided on the database source or availability to the public. Reporting Results There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. The information source reporting the analysis of the database data is missing key sections or lacks enough organization and clarity to locate and extract necessary information. Variability and Uncertainty Variability and Uncertainty Estimates are highly uncertain based on characterization of variability and uncertainty. 138 ------- Table E-19. Evaluation Criteria for Sources of Database Data Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Sampling methodology High (score = 1) • Widely accepted sampling methodologies (i.e.,from a source generally using sound methods and/or approaches) were used to generate the data presented in the database. Example SOPs include USGS's "National Field Manual for the Collection of Water-Quality Data", EPA's "Ambient Air Sampling" (SESDPROC-303- R5), etc. Medium (score = 2) • The sampling methodologies were consistent with sound scientific theory and/or accepted approaches based on the reported sampling information, but may not have followed published procedures from a source generally known to use sound methods and/or approaches.. Low (score = 3) • The sampling methodology was not reported in data source or companion data source. Unacceptable (score = 4) • The sampling methodologies used were not appropriate for the chemical/media of interest in the database (e.g., inappropriate sampling equipment, improper storage conditions). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Analytical methodology High (score = 1) • Widely accepted analytical methodologies (i.e., from a source generally using sound methods and/or approaches) were used to generate the data presented in the database. Example SOPs include EPASW-846 Methods, NIOSH Manual of Analytical Methods 5th Edition, etc. Medium (score = 2) • The analytical methodologies were consistent with sound scientific theory and/or accepted approaches based on the reported analytical information, but may not have followed published procedures from a source generally known to use sound methods and/or approaches. Low (score = 3) • The analytical methodology was not reported in data source or companion data source. Unacceptable (score = 4) • The analytical methodologies used were not appropriate for the chemical/media of interest in the database (e.g., method not sensitive enough, not specific to the chemical, out of date). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 3. Geographic Area High (score = 1) • Geographic location(s) is reported, discussed, or referenced. Medium (score = 2) • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). Low • Not applicable. This metric is dichotomous (i.e., high versus unacceptable). 139 ------- Confidence Level (Score) Description Selected Score (score = 3) Unacceptable (score = 4) • Geographic location is not reported, discussed, or referenced. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 4. Temporal High (score = 1) • The data reflect current conditions (within 5 years); and/or • Database contains robust historical data for spatial and temporal analyses (if applicable). Medium (score = 2) • The data are less consistent with current or recent exposures (>5 to 15 years); and/or • Database contains sufficient historical data for spatial and temporal analyses (if applicable). Low (score = 3) • Data are not consistent with when current exposures (>15 years old) may be expected; and/or • Database does not contain enough historical data for spatial and temporal analyses (if applicable). Unacceptable (score = 4) • Timing of sample data is not reported, discussed, or referenced. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Exposure Scenario High (score = 1) • The data closely represent relevant exposure scenario (i.e., the population/scenario/media of interest). Examples include: > amount and type of chemical / product used > source of exposure > method of application or by-stander exposure > use of exposure controls • microenvironment (location, time, climate) Medium (score = 2) • The data likely represent the relevant exposure scenario (i.e., population/scenario/media of interest). One or more key pieces of information may not be described but the deficiencies are unlikely to have a substantial impact on the characterization of the exposure scenario. AND/OR • If surrogate data, activities seem similar to the activities within scope. Low (score = 3) • The data lack multiple key pieces of information and the deficiencies are likely to have a substantial impact on the characterization of the exposure scenario. AND/OR • There are some inconsistencies or possible errors in the reporting of scenario information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used, etc.) which leads to a lower confidence in the scenario assessed. AND/OR 140 ------- Confidence Level (Score) Description Selected Score • If surrogate data, activities have lesser similarity but are still potentially applicable to the activities within scope. Unacceptable (score = 4) • If reported, the exposure scenario discussed in the monitored study does not represent the exposure scenario of interest for the chemical. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Accessibility / Clarity Metric 6. Availability of Database and Supporting Documents High (score = 1) • Database is widely accepted and/or from a source generally known to use sound methods and/or approaches (e.g., NHANES, STORET). Medium (score = 2) • The database may not be widely known or accepted (e.g., state maintained databases), but the database is adequately documented with the following information: > Within the database, metadata is present (sample identifiers, annotations, flags, units, matrix descriptions, etc.) and data fields are generally clear and defined. > A user manual other supporting documentation is available, or there is sufficient documentation in the data source or companion source. > Database quality assurance and data quality control measures are defined and/or a QA/QC protocol was followed. Low (score = 3) • The database may not be widely known or accepted and only limited database documentation is available (see the medium rating). Unacceptable (score = 4) • No information is provided on the database source or availability to the public. Not rated/ applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 7. Reporting of Results High (score = 1) • The information source reporting the analysis of the database data is well organized and understandable by the target audience. AND • Summary statistics in the data source are detailed and complete. Example parameters include: > Description of data set summarized (i.e., location, population, dates, etc.) > Range of concentrations or percentiles > Number of samples in data set > Frequency of detection > Measure of variation (CV, standard deviation) > Measure of central tendency (mean, geometric mean, median) > Test for outliers (if applicable) Medium (score = 2) • The information source reporting the analysis of the database data is well organized and understandable by the target audience. AND • Summary statistics are missing one or more parameters (see description for high). 141 ------- Confidence Level (Score) Description Selected Score Low (score = 3) • The information source reporting the analysis of the database data is unclear or not well organized. AND/OR • Summary statistics are missing most parameters (see description for high) AND/OR • There are some inconsistencies or errors in the results reported, resulting in low confidence in the results reported (e.g., differences between text and tables in data source, less appropriate statistical methods). Unacceptable (score = 4) • There are numerous inconsistencies or errors in the calculation and/or reporting of results, resulting in highly uncertain reported results. AND/OR • The information source reporting the analysis of the database data is missing key sections or lacks enough organization and clarity to locate and extract necessary information. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 8. Variability and Uncertainty High (score = 1) • Key uncertainties, limitations, and data gaps have been identified. AND • The uncertainties are minimal and have been characterized. Medium (score = 2) • The study has limited discussion of key uncertainties, limitations, and data gaps. AND/OR • Multiple uncertainties have been identified, but are unlikely to have a substantial impact on results. Low (score = 3) • Key uncertainties, limitations, and data gaps are not discussed. AND/OR • Uncertainties identified may have a substantial impact on the exposure the exposure assessment Unacceptable (score = 4) • Estimates are highly uncertain based on characterization of variability and uncertainty. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Notes: CV = Coefficient of variation NHANES = National Health and Nutrition Examination Survey NIOSH = National Institute for Occupational Safety and Health QA/QC = Quality assurance/quality control SOPs = Standard operating procedures STORET = Storage and Retrieval for Water Quality Data database USGS = U.S. Geological Survey 142 ------- E.6.7 Completed Exposure Assessments and Risk Characterizations Table E-16. List of Serious Flaws that Would Make Completed Exposure Assessments and Risk Characterizations Unacceptable for Use in the Exposure Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Reliability Methodology The assessment uses techniques that are not appropriate (e.g., inappropriate assumptions, models not within domain of the exposure scenario, etc.). Assumptions, extrapolations, measurements, and models are not described. There appears to be mathematical errors or errors in logic which significantly interfere with the overall reliability of the study. Representative Exposure Scenario If reported, the exposure scenario discussed in the monitored study does not represent the exposure scenario of interest for the chemical. Surrogate data, if available, are not similar enough to the chemical and use of interest to be used. Accessibility / Clarity Documentation of References The reported data, inputs, and defaults are not documented or only sparsely documented. Variability and Uncertainty Variability and Uncertainty Estimates are highly uncertain based on characterization of variability and uncertainty. Table E-17. Evaluation Criteria for Completed Exposure Assessments and Risk Characterizations Confidence Level (Score) Description Selected Score Domain 1. Reliability Metric 1. Methodology High (score = 1) • The assessment uses technical approaches that are generally accepted by the scientific community. AND • Assumptions, extrapolations, measurements, and models have been documented and described. AND • There are no mathematical errors or errors in logic. Medium (score = 2) • The assessment uses techniques that are from reliable sources and are generally accepted by the scientific community; however, a discussion of assumptions, extrapolations, measurements, and models is limited. Low (score = 3) • The assessment uses techniques that may not be generally accepted by the scientific community. AND/OR 143 ------- Confidence Level (Score) Description Selected Score • There is only a brief discussion of assumptions, extrapolations, measurements, and models, or some components may be missing. AND/OR * There are some mathematical errors or errors in logic. Unacceptable (score = 4) • The assessment uses techniques that are not appropriate (e.g., inappropriate assumptions, models not within domain of the exposure scenario, etc.) AND/OR • Assumptions, extrapolations, measurements, and models are not described. AND/OR • There appears to be mathematical errors or errors in logic which significantly interfere with the overall reliability of the study. Not rated/applicable Reviewer's Comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Representative Metric 2. Exposure Scenario High (score = 1) • The data (media concentrations, doses, estimated values, exposure factors) closely represent exposure scenarios of interest. Examples include: > geography > temporality > chemical/use of interest Medium (score = 2) • The exposure activity assessed likely represents the population/scenario/media of interest; however, one or more key pieces of information may not be described. OR • If surrogate data, activities seem similar to the activities within scope. Low (score = 3) • The study lacks multiple key pieces of information and the deficiencies are likely to have a substantial impact on the characterization of the exposure scenario. AND/OR • There are some inconsistencies or possible errors in the reporting of scenario information (e.g., differences between text and tables in data source, differences between standard method and actual procedures reported to have been used, etc.) which leads to a lower confidence in the scenario assessed. AND/OR • If surrogate data, activities have lesser similarity but are still potentially applicable to the activities within scope. Unacceptable (score = 4) • If reported, the exposure scenario discussed in the monitored study does not represent the exposure scenario of interest for the chemical. AND/OR • Surrogate data, if available, are not similar enough to the chemical and use of interest to be used. Not rated/applicable Reviewer's Comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 144 ------- Confidence Level (Score) Description Selected Score Domain 3. Accessibility / Clarity Metric 3. Documentation of References High (score = 1) • References are available for all reported data, inputs, and defaults. AND • References generally appear to be from publically available and peer reviewed sources. Medium (score = 2) • References are available for all reported data, inputs, and defaults; however, some references may not be publically available or are not from peer reviewed sources (i.e., professional judgment, personal communication). Low (score = 3) • Numerous references for reported data, inputs, and defaults appear to be missing or there are discrepancies with the references. AND/OR • Numerous references may not be publically available or are not from peer reviewed sources (i.e., professional judgment or personal communication). Unacceptable (score = 4) * The reported data, inputs, and defaults are not documented or only sparsely documented. Not rated/applicable Reviewer's Comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Variability and Uncertainty Metric 4. Variability and Uncertainty High (score = 1) • The study characterizes variability in the population/media studied. AND • Key uncertainties, limitations, and data gaps have been identified. AND • The uncertainties are minimal and have been characterized. Medium (score = 2) • The study has limited characterization of variability in the population/media studied. AND/OR • The study has limited discussion of key uncertainties, limitations, and data gaps. AND/OR • Multiple uncertainties have been identified, but are unlikely to have a substantial impact on results. Low (score = 3) • The characterization of variability is absent. AND/OR • Key uncertainties, limitations, and data gaps are not discussed. AND/OR • Uncertainties identified may have a substantial impact on the exposure the exposure assessment Unacceptable (score = 4) • Estimates are highly uncertain based on characterization of variability and uncertainty. Not rated/applicable Reviewer's Comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 145 ------- E.7 References 1. ECHA. (2011). Guidance on information requirements and chemical safety assessment. (ECHA-2011- G-13-EN). https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262842. 2. NRC. (1991). Environmental Epidemiology, Volume 1: Public Health and Hazardous Wastes. Washington, DC: The National Academies Press. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262908. 3. U.S. EPA. (2009). Guidance on the Development, Evaluation, and Application of Environmental Models. (EPA/100/K-09/003). Washington, DC: Office of the Science Advisor. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262976. 146 ------- APPENDIX F: DATA QUALITY CRITERIA FOR ECOLOGICAL HAZARD STUDIES F.l Types of Data Sources The data quality will be evaluated for a variety of ecological hazard studies (Table F-l). Since the availability of information varies considerably on different chemicals, it is anticipated that some ecological hazard studies will not be available while others may be identified beyond those listed in Table F-l. Table F-l. Study Types that Provide Ecological Hazard Data Data Category Types of Data Sources Ecological Hazard Acute and chronic toxicity to aquatic invertebrates and fish (e.g., freshwater, saltwater, and sediment-based exposures); toxicity to algae, cyanobacteria, and other microorganisms; toxicity to terrestrial invertebrates; acute oral toxicity to birds; toxicity to reproduction of birds; toxicity to terrestrial plants; toxicity to mammalian wildlife F.2 Data Quality Evaluation Domains The methods for evaluation of study quality were developed after review of selected existing processes and references describing existing study quality and risk of bias evaluation tools for toxicity studies including Criteria for Reporting and Evaluating Ecotoxicity Data (CRED) and ECOTOX knowledgebase (ECOTOX) (EC. 2018; Cooper et al.. 2016; Lynch et al.. 2016; Moermond et al.. 2016b; Samuel et al.. 2016; NTP. 2015a; Hooijmans et al.. 2014; Koustas et al.. 2014; Kushman et al.. 2013; Hartling et al.. 2012; Hooiimans et al.. 2010). These publications, coupled with professional judgment and experience, informed the identification of domains and metrics for consideration in the evaluation and scoring of study quality. The evaluation domains and criteria were developed by harmonizing criteria across existing processes including CRED and ECOTOX processes. Furthermore, the evaluation tool is intended to address elements of TSCA Science Standards 26(h)(1) through 26(h)(5) that EPA must address during the development process of the risk evaluations. Ecological hazard studies will be evaluated for data quality by assessing the following seven domains: Test Substance, Test Design, Exposure Characterization, Test Organism, Outcome Assessment, Confounding/Variable Control, and Data Presentation and Analysis. The data quality within each domain will be evaluated by assessing unique metrics that pertain to each domain. For example, the Test Substance domain will be evaluated by considering the information reported by the study on the test substance identity, purity, and source. The domains are defined in Table F-2 and further information on evaluation metrics is provided in section F.3. 147 ------- Table F-2. Data Evaluation Domains and Definitions Evaluation Domain Definition Test Substance Metrics in this domain evaluate whether the information provided in the study provides a reliable3 confirmation that the test substance used in a study has the same (or sufficiently similar) identity, purity, and properties as the substance of interest. Test Design Metrics in this domain evaluate whether the experimental design enables the study to distinguish the effect of exposure from other factors. This domain includes metrics related to the use of control groups and randomization in allocation to ensure that the effect of exposure is isolated. Exposure Characterization Metrics in this domain assess the validity and reliability of methods used to measure or characterize exposure. These metrics evaluate whether exposure to the test substance was characterized using a method(s) that provides valid and reliable results, whether the exposure remained consistent over the duration of the experiment, and whether the exposure levels were appropriate to the outcome of interest. Test Organisms These metrics assess the appropriateness of the population or organism(s), number of organisms used in the study, and the organism conditions to assess the outcome of interest associated with the exposure of interest. Outcome Assessment Metrics in this domain assess the validity and reliability of methods, including sensitivity of methods, that are used to measure or otherwise characterize the outcome((e.g.. immobilization as a measure of mortality in aquatic invertebrates) Confounding/Variable Control Metrics in this domain assess the potential impact of factors other than exposure that may affect the risk of outcome. The metrics evaluate whether studies identify and account for factors that are related to exposure and independently related to outcome (confounding factors) and whether appropriate experimental or analytical (statistical) methods are used to control for factors unrelated to exposure that may affect the risk of outcome (variable control). Data Presentation and Analysis Metrics in this domain assess whether appropriate statistical methods were used and if data for all outcomes are presented. Other Metrics in this domain are added as needed to incorporate chemical- or study-specific evaluations. Note: a Reliability is defined as "the inherent property of a study or data, which includes the use of well-founded scientific approaches, the avoidance of bias within the study or data collection design and faithful study or data collection conduct and documentation" (ECHA. 2011b). F.3 Data Quality Evaluation Metrics The data quality evaluation domains will be evaluated by assessing unique metrics that have been developed for ecological hazard studies. Each metric will be binned into a confidence level of high, medium, low, or unacceptable. Each confidence level is assigned a numerical score (i.e., 1 through 4) that is used in the method of assessing the overall quality of the study. Table F-3 lists the data evaluation domains and metrics for ecological hazard studies. Each domain has between 2 and 6 metrics; however, some metrics may not apply to all study types. 148 ------- A general domain for other considerations is available for metrics that are specific to a given test substance or study type. EPA/OPPT may modify the metrics used for ecological hazard studies as the Agency acquires experience with the evaluation tool. Any modifications will be documented. Confidence level specifications for each metric are provided in Table F-4. Table F-7 summarizes the serious flaws that would make ecological hazard studies unacceptable for use in the assessment. Table F-3. Data Evaluation Domains and Metrics for Ecological Hazard Studies Number Evaluation Domain of Metrics Overall Metrics (Metric Number and Description) • Metric 1: Test Substance Identity Test Substance 3 • Metric 2: Test Substance Source • Metric 3: Test Substance Purity • Metric 4: Negative Controls Test Design 3 • Metric 5: Negative Control Response • Metric 6: Randomized Allocation • Metric 7: Experimental System/Test Media Preparation • Metric 8: Consistency of Exposure Administration Exposure Characterization 6 • Metric 9: Measurement of Test Substance Concentration • Metric 10: Exposure Duration and Frequency • Metric 11: Number of Exposure Groups and Spacing of Exposure Levels • Metric 12 Testing at or Below Solubility Limit • Metric 13 Test Organism Characteristics Test Organisms • Metric 14 Acclimatization and Pretreatment Conditions • Metric 15 • Metric 16 Number of Organisms and Replicates per Group Adequacy of Test Conditions Outcome Assessment • Metric 17 Outcome Assessment Methodology I • Metric 18 Consistency of Outcome Assessment Confounding/ 9 • Metric 19 Confounding Variables in Test design and Procedures Variable Control • Metric 20 Outcomes Unrelated to Exposure Data Presentation and Analysis 3 • Metric 21 • Metric 22 • Metric 23 Statistical Methods Reporting of Data Explanation of Unexpected Outcomes 149 ------- F.4 Scoring Method and Determination of Overall Data Quality Level Appendix A provides information about the evaluation method that will be applied across the various data/information sources being assessed to support TSCA risk evaluations. This section provides details about the scoring system that will be applied to ecological hazard studies, including the weighting factors assigned to each metric score of each domain. Some metrics will be given greater weights than others, if they are regarded as key or critical metrics. Thus, EPA/OPPT will use a weighting approach to reflect that some metrics are more important than others when assessing the overall quality of the data. F.4.1 Weighting Factors Each metric was assigned a weighting factor of 1 or 2, with the higher weighting factor (2) given to metrics deemed critical for the evaluation. In selecting critical metrics, EPA recognized that the relevance of an individual study to the risk analysis for a given substance is determined by its ability to inform hazard characterization and/or exposure-response assessment. Thus, the critical metrics are those that determine how well a study answers these key questions: • Is a change in the outcome demonstrated in the study? • Is the observed change more likely than not attributable to the substance exposure? • At what test substance concentrations does the change occur? EPA/OPPT assigned a weighting factor of 2 to each metric considered critical to answering these questions. Remaining metrics were assigned a weighting factor of 1. Table F-4 identifies the critical metrics (i.e., those assigned a weighting factor of 2) for ecological hazard studies and provides a rationale for selection of each metric. Table F-5 identifies the weighting factors assigned to each metric, and the ranges of possible weighted metric scores for ecological hazard studies. F.4.2 Calculation of Overall Study Score A confidence level (1, 2, or 3 for High, Medium, or Low confidence, respectively) is assigned for each relevant metric within each domain. To determine the overall study score, the first step is to multiply the score for each metric (1, 2, or 3 for High, Medium, or Low confidence, respectively) by the appropriate weighting factor (as shown in Table F-5) to obtain a weighted metric score. The weighted metric scores are then summed and divided by the sum of the weighting factors (for all metrics that are scored) to obtain an overall study score between 1 and 3. The equation for calculating the overall score is shown below: Overall Score (range of 1 to 3) = Metric Score x Weighting Factor)/Z(Weighting Factors) Some metrics may not be applicable to all study types. Any metrics that are considered to be Not rated/not applicable to the study under evaluation will not be considered in the calculation of the study's overall quality score. These metrics will not be included in the nominator or denominator of the equation above. The overall score will be calculated using only those 150 ------- metrics that receive a numerical score. Scoring samples for ecological hazard studies are given in Tables F-6 and F-7. Studies with any single metric scored as unacceptable (score = 4) will be automatically assigned an overall quality score of 4 (Unacceptable). An unacceptable score means that serious flaws are noted in the domain metric that consequently make the data unusable (or invalid). If a metric is not applicable for a study type, the serious flaws would not be applicable for that metric and would not receive a score. EPA/OPPT plans to use data with an overall quality level of High, Medium, or Low confidence to quantitatively or qualitatively support the risk evaluations, but does not plan to use data rated as Unacceptable. An overall study score will not be calculated when a serious flaw is identified for any metric. If a publication reports more than one study or endpoint, each study and, as needed, each endpoint will be evaluated separately. Detailed tables showing quality criteria for the metrics are provided in Tables F-8 and F-9, including a table that summarizes the serious flaws that would make the data unacceptable for use in the environmental hazard assessment. 151 ------- Table F-4. Ecological Hazard Metrics with Greater Importance in the Evaluation and Rationale for Selection Domain Critical Metrics with Weighting Factor of 2 (Metric Number)a Rationale Test substance Test substance identity (Metric 1) The test substance must be identified and characterized definitively to ensure that the study is relevant to the substance of interest. Test design Negative controls (Metric 4) A concurrent negative control is required to ensure that any observed effects are attributable to substance exposure. Exposure characterization Experimental test system/test media preparation (Metric 7) The design of the test system and methods of test media preparation must take into account the physical-chemical properties (e.g., solubility, volatility) and reactivity of the test substance (e.g., hydrolysis, biodegradation, bioaccumulation, adsorption) to ensure confidence in test substance concentrations, which will allow for determination of a concentration-response relationship and enable valid comparisons across studies. Exposure characterization Measurement of test substance concentration (Metric 9)b For test substances that have poor water solubility, are volatile or unstable in the test media measurement of test substance concentrations is necessary for determination of a concentration-response relationship and to enable valid comparisons across studies. Test organisms Test organism characteristics (Metric 13) The test organism characteristics must be reported to enable assessment of a) whether they are suitable for the endpoint of interest; and b) whether there are species, strain, sex, size, or age/lifestage differences within or between different studies. Outcome assessment Outcome assessment methodology (Metric 17) The methods used for outcome assessment must be fully described, valid, and sensitive to ensure that effects are detected, that observed effects are true, and to enable valid comparisons across studies. Confounding/variable control Confounding variables in test design and procedures (Metric 19) Control for confounding variables in test design and procedures are necessary to ensure that any observed effects are attributable to substance exposure and not to other factors. Data presentation and analysis Reporting of data (Metric 22) Detailed results are necessary to determine if the study authors' conclusions are valid and to determine a exposure- response relationship. Notes: a A weighting factor of 1 is assigned for the following metrics: test substance source (metric 2); test substance purity (metric 3); negative control response (metric 5); randomized allocation (metric 6); consistency of exposure administration (metric 8); exposure duration and frequency (metric 10); number of exposure groups and spacing of exposure levels (metric 11); testing at or below solubility limit (metric 12); acclimatization and pretreatment conditions (metric 14); number of organisms and replicates per group (metric 15); adequacy of test conditions (metric 16); consistency of outcome assessment (metric 18); outcomes unrelated to exposure (metric 20); statistical methods (metric 21); and explanation of unexpected outcomes (metric 23) bThis metric is applicable only to test substances that have poor water solubility or are volatile or unstable in test media 152 ------- Table F-5. Metric Weighting Factors and Range of Weighted Metric Scores for Ecological Hazard Studies Domain Number/ Description Metric Number/Description Range of Metric Scores3 Metric Weighting Factor Range of Weighted Metric Scores'3 1. Test substance 1. Test substance identity 1 to 3 2 2 to 6 2. Test substance source 1 1 to 3 3.Test substance purity 1 1 to 3 2. Test design 4. Negative controls 2 2 to 6 5. Negative control response 1 1 to 3 6. Randomized allocation 1 1 to 3 3. Exposure characterization 7. Experimental system/test media preparation 2 2 to 6 8. Consistency of exposure administration 1 1 to 3 9. Exposure duration and frequency 2 2 to 6 10. Measurement of test substance concentration 1 1 to 3 11. Number of exposure groups and dose spacing 1 1 to 3 12. Testing at or Below Solubility Limit 1 1 to 3 4. Test organisms 13. Test organism characteristics 2 2 to 6 14. Acclimatization and pretreatment conditions 1 1 to 3 15. Number of organisms and replicates per group 1 1 to 3 16. Adequacy of test conditions 1 1 to 3 5. Outcome assessment 17. Outcome assessment methodology 2 2 to 6 18. Consistency of outcome assessment 1 1 to 3 6. Confounding/ variable control 19. Confounding variables in test design and procedures 2 2 to 6 20. Outcomes unrelated to exposure 1 1 to 3 7. Data presentation and analysis 21. Statistical methods 1 1 to 3 22. Reporting of data 2 2 to 6 23. Explanation of unexpected outcomes 1 1 to 3 Sum (if all metrics scored)c 31 31 to 93 Range of Overall See Overall Score = Sum res, where of Weighted Score High s/Sum of Metric W Medium sighting Factor Low 31/31=1; 93/31=3 Range of overall score = 1 to 3d >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Notes: a For the purposes of calculating an overall study score, the range of possible metric scores is 1 to 3 for each metric, corresponding to high and low confidence. No calculations will be conducted if a study receives an "unacceptable" rating (score of "A") for any metric. b The range of weighted scores for each metric is calculated by multiplying the range of metric scores (1 to 3) by the weighting factor for that metric. cThe sum of weighting factors and the sum of the weighted scores will differ if some metrics are not scored (not applicable). dThe range of possible overall scores is 1 to 3. If a study receives a score of 1 for every metric, then the overall study score will be 1. If a study receives a score of 3 for every metric, then the overall study score will be 3. 153 ------- Table F-6. Scoring Example for an Ecological Hazard Study with all Metrics Scored Domain Metric Metric Score Metric Weighting Factor Test substance 1. Test substance identity 2. Test substance source 3.Test substance purity Test design 4. Negative controls 5. Negative control response 6. Randomized allocation Exposure characterization 7. Experimental system/test media preparation 8. Consistency of exposure administration 9. Exposure duration and frequency 10. Measurement of test substance concentration 11. Number of exposure groups and dose spacing 12. Testing at or Below Solubility Limit Test organisms 13. Test organism characteristics 14. Acclimatization and pretreatment conditions 15. Number of organisms and replicates per group 16. Adequacy of test conditions Outcome assessment 17. Outcome assessment methodology 18. Consistency of outcome assessment Confounding/variable control 19. Confounding variables in test design and procedures 20. Outcomes unrelated to exposure Data presentation and analysis 21. Statistical methods 22. Reporting of data 23. Explanation of unexpected outcomes Sum Overall Study Score 1.6= High 31 49 Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor High Medium Low >1 and <1.7 >1.7 and <2.3 >2.3 and <3 154 ------- Table F-7. Scoring Example for an Ecological Hazard with Some Metrics Not Rated/Not Applicable Domain Metric Metric Score Metric Weighting Factor Test substance 1. Test substance identity 2. Test substance source 3.Test substance purity Test design 4. Negative controls 5. Negative control response 6. Randomized allocation Exposure characterization 7. Experimental system/test media preparation 8. Consistency of exposure administration 9. Exposure duration and frequency 10. Measurement of test substance concentration 11. Number of exposure groups and dose spacing 12. Testing at or Below Solubility Limit 2 1 1 1 1 NR Test organisms 13. Test organism characteristics 14. Acclimatization and pretreatment conditions 15. Number of organisms and replicates per group 16. Adequacy of test conditions 3 2 1 NR Outcome assessment 17. Outcome assessment methodology 18. Consistency of outcome assessment 1 NR Confounding/variable control 19. Confounding variables in test design and procedures 20. Outcomes unrelated to exposure 3 NR Data presentation and analysis 21. Statistical methods 22. Reporting of data 23. Explanation of unexpected outcomes 2 1 NR NR= not rated/not applicable Sum Overall Study Score 1.8= Medium 26 46 Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor High Medium Low >1 and <1.7 >1.7 and <2.3 >2.3 and <3 155 ------- F.5 Data Quality Criteria Table F-8. Serious Flaws that Would Make Ecological Hazard Studies Unacceptable Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Test substance identity The test substance identity and form (the latter if applicable) cannot be determined from the information provided (e.g., nomenclature was unclear and CASRN or structure were not reported) OR for mixtures, the components and ratios were not characterized. Test substance The test substance was not obtained from a manufacturer Test substance source OR if synthesized or extracted, analytical verification of the test substance was not conducted. Test substance purity The nature and quantity of reported impurities were such that study results were likely to be due to one or more of the impurities. Negative controls A concurrent negative control group was not included or reported OR the reported negative control group was not appropriate (e.g., age/weight of organisms differed between control and treated groups). Test design Negative control response The biological responses of the negative control groups were not reported OR there was unacceptable variation in biological responses between control replicates. Randomized allocation The study reported using a biased method to allocate organisms to study groups (e.g., each study group consists of organisms from a single brood and the broods differ among study groups). Exposure characterization Experimental system/test media preparation The physical-chemical properties of the test substance required special considerations for preparation and maintenance of test substance concentrations, but no measures were taken to appropriately prepare test concentrations and/or minimize loss of test substance before and during the exposure and/or the use of such measures was not reported. In addition, the test substance concentrations were not measured, thereby preventing characterization of a concentration-response relationship. Consistency of exposure administration Reported information indicated that critical exposure details were inconsistent across study groups and these differences are considered serious flaws that make the study unusable (e.g., for a poorly soluble mixture, a solvent was used for some study groups while a water- accommodated fraction was used for others). 156 ------- Domain Metric Description of Serious Flaw(s) in Data Source Measurement of test substance concentration For test substances that have poor water solubility or are volatile or unstable in test media: Exposure concentrations were not measured and nominal values are highly uncertain due to the nature of the test substance OR exposure concentrations were measured but analytical methods were not appropriate for the test substance resulting in serious uncertainties in measured concentrations (e.g., recovery and/or repeatability were poor). Exposure duration and frequency The duration of exposure and/or exposure frequency were not reported OR the reported duration of exposure and/or exposure frequency were not suited to the study type and/or outcome(s) of interest (e.g., study intended to assess effects on reproduction did not expose organisms to test substance for an acceptable period of time prior to mating). Number of exposure groups and spacing of exposure levels The number of exposure groups and spacing of exposure levels were not conducive to the purpose of the study (e.g., the range of concentrations tested was either too high or too low to observe a concentration-response relationship, a LOAEC, NOAEC, LC5o, or EC5o could not be identified) OR no information is provided on the number of exposure groups and spacing of exposure levels. Testing at or below solubility limit All exposure concentrations greatly exceeded the water solubility limit (or dispersibility limit if applicable) and the range of exposure concentrations tested was insufficient to characterize a concentration-response relationship AND/OR the solvent concentration exceeded an appropriate concentration and is likely to have influenced the biological response of the test organisms. Test organisms Test organism characteristics The test organisms were not identified sufficiently or were not appropriate for the evaluation of the specific outcome(s) of interest or were not from an appropriate source (e.g., collected from a polluted field site). Acclimatization and pretreatment conditions There were serious differences in acclimatization and/or pretreatment conditions between control and exposed groups OR organisms were previously exposed to the test substance or other unintended stressors. Number of organisms and replicates per group The number of test organisms and/or replicates was insufficient to characterize toxicological effects and/or provided insufficient power for statistical analysis (e.g., 1-2 organisms/group). 157 ------- Domain Metric Description of Serious Flaw(s) in Data Source Adequacy of test conditions Organism housing and/or environmental conditions and/or food, water, and nutrients and/or biomass loading were not conducive to maintenance of health (e.g., overt signs of handling stress are evident). Outcome assessment Outcome assessment methodology The outcome assessment methodology was not reported OR the reported outcome assessment methodology was not sensitive for the outcome(s) of interest (e.g., in the assessment of reproduction in a chronic daphnid test, offspring were not counted and removed until the end of the test, rather than daily). Consistency of outcome assessment There were large inconsistencies in the execution of study protocols for outcome assessment across study groups OR outcome assessments were not adequately reported for meaningful interpretation of results. Confounding/ variable control Confounding variables in test design and procedures The study reported significant differences among the study groups with respect to environmental conditions (e.g., differences in pH unrelated to the test substance) or other non-treatment-related factors and these prevent meaningful interpretation of the results. Outcomes unrelated to exposure One or more study groups experienced serious test organism attrition or outcomes unrelated to exposure (e.g., infection). Data presentation and analysis Statistical methods Statistical methods used were not appropriate (e.g., parametric test for non-normally distributed data) OR statistical analysis was not conducted AND data enabling an independent statistical analysis were not provided. Reporting of data Data presentation was inadequate (e.g., the report does not differentiate among findings in multiple treatment groups) OR major inconsistencies were present in reporting of results. Explanation of unexpected outcomes The occurrence of unexpected outcomes, including, but not limited to, within-study variability and/or variation from historical measures, are considered serious flaws that make the study unusable. 158 ------- Table F-9. Data Quality Criteria for Ecological Hazard Studies Confidence Level (Score) Description Selected Score Domain 1. Test Substance Metric 1. Test substance identity Was the test substance identified definitively (i.e., established nomenclature, CASRN, and/or structure reported, including information on the specific form tested [e.g., valence state] for substances that may vary in form)? If test substance is a mixture, were mixture components and ratios characterized? High (score = 1) The test substance was identified definitively and the specific form was characterized (where applicable). For mixtures, the components and ratios were characterized. Medium (score = 2) The test substance and form (the latter if applicable) were identified and components and ratios of mixtures were characterized, but there were minor uncertainties (e.g., minor characterization details were omitted) that are unlikely to have a substantial impact on results. Low (score = 3) The test substance and form (the latter if applicable) were identified and components and ratios of mixtures were characterized, but there were uncertainties regarding test substance identification or characterization that are likely to have a substantial impact on results. Unacceptable (score = 4) The test substance identity and form (the latter if applicable) cannot be determined from the information provided (e.g., nomenclature was unclear and CASRN or structure were not reported) OR for mixtures, the components and ratios were not characterized. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Test substance source Is the source of the test substance reported, including manufacturer and batch/lot number for materials that may vary in composition? If synthesized or extracted, was test substance identity verified by analytical methods? High (score = 1) The source of the test substance was reported, including manufacturer and batch/lot number for materials that may vary in composition, and its identity was certified by manufacturer and/or verified by analytical methods (e.g., melting point, chemical analysis, etc.). Medium (score = 2) The source of the test substance and/or the analytical verification of a synthesized test substance was reported incompletely, but the omitted details are unlikely to have a substantial impact on results. Low (score = 3) Omitted details on the source of the test substance and/or the analytical verification of a synthesized test substance are likely to have a substantial impact on results. Unacceptable (score = 4) The test substance was not obtained from a manufacturer OR if synthesized or extracted, analytical verification of the test substance was not conducted. These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 159 ------- Confidence Level (Score) Description Selected Score Metric 3. Test substance purity Was the purity or grade (i.e., analytical, technical) of the test substance reported and adequate to identify its toxicological effects? Were impurities identified? Were impurities present in quantities that could influence the results? High (score = 1) The test substance purity and composition were such that any observed effects were highly likely to be due to the nominal test substance itself (e.g., highly pure or analytical-grade test substance or a formulation comprising primarily inert ingredients with small amount of active ingredient). Medium (score = 2) Minor uncertainties or limitations were identified regarding the test substance purity and composition; however, the purity and composition were such that observed effects were more likely than not due to the nominal test substance, and any identified impurities are unlikely to have a substantial impact on results. Low (score = 3) Purity and/or grade of test substance were not reported or were low enough to have a substantial impact on results (i.e., observed effects may not be due to the nominal test substance). Unacceptable (score = 4) The nature and quantity of reported impurities were such that study results were likely to be due to one or more of the impurities. This is a serious flaw that makes the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Test Design Metric 4. Negative controls Was an appropriate concurrent negative control group tested? If a vehicle/solvent was used, was a vehicle (solvent) control tested in parallel? High (score = 1) Study authors reported using an appropriate concurrent negative control group (i.e., all conditions equal except chemical exposure). Medium (score = 2) Study authors reported using a concurrent negative control group, but all conditions were not equal to those of treated groups (e.g., untreated control instead of a vehicle control); however, the identified differences are considered to be minor limitations that are unlikely to have a substantial impact on results. Low (score = 3) Study authors acknowledged using a concurrent negative control group, but details regarding the negative control group were not reported, and the lack of details is likely to have a substantial impact on results. Unacceptable (score = 4) A concurrent negative control group was not included or reported OR the reported negative control group was not appropriate (e.g., age/weight of organisms differed between control and treated groups). This is a serious flaw that makes the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Negative control response Were the biological responses (e.g., survival, growth, reproduction, etc.) of the negative control group(s) adequate? High (score = 1) The biological responses (e.g., survival, growth, reproduction, etc.) of the negative control group(s) were adequate (e.g., mortality of control fish <10% in an acute test). 160 ------- Confidence Level (Score) Description Selected Score Medium (score = 2) There were minor uncertainties or limitations regarding the biological responses of the negative control group(s) (e.g., differences in outcome between untreated and solvent controls) that are unlikely to have a substantial impact on results. Low (score = 3) The biological responses of the negative control group(s) were reported, but there were deficiencies regarding the control responses that are likely to have a substantial impact on results (e.g., 30% mortality of control fish in an acute test). Unacceptable (score = 4) The biological responses of the negative control groups were not reported OR there was unacceptable variation in biological responses between control replicates. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 6. Randomized allocation Did the study explicitly report randomized allocation of organisms to study groups? High (score = 1) The study reported that organisms were randomly allocated into study groups (including the control group). Medium (score = 2) The study reported methods of allocation of organisms to study groups, but there were minor limitations in the allocation method (e.g., method with a nonrandom component like assignment to minimize differences in body weight across groups) that are unlikely to have a substantial impact on results. Low (score = 3) Researchers did not report how organisms were allocated to study groups, or there were deficiencies regarding the allocation method that are likely to have a substantial impact on results (e.g., allocation by animal number). Unacceptable (score = 4) The study reported using a biased method to allocate organisms to study groups (e.g., each study group consists of organisms from a single brood and the broods differ among study groups). This is a serious flaw that makes the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Exposure Characterization Was the experimental system (e.g., static, semi-static, or flow-through regime) described in adequate detail? Were methods for test media preparation appropriate for the test substance, taking into account its physical-chemical properties (e.g., solubility, volatility) and reactivity (e.g., hydrolysis, biodegradation, bioaccumulation, adsorption)? For reactive, volatile, and/or poorly soluble test substances, were adequate measures taken to prepare and maintain test substance concentrations and minimize loss of test substance before and during the exposure? (Based on professional judgment, the reviewer may consider this metric to be not rated/applicable for field and mesocosm studies.) High (score = 1) The experimental system and methods for preparation of test media were described in adequate detail and appropriately accounted for the physical- chemical properties of the test substance (e.g., use of closed, static systems with minimal headspace for volatile substances, use of water-accommodated fractions for multi-component substances that are only partially soluble in water, etc.). 161 ------- Confidence Level (Score) Description Selected Score Medium (score = 2) The experimental system and/or test media preparation methods were adequately reported but did not completely account for physical-chemical properties (e.g., period between renewals was greater than the half-life of a test substance that degrades in the system); however, the identified limitations are unlikely to have a substantial impact on results. Low (score = 3) The type of experimental system and/or test media preparation methods were not reported OR the study provided only limited details on the measures taken to appropriately prepare test concentrations and/or minimize loss of test substance before and during the exposure for reactive, volatile, and/or poorly soluble substances AND concentrations of test substance were not measured during the study. Therefore, the deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) The physical-chemical properties of the test substance required special considerations for preparation and maintenance of test substance concentrations, but no measures were taken to appropriately prepare test concentrations and/or minimize loss of test substance before and during the exposure and/or the use of such measures was not reported. In addition, the test substance concentrations were not measured, thereby preventing characterization of a concentration-response relationship. These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 8. Consistency of exposure administration Were exposures administered consistently across study groups (e.g., same exposure protocol; same time of day)? High (score = 1) Details of exposure administration were reported and exposures were administered consistently across study groups. Medium (score = 2) Details of exposure administration were reported, but minor inconsistencies in administration of exposures among study groups were identified that are unlikely to have a substantial impact on results (e.g., slightly different solvent concentrations). Low (score = 3) Details of exposure administration were reported, but inconsistencies in administration of exposures among study groups are considered deficiencies that are likely to have a substantial impact on results (e.g., differing periods between renewal for an unstable test substance) OR reporting omissions are likely to have a substantial impact on results. Unacceptable (score = 4) Reported information indicated that critical exposure details were inconsistent across study groups and these differences are considered serious flaws that make the study unusable (e.g., for a poorly soluble mixture, a solvent was used for some study groups while a water-accommodated fraction was used for others). Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 162 ------- Confidence Level (Score) Description Selected Score Metric 9. Measurement of test substance concentration If test substance has poor water solubility, is volatile or unstable in the test system (e.g., hydrolyzes or biodegrades rapidly), is bioaccumulated by biota, adsorbs to objects in the test system, or is otherwise subject to factors that are likely to cause test concentrations to change during exposure, were test substance concentrations in the exposure medium measured analytically? Were appropriate analytical methods used (i.e., recovery and repeatability were demonstrated)? This metric is not rated/e any factors that are likely pplicable if the test substance does not have poor water solubility and is not subject to to cause test concentrations to change during exposure. High (score = 1) Exposure concentrations were measured using appropriate analytical methods (i.e., recovery and repeatability were demonstrated). Endpoints were based on measured concentrations or analytically verified nominal concentrations. Medium (score = 2) Exposure concentrations were measured and measured concentrations were similar to nominal, but analytical methods were not reported OR exposure concentrations were not measured, but based on professional judgment of experimental design and nature of test substance, actual concentrations are likely to be similar to nominal concentrations. These minor uncertainties or limitations are unlikely to have a substantial impact on results. Low (score = 3) Exposure concentrations were not measured or measurements were not reported AND based on professional judgment of experimental design and nature of test substance, actual concentrations cannot be expected to be similar to nominal concentrations. This is likely to have a substantial impact on results Unacceptable (score = 4) Exposure concentrations were not measured and nominal values are highly uncertain due to the nature of the test substance OR exposure concentrations were measured but analytical methods were not appropriate for the test substance resulting in serious uncertainties in measured concentrations (e.g., recovery and/or repeatability were poor). These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 10. Exposure duration and frequency Were the duration of exposure and/or exposure frequency reported and appropriate for the study type and/or outcome(s) of interest? High (score = 1) The duration of exposure and/or exposure frequency were reported and appropriate for the study type and/or outcome(s) of interest (e.g., acute daphnid study of 48-hour duration). Medium (score = 2) Minor limitations in exposure frequency and duration of exposure were identified (e.g., acute daphnid toxicity study of 24-hour duration) but are unlikely to have a substantial impact on results. Low (score = 3) The duration of exposure and/or exposure frequency differed significantly from typical study designs (e.g., acute daphnid toxicity study of 8-hour duration), and these deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) The duration of exposure and/or exposure frequency were not reported OR 163 ------- Confidence Level (Score) Description Selected Score the reported duration of exposure and/or exposure frequency were not suited to the study type and/or outcome(s) of interest (e.g., study intended to assess effects on reproduction did not expose organisms to test substance for an acceptable period of time prior to mating). These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 11. Number of exposure groups and spacing of exposure levels Were the number of exposure groups and spacing of exposure levels justified by study authors (e.g., based on range-finding studies) and adequate to address the purpose of the study? Did the range of concentrations/doses tested allow for identification of endpoint values (i.e., LOAEC and NOAEC, LC5o, or EC5o, depending upon duration of study)? High (score = 1) The number of exposure groups and spacing of exposure levels were justified by study authors, adequate to address the purpose of the study (e.g., the selected doses produce a range of responses), and allowed for identification of endpoint values. Medium (score = 2) There were minor limitations regarding the number of exposure groups and/or spacing of exposure levels (e.g., unclear if lowest concentration was low enough), but the number of exposure groups and spacing of exposure levels were adequate to show results relevant to the outcome of interest (e.g., observation of a concentration-response relationship) and the concerns are unlikely to have a substantial impact on results. Low (score = 3) There were deficiencies regarding the number of exposure groups and/or spacing of exposure levels (e.g., narrow spacing between exposure levels with similar responses across groups), which may include the omission of some important details (e.g., not all exposure levels are specified), and these are likely to have a substantial impact on results. Unacceptable (score = 4) The number of exposure groups and spacing of exposure levels were not conducive to the purpose of the study (e.g., the range of concentrations tested was either too high or too low to observe a concentration-response relationship, a LOAEC, NOAEC, LC50, or EC50 could not be identified) OR no information is provided on the number of exposure groups and spacing of exposure levels. These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 12. Testing at or below solubility limit Were exposure concentrations at or below the limit of water solubility (or dispersibility limit if applicable)? If a solvent was used, was the solvent concentration appropriate (i.e., no effects on biological responses were observed in the solvent control and no interactions were expected between the solvent and test substance)? High (score = 1) Exposure concentrations were at or below the water solubility limit (or dispersibility limit if applicable). The solvent concentration was appropriate. Medium (score = 2) A subset of the exposure concentrations exceeded the water solubility limit (or dispersibility limit if applicable) but a sufficient range of exposure concentrations was tested to characterize a concentration-response relationship AND/OR 164 ------- Confidence Level (Score) Description Selected Score the solvent concentration slightly exceeded an appropriate concentration or was not reported, but the biological response of the solvent control was acceptable and no interactions are expected between the solvent and test substance. These minor uncertainties or limitations are unlikely to have a substantial impact on results. Low (score = 3) Reporting omissions prevented determination of whether exposure concentrations exceeded the water solubility limit (or dispersibility limit if applicable) AND/OR both the solvent concentration and biological response of the solvent control were not reported. These deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) All exposure concentrations greatly exceeded the water solubility limit (or dispersibility limit if applicable) and the range of exposure concentrations tested was insufficient to characterize a concentration-response relationship AND/OR the solvent concentration exceeded an appropriate concentration and is likely to have influenced the biological response of the test organisms. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Test Organisms Metric 13. Test organism characteristics Were the species, strain, sex, age, size, life stage, and/or embryonic stage of the test organisms reported and appropriate for the evaluation of the specific outcome(s) of interest (e.g., routinely used for similar study types or acceptable rationale provided for selection)? Were the test organisms from a reliable source? High (score = 1) The test organisms were adequately described and were obtained from a reliable source. The test organisms were appropriate for evaluation of the specific outcome(s) of interest (e.g., routinely used for similar study types or acceptable rationale provided for selection). Medium (score = 2) There are minor reservations or uncertainties about the choice of test species, source of test organisms, or characteristics of test organisms (e.g., age, size, or sex not reported for fish) that are unlikely to have a substantial impact on results. Low (score = 3) There were significant deficiencies or concerns regarding the choice of test species, source of test organisms, or characteristics of test organisms that are likely to have a substantial impact on study results. Unacceptable (score = 4) The test organisms were not identified sufficiently or were not appropriate for the evaluation of the specific outcome(s) of interest or were not from an appropriate source (e.g., collected from a polluted field site). These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 165 ------- Confidence Level (Score) Description Selected Score Metric 14. Acclimatization and pretreatment conditions Were the test organisms acclimatized to test conditions? Were pretreatment conditions the same for control and exposed groups? High (score = 1) The test organisms were acclimatized to test conditions and all pretreatment conditions were the same for control and exposed populations, such that the only difference was exposure to test substance. Medium (score = 2) Some acclimatization and/or pretreatment conditions differed between control and exposed populations, but the differences are unlikely to have a substantial impact on results or there are minor uncertainties or limitations in the details provided. Low (score = 3) The study did not report whether test organisms were acclimatized and/or whether pretreatment conditions were the same for control and exposed groups, and this is likely to have a substantial impact on results. Unacceptable (score = 4) There were serious differences in acclimatization and/or pretreatment conditions between control and exposed groups OR organisms were previously exposed to the test substance or other unintended stressors. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 15. Number of organisms and replicates per group Were the numbers of test organisms and replicates sufficient to characterize toxicological effects? High (score = 1) The numbers of test organisms and replicates were reported and sufficient to characterize toxicological effects. Medium (score = 2) The numbers of test organisms and replicates were sufficient to characterize toxicological effects, but minor uncertainties or limitations were identified regarding the number of test organisms and/or replicates that are unlikely to have a substantial impact on results. Low (score = 3) The number of test organisms and/or replicates was not reported and this is likely to have a substantial impact on results. Unacceptable (score = 4) The number of test organisms and/or replicates was insufficient to characterize toxicological effects and/or provided insufficient power for statistical analysis (e.g., 1-2 organisms/group). These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 16. Adequacy of test conditions Were organism housing, environmental conditions (e.g., temperature, pH, dissolved oxygen, hardness, and salinity), food, water, and nutrients conducive to maintenance of health, both before and during exposure? Was the biomass loading of the organisms in the test system appropriate? High (score = 1) Organism housing, environmental conditions, food, water, and nutrients were conducive to maintenance of health and biomass loading was appropriate. Medium (score = 2) Minor uncertainties or limitations were identified regarding organism housing, environmental conditions, food, water, nutrients, and/or biomass loading, but these are not likely to have a substantial impact on results. 166 ------- Confidence Level (Score) Description Selected Score Low (score = 3) Reporting of housing and/or environmental conditions and/or food, water, and nutrients and/or biomass loading was limited or unclear, and the omitted details are likely to have a substantial impact on results. Unacceptable (score = 4) Organism housing and/or environmental conditions and/or food, water, and nutrients and/or biomass loading were not conducive to maintenance of health (e.g., overt signs of handling stress are evident). These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 5. Outcome Assessment Metric 17. Outcome assessment methodology Did the outcome assessment methodology address or report the intended outcome(s) of interest? Was the outcome assessment methodology (including endpoints assessed and timing of endpoint assessment) sensitive for the outcome(s) of interest (e.g., measured endpoints that were able to detect a true biological effect or hazard)? (Note: Outcome, as addressed in this domain, refers to biological effects measured in an ecotoxicity study; e.g., reproductive toxicity.) High (score = 1) The outcome assessment methodology addressed or reported the intended outcome(s) of interest and was sensitive for the outcomes(s) of interest. Medium (score = 2) The outcome assessment methodology partially addressed or reported the intended outcomes(s) of interest (e.g., total number of offspring per group reported in the absence of data on fecundity per individual), but minor uncertainties or limitations are unlikely to have a substantial impact on results. Low (score = 3) Significant deficiencies in the reported outcome assessment methodology were identified OR due to incomplete reporting, it was unclear whether methods were sensitive for the outcome of interest. This is likely to have a substantial impact on results. Unacceptable (score = 4) The outcome assessment methodology was not reported OR the reported outcome assessment methodology was not sensitive for the outcome(s) of interest (e.g., in the assessment of reproduction in a chronic daphnid test, offspring were not counted and removed until the end of the test, rather than daily). These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 18. Consistency of outcome assessment Was the outcome assessment carried out consistently (i.e., using the same protocol) across study groups (e.g., assessment at the same time after initial exposure in all study groups)? High (score = 1) Details of the outcome assessment protocol were reported and outcomes were assessed consistently across study groups (e.g., at the same time after initial exposure) using the same protocol in all study groups. 167 ------- Confidence Level (Score) Description Selected Score Medium (score = 2) There were minor differences in the timing of outcome assessment across study groups, or incomplete reporting of minor details of outcome assessment protocol execution, but these uncertainties or limitations are unlikely to have substantial impact on results. Low (score = 3) Details regarding the execution of the study protocol for outcome assessment (e.g., timing of assessment across groups) were not reported, and these deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) There were large inconsistencies in the execution of study protocols for outcome assessment across study groups OR outcome assessments were not adequately reported for meaningful interpretation of results. These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 6. Confounding/Variable Control Metric 19. Confounding variables in test design and procedures Were all variables consistent across experimental groups or appropriately controlled for in the analysis, including, but not limited to, size and age of test organisms, environmental conditions (e.g., temperature, pH, and dissolved oxygen), and protective or toxic factors that could mask or enhance effects? High (score = 1) There were no reported differences among the study groups in environmental conditions or other factors that could influence the outcome assessment. Medium (score = 2) The study reported minor differences among the study groups with respect to environmental conditions or other non-treatment-related factors, but these are unlikely to have a substantial impact on results. Low (score = 3) The study did not provide enough information to allow a comparison of environmental conditions or other non-treatment-related factors across study groups, and the omitted information is likely to have a substantial impact on study results. Unacceptable (score = 4) The study reported significant differences among the study groups with respect to environmental conditions (e.g., differences in pH unrelated to the test substance) or other non-treatment-related factors and these prevent meaningful interpretation of the results. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 20. Outcomes unrelated to exposure Were there differences among the study groups in test organism attrition or outcomes unrelated to exposure (e.g., infection) that could influence the outcome assessment? High (score = 1) Details regarding test organism attrition and outcomes unrelated to exposure (e.g., infection) were reported for each study group and there were no differences among groups that could influence the outcome assessment. Medium (score = 2) Authors reported that one or more study groups experienced disproportionate test organism attrition or outcomes unrelated to exposure (e.g., infection), but data from the remaining exposure groups were valid and the low incidence of attrition is unlikely to have a substantial impact on 168 ------- Confidence Level (Score) Description Selected Score results OR data on attrition and/or outcomes unrelated to exposure for each study group were not reported because only substantial differences among groups were noted (as indicated by study authors). Low (score = 3) Data on attrition and/or outcomes unrelated to exposure were not reported for each study group, and this deficiency is likely to have a substantial impact on results. Unacceptable (score = 4) One or more study groups experienced serious test organism attrition or outcomes unrelated to exposure (e.g., infection). This is a serious flaw that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 7. Data Presentation and Analysis Metric 21. Statistical methods Were statistical methods clearly described and appropriate for dataset(s) (e.g., parametric test for normally distributed data)? High (score = 1) Statistical methods were clearly described and appropriate for dataset(s) (e.g., parametric test for normally distributed data). OR no statistical analyses, calculation methods, and/or data manipulation were conducted but sufficient data were provided to conduct an independent statistical analysis. Medium (score = 2) Not applicable for this metric Low (score = 3) Statistical analysis was not described clearly, and this deficiency is likely to have a substantial impact on results. Unacceptable score = 4) Statistical methods used were not appropriate (e.g., parametric test for non- normally distributed data) OR statistical analysis was not conducted AND data enabling an independent statistical analysis were not provided. These are serious flaws that make the study unusable. Not rated/applicable3 Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 22. Reporting of data Were the data for all outcomes presented? Were data reported for each treatment and control group? Were reported data sufficient to determine values for the endpoint(s) of interest (e.g., LOEC, NOEC, LC50, and EC50)? High (score = 1) Data for exposure-related findings were presented for each treatment and control group and were adequate to determine values for the endpoint(s) of interest. Negative findings were reported qualitatively or quantitatively. Medium (score = 2) Data for exposure-related findings were reported for most, but not all, outcomes by study group and/or data were not reported for outcomes with negative findings, but these minor uncertainties or limitations are unlikely to have a substantial impact on results. Low Data for exposure-related findings were not shown for each study group, but 169 ------- Confidence Level (Score) Description Selected Score (score = 3) results were described in the text and/or data were only reported for some outcomes. These deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) Data presentation was inadequate (e.g., the report does not differentiate among findings in multiple treatment groups) OR major inconsistencies were present in reporting of results. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 23. Explanation of unexpected outcomes Did the author provide a suitable explanation for unexpected outcomes (including excessive within-study variability)? High (score = 1) There were no unexpected outcomes, or unexpected outcomes were satisfactorily explained. Medium (score = 2) Minor uncertainties or limitations were identified in how the study characterized unexpected outcomes, including within-study variability and/or variation from historical measures, but those are not likely to have a substantial impact on results. Low (score = 3) The study did not report any measures of variability (e.g., SE, SD, confidence intervals) and/or insufficient information was provided to determine if excessive variability or unexpected outcomes occurred. This is likely to have a substantial impact on results. Unacceptable (score = 4) The occurrence of unexpected outcomes, including, but not limited to, within-study variability and/or variation from historical measures, are considered serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 8. Other (Apply as Needed) Metric High (score = 1) Medium (score = 2) Low (score = 3) Unacceptable (score = 4) Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Note: aThese metrics should be scored as Not rated/applicable if the study cited a secondary literature source for the description of testing methodology; if the study is not classified as unacceptable in the initial review, the secondary source will be reviewed during a subsequent evaluation step and the metric will be rated at that time. 170 ------- F.6 References 1. Cooper. GL. R. Agerstrand. M. Glenn. B. Kraft. A. Luke. A. Ratcliffe. J. (2016). Study sensitivity: Evaluating the ability to detect effects in systematic reviews of chemical exposures. Environ Int. 92- 93: 605-610. http://dx.doi.Org/10.1016/i.envint.2016.03.017. 2. EC (2018). ToxRTool - Toxicological data Reliability assessment Tool. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262819. 3. ECHA. (2011). Guidance on information requirements and chemical safety assessment. Chapter R.3: Information gathering. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262857. 4. Hartling, LH, M. Milne, A. Vandermeer, B. Santaguida, P. L. Ansari, M. Tsertsvadze, A. Hempel, S. Shekelle, P. Dryden, D. M. (2012). Validity and inter-rater reliability testing of quality assessment instrumentsalidity and inter-rater reliability testing of quality assessment instruments. (AHRQ Publication No. 12-EHC039-EF). Rockville, MD: Agency for Healthcare Research and Quality. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262864. 5. Hooiimans, CDV, R. Leenaars, M. Ritskes-Hoitinga, M. (2010). The Gold Standard Publication Checklist (GSPC) for improved design, reporting and scientific quality of animal studies GSPC versus ARRIVE guidelines. http://dx.doi.org/10.1258/la.2010.01013Q. 6. Hooiimans, CRR, M. M. De Vries, R. B. M. Leenaars, M. Ritskes-Hoitinga, M. Langendam, M. W. (2014). SYRCLE's risk of bias tool for animal studies. BMC Medical Research Methodology. 14(1): 43. http://dx.doi.org/10.1186/1471-2288-14-43. 7. Koustas, EL, J. Sutton, P. Johnson, P. I. Atchley, D. S. Sen, S. Robinson, K. A. Axelrad, D. A. Woodruff, T. J. (2014). The Navigation Guide - Evidence-based medicine meets environmental health: Systematic review of nonhuman evidence for PFOA effects on fetal growth [Review], Environ Health Perspect. 122(10): 1015-1027. http://dx.doi.org/10.1289/ehp.1307177: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4181920/pdf/ehp.1307177.pdf. 8. Kushman. MEK. A. D. Guvton. K. Z. Chiu. W. A. Makris. S. L. Rusvn. I. (2013). A systematic approach for identifying and presenting mechanistic evidence in human health assessments. Regul Toxicol Pharmacol. 67(2): 266-277. http://dx.doi.Org/10.1016/i.vrtph.2013.08.005: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3818152/pdf/nihms516764.pdf. 9. Lynch, HNG, J. E. Tabony, J. A. Rhomberg, L. R. (2016). Systematic comparison of study quality criteria. Regul Toxicol Pharmacol. 76: 187-198. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262904. 10. Moermond, CTK, R. Korkaric, M. Agerstrand, M. (2016). CRED: Criteria for reporting and evaluating ecotoxicity data. Environ Toxicol Chem. 35(5): 1297-1309. http://dx.doi.org/10.1002/etc.3259. 11. NTP. (2015). Handbook for conducting a literature-based health assessment using OHAT approach for systematic review and evidence integration. U.S. Dept. of Health and Human Services, National Toxicology Program, http://ntp.niehs.nih.gov/pubhealth/hat/noms/index-2.html. 12. Samuel, GOH, S. Wright, R. A. Lalu, M. M. Patlewicz, G. Becker, R. A. Degeorge, G. L. Fergusson, D. Hartung, T. Lewis, R. J. Stephens, M. L. (2016). Guidance on assessing the methodological and reporting quality of toxicologically relevant studies: A scoping review. Environ Int. 92-93: 630-646. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262966 171 ------- APPENDIX G: DATA QUALITY CRITERIA FOR STUDIES ON ANIMAL AND IN VITRO TOXICITY G. 1 Types of Data Sources The data quality will be evaluated for a variety of animal and in vitro toxicity studies. Table G-l provides examples of types of studies falling into these two broad categories. Since the availability of information varies considerably on different chemicals, it is anticipated that some study types will not be available while others may be identified beyond those listed in Table G- 1. Table G-l. Types of Animal and In Vitro Toxicity Data Data Category Type of Data Sources Animal Toxicity Oral, dermal, and inhalation routes: lethality, irritation, sensitization, reproduction, fertility, developmental, neurotoxicity, carcinogenicity, systemic toxicity, metabolism, pharmacokinetics, absorption, immunotoxicity, genotoxicity, mutagenicity, endocrine disruption In Vitro Toxicity Studies Irritation, corrosion, sensitization, genotoxicity, dermal absorption, phototoxicity, ligand binding, steroidogenesis, developmental, organ toxicity, mechanisms, high throughput, immunotoxicity Mechanistic evidence is highly heterogeneous and may come from human, animal or in vitro toxicity studies. Mechanistic evidence may provide support for biological plausibility and help explain differences in tissue sensitivity, species, gender, life-stage or other factors (U.S. EPA. 2006). Although highly preferred, the availability of a fully elucidated mode of action (MOA) or adverse outcome pathway (AOP) is not required to conduct the human health hazard assessment for a given chemical. EPA/OPPT plans to prioritize the evaluation of mechanistic evidence instead of evaluating all of the identified evidence upfront. This approach has the advantage of conducting a focused review of those mechanistic studies that are most relevant to the hazards under evaluation. The prioritization approach is generally initiated during the data screening step. For example, many of the human health PECOs for the first ten TSCA risk evaluation excluded mechanistic evidence during full text screening. Excluding the mechanistic evidence during full text screening does not mean that the data cannot be accessed later. The assessor can eventually mine the database of mechanistic references when specific questions or hypotheses arise related to the chemical's MOA/AOP. Moreover, EPA/OPPT anticipates that some chemicals undergoing TSCA risk evaluations may have physiologically based pharmacokinetic (PBPK) models that could be used for predicting internal dose at a target site as well as interspecies, intraspecies, route-to-route extrapolations or other types of extrapolations. These models should be carefully evaluated to determine if they can be used for risk assessment purposes. Although EPA/OPPT is not including an evaluation strategy for PBPK models in this document, when necessary, it plans to document 172 ------- the model evaluation process based on the list of considerations described in U.S. EPA (2006) and IPCS (2010). EPA/OPPT plans to use the evaluation strategies for animal and in vitro toxicity data to assess the quality of mechanistic and pharmacokinetic data supporting the model. EPA/OPPT may tailor the criteria to capture the inherent characteristics of particular studies that are not captured in the current criteria (e.g., optimization of criteria to evaluate the quality of new approach methodologies or NAMs). G.2 Data Quality Evaluation Domains The methods for evaluation of study quality were developed after review of selected references describing existing study quality and risk of bias evaluation tools for toxicity studies (EC. 2018; Cooper et al.. 2016; Lynch et al.. 2016; Moermond et al.. 2016b; Samuel et al.. 2016; NTP. 2015a; Hooiimans et al.. 2014; Koustas et al.. 2014; Kushman et al.. 2013; Hartling et al.. 2012; Hooiimans et al.. 2010). These publications, coupled with professional judgment and experience, informed the identification of domains and metrics for consideration in the evaluation and scoring of study quality. Furthermore, the evaluation tool is intended to address elements of TSCA Science Standards 26(h)(1) through 26(h)(5) that EPA must address during the development process of the risk evaluations. The data quality of animal toxicity studies and in vitro toxicity studies is evaluated by assessing the following seven domains: Test Substance, Test Design, Exposure Characterization, Test Organism/Test Model, Outcome Assessment, Confounding/Variable Control, and Data Presentation and Analysis. The data quality within each domain will be evaluated by assessing unique metrics that pertain to each domain. The domains are defined in Table G-2 and further information on evaluation metrics is provided in section G.3. Relevance of the studies will also be checked in continuance with relevance identification that began during the data screening process. Table G-2. Data Evaluation Domains and Definitions Evaluation Domain Definition Test Substance Metrics in this domain evaluate whether the information provided in the study provides a reliable3 confirmation that the test substance used in a study has the same (or sufficiently similar) identity, purity, and properties as the substance of interest. Test Design Metrics in this domain evaluate whether the experimental design enables the study to distinguish the effect of exposure from other factors. This domain includes metrics related to the use of control groups and randomization in allocation to ensure that the effect of exposure is isolated. Exposure Characterization Metrics in this domain assess the validity and reliability of methods used to measure or characterize exposure. These metrics evaluate whether exposure to the test substance was characterized using a method(s) that provides valid and reliable results, whether the exposure remained consistent over the duration of the experiment, and whether the exposure levels were appropriate to the outcome of interest. Test Organism/Test Model These metrics assess the appropriateness of the population or organism(s), group sizes used in the study (i.e., number of organisms and/or number of replicates per exposure group), and the organism conditions to assess the outcome of interest associated with the exposure of interest. 173 ------- Evaluation Domain Definition Outcome Assessment Metrics in this domain assess the validity and reliability of methods, including sensitivity of methods, that are used to measure or otherwise characterize the outcome(s) of interest. Confounding/Variable Control Metrics in this domain assess the potential impact of factors other than exposure that may affect the risk of outcome. The metrics evaluate whether studies identify and account for factors that are related to exposure and independently related to outcome (confounding factors) and whether appropriate experimental or analytical (statistical) methods are used to control for factors unrelated to exposure that may affect the risk of outcome (variable control). Data Presentation and Analysis Metrics in this domain assess whether appropriate statistical methods were used and if data for all outcomes are presented. Other Metrics in this domain are added as needed to incorporate chemical- or study-specific evaluations. Note: a Reliability is defined as "the inherent property of a study or data, which includes the use of well-founded scientific approaches, the avoidance of bias within the study or data collection design and faithful study or data collection conduct and documentation" (ECHA. 2011a). G.3 Data Quality Evaluation Metrics The data quality evaluation domains are evaluated by assessing unique metrics that have been developed for animal and in vitro studies. Each metric is binned into a confidence level of High, Medium, Low, or Unacceptable. Each confidence level is assigned a numerical score (i.e., 1 through 4) that is used in the method of assessing the overall quality of the study. Table G-3 lists the data evaluation domains and metrics for animal toxicity studies including metrics that inform risk of bias and types of bias, and Table G-4 lists the data evaluation domains and metrics for in vitro toxicity studies. Each domain has between 2 and 6 metrics; however, some metrics may not apply to all study types. A general domain for other considerations is available for metrics that are specific to a given test substance or study type. EPA may modify the metrics used for animal toxicity and in vitro toxicity studies as the Agency acquires experience with the evaluation tool. Any modifications will be documented. 174 ------- Table G-3. Data Evaluation Domains and Metrics for Animal Toxicity Studies Evaluation Domain Number of Metrics Overall Metrics (Metric Number and Description, Type of Bias) Test Substance 3 • Metric 1: Test Substance Identity • Metric 2: Test Substance Source • Metric 3: Test Substance Purity (information bias3) (*detection biasb) Test Design 3 • Metric 4: Negative and Vehicle Controls (*performance biasb) • Metric 5: Positive Controls (information bias3) • Metric 6: Randomized Allocation (*selection bias3 b) Exposure Characterization 6 • Metric 7: • Metric 8: • Metric 9: • Metric 10 • Metric 11 • Metric 12 Preparation and Storage of Test Substance Consistency of Exposure Administration Reporting of Doses/Concentrations Exposure Frequency and Duration Number of Exposure Groups and Dose Spacing Exposure Route and Method Test Organism 3 • Metric 13 • Metric 14 • Metric 15 Test Animal Characteristics Adequacy and Consistency of Animal Husbandry Conditions Number per Group (*missing data bias3) Outcome Assessment 5 • Metric 16 • Metric 17 • Metric 18 • Metric 19 • Metric 20 Outcome Assessment Methodology (information bias3) (*detection biasb) Consistency of Outcome Assessment Sampling Adequacy Blinding of Assessors (*selection bias3) (*performance biasb) Negative Control Response Confounding/ Variable Control 2 • Metric 21: Confounding Variables in Test Design and Procedures (*other biasb) • Metric 22: Health Outcomes Unrelated to Exposure (*attrition/exclusion biasb) Data Presentation and Analysis 2 • Metric 23: Statistical Methods (information bias3) (*other biasb) • Metric 24: Reporting of Data (*selective reporting biasb) Notes: Items marked with an asterisk (*) are examples of items that can be used to assess internal validity/risk of bias. aNational Academies of Sciences, Engineering, and Medicine. 2017. Application of Systematic Review Methods in an Overall Strategy for Evaluating Low-Dose Toxicity from Endocrine Active Chemicals. Washington, DC: The National Academies Press, doi: https://doi.org/10.17226/24758 bNational Toxicology Program, Office of Health Assessment and Translation (OHAT). 2015. OHAT Risk of Bias Rating Tool for Human and Animal Studies, https://ntp.niehs.nih.gov/ntp/ohat/pubs/riskofbiastool 508.pdf 175 ------- Table G-4. Data Evaluation Domains and Metrics for In Vitro Toxicity Studies Evaluation Domain Number of Metrics Overall Metrics (Metric Number and Description, Type of Bias) Test Substance 3 • Metric 1: Test Substance Identity • Metric 2: Test Substance Source • Metric 3: Test Substance Purity Test Design 4 • Metric 4: Negative Controls3 • Metric 5: Positive Controls a • Metric 6: Assay Procedures • Metric 7: Standards for Test Exposure Characterization 6 • Metric 8: • Metric 9: • Metric 10 • Metric 11 • Metric 12 • Metric 13 Preparation and Storage of Test Substance Consistency of Exposure Administration Reporting of Doses/Concentrations Exposure Duration Number of Exposure Groups and Dose Spacing Metabolic Activation Test Model 2 • Metric 14 • Metric 15 Test Model Number per Group Outcome Assessment 4 • Metric 16 • Metric 17 • Metric 18 • Metric 19 Outcome Assessment Methodology Consistency of Outcome Assessment Sampling Adequacy Blinding of Assessors Confounding/ Variable Control 2 • Metric 20 • Metric 21 Confounding Variables in Test Design and Procedures Outcomes Unrelated to Exposure Data Presentation and Analysis 4 • Metric 22 • Metric 23 • Metric 24 • Metric 25 Data Analysis Data Interpretation Cytotoxicity Data Reporting of Data Note: a These are for the assay performance, not necessarily for the "validation" of extrapolating to a particular apical outcome (i.e., assay performance vs assay validation). G.4 Scoring Method and Determination of Overall Data Quality Level Appendix A provides information about the evaluation method that will be applied across the various data/information sources being assessed to support TSCA risk evaluations. This section provides details about the scoring system that will be applied to animal and in vitro toxicity studies, including the weighting factors assigned to each metric score of each domain. Some metrics will be given greater weights than others, if they are regarded as key or critical metrics. Thus, EPA will use a weighting approach to reflect that some metrics are more important than others when assessing the overall quality of the data. 176 ------- G.4.1 Weighting Factors Each metric was assigned a weighting factor of 1 or 2, with the higher weighting factor (2) given to metrics deemed critical for the evaluation. The critical metrics were identified based on professional judgment in conjunction with consideration of the factors that are most frequently included in other study quality/risk of bias tools for animal toxicity studies [reviewed by Lynch et al. (2016); Samuel et al. (2016)1. In selecting critical metrics, EPA recognized that the relevance of an individual study to the risk analysis for a given substance is determined by its ability to inform hazard identification and/or dose-response assessment. Thus, the critical metrics are those that determine how well a study answers these key questions: • Is a change in health outcome demonstrated in the study? • Is the observed change more likely than not attributable to the substance exposure? • At what substance dose(s) does the change occur? EPA/OPPT assigned a weighting factor of 2 to each metric considered critical to answering these questions. Remaining metrics were assigned a weighting factor of 1. Tables G-5 and G-6 identify the critical metrics (i.e., those assigned a weighting factor of 2) for animal toxicity and in vitro toxicity studies, respectively, and provides a rationale for selection of each metric. Tables G-7 and G-8 identify the weighting factors assigned to each metric for animal toxicity and in vitro toxicity studies, respectively. Table G-5. Animal Toxicity Metrics with Greater Importance in the Evaluation and Rationale for Selection Domain Critical Metrics with Weighting Factor of 2 (Metric Number)a Rationale Test substance Test substance identity (Metric 1) The test substance must be identified and characterized definitively to ensure that the study is relevant to the substance of interest. Test design Negative and vehicle controls (Metric 4) A concurrent negative control and vehicle control (when indicated) are required to ensure that any observed effects are attributable to substance exposure. Note that more than one negative control may be necessary in some studies. Exposure characterization Reporting of doses/concentrations (Metric 9) Dose levels must be defined without ambiguity to allow for determination of the dose-response relationship and to enable valid comparisons across studies. Test organisms Test animal characteristics (Metric 13) The test animal characteristics must be reported to enable assessment of a) whether they are suitable for the endpoint of interest; b) whether there are species, strain, sex, or age/lifestage differences within or between different studies; and c) to enable consideration of approaches for extrapolation to humans. Outcome assessment Outcome assessment methodology (Metric 16) The methods used for outcome assessment must be fully described, valid, and sensitive to ensure that effects are detected, that observed effects are true, and to enable valid comparisons across studies. Confounding/ variable control Confounding variables in test design and procedures (Metric 21) Control for confounding variables in test design and procedures is necessary to ensure that any observed effects are attributable to substance exposure and not to other factors. Data presentation and analysis Reporting of data (Metric 24) Detailed results are necessary to determine if the study authors' conclusions are valid and to enable dose-response modeling. Note: aA weighting factor of 1 is assigned for the remaining metrics. 177 ------- Table G-6. In Vitro Toxicity Metrics with Greater Importance in the Evaluation and Rationale for Selection Domain Critical Metrics with Weighting Factor of 2 (Metric Number)a Rationale Test Substance Test Substance Identity (Metric 1) The test substance must be identified and characterized definitively to ensure that the study is relevant to the substance of interest. Test Design Negative and Vehicle Controls (Metric 4) A concurrent negative control and vehicle control (when indicated) are required for comparison of results between exposed and unexposed models to allow determination of treatment-related effects. Positive Controls (Metric 5) A concurrent positive control or proficiency control (when applicable) is required to determine if the chemical of interest produces the intended outcome for the study type. Exposure Characterization Reporting of concentrations (Metric 10) Dose levels must be defined without ambiguity to allow for determination of an accurate dose- response relationship or and to ensure valid comparisons across studies. Exposure duration (Metric 11) The exposure duration during the study must be defined to accurately assess potential risk. Test Model Test Model (Metric 14) The identity of the test model must be reported and suitable for the evaluation of outcome(s) of interest. Outcome Assessment Outcome assessment methodology (Metric 16) The methods used for outcome assessment must be fully described, valid, and sensitive to ensure that effects are detected and that observed effects are true. Sampling adequacy (Metric 18) The number of samples evaluated must be sufficient to allow data interpretation and analysis. Confounding/Variable Control Confounding variables in test design and procedures (Metric 20) Control for confounding variables in test design and procedures are necessary to ensure that any observed effects are attributable to substance exposure and not to other factors. Data Presentation and Analysis Data interpretation (Metric 23) The criteria for scoring and/or evaluation criteria are necessary so that the correct categorization (e.g., positive, negative, equivocal) can be determined for the chemical of interest. Reporting of data (Metric 25) Detailed results are necessary to determine if the study authors' conclusions are valid and to enable dose-response modeling. Note: a A weighting factor of 1 is assigned for the remaining metrics. 178 ------- G.4.2 Calculation of Overall Study Score A confidence level (1, 2, or 3 for High, Medium, or Low confidence, respectively) is assigned for each relevant metric within each domain. To determine the overall study score, the first step is to multiply the score for each metric (1, 2, or 3 for High, Medium, or Low confidence, respectively) by the appropriate weighting factor (as shown in Tables G-7 and G-8 for animal toxicity and in vitro studies, respectively) to obtain a weighted metric score. The weighted metric scores are then summed and divided by the sum of the weighting factors (for all metrics that are scored) to obtain an overall study score between 1 and 3. The equation for calculating the overall score is shown below: Overall Score (range of 1 to 3) = Z (Metric Score x Weighting Factor)/Z(Weighting Factors) Some metrics may not be applicable to all study types. These metrics will not be included in the nominator or denominator of the equation above. The overall score will be calculated using only those metrics that receive a numerical score. Scoring examples for animal toxicity and in vitro toxicity studies are in tables G-9 through G-12. Studies with any single metric scored as unacceptable (score = 4) will be automatically assigned an overall quality score of 4 (Unacceptable). An unacceptable score means that serious flaws are noted in the domain metric that consequently make the data unusable. If a metric is not applicable for a study type, the serious flaws would not be applicable for that metric and would not receive a score. EPA/OPPT plans to use data with an overall quality level of High, Medium, or Low confidence to quantitatively or qualitatively support the risk evaluations, but does not plan to use data rated as Unacceptable. An overall study score will not be calculated when a serious flaw is identified for any metric. If a publication reports more than one study or endpoint, each study and, as needed, each endpoint will be evaluated separately. Detailed tables showing quality criteria for the metrics are provided in Tables G-13 through G- 16 for animal toxicity and in vitro toxicity studies, including a table that summarizes the serious flaws that would make the data unacceptable for use in the environmental hazard assessment 179 ------- Table G-7. Metric Weighting Factors and Range of Weighted Metric Scores for Animal Toxicity Studies Domain Number/ Description Metric Number/Description Range of Metric Scores3 Metric Weighting Factor Range of Weighted Metric Scores'3 1. Test Substance 1. Test Substance Identity 1 to 3 2 2 to 6 2. Test Substance Source 1 1 to 3 3. Test Substance Purity 1 1 to 3 2. Test Design 4. Negative and Vehicle Controls 2 2 to 6 5. Positive Controls 1 1 to 3 6. Randomized Allocation 1 1 to 3 3. Exposure Characterization 7. Preparation and Storage of Test Substance 1 1 to 3 8. Consistency of Exposure Administration 1 1 to 3 9. Reporting of Doses/Concentrations 2 2 to 6 10. Exposure Frequency and Duration 1 1 to 3 11. Number of Exposure Groups and Dose Spacing 1 1 to 3 12. Exposure Route and Method 1 1 to 3 4. Test Organisms 13. Test Animal Characteristics 2 2 to 6 14. Adequacy and Consistency of Animal Husbandry Conditions 1 1 to 3 15. Number per Group 1 1 to 3 5. Outcome Assessment 16. Outcome Assessment Methodology 2 2 to 6 17. Consistency of Outcome Assessment 1 1 to 3 18. Sampling Adequacy 1 1 to 3 19. Blinding of Assessors 1 1 to 3 20. Negative Control Response 1 1 to 3 6. Confounding/ Variable Control 21. Confounding Variables in Test Design and Procedures 2 2 to 6 22. Health Outcomes Unrelated to Exposure 1 1 to 3 7. Data Presentation and Analysis 23. Statistical Methods 1 1 to 3 24. Reporting of Data 2 2 to 6 Sum (if all metrics scored)c 31 31 to 93 Range of Overall See Overal )res, where Score = Sum of We High ghted Scores/Sum Medium of Metric Weightin Low g Factor 31/31=1; 93/31=3 Range of overall score = 1 to 3d >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Notes: a For the purposes of calculating an overall study score, the range of possible metric scores is 1 to 3 for each metric, corresponding to high and low confidence. No calculations will be conducted if a study receives an "unacceptable" rating (score of "A") for any metric. bThe range of weighted scores for each metric is calculated by multiplying the range of metric scores (1 to 3) by the weighting factor for that metric. cThe sum of weighting factors and the sum of the weighted scores will differ if some metrics are not scored (not applicable). dThe range of possible overall scores is 1 to 3. If a study receives a score of 1 for every metric, then the overall study score will be 1. If a study receives a score of 3 for every metric, then the overall study score will be 3. 180 ------- Table G-8. Metric Weighting Factors and Range of Weighted Metric Scores for In Vitro Toxicity Studies Domain Number/ Description Metric Number/Description Range of Metric Scores3 Metric Weighting Factor Range of Weighted Metric Scores'3 1. Test Substance 1. Test Substance Identity 1 to 3 2 2 to 6 2. Test Substance Source 1 1 to 3 3. Test Substance Purity 1 1 to 3 2. Test Design 4. Negative and Vehicle Controls 2 2 to 6 5. Positive Controls 2 2 to 6 6. Assay Procedures 1 1 to 3 7. Standards for Test 1 1 to 3 3. Exposure Characterization 8. Preparation and Storage of Test Substance 1 1 to 3 9. Consistency of Exposure Administration 1 1 to 3 10. Reporting of Concentrations 2 2 to 6 11. Exposure Duration 2 2 to 6 12. Number of Exposure Groups and Dose Spacing 1 1 to 3 13. Metabolic Activation 1 1 to 3 4. Test model 14. Test Model 2 2 to 6 15. Number per Group 1 1 to 3 5. Outcome Assessment 16. Outcome Assessment Methodology 2 2 to 6 17. Consistency of Outcome Assessment 1 1 to 3 18. Sampling Adequacy 2 2 to 6 19. Blinding of Assessors 1 1 to 3 6. Confounding/ Variable Control 20. Confounding Variables in Test design and Procedures 2 2 to 6 21. Outcomes Unrelated to Exposure 1 1 to 3 7. Data Presentation and Analysis 22. Data Analysis 1 1 to 3 23. Data Interpretation 2 2 to 6 24. Cytotoxicity Data 1 1 to 3 25. Reporting of Data 2 2 to 6 Sum (if all metrics scored)c 36 36 -108 Range of Overall See Overal )res, where Score = Sum of We High ghted Scores/Sum Medium of Metric Weightin Low g Factor 36/36=1; 108/36=3 Range of overall score = 1 to 3d >1 and <1.7 >1.7 and <2.3 >2.3 and <3 Notes: a For the purposes of calculating an overall study score, the range of possible metric scores is 1 to 3 for each metric, corresponding to high and low confidence. No calculations will be conducted if a study receives an "unacceptable" rating (score of "A") for any metric. bThe range of weighted scores for each metric is calculated by multiplying the range of metric scores (1 to 3) by the weighting factor for that metric. cThe sum of weighting factors and the sum of the weighted scores will differ if some metrics are not scored (not applicable). dThe range of possible overall scores is 1 to 3. If a study receives a score of 1 for every metric, then the overall study score will be 1. If a study receives a score of 3 for every metric, then the overall study score will be 3. 181 ------- Table G-9. Scoring Example for Animal Toxicity Study with all Metrics Scored Domain Metric Metric Score Metric Weighting Factor Test substance 1. Test substance identity 2. Test substance source 3. Test substance purity Test design 4. Negative and vehicle controls 5. Positive controls 6. Randomized allocation Exposure characterization 7. Preparation and storage of test substance 8. Consistency of exposure administration 9. Reporting of doses/concentrations 10. Exposure frequency and duration 11. Number of exposure groups and dose spacing 12. Exposure route and method Test organisms 13. Test animal characteristics 14. Consistency of animal conditions 15. Number per group Outcome assessment 16. Outcome assessment methodology 17. Consistency of outcome assessment 18. Sampling adequacy 19. Blinding of assessors 20. Negative control responses Confounding/variable control 21. Confounding variables in test design and procedures 22. Health outcomes unrelated to exposure Data presentation and analysis 23. Statistical methods 24. Reporting of data NR= not rated/not applicable Sum of scores 31 59 Overall Study Score 1.9 = Medium Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factors High Medium Low >1 and <1.7 >1.7 and <2.3 >2.3 and <3 182 ------- Table G-10. Scoring Example for Animal Toxicity Study with Some Metrics Not Rated/Not Applicable Domain Metric Metric Score Metric Weighting Factor Test substance 1. Test substance identity 2. Test substance source 3. Test substance purity Test design 4. Negative and vehicle controls 5. Positive controls 6. Randomized allocation 1 NR 3 Exposure characterization 7. Preparation and storage of test substance 8. Consistency of exposure administration 9. Reporting of doses/concentrations 10. Exposure frequency and duration 11. Number of exposure groups and dose spacing 12. Exposure route and method 2 NR 1 2 1 1 Test organisms 13. Test animal characteristics 14. Consistency of animal conditions 15. Number per group Outcome assessment 16. Outcome assessment methodology 17. Consistency of outcome assessment 18. Sampling adequacy 19. Blinding of assessors 20. Negative control responses 2 NR 2 NR 2 Confounding/variable control 21. Confounding variables in test design and procedures 22. Health outcomes unrelated to exposure Data presentation and analysis 23. Statistical methods 24. Reporting of data NR= not rated/not applicable Sum Overall Study Score 1.8 27 : Medium 49 Overall Score High = Sum of Weig Medium hted Scores/S Low um of Metric Weighting Factor >1 and <1.7 >1.7 and <2.3 >2.3 and <3 183 ------- Table G-ll. Scoring Example for In Vitro Study with all Metrics Scored Domain Metric Metric Score Metric Weighting Factor Test substance 1. Test substance identity 2. Test substance source 3. Test substance purity Test design 4. Negative controls 5. Positive controls 6. Assay procedures 7. Standards for test Exposure characterization 8. Preparation and storage of test substance 9. Consistency of exposure administration 10. Reporting of concentrations 11. Exposure duration 12. Number of exposure groups and dose spacing 13. Metabolic activation Test Model 14. Test model 15. Number per group Outcome assessment 16. Outcome assessment methodology 17. Consistency of outcome assessment 18. Sampling adequacy 19. Blinding of assessors Confounding/variable control 20. Confounding variables in test design and procedures 21. Outcomes unrelated to exposure Data presentation and analysis 22. Data analysis 23. Data interpretation 24. Cytotoxicity data 25. Reporting of data NR= not rated/not applicable Sum Overall Study Score 1.8 36 : Medium 66 Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor High Medium Low >1 and <1.7 >1.7 and <2.3 >2.3 and <3 184 ------- Table G-12. Scoring Example for In Vitro Study with Some Metrics Not Rated/Not Applicable Domain Metric Metric Score Metric Weighting Factor Test substance 1. Test substance identity 2. Test substance source 3. Test substance purity Test design 4. Negative controls 5. Positive controls 6. Assay procedures 7. Standards for test Exposure characterization 8. Preparation and storage of test substance 9. Consistency of exposure administration 10. Reporting of concentrations 11. Exposure duration 12. Number of exposure groups and dose spacing 13. Metabolic activation NR 2 1 1 1 NR Test Model 14. Test model 15. Number per group Outcome assessment 16. Outcome assessment methodology 17. Consistency of outcome assessment 18. Sampling adequacy 19. Blinding of assessors 3 2 1 NR Confounding/variable control 20. Confounding variables in test design and procedures 21. Outcomes unrelated to exposure Data presentation and analysis 22. Data analysis 23. Data interpretation 24. Cytotoxicity data 25. Reporting of data 1 2 NR 3 NR= not rated/not applicable Sum Overall Study Score 1.8 32 Medium 58 Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor High Medium Low >1 and <1.7 >1.7 and <2.3 >2.3 and <3 185 ------- G.5 Data Quality Criteria G.5.1 Animal Toxicity Studies Optimization of the list of serious flaws may occur after pilot calibration exercises. Table G-13. Serious Flaws that Would Make Animal Toxicity Studies Unacceptable Domain Metric Description of Serious Flaw(s) in Data Source Test substance Test substance identity The test substance identity and form (the latter if applicable) cannot be determined from the information provided (e.g., nomenclature was unclear and CASRN or structure were not reported) OR for mixtures, the components and ratios were not characterized. Test substance source The test substance was not obtained from a manufacturer OR if synthesized or extracted, analytical verification of the test substance was not conducted. Test substance purity The nature and quantity of reported impurities were such that study results were likely to be due to one or more of the impurities. Test design Negative and vehicle controls A concurrent negative control group was not included or reported OR the reported negative control group was not appropriate (e.g., age/ weight of animals differed between control and treated groups). Positive controls For study types that require a concurrent positive control group: When applicable, an appropriate concurrent positive control (i.e., inducing a positive response) was not used and its omission is a serious flaw that makes the study unusable. Randomized allocation of animals The study reported using a biased method to allocate animals to study groups (e.g., judgement of investigator). Exposure characterization Preparation and storage of test substance Information on preparation and storage was not reported OR serious flaws reported with test substance preparation and/or storage conditions will have critical impacts on dose/concentration estimates and make the study unusable (e.g., instability of test substance in exposure medium was reported, or there was heterogeneous distribution of test substance in exposure matrix [e.g., aerosol deposition in exposure chamber, insufficient mixing of dietary matrix]). For inhalation studies, there was no mention of the method and equipment used to generate the test substance, or the method used is atypical and inappropriate. 186 ------- Domain Metric Description of Serious Flaw(s) in Data Source Critical exposure details (e.g., methods for generating atmosphere in inhalation studies) were not reported OR reported information indicated that exposures were not administered consistently across study groups (e.g., differing particle size), resulting in serious flaws that make the study unusable. Consistency of exposure administration Reporting of doses/concentrations The reported exposure levels could not be validated (e.g., lack of food or water intake data for dietary or water exposures in conjunction with evidence of palatability differences, lack of body weight data in conjunction with qualitative evidence for body weight differences across groups, inconsistencies in reporting, etc.). For inhalation studies, actual concentrations not reported along with animal responses (or lack of responses) that indicate exposure problems due to faulty test substance generation. Animals were exposed to an aerosol but no particle size data were reported. Exposure frequency and duration The exposure frequency or duration of exposure were not reported OR the reported exposure frequency and duration were not suited to the study type and/or outcome(s) of interest (e.g., study length inadequate to evaluate tumorigenicity). Number of exposure groups and dose/concentration spacing The number of exposure groups and spacing were not reported OR dose groups and spacing were not relevant for the assessment (e.g., all doses in a developmental toxicity study produced overt maternal toxicity). The route or method of exposure was not reported OR an inappropriate route or method (e.g., administration of a volatile organic compound via the diet) was used for the test substance without taking steps to correct the problem (e.g., mixing fresh diet, replacing air in static chambers). For inhalation studies, there is no description of the inhalation chamber used, or an atypical exposure method was used, such as allowing a container of test substance to evaporate in a room. Exposure route and method The test animal species was not reported OR the test animal (species, strain, sex, life-stage, source) was not appropriate for the evaluation of the specific outcome(s) of interest (e.g., genetically modified animals, strain was uniquely susceptible or resistant to one or more outcome of interest). Test organisms Test animal characteristics Adequacy and consistency of animal husbandry conditions There were significant differences in husbandry conditions between control and exposed groups (e.g., temperature, humidity, light-dark cycle) OR 187 ------- Domain Metric Description of Serious Flaw(s) in Data Source animal husbandry conditions deviated from customary practices in ways likely to impact study results (e.g., injuries and stress due to cage overcrowding). The number of animals per study group was not reported OR the number of animals per study group was insufficient to characterize toxicological effects (e.g., 1-2 animals in each group). Number of animals per group The outcome assessment methodology was not reported OR the reported outcome assessment methodology was not sensitive for the outcome(s) of interest (e.g., evaluation of endpoints outside the critical window of development, a systemic toxicity study that evaluated only grossly observable endpoints, such as clinical signs and mortality, etc.). Outcome assessment methodology Consistency of outcome assessment There were large inconsistencies in the execution of study protocols for outcome assessment across study groups OR outcome assessments were not adequately reported for meaningful interpretation of results. Outcome assessment Sampling adequacy Sampling was not adequate for the outcome(s) of interest (e.g., histopathology was performed on exposed groups, but not controls). Blinding of assessors Information in the study report did not report whether assessors were blinded to treatment group for subjective outcomes and suggested that the assessment of subjective outcomes (e.g., functional observational battery, qualitative neurobehavioral endpoints, histopathological re-evaluations) was performed in a biased fashion (e.g., assessors of subjective outcomes were aware of study groups). This is a serious flaw that makes the study unusable. Negative control responses The biological responses of the negative control groups were not reported OR there was unacceptable variation in biological responses between control replicates. Confounding/ variable control Confounding variables in test design and procedures The study reported significant differences among the study groups with respect to initial body weight, decreased drinking water/food intake due to palatability issues (>20% difference from control) that could lead to dehydration and/or malnourishment, or reflex bradypnea that could lead to decreased oxygenation of the blood. Health outcomes unrelated to exposure One or more study groups experienced serious animal attrition or health outcomes unrelated to exposure (e.g., infection). 188 ------- Domain Metric Description of Serious Flaw(s) in Data Source Data presentation and analysis Statistical methods Statistical methods used were not appropriate (e.g., parametric test for non-normally distributed data) OR statistical analysis was not conducted AND data were not provided preventing an independent statistical analysis. Reporting of data Data presentation was inadequate (e.g., the report does not differentiate among findings in multiple exposure groups) OR major inconsistencies were present in reporting of results. 189 ------- Table G-14. Data Quality Criteria for Animal Toxicity Studies Confidence Level (Score) Description Selected Score Domain 1. Test Substance Metric 1. Test substance identity Was the test substance identified definitively (i.e., established nomenclature, CASRN, and/or structure reported, including information on the specific form tested [particle characteristics for solid-state materials, salt or base, valence state, hydration state, isomer, radiolabel, etc.] for materials that may vary in form)? If test substance is a mixture, were mixture components and ratios characterized? High (score = 1) The test substance was identified definitively and the specific form was characterized (where applicable). For mixtures, the components and ratios were characterized. Medium (score = 2) The test substance and form (the latter if applicable) were identified and components and ratios of mixtures were characterized, but there were minor uncertainties (e.g., minor characterization details were omitted) that are unlikely to have a substantial impact on results. Low (score = 3) The test substance and form (the latter if applicable) were identified and components and ratios of mixtures were characterized, but there were uncertainties regarding test substance identification or characterization that are likely to have a substantial impact on results. Unacceptable (score = 4) The test substance identity and form (the latter if applicable) cannot be determined from the information provided (e.g., nomenclature was unclear and CASRN or structure were not reported) OR for mixtures, the components and ratios were not characterized. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Test substance source Was the source of the test substance reported, including manufacturer and batch/lot number for materials that may vary in composition? If synthesized or extracted, was test substance identity verified by analytical methods? High (score = 1) The source of the test substance was reported, including manufacturer and batch/lot number for materials that may vary in composition, and its identity was certified by manufacturer and/or verified by analytical methods (melting point, chemical analysis, etc.). Medium (score = 2) The source of the test substance and/or the analytical verification of a synthesized test substance was reported incompletely, but the omitted details are unlikely to have a substantial impact on results. Low (score = 3) Omitted details on the source of the test substance and/or the analytical verification of a synthesized test substance are likely to have a substantial impact on results. Unacceptable (score = 4) The test substance was not obtained from a manufacturer OR if synthesized or extracted, analytical verification of the test substance was not conducted. These are serious flaws that makes the study unusable. Not rated/applicable 190 ------- Confidence Level (Score) Description Selected Score Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Test substance purity Was the purity or grade (i.e., analytical, technical) of the test substance reported and adequate to identify its toxicological effects? Were impurities identified? Were impurities present in quantities that could influence the results? High (score = 1) The test substance purity and composition were such that any observed effects were highly likely to be due to the nominal test substance itself (e.g., highly pure or analytical-grade test substance or a formulation comprising primarily inert ingredients with small amount of active ingredient). Medium (score = 2) Minor uncertainties or limitations were identified regarding the test substance purity and composition; however, the purity and composition were such that observed effects were more likely than not due to the nominal test substance, and any identified impurities are unlikely to have a substantial impact on results. Alternately, purity was not reported but given other information purity was not expected to be of concern. Low (score = 3) Purity and/or grade of test substance were not reported or were low enough to have a substantial impact on results (i.e., observed effects may not be due to the nominal test substance). Unacceptable (score = 4) The nature and quantity of reported impurities were such that study results were likely to be due to one or more of the impurities. This is a serious flaw that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Test Design Metric 4. Negative and vehicle controls Was an appropriate concurrent negative control group included? If a vehicle was used, was the control group exposed to the vehicle? For inhalation and gavage studies, were controls sham-exposed? High (score = 1) Study authors reported using an appropriate concurrent negative control group (i.e., all conditions equal except chemical exposure). If gavage or inhalation study, a vehicle and/or sham-treated control group was included. Medium (score = 2) Study authors reported using a concurrent negative control group, but all conditions were not equal to those of treated groups; however, the identified differences are considered to be minor limitations that are unlikely to have a substantial impact on results. Low (score = 3) Study authors acknowledged using a concurrent negative control group, but details regarding the negative control group were not reported, and the lack of details is likely to have a substantial impact on results. Unacceptable (score = 4) A concurrent negative control group was not included or reported OR the reported negative control group was not appropriate (e.g., age/ weight of animals differed between control and treated groups). This is a serious flaw that makes the study unusable. Not rated/applicable 191 ------- Confidence Level (Score) Description Selected Score Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Positive controls Was an appropriate concurrent positive control group included if necessary based on study type (e.g., certain neurotoxicity studies)? This metric is not rated/applicable if positive control was not indicated by study type. High (score = 1) When applicable, A concurrent positive control was used (if necessary for the study type) and a positive response was observed. Medium (score = 2) When applicable, A concurrent positive control was used, but there were minor uncertainties (e.g., minor details regarding control exposure or response were omitted) that are unlikely to have a substantial impact on results. Low (score = 3) When applicable, A concurrent positive control was used, but there were deficiencies regarding the control exposure or response that are likely to have a substantial impact on results (e.g., the control response was not described). Unacceptable (score = 4) When applicable, an appropriate concurrent positive control (i.e., inducing a positive response) was not used and its omission is a serious flaw that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 6. Randomized allocation of animals Did the study explicitly report randomized allocation of animals to study groups? High (score = 1) The study reported that animals were randomly allocated into study groups (including the control group). Medium (score = 2) The study reported methods of allocation of animals to study groups, but there were minor limitations in the allocation method (e.g., method with a nonrandom component like assignment to minimize differences in body weight across groups) that are unlikely to have a substantial impact on results. Low (score = 3) The study did not report how animals were allocated to study groups, or there were deficiencies regarding the allocation method that are likely to have a substantial impact on results (e.g., allocation by animal number). Unacceptable (score = 4) The study reported using a biased method to allocate animals to study groups (e.g., judgement of investigator). This is a serious flaw that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 192 ------- Confidence Level (Score) Description Selected Score Domain 3. Exposure Characterization Metric 7. Preparation and storage of test substance Did the study characterize the test substance preparation and storage conditions (e.g., test substance stability, homogeneity, mixing temperature, stock concentration, stirring methods, centrifugation/filtration)? Were the frequency of preparation and/or storage conditions appropriate to the test substance stability? For inhalation studies, was the aerosol/vapor generation method appropriate? High (score = 1) The test substance preparation and storage conditions were reported and appropriate for the test substance (e.g., test substance well-mixed in diet). For inhalation studies, the method and equipment used to generate the test substance as a gas, vapor, or aerosol were reported and appropriate. Medium (score = 2) The test substance preparation and storage conditions were reported, but there were only minor limitations in the test substance preparation and/or storage conditions were identified (i.e., diet was not mixed fresh daily) or omission of details that are unlikely to have a substantial impact on results. For inhalation studies, the method and equipment used to generate the test substance were incomplete or confusing but there is no reason to believe there was an impact on animal exposure. Low (score = 3) Deficiencies in reporting of test substance preparation and/or storage conditions are likely to have a substantial impact on results (e.g., available information on physical-chemical properties suggested that stability and/or solubility of test substance in vehicle may be poor). For inhalation studies, there is reason to question the validity of the method used for generating the test substance. Unacceptable (score = 4) Information on preparation and storage was not reported OR serious flaws reported with test substance preparation and/or storage conditions will have critical impacts on dose/concentration estimates and make the study unusable (e.g., instability of test substance in exposure medium was reported, or there was heterogeneous distribution of test substance in exposure matrix [e.g., aerosol deposition in exposure chamber, insufficient mixing of dietary matrix]). For inhalation studies, there was no mention of the method and equipment used to generate the test substance, or the method used is atypical and inappropriate. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 8. Consistency of exposure administration Were exposures administered consistently across study groups (e.g., same exposure frequency; same time of day; consistent gavage volumes or diet compositions in oral studies; consistent chamber designs, animals/chamber, and comparable particle size characteristics in inhalation studies; consistent application methods and volumes in dermal studies)? High (score = 1) Details of exposure administration were reported and exposures were administered consistently across study groups in a scientifically sound manner (e.g., gavage volume was not excessive). Medium (score = 2) Details of exposure administration were reported, but minor limitations in administration of exposures (e.g., accidental mistakes in dosing) were 193 ------- Confidence Level (Score) Description Selected Score identified that are unlikely to have a substantial impact on results. Low (score = 3) Details of exposure administration were reported, but deficiencies in administration of exposures (e.g., exposed at different times of day) are likely to have a substantial impact on results. Unacceptable (score = 4) Critical exposure details (e.g., methods for generating atmosphere in inhalation studies) were not reported OR reported information indicated that exposures were not administered consistently across study groups (e.g., differing particle size), resulting in serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 9. Reporting of doses/concentrations Were doses/concentrations reported without ambiguity (e.g., point estimate in addition to a range)? In oral studies, if doses were not reported, was information reported that enabled dose estimation (e.g., test animal dietary intake and body weight monitoring data in dietary studies)? In inhalation studies, was test substance vapor/aerosol concentration measured analytically along with nominal and target concentrations? High (score = 1) For oral and dermal studies, administered doses/concentrations, or the information to calculate them, were reported without ambiguity. For inhalation studies, several specific considerations apply: Analytical, nominal and target chamber concentrations were all reported, with high confidence in the accuracy of the actual concentrations; the range of concentrations within a treatment group did not deviate widely (range should be within ±10% for gases and vapors and within ±20% for liquid and solid aerosols). The analytical method (HPLC, GC, IR spectrophotometry, etc.) used to measure chamber test substance and vehicle concentration was reported and appropriate. Actual chamber measurements using gravimetric filters are acceptable when testing dry aerosols and non-volatile liquid aerosols. The particle size distribution data, mass median aerodynamic diameter (MMAD), and geometric standard deviation were reported for all exposed groups (including vehicle controls, when used). Medium (score = 2) For oral and dermal studies, minor uncertainties in reporting of administered doses/concentrations occurred (e.g., dietary or air concentrations were not measured analytically) but are unlikely to have a substantial impact on results. For inhalation studies, several specific considerations apply: With gases only, actual concentrations were not reported but there is high confidence that the animals were exposed at approximately the reported target concentrations. [There is no comparable medium result for aerosols and vapors if analytical concentrations are not reported.] For inhalation studies (gas, vapor, aerosol), the analytical method used was less than ideal or subject to interference but nevertheless yielded fairly reliable measurements of chamber concentrations. 194 ------- Confidence Level (Score) Description Selected Score Particle size distribution data were not reported, but mass median aerodynamic diameter (MMAD), and geometric standard deviation values were reported for all exposed groups (including vehicle controls, when used). Low (score = 3) For oral and dermal studies, deficiencies in reporting of administered doses/concentrations occurred (e.g., no information on animal body weight or intake were provided) that are likely to have a substantial impact on results. For inhalation studies, several considerations apply: Using aerosols and vapors, a score of low is indicated if actual concentrations are not reported or the analytical method used, such as sampling tubes (e.g., Draeger tubes) provided imprecise measurements. An MMAD is reported but no geometric standard deviation or particle size distribution data were reported. Unacceptable (score = 4) The reported exposure levels could not be validated (e.g., lack of food or water intake data for dietary or water exposures in conjunction with evidence of palatability differences, lack of body weight data in conjunction with qualitative evidence for body weight differences across groups, inconsistencies in reporting, etc.). This is a serious flaw that makes the study unusable. For inhalation studies, actual concentrations were not reported along with animal responses (or lack of responses) that indicate exposure problems due to faulty test substance generation. Animals were exposed to an aerosol but no MMAD or particle size data were reported. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 10. Exposure frequency and duration Were the exposure frequency (hours/day and days/week) and duration of exposure reported and appropriate for this study type and/or outcome(s) of interest? High (score = 1) The exposure frequency and duration of exposure were reported and appropriate for this study type and/or outcome(s) of interest (e.g., inhalation exposure 6 hours/day, gavage 5 days/week, 2-year duration for cancer bioassays). Medium (score = 2) Minor limitations in exposure frequency and duration of exposure were identified (e.g., inhalation exposure of 4 hours/day instead of 6 hours/day in a repeated exposure study), but are unlikely to have a substantial impact on results. Low (score = 3) The duration of exposure and/or exposure frequency differed significantly from typical study designs (e.g., gavage 1 day/week) and these deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) The exposure frequency or duration of exposure were not reported OR 195 ------- Confidence Level (Score) Description Selected Score the reported exposure frequency and duration were not suited to the study type and/or outcome(s) of interest (e.g., study length inadequate to evaluate tumorigenicity). These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 11. Number of exposure groups and dose/concentration spacing Were the number of exposure groups and dose/concentration spacing justified by study authors (e.g., based on range-finding studies) and adequate to address the purpose of the study (e.g., to evaluate dose-response relationships, identify points of departure, inform MOA/AOP, etc.)? High (score = 1) The number of exposure groups and dose/concentration spacing were justified by study authors and considered adequate to address the purpose of the study (e.g., the selected doses produce a range of responses). Medium (score = 2) There were minor limitations regarding the number of exposure groups and/or dose/concentration spacing (e.g., unclear if lowest dose was low enough or the highest dose was high enough), but the number of exposure groups and spacing of exposure levels were adequate to show results relevant to the outcome of interest (e.g., observation of a dose-response relationship) and the concerns are unlikely to have a substantial impact on results. Low (score = 3) There were deficiencies regarding the number of exposure groups and/or dose/concentration spacing (e.g., narrow spacing between doses with similar responses across groups), and these are likely to have a substantial impact on results. Unacceptable (score = 4) The number of exposure groups and spacing were not reported OR dose groups and spacing were not relevant for the assessment (e.g., all doses in a developmental toxicity study produced overt maternal toxicity). These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 12. Exposure route and method Were the route and method of exposure reported and suited to the test substance (e.g., was the test substance non-volatile in dietary studies)? High (score = 1) The route and method of exposure were reported and were suited to the test substance. For inhalation studies, a dynamic chamber was used. While dynamic nose- only (or head-only) studies are generally preferred, dynamic whole-body chambers are acceptable for gases and for vapors that do not condense. Medium (score = 2) There were minor limitations regarding the route and method of exposure, but the researchers took appropriate steps to mitigate the problem (e.g., mixed diet fresh each day for volatile compounds). These limitations are unlikely to have a substantial impact on results. For inhalation studies, a dynamic whole-body chamber was used for vapors 196 ------- Confidence Level (Score) Description Selected Score that may condense or for aerosols.28 Low (score = 3) There were deficiencies regarding the route and method of exposure that are likely to have a substantial effect on results. Researchers may have attempted to correct the problem, but the success of the mitigating action was unclear. For inhalation studies, there are significant flaws in the design or operation of the inhalation chamber, such as uneven distribution of test substance in a whole-body chamber, having less than 15 air changes/hour in a whole-body chamber, or using a whole-body chamber that is too small for the number and volume of animals exposed. Unacceptable (score = 4) The route or method of exposure was not reported OR an inappropriate route or method (e.g., administration of a volatile organic compound via the diet) was used for the test substance without taking steps to correct the problem (e.g., mixing fresh diet). These are serious flaws that makes the study unusable. For inhalation studies, either a static chamber was used, there is no description of the inhalation chamber, or an atypical exposure method was used, such as allowing a container of test substance to evaporate in a room. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Test Animals Metric 13. Test animal characteristics Were the test animal species, strain, sex, health status, age, and starting body weight reported? Was the test animal from a commercial source or in-house colony? Was the test species and strain an appropriate animal model for the evaluation of the specific outcome(s) of interest (e.g., routinely used for similar study types)? High (score = 1) The test animal species, strain, sex, health status, age, and starting body weight were reported, and the test animal was obtained from a commercial source or laboratory-maintained colony. The test species and strain were an appropriate animal model for the evaluation of the specific outcome(s) of interest (e.g., routinely used for similar study types). Medium (score = 2) Minor uncertainties in the reporting of test animal characteristics (e.g., health status, age, or starting body weight) are unlikely to have a substantial impact on results. The test animals were obtained from a commercial source or in-house colony, and the test species/strain/sex was an appropriate animal model for the evaluation of the specific outcome(s) of interest (e.g., routinely used for similar study types). Low (score = 3) The source of the test animal was not reported OR the test animal strain or sex was not reported. These deficiencies are likely to 28 This results in a medium score because in addition to inhalation exposure to the test substance, there may also be significant oral exposure due to rodents grooming test substance that adheres to their fur. The combined oral and inhalation exposure results in a lower POD, which makes a test substance appear more toxic than it really is by the inhalation route. 197 ------- Confidence Level (Score) Description Selected Score have a substantial impact on results. Unacceptable (score = 4) The test animal species was not reported OR the test animal (species, strain, sex, life-stage, source) was not appropriate for the evaluation of the specific outcome(s) of interest (e.g., genetically modified animals, strain was uniquely susceptible or resistant to one or more outcome of interest). These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 14. Adequacy and consistency of animal husbandry conditions Were all husbandry conditions (e.g., housing, temperature) adequate and the same for control and exposed populations, such that the only difference was exposure to the test substance? High (score = 1) All husbandry conditions were reported (e.g., temperature, humidity, light- dark cycle) and were adequate and the same for control and exposed populations, such that the only difference was exposure. Medium (score = 2) Most husbandry conditions were reported and were adequate and similar for all groups. Some differences in conditions were identified among groups, but these differences were considered minor uncertainties or limitations that are unlikely to have a substantial impact on results. Low (score = 3) Husbandry conditions were not sufficiently reported to evaluate if husbandry was adequate and if differences occurred between control and exposed populations. These deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) There were significant differences in husbandry conditions between control and exposed groups (e.g., temperature, humidity, light-dark cycle) OR animal husbandry conditions deviated from customary practices in ways likely to impact study results (e.g., injuries and stress due to cage overcrowding). These are serious flaws that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 15. Number of animals per group Was the number of animals per study group appropriate for the study type and outcome analysis? High (score = 1) The number of animals per study group was reported, appropriate for the study type and outcome analysis, and consistent with studies of the same or similar type (e.g., 50/sex/group for rodent cancer bioassay, 10/sex/group for rodent subchronic study, etc.). Medium (score = 2) The reported number of animals per study group was lower than the typical number used in studies of the same or similar type (e.g., 30/sex/group for rodent cancer bioassay, 8/sex/group for rodent subchronic study, etc.), but sufficient for statistical analysis and this minor limitation is unlikely to have a substantial impact on results. Low (score = 3) The reported number of animals per study group was not sufficient for statistical analysis (e.g., varying numbers per group with some groups consisting of only one animal) and this deficiency is likely to have a substantial impact on results. 198 ------- Confidence Level (Score) Description Selected Score Unacceptable The number of animals per study group was not reported (score = 4) OR the number of animals per study group was insufficient to characterize toxicological effects (e.g., 1-2 animals in each group). These are serious flaws that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 5. Outcome Assessment Metric 16. Outcome assessment methodology Did the outcome assessment methodology address or report the intended outcome(s) of interest? Was the outcome assessment methodology (including endpoints and timing of assessment) sensitive for the outcome(s) of interest (e.g., measured endpoints that are able to detect a true health effect or hazard)? Note: Outcome, as addressed in this domain, refers to health effects measured in an animal study (e.g., organ- specific toxicity, reproductive and developmental toxicity). High The outcome assessment methodology addressed or reported the intended (score = 1) outcome(s) of interest and was sensitive for the outcomes(s) of interest. Medium The outcome assessment methodology partially addressed or reported the (score = 2) intended outcomes(s) of interest (e.g., serum chemistry and organ weight evaluated in the absence of histology), but minor uncertainties are unlikely to have a substantial impact on results. Low Significant deficiencies in the reported outcome assessment methodology (score = 3) were identified OR due to incomplete reporting, it was unclear whether methods were sensitive for the outcome of interest. This is likely to have a substantial impact on results. Unacceptable The outcome assessment methodology was not reported (score = 4) OR the reported outcome assessment methodology was not sensitive for the outcome(s) of interest (e.g., evaluation of endpoints outside the critical window of development, a systemic toxicity study that evaluated only grossly observable endpoints, such as clinical signs and mortality, etc.). These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 17. Consistency of outcome assessment Was the outcome assessment carried out consistently (i.e., using the same protocol) across study groups (e.g., assessment at the same time after initial exposure in all study groups)? High Details of the outcome assessment protocol were reported and outcomes (score = 1) were assessed consistently across study groups (e.g., at the same time after initial exposure) using the same protocol in all study groups. Medium There were minor differences in the timing of outcome assessment across (score = 2) study groups, or incomplete reporting of minor details of outcome assessment protocol execution, but these uncertainties or limitations are unlikely to have substantial impact on results. 199 ------- Confidence Level (Score) Description Selected Score Low (score = 3) Details regarding the execution of the study protocol for outcome assessment (e.g., timing of assessment across groups) were not reported, and these deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) There were large inconsistencies in the execution of study protocols for outcome assessment across study groups OR outcome assessments were not adequately reported for meaningful interpretation of results. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 18. Sampling adequacy Was sampling adequate for the outcome(s) of interest, including experimental unit (e.g., litter vs. individual animal weight), number of evaluations per dose group, and endpoint (e.g., number of slides evaluated per organ)? High (score = 1) Details regarding sampling for the outcome(s) of interest were reported and the study used adequate sampling for the outcome(s) of interest (e.g., litter data provided for developmental studies; endpoints were evaluated in an adequate number of animals in each group). Medium (score = 2) Details regarding sampling for the outcome(s) of interest were reported, but minor limitations were identified in the sampling of the outcome(s) of interest (e.g., histopathology was performed for high-dose group and controls only, and treatment-related changes were observed at the high dose) that are unlikely to have a substantial impact on results. Low (score = 3) Details regarding sampling of outcomes were not reported and this deficiency is likely to have a substantial impact on results. Unacceptable (score = 4) Sampling was not adequate for the outcome(s) of interest (e.g., histopathology was performed on exposed groups, but not controls). This is a serious flaw that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 19. Blinding of assessors Were investigators assessing subjective outcomes (i.e., those evaluated using human judgment, including functional observational battery, qualitative neurobehavioral endpoints, histopathological re-evaluations) blinded to treatment group? If blinding was not applied, were quality control/quality assurance procedures for endpoint evaluation cited? Note that blinding is not required for initial histopathology review in accordance with Best Practices recommended by the Society of Toxicologic Pathology. This should be considered when rating this metric.3 This metric is not rated/applicable for initial histopathology review or if no subjective outcomes were assessed (i.e., only automated measurements were included and/or human judgment was not applied). High (score = 1) The study explicitly reported that investigators assessing subjective outcomes (i.e., those evaluated using human judgment, including functional observational battery, qualitative neurobehavioral endpoints, histopathological re-evaluations) were blinded to treatment group or that quality control/quality assurance methods were followed in the absence of blinding. 200 ------- Confidence Level (Score) Description Selected Score Medium (score = 2) The study reported that blinding was not possible, but steps were taken to minimize bias (e.g., knowledge of study group was restricted to personnel not assessing subjective outcome) and this minor uncertainty is unlikely to have a substantial impact on results. Alternately, blinding was not reported; however, lack of blinding is not expected to have a substantial impact on results. Low (score = 3) The study did not report whether assessors were blinded to treatment group for subjective outcomes, and this deficiency is likely to have a substantial impact on results. Unacceptable (score = 4) Information in the study report did not report whether assessors were blinded to treatment group for subjective outcomes or suggested that the assessment of subjective outcomes (e.g., functional observational battery, qualitative neurobehavioral endpoints, histopathological re-evaluations) was performed in a biased fashion (e.g., assessors of subjective outcomes were aware of study groups). This is a serious flaw that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 20. Negative control response Were the biological responses (e.g., histopathology, litter size, pup viability, etc.) of the negative control group(s) adequate? High (score = 1) The biological responses of the negative control group(s) were adequate (e.g., no/low incidence of histopathological lesions). Medium (score = 2) There were minor uncertainties or limitations regarding the biological responses of the negative control group(s) (e.g., differences in outcome between untreated and solvent controls) that are unlikely to have a substantial impact on results. Low (score = 3) The biological responses of the negative control group(s) were reported, but there were deficiencies regarding the control responses that are likely to have a substantial impact on results (e.g., elevated incidence of histopathological lesions). Unacceptable (score = 4) The biological responses of the negative control groups were not reported OR there was unacceptable variation in biological responses between control replicates. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 201 ------- Confidence Level (Score) Description Selected Score Domain 6. Confounding/Variable Control Metric 21 Confounding variables in test design and procedures Were there confounding differences among the study groups in initial body weight or test substance palatability that could influence the outcome assessment (e.g., did palatability issues lead to dehydration and/or malnourishment)? Did reflex bradypnea (i.e., reduced respiration and reduced test substance exposure) induced by respiratory irritants influence outcome assessment? Were normal signs of reflex bradypnea misinterpreted as neurologic, behavioral, or developmental effects (e.g. hypothermia, lethargy, unconsciousness, poor performance in behavioral studies, delayed pup development)? High (score = 1) There were no reported differences among the study groups in initial body weight, food or water intake, or respiratory rate that could influence the outcome assessment. Medium (score = 2) The study reported minor differences among the study groups (<20% difference from control) with respect to initial body weight, drinking water and/or food consumption due to palatability issues, or respiratory rate due to reflex bradypnea. These minor uncertainties are unlikely to have a substantial impact on results. Alternately, the lack of reporting of initial body weights, food/water intake, and/or respiratory rate is not likely to have a significant impact on results. Low (score = 3) Initial body weight, food/water intake, and respiratory rate were not reported. These deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) The study reported significant differences among the study groups with respect to initial body weight, decreased drinking water/food intake due to palatability issues (>20% difference from control) that could lead to dehydration and/or malnourishment, or reflex bradypnea that could lead to decreased oxygenation of the blood. These are serious flaws that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 22. Health outcomes unrelated to exposure Were there differences among the study groups in animal attrition or health outcomes unrelated to exposure (e.g., infection) that could influence the outcome assessment? Professional judgement should be used to determine whether or not signs of infection would invalidate the study. Criteria for High, Medium and Low are used when the study is still usable. High (score = 1) Details regarding animal attrition and health outcomes unrelated to exposure (e.g., infection) were reported for each study group and there were no differences among groups that could influence the outcome assessment. Medium (score = 2) Authors reported that one or more study groups experienced disproportionate animal attrition or health outcomes unrelated to exposure (e.g., infection), but data from the remaining exposure groups were valid and the low incidence of attrition is unlikely to have a substantial impact on results OR data on attrition and/or health outcomes unrelated to exposure for each study group were not reported because only substantial differences among groups were noted (as indicated by study authors). Low (score = 3) Data on attrition and/or health outcomes unrelated to exposure were not reported for each study group and this deficiency is likely to have a substantial impact on results. OR data on attrition and/or health outcomes 202 ------- Confidence Level (Score) Description Selected Score are reported and could have substantial impact on results. Unacceptable (score = 4) One or more study groups experienced serious animal attrition or health outcomes unrelated to exposure (e.g., infection). This is a serious flaw that makes the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 7. Data Presentation and Analysis Metric 23. Statistical methods Were statistical methods clearly described and appropriate for dataset(s) (e.g., parametric test for normally distributed data)? High (score = 1) Statistical methods were clearly described and appropriate for dataset(s) (e.g., parametric test for normally distributed data). OR no statistical analyses, calculation methods, and/or data manipulation were conducted but sufficient data were provided to conduct an independent statistical analysis. Medium (score = 2) Statistical analysis was described with some omissions that would unlikely have a substantial impact on results. Low (score = 3) Statistical analysis was not described clearly, and this deficiency is likely to have a substantial impact on results. Unacceptable (score = 4) Statistical methods were not appropriate (e.g., parametric test for non- normally distributed data) OR statistical analysis was not conducted AND data were not provided preventing an independent statistical analysis. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 24. Reporting of data Were the data for all outcomes presented? Were data reported by exposure group and sex (if applicable), with numbers of animals affected and numbers of animals evaluated (for quantal data) or group means and variance (for continuous data)? If severity scores were used, was the scoring system clearly articulated? High (score = 1) Data for exposure-related findings were presented for all outcomes by exposure group and sex (if applicable) with quantal and/or continuous presentation and description of severity scores if applicable. Negative findings were reported qualitatively or quantitatively. Medium (score = 2) Data for exposure-related findings were reported for most, but not all, outcomes by exposure group and sex (if applicable) with quantal and/or continuous presentation and description of severity scores if applicable. The minor uncertainties in outcome reporting are unlikely to have substantial impact on results. Low (score = 3) Data for exposure-related findings were not shown for each study group, but results were described in the text and/or data were only reported for some outcomes. These deficiencies are likely to have a substantial impact on 203 ------- Confidence Level (Score) Description Selected Score results. Unacceptable (score = 4) Data presentation was inadequate (e.g., the report does not differentiate among findings in multiple exposure groups) OR major inconsistencies were present in reporting of results. These are serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 8. Other (Apply as Needed) Metric: High (score = 1) Medium (score = 2) Low (score = 3) Unacceptable (score = 4) Not rated/applicable Reviewer's comments Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] aCrissman et al. (2004) 204 ------- G.5.2 In Vitro Toxicity Studies Table G-15. Serious Flaws that Would Make In Vitro Toxicity Studies Unacceptable Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source3 Test Substance Test Substance Identity The test substance identity and form (if applicable) could not be determined from the information provided (e.g., nomenclature was unclear and CASRN or structure were not reported) OR the components and ratios of mixtures were not characterized. Test Substance Source The test substance was not obtained from a manufacturer OR if synthesized or extracted, analytical verification of the test substance was not conducted. Test Substance Purity The nature and quantity of reported impurities were such that study results were likely to be due to one or more of the impurities. Test Design Negative Controls A concurrent negative control group was not included or reported OR the reported negative control group was not appropriate (e.g., different cell lines used for controls and test substance exposure). Positive Controls A concurrent positive control or proficiency group was not used (when applicable). Assay Procedures Assay methods and procedures were not reported OR assay methods and procedures were not appropriate for the study type (e.g., in vitro skin corrosion protocol used for in vitro skin irritation assay). Standards for Testing QC criteria were not reported and/or inadequate data were provided to demonstrate validity, acceptability, and reliability of the test when compared with current standards and guidelines. Exposure Characterization Preparation and Storage of Test Substance Information on preparation and storage was not reported OR serious flaws reported with test substance preparation and/or storage conditions will have critical impacts on dose/concentration estimates and make the study unusable (e.g., instability of test substance in exposure media, test substance volatilized rapidly from the open containers that were used as test vessels). Consistency of Administration Critical exposure details (e.g., amount of test substance used) were not reported OR exposures were not administered consistently across and/or within study groups (e.g., 75 mg/cm2 and 87 mg/cm2 administered to reconstructed corneas replicate 1 and replicate 2, respectively, in in vitro eye irritation test) resulting in serious flaws that make the study unusable. Reporting of Concentrations The exposure doses/concentrations or amounts of test substance were not reported resulting in serious flaws. 205 ------- Domain Metric Description of Serious Flaw(s) in Data Source3 No information on exposure duration(s) was reported OR the exposure duration was not appropriate for the study type and/or outcome of interest (e.g., 5 hours for reconstructed epidermis in skin irritation test, 24 hours exposure for bacterial reverse mutation test). Exposure Duration Number of Exposure Groups and Concentrations Spacing The number of exposure groups and dose/concentration spacing were not reported OR the number of exposure groups and dose/concentration spacing were not relevant for the assessment (e.g., all concentrations used in an in vitro mammalian cell micronucleus test were cytotoxic). Metabolic Activation No information on the characterization and use of a metabolic activation system was reported. The test model and descriptive information were not reported OR the test model was not appropriate for evaluation of the specific outcome of interest (e.g., bacterial reverse mutation assay to evaluate chromosome aberrations). Test Model Test Model The number of organisms or tissues per study group and/or replicates per study group were not reported OR the number of organisms or tissues per study group and/or replicates per study group were insufficient to characterize toxicological effects (e.g., one tissue/test concentration/one exposure time for in vitro skin corrosion test, one replicate/strain of bacteria exposed in bacterial reverse mutation assay). Number per Group The outcome assessment methodology was not reported OR the assessment methodology was not appropriate for the outcome(s) of interest (e.g., cells were evaluated for chromosomal aberrations immediately after exposure to the test substance instead of after post-exposure incubation period, cytotoxicity not determined prior to CD86/CD expression measurement assay, and labeling antibodies were not tested on proficiency substances in an in vitro skin sensitization test in h-CLAT cells). Outcome Assessment Methodology Outcome Assessment Consistency of Outcome Assessment There were large inconsistencies in the execution of study protocols for outcome assessment across study groups OR outcome assessments were not adequately reported for meaningful interpretation of results. Sampling Adequacy Reported sampling was not adequate for the outcome(s) of interest and/or serious uncertainties or limitations were identified in how the study carried out the sampling of the outcome(s) of interest (e.g., replicates from control and test concentrations were evaluated at different times). Blinding of Assessors Information in the study report suggested that the assessment of subjective outcomes was performed in a biased fashion (e.g., assessors of subjective outcomes were aware of study groups). Confounding/ Variable Control Confounding Variables in Test Design and There were significant differences among the study groups with respect to the strain/batch/lot number of organisms or models used per group or size and/or quality of tissues exposed (e.g., initial 206 ------- Domain Metric Description of Serious Flaw(s) in Data Source3 Procedures number of viable bacterial cells were different for each replicate [105 cells in replicate 1,10scell in replicate 2, and 103 cells in replicate 3], tissues from two different lots were used for in vitro skin corrosion test, but the control batch quality for one lot was outside of the acceptability range). Confounding Variables in Outcomes Unrelated to Exposure One or more replicates or groups (i.e., negative and positive controls experienced disproportionate growth or reduction in growth unrelated to exposure (e.g., contamination) such that no outcomes could be assessed. Data Analysis Statistical methods, calculation methods, or data manipulation were not appropriate (e.g., Student's t-test used to compare 2 groups in a multi-group study, parametric test for non-normally distributed data) OR statistical analysis was not conducted AND data enabling an independent statistical analysis were not provided. Data Presentation and Analysis Data Interpretation The reported scoring and/or evaluation criteria were inconsistent with established practices resulting in the interpretation of data results that are seriously flawed. Cytotoxicity Data Cytotoxicity endpoints were not defined, methods were not described, and it could not be determined that cytotoxicity was accounted for in the interpretation of study results. Reporting of Data Data presentation was inadequate (e.g., the report did not differentiate among findings in multiple exposure groups, no scores or frequencies were reported), or major inconsistencies were present in reporting of results. Note: a If the metric does not apply to the study type, the flaw will not be applied to determine unacceptability. 207 ------- Table G-16. Data Quality Criteria for In Vitro Toxicity Studies Confidence Level (Score) Description Selected Score Domain 1. Test Substance Metric 1. Test substance identity Was the test substance identified definitively (i.e., established nomenclature, CASRN, physical nature, physiochemical properties, and/or structure reported, including information on the specific form tested [e.g., salt or base, valence state, isomer, if applicable] for materials that may vary in form)? If test substance was a mixture, were mixture components and ratios characterized? High (score = 1) The test substance was identified definitively (i.e., established nomenclature, CASRN, physical nature, physiochemical properties, and/or structure reported, including information on the specific form tested (e.g., salt or base, valence state, isomer, [if applicable]) for materials that may vary in form. For mixtures, the components and ratios were characterized. Medium (score = 2) The test substance and form (if applicable) were identified, and components and ratios of mixtures were characterized, but there were minor uncertainties (e.g., minor characterization details were omitted) that are unlikely to have a substantial impact on results. Low (score = 3) The test substance and form (if applicable) were identified, and components and ratios of mixtures were characterized, but there were uncertainties regarding test substance identification or characterization that are likely to have a substantial impact on the results. Unacceptable (score = 4) The test substance identity and form (if applicable) could not be determined from the information provided (e.g., nomenclature was unclear and CASRN or structure were not reported) OR the components and ratios of mixtures were not characterized. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Test substance source Was the source of the test substance reported, including manufacturer and batch/lot number for materials that may vary in composition? If synthesized or extracted, was test substance identity verified by analytical methods? High (score = 1) The source of the test substance was reported, including manufacturer and batch/lot number for materials that may vary in composition, and its identity was certified by manufacturer and/or verified by analytical methods (melting point, chemical analysis, etc.). Medium (score = 2) The source of the test substance and/or the analytical verification of a synthesized test substance was reported incompletely, but the omitted details are unlikely to have a substantial impact on the results. Low (score = 3) Omitted details on the source of the test substance and/or analytical verification of a synthesized test substance are likely to have a substantial impact on the results. Unacceptable (score = 4) The test substance was not obtained from a manufacturer OR if synthesized or extracted, analytical verification of the test substance was not conducted. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any 208 ------- Confidence Level (Score) Description Selected Score additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Test substance purity Was the purity or grade (i.e., analytical, technical) of the test substance reported and adequate to identify its toxicological effects? Were impurities identified? Were impurities present in quantities that could influence the results? High (score = 1) The test substance purity and composition were such that any observed effects were highly likely to be due to the nominal test substance itself (e.g., ACS grade, analytical grade, reagent grade test substance or a formulation comprising primarily inert ingredients with small amount of active ingredient). Impurities, if identified, were not present in quantities that could influence the results. Medium (score = 2) Minor uncertainties or limitations were identified regarding the test substance purity and composition; however, the purity and composition were such that observed effects were more likely than not to be due to the nominal test substance and impurities, if identified, were unlikely to have a substantial impact on the results. Low (score = 3) Purity and/or grade of test substance were not reported OR the percentage of the reported purity was such that the observed effects may not have been due to the nominal test substance. Unacceptable (score = 4) The nature and quantity of reported impurities were such that study results were likely to be due to one or more of the impurities. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Test Design Metric 4. Negative controls Was a concurrent negative (untreated, sham-treated, and/or vehicle, as necessary) control group included? High (score = 1) Study authors reported using a concurrent negative control group (untreated, sham-treated, and/or vehicle, as applicable) in which all conditions equal except exposure to test substance. Medium (score = 2) Study authors reported using a concurrent negative control group, but all conditions were not equal to those of treated groups; however, the identified differences are considered to be minor limitations that are unlikely to have substantial impact on results. Low (score = 3) Study authors acknowledged using a concurrent negative control group, but details regarding the negative control group were not reported, and the lack of details is likely to have a substantial impact on the results. Unacceptable (score = 4) A concurrent negative control group was not included or reported OR the reported negative control group was not appropriate (e.g., different cell lines used for controls and test substance exposure). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important 209 ------- Confidence Level (Score) Description Selected Score elements such as relevance] Metric 5. Positive controls Was a concurrent positive or proficiency control group included, if applicable, based on study type, and was the response appropriate in this group (e.g., induction of positive effect)? *This metric is applicable studies that require a concurrent positive control. High (score = 1) A concurrent positive control or proficiency control group, if applicable, was used and the intended positive response was induced. Medium (score = 2) A concurrent positive control or proficiency control was used, but there were minor uncertainties (e.g., minor details regarding control exposure or response were omitted) that are unlikely to have a substantial impact on results. Low (score = 3) A concurrent positive control or proficiency control was used, but there were uncertainties regarding the control exposure or response that are likely to have a substantial impact on results (e.g., the control response was not described). Unacceptable (score = 4) A concurrent positive control or proficiency group was not used. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 6. Assay procedures Were assay methods and procedures (e.g., test conditions, cell density culture media and volumes, pre- and post- incubation temperatures, humidity, reaction mix, washing/rinsing methods, incubation with amino acids, slide preparation, instrument used and calibration, wavelengths measured) described in detail and applicable to the study type? High (score = 1) Study authors described the methods and procedures (e.g., test conditions, cell density culture media and volumes, pre- and post-incubation temperatures, humidity, reaction mix, washing/rinsing methods, incubation with amino acids, slide preparation, instrument used and calibration, wavelengths measured) used for the test in detail and they were applicable for the study type (e.g., protocol for in vitro skin irritation test was reported). Medium (score = 2) Methods and procedures were partially described and/or cited in another publication(s), but appeared to be appropriate (e.g., reporting that "calculations were used for enumerating viable and mutant cells" in a mammalian cell gene mutation test using Hprt and xprt genes instead of inclusion of the equations) to the study type, so the omission is unlikely to have a substantial impact on results. Low (score = 3) The methods and procedures were not well described or deviated from customary practices (e.g., post-incubation time was not stated in a mammalian cell gene mutation test using Hprt and xprt genes) and this is likely to have a substantial impact on results. Unacceptable (score = 4) Assay methods and procedures were not reported OR assay methods and procedures were not appropriate for the study type (e.g., 210 ------- Confidence Level (Score) Description Selected Score in vitro skin corrosion protocol used for in vitro skin irritation assay). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 7. Standards for tests For assays with established criteria, were the test validity, acceptability, reliability, and/or QC criteria reported and consistent with current standards and guidelines? Example acceptability and QC criteria for an in vitro skin corrosion test using the EpiSkin™ (SM) model: Acceptability criteria: negative control OD values between >0.6 and <1.5, variability of the positive control replicates should be <20% of negative control, difference of viability between 2 tissue replicates should not exceed 30% in the range of 20-100% viability and for EDs>0.3: QC criteria: Only QC-accepted tissue batches having an IC5o range of 1.0-3.0 mg/mL were used.) * This metric is generally applicable to studies using reconstructed human cells and may not be applicable to other studies. High (score = 1) The test validity, acceptability, reliability, and/or QC criteria were reported and consistent with current standards and guidelines,3 if applicable. Medium (score = 2) Not applicable for this metric. Low (score = 3) Not applicable for this metric. Unacceptable (score = 4) QC criteria were not reported and/or inadequate data were provided to demonstrate validity, acceptability, and reliability of the test when compared with current standards and guidelines. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Exposure Characterization Metric 8. Preparation and storage of test substance Did the study characterize preparation of the test substance and storage conditions? Were the frequency of preparation and/or storage conditions appropriate to the test substance stability and solubility (if applicable)? High (score = 1) The test substance preparation and/or storage conditions (e.g., test substance stability, homogeneity, mixing temperature, stock concentration, stirring methods, centrifugation/filtration, aerosol/vapor generation method, storage conditions) were reported and appropriate (e.g., stability in exposure media confirmed, volatile test substances prepared and stored in sealed containers) for the test substance. Medium (score = 2) The test substance preparation and storage conditions were reported, but minor limitations in the test substance preparation and/or storage conditions were identified (e.g., test substance formulations were stirred instead of centrifuged for a specific number of rotations per minute) that are unlikely to have a substantial impact on results. Low (score = 3) Deficiencies in reporting of test substance preparation, and/or storage conditions are likely to have a substantial impact on results (e.g., available information on physical-chemical properties suggests that stability and/or solubility of test substance in vehicle or culture media may be poor). Unacceptable (score = 4) Information on preparation and storage was not reported OR 211 ------- Confidence Level (Score) Description Selected Score serious flaws reported with test substance preparation and/or storage conditions will have critical impacts on dose/concentration estimates and make the study unusable (e.g., instability of test substance in exposure media, test substance volatilized rapidly from the open containers that were used as test vessels). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 9. Consistency of administration Were exposures administered consistently across study groups (e.g., consistent application methods and volumes, control for evaporation)? High (score = 1) Details of exposure administration were reported and exposures were administered consistently across study groups in a scientifically sound manner (e.g., consistent application methods and volumes, control for evaporation). Medium (score = 2) Details of exposure administration were reported or inferred from the text, but the minor limitations in administration of exposures (e.g., accidental mistakes in dosing) that were identified are unlikely to have a substantial impact on results. Low (score = 3) Details of exposure administration were reported, but deficiencies in administration of exposures (e.g., non-calibrated instrument used to administer test substance) that were reported or inferred from the text are likely to have a substantial impact on results. Unacceptable (score = 4) Critical exposure details (e.g., amount of test substance used) were not reported OR exposures were not administered consistently across and/or within study groups (e.g., 75 mg/cm2 and 87 mg/cm2 administered to reconstructed corneas replicate 1 and replicate 2, respectively, in in vitro eye irritation test) resulting in serious flaws that make the study unusable. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 10. Reporting of concentrations Were exposure doses/concentrations or amounts of test substance reported without ambiguity (e.g., point estimate instead of range, analytical instead of nominal)? High (score = 1) The exposure doses/concentrations or amounts of test substance were reported without ambiguity (e.g., point estimate instead of range, analytical instead of nominal). Medium (score = 2) Not applicable for this metric. Low (score = 3) Not applicable for this metric. Unacceptable (score = 4) The exposure doses/concentrations or amounts of test substance were not reported resulting in serious flaws. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any 212 ------- Confidence Level (Score) Description Selected Score additional comments that may highlight study strengths or important elements such as relevance] Metric 11. Exposure duration Was the exposure duration (e.g., minutes, hours, days) reported and appropriate for this study type and/or outcome(s) of interest? High (score = 1) The exposure duration (e.g., min, hours, days) was reported and appropriate for the study type and/or outcome(s) of interest (e.g., 60-minute exposure for reconstructed epidermis in skin irritation test, 48-72-hour exposure for bacterial reverse mutation assay). Medium (score = 2) Duration(s) of exposure differed slightly from current standards and guidelines3 for studies of this type (e.g., 65 minutes for reconstructed epidermis in skin irritation test), but the differences are unlikely to have a substantial impact on results. Low (score = 3) Duration(s) of exposure were not clearly stated (e.g., exposure duration was described only in qualitative terms) or duration(s) differed significantly from studies of the same or similar types. These deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) No information on exposure duration(s) was reported OR the exposure duration was not appropriate for the study type and/or outcome of interest (e.g., 5 hours for reconstructed epidermis in skin irritation test, 24 hours exposure for bacterial reverse mutation test). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 12. Number of exposure groups and concentrations spacing Were the number of exposure groups and dose/concentration spacing justified by study authors (e.g., based on study type, range-finding study, and/or cytotoxicity studies) and adequate to address the purpose of the study (e.g., to evaluate dose-response relationships, inform MOA/AOP)? High (score = 1) The number of exposure groups and dose/concentration spacing were justified by study authors (e.g., based on study type, range-finding study, and/or cytotoxicity studies) and considered adequate to address the purpose of the study (e.g., to evaluate dose-response relationships, inform MOA/AOP). Medium (score = 2) There were minor limitations regarding the number of exposure groups and/or dose/concentration spacing, but the number of exposure groups and spacing of exposure levels were adequate to show results relevant to the outcome of interest (e.g., observation of a dose-response relationship) and the concerns are unlikely to have a substantial impact on results. Low (score = 3) There were deficiencies regarding the number of exposure groups and/or dose/concentration spacing (e.g., one bacterial strain exposed to 2 concentrations of the test substance in bacterial reverse mutation assay) and these concerns were likely had a substantial impact on interpretation of the results. Unacceptable (score = 4) The number of exposure groups and dose/concentration spacing were not reported OR the number of exposure groups and dose/concentration spacing were not 213 ------- Confidence Level (Score) Description Selected Score relevant for the assessment (e.g., all concentrations used in an in vitro mammalian cell micronucleus test were cytotoxic). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 13. Metabolic activation (if applicable) Were exposures conducted in the presence and absence of a metabolic activation system, if applicable, for the study type? Were the source, method of preparation, concentration or volume in final culture, and quality control information on the metabolic activation system reported? High (score = 1) Study authors reported exposures were conducted in the presence of metabolic activation and the type and source, method of preparation, concentration or volume in final culture, and quality control information of the metabolic activation system were described. Medium (score = 2) The presence of a commonly used metabolic activation system (e.g., aroclor-, ethanol-, or phenobarbitial/(3-naphthoflavone-induced rat, hamster, or mice liver cells) was reported in the study; however, some details regarding type, composition mix, concentration, or quality control information were not described. These omissions are unlikely to have a substantial impact on the results. Low (score = 3) The presence of a metabolic activation system was reported in the study, but the system described was not validated (e.g., rigorous testing to ensure that it suitable for the purpose for which it is used) or comparable to commonly used systems (e.g., aroclor-, ethanol-, or phenobarbitial/(3-naphthoflavone- induced rat, hamster, or mice liver cells). Unacceptable (score = 4) No information on the characterization and use of a metabolic activation system was reported. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 4. Test Model Metric 14. Test model Were the test models (e.g., cell types or lines, tissue models) and descriptive information (e.g., tissue origin, number of passages, karyotype features, doubling times, donor information, biomarkers) reported? Was the test model from a commercial source or an in-house culture? Was the model routinely used for the outcome of interest (e.g., Chinese hamster ovary cells for micronucleus formation)? High (score = 1) The test model (e.g., cell types or lines, tissue models) and descriptive information (e.g., tissue origin, number of passages, karyotype features, doubling times, donor information, biomarkers) were reported, the test model was obtained from a commercial source or laboratory-maintained culture, and the test model was routinely used for the outcome of interest (e.g., Chinese hamster ovary cells for micronucleus formation). Medium (score = 2) The test model was reported along with limited descriptive information. The test model was routinely used for the outcome of interest. Reporting limitations are unlikely to have a substantial impact on results. Low (score = 3) The test model was reported but no additional details were reported AND/OR 214 ------- Confidence Level (Score) Description Selected Score the test model was not routinely used for the outcome of interest (e.g., feline cell line for micronucleus formation). This is likely to have a substantial impact on results. Unacceptable (score = 4) The test model and descriptive information were not reported OR the test model was not appropriate for evaluation of the specific outcome of interest (e.g., bacterial reverse mutation assay to evaluate chromosome aberrations). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 15. Number per group Was the number of organisms or tissues per study group and/or replicates per study group reported and appropriate for the study type and outcome analysis? High (score = 1) The number of organisms or tissues per study group and/or number of replicates per study group were reported and were appropriate3 for the study type and outcome analysis, and consistent with studies of the same or similar type (e.g., at least two replicates/test substance/3 different exposure times for in vitro skin corrosion test, 3 replicates/strain of bacteria in bacterial reverse mutation assay). Medium (score = 2) The number of organisms or tissues per study group and/or replicates per study group were reported but were lower than the typical number used in studies of the same or similar type (e.g., 3 replicates/strain of bacteria in bacterial reverse mutation assay), but were sufficient for analysis and unlikely to have a substantial impact on results. Low (score = 3) The number of organisms or tissues per study group and/or replicates per study group were reported but were less than recommended by current standards and guidelines3 (e.g., one tissue/test concentration/exposure time for in vitro skin corrosion test). This is likely to have a substantial impact on results. Unacceptable (score = 4) The number of organisms or tissues per study group and/or replicates per study group were not reported OR the number of organisms or tissues per study group and/or replicates per study group were insufficient to characterize toxicological effects (e.g., one tissue/test concentration/one exposure time for in vitro skin corrosion test, one replicate/strain of bacteria exposed in bacterial reverse mutation assay). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 5. Outcome Assessment Metric 16. Outcome assessment methodology Did the outcome assessment methodology address or report the intended outcome(s) of interest? Was the outcome assessment methodology (including endpoints and timing of assessment) sensitive for the outcome(s) of interest (e.g., measured endpoints that are able to detect a true effect)? High (score = 1) The outcome assessment methodology addressed or reported the intended outcome(s) of interest and was sensitive for the outcome(s) of interest. 215 ------- Confidence Level (Score) Description Selected Score Medium (score = 2) The outcome assessment methodology used only partially addressed or reported the intended outcomes(s) of interest (e.g., mutation frequency evaluated in the absence of cytotoxicity in a gene mutation test), but minor uncertainties are unlikely to have a substantial impact on results. Low (score = 3) Significant deficiencies in the reported outcome assessment methodology were identified (e.g., optimum time for expression of chromosomal aberrations after exposure to test compound was not determined) OR due to incomplete reporting, it was unclear whether methods were sensitive for the outcome of interest. This is likely to have a substantial impact on results. Unacceptable (score = 4) The outcome assessment methodology was not reported OR the assessment methodology was not appropriate for the outcome(s) of interest (e.g., cells were evaluated for chromosomal aberrations immediately after exposure to the test substance instead of after post-exposure incubation period). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 17. Consistency of outcome assessment Was the outcome assessment carried out consistently (i.e., using the same protocol) across study groups (e.g., assessment at the same time after initial exposure in all study groups)? High (score = 1) Details of the outcome assessment protocol were reported and outcomes were assessed consistently across study groups (e.g., at the same time after initial exposure) using the same protocol in all study groups. Medium (score = 2) There were minor differences in the timing of outcome assessment across study groups, or incomplete reporting of minor details of outcome assessment protocol execution, but these uncertainties or limitations are unlikely to have substantial impact on results. Low (score = 3) Details regarding the execution of the study protocol for outcome assessment (e.g., timing of assessment across groups) were not reported, and these deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) There were large inconsistencies in the execution of study protocols for outcome assessment across study groups OR outcome assessments were not adequately reported for meaningful interpretation of results. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 18. Sampling adequacy Was the reported sampling adequate for the outcome(s) of interest, including number of evaluations per exposure group, and endpoint (e.g., number of replicates/slides/cells/metaphases evaluated per test concentration)? High (score = 1) The study reported adequate sampling for the outcome(s) of interest including number of evaluations per exposure group, and endpoint (e.g., number of replicates/slides/cells/metaphases [at least 300 well-spread 216 ------- Confidence Level (Score) Description Selected Score metaphases scored/concentration in a chromosome aberration test]). Medium (score = 2) Details regarding sampling for the outcome(s) of interest were reported, but minor limitations were identified in the reported sampling of the outcome(s) of interest, but those are unlikely to have a substantial impact on results. Low (score = 3) Details regarding sampling of outcomes were not fully reported and the omissions are likely to have a substantial impact on results. Unacceptable (score = 4) Reported sampling was not adequate for the outcome(s) of interest and/or serious uncertainties or limitations were identified in how the study carried out the sampling of the outcome(s) of interest (e.g., replicates from control and test concentrations were evaluated at different times). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 19. Blinding of assessors Were investigators assessing subjective outcomes (i.e., those evaluated using human judgment) blinded to treatment group? This metric is not rated/applicable if no subjective outcomes were assessed (i.e., only automated measurements were included and human judgment was not applied). High (score = 1) The study explicitly reported that investigators assessing subjective outcomes (i.e., those evaluated using human judgment) were blinded to treatment group or that quality control/quality assurance methods were followed in the absence of blinding. Medium (score = 2) The study reported that blinding was not possible, but steps were taken to minimize bias (e.g., knowledge of study group was restricted to personnel not assessing subjective outcome) and this minor uncertainty is unlikely to have a substantial impact on results. Low (score = 3) The study did not report whether assessors were blinded to treatment group for subjective outcomes, and this deficiency is likely to have a substantial impact on results. Unacceptable (score = 4) Information in the study report suggested that the assessment of subjective outcomes was performed in a biased fashion (e.g., assessors of subjective outcomes were aware of study groups). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 6. Confounding/Variable Control Metric 20. Confounding variables in test design and procedures Were there confounding differences among the study groups in the strain/batch/lot number of organisms or models used per group, size, and/or quality of tissues exposed, or lot of test substance used that could influence the outcome assessment? High (score = 1) There were no differences reported among study group parameters (e.g., test substance lot or batch, strain/batch/ lot number of organisms or models used per group or size, and/or quality of tissues exposed) that could influence the outcome assessment. Medium Minor differences were reported in initial conditions that are unlikely to have 217 ------- Confidence Level (Score) Description Selected Score (score = 2) a substantial impact on results (e.g., tissues from two different lots were used for in vitro skin corrosion test, and QC data were similar for both lots). Low (score = 3) Initial strain/batch/lot number of organisms or models used per group, size, and/or quality of tissues exposed was not reported. These deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) There were significant differences among the study groups with respect to the strain/batch/lot number of organisms or models used per group or size and/or quality of tissues exposed (e.g., initial number of viable bacterial cells were different for each replicate [105 cells in replicate 1,10s cell in replicate 2, and 103 cells in replicate 3], tissues from two different lots were used for in vitro skin corrosion test, but the control batch quality for one lot was outside of the acceptability range). Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 21. Confounding variables in outcomes unrelated to exposure Were there differences among the study groups unrelated to exposure to test substance (e.g., contamination) that could influence the outcome assessment? Did the test material interfere in the assay (e.g., altering fluorescence or absorbance, signal quenching by heavy metals, altering pH, solubility or stability issues)? High (score = 1) There were no reported differences among the study replicates or groups in test model unrelated to exposure (e.g., contamination) and the test substance did not interfere with the assay (e.g., signal quenching by heavy metals). Medium (score = 2) Authors reported that one or more replicates or groups experienced disproportionate outcomes unrelated to exposure (e.g., contamination), but data from the remaining exposure replicates or groups were valid and is unlikely to have a substantial impact on results OR data on experienced disproportionate outcomes unrelated to exposure were not reported because only substantial differences among groups were noted (as indicated by study authors). OR the test material interfered in the assay, but the interference did not cause substantial differences among the groups.. Low (score = 3) Data on outcome differences unrelated to exposure were not reported for each study replicate or group. Assay interference was present or inferred resulting in large variabilities among the groups. The absence of this information is likely to have a substantial impact on results. Unacceptable (score = 4) One or more replicates or groups (i.e., negative and positive controls experienced disproportionate growth or reduction in growth unrelated to exposure (e.g., contamination), or assay interference occurred such that no outcomes could be assessed. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 218 ------- Confidence Level (Score) Description Selected Score Domain 7. Data Presentation and Analysis Metric 22. Data analysis Were statistical methods, calculations methods, and/or data manipulation clearly described and appropriate for dataset(s)? High (score = 1) Statistical methods, calculation methods, and/or data manipulation were clearly described and presented for dataset(s) (e.g., frequencies of chromosomal aberrations were statistically analyzed across groups, trend test used to determine dose relationships, or results compared to historical negative control data). OR no statistical analyses, calculation methods, and/or data manipulation were conducted but sufficient data were provided to conduct an independent statistical analysis. Medium (score = 2) Statistical analysis was described with some omissions that would unlikely have a substantial impact on results. Low (score = 3) Statistical analysis was not described clearly, and this deficiency is likely to have a substantial impact on results. Unacceptable (score = 4) Statistical methods were not appropriate (e.g., Student's t-test used to compare 2 groups in a multi-group study, parametric test for non-normally distributed data) OR statistical analysis was not conducted AND data were not provided preventing an independent statistical analysis. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 23. Data interpretation Were the scoring and/or evaluation criteria reported and consistent with standards and guidelines? High (score = 1) Study authors reported the scoring and/or evaluation criteria (e.g., for determining negative, positive, and equivocal outcomes) for the test and these were consistent with established practices.3 Medium (score = 2) Scoring and/or evaluation criteria were partially reported (e.g., evaluation criteria were reported following 3- and 60-minute exposures, but not for 240-minute exposure in in vitro skin corrosion test), but the omissions are unlikely to have a substantial impact on results. Low (score = 3) Scoring and/or evaluation criteria were not reported and the omissions are likely to have a substantial impact on interpretation of the results. Unacceptable (score = 4) The reported scoring and/or evaluation criteria were inconsistent with established practices, resulting in the interpretation of data results that are seriously flawed. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 219 ------- Confidence Level (Score) Description Selected Score Metric 24. Cytotoxicity data Were cytotoxicity endpoints defined, if necessitated by study type, and were methods for measuring cytotoxicity described and commonly used for assessment3? High (score = 1) Study authors defined cytotoxicity endpoints (e.g., cell integrity, apoptosis, necrosis, color induction, cell viability, mitotic index) and the methods for measuring cytotoxicity were clearly described and commonly used for assessment. Medium (score = 2) Cytotoxicity endpoints were defined and methods of measurement were partially reported, but the omissions are unlikely to have substantial impact on study results. Low (score = 3) Cytotoxicity endpoints were defined, but the methods of measurements were not fully described or reported, and the omissions are likely to have a substantial impact on the study results. Unacceptable (score = 4) Cytotoxicity endpoints were not defined, methods were not described, and it could not be determined that cytotoxicity was accounted for in the interpretation of study results. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 25. Reporting of data Were the data for all outcomes presented? Were data reported by exposure group? High (score = 1) Data for exposure-related findings were presented for all outcomes by exposure group. Negative findings were reported qualitatively or quantitatively. Medium (score = 2) Data for exposure-related findings were reported for most, but not all, outcomes by exposure group (e.g., sensitization percentages reported in the absence of incidence data). The minor uncertainties in outcome reporting are unlikely to have substantial impact on results. Low (score = 3) Data for exposure-related findings were not shown for each study group, but results were described in the text and/or data were only reported for some outcomes. These deficiencies are likely to have a substantial impact on results. Unacceptable (score = 4) Data presentation was inadequate (e.g., the report did not differentiate among findings in multiple exposure groups, no scores or frequencies were reported), or major inconsistencies were present in reporting of results. Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 8. Other (Apply as Needed) Metric: High (score = 1) Medium (score = 2) Low (score = 3) Unacceptable 220 ------- Confidence Level (Score) Description Selected Score (score = 4) Not rated/applicable Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Note: a For comparison purposes, current standards and guidelines may be reviewed at http://www.oecd- ilibrarv.org/environment/oecd-guidelines-for-the-testing-of-chemicals-section-4-health-effects 20745788; https://www.epa.gov/test-guidelines-pesticides-and-toxic-substances; https://www.fda.gov/Food/GuidanceRegulation/GuidanceDocumentsRegulatorvlnformation/lngredientsAdditives GRASPackaging/ucm2006826.htm#TQC. G.6 References 1. Cooper, GL, R. Agerstrand, M. Glenn, B. Kraft, A. Luke, A. Ratcliffe, J. (2016). Study sensitivity: Evaluating the ability to detect effects in systematic reviews of chemical exposures. Environ Int. 92- 93: 605-610. http://dx.doi.Org/10.1016/i.envint.2016.03.017. 2. Crissman, JWG, D. G. Hildebrandt, P. K. Maronpot, R. R. Prater, D. A. Riley, J. H. Seaman, W. J. Thake, D. C. (2004). Best practices guideline: Toxicologic histopathology. Toxicol Pathol. 32:126-131. http://dx.doi.org/10.1080/0192623049Q268756. 3. EC (2018). ToxRTool - Toxicological data Reliability assessment Tool. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262819. 4. ECHA. (2011). Guidance on information requirements and chemical safety assessment. (ECHA-2011- G-13-EN). https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262842. 5. Hartling, LH, M. Milne, A. Vandermeer, B. Santaguida, P. L. Ansari, M. Tsertsvadze, A. Hempel, S. Shekelle, P. Drvden, D. M. (2012). Validity and inter-rater reliability testing of quality assessment instrumentsalidity and inter-rater reliability testing of quality assessment instruments. (AHRQ Publication No. 12-EHC039-EF). Rockville, MD: Agency for Healthcare Research and Quality. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262864. 6. Hooiimans, CDV, R. Leenaars, M. Ritskes-Hoitinga, M. (2010). The Gold Standard Publication Checklist (GSPC) for improved design, reporting and scientific quality of animal studies GSPC versus ARRIVE guidelines. http://dx.doi.org/10.1258/la.2010.01013Q. 7. Hooiimans, CRR, M. M. De Vries, R. B. M. Leenaars, M. Ritskes-Hoitinga, M. Langendam, M. W. (2014). SYRCLE's risk of bias tool for animal studies. BMC Medical Research Methodology. 14(1): 43. http://dx.doi.org/10.1186/1471-2288-14-43. 8. IPCS. (2010). Guidance on Characterization and Application of Physiologically Based Pharmacokinetic Models in Risk Assessment. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262900. 9. Koustas, EL, J. Sutton, P. Johnson, P. I. Atchley, D. S. Sen, S. Robinson, K. A. Axelrad, D. A. Woodruff, T. J. (2014). The Navigation Guide - Evidence-based medicine meets environmental health: Systematic review of nonhuman evidence for PFOA effects on fetal growth [Review], Environ Health Perspect. 122(10): 1015-1027. http://dx.doi.org/10.1289/ehp.1307177; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4181920/pdf/ehp.1307177.pdf. 10. Kushman, MEK, A. D. Guvton, K. Z. Chiu, W. A. Makris, S. L. Rusvn, I. (2013). A systematic approach for identifying and presenting mechanistic evidence in human health assessments. Regul Toxicol Pharmacol. 67(2): 266-277. http://dx.doi.Org/10.1016/i.vrtph.2013.08.005: 221 ------- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3818152/pdf/nihms516764.pdf. 11. Lynch, HNG, J. E. Tabony, J. A. Rhomberg, L. R. (2016). Systematic comparison of study quality criteria. Regul Toxicol Pharmacol. 76: 187-198. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262904. 12. Moermond, CTK, R. Korkaric, M. Agerstrand, M. (2016). CRED: Criteria for reporting and evaluating ecotoxicity data. Environ Toxicol Chem. 35(5): 1297-1309. http://dx.doi.org/10.1002/etc.3259. 13. NTP. (2015). Handbook for conducting a literature-based health assessment using OHAT approach for systematic review and evidence integration. U.S. Dept. of Health and Human Services, National Toxicology Program, http://ntp.niehs.nih.gov/pubhealth/hat/noms/index-2.html. 14. Samuel, GOH, S. Wright, R. A. Lalu, M. M. Patlewicz, G. Becker, R. A. Degeorge, G. L. Fergusson, D. Hartung, T. Lewis, R. J. Stephens, M. L. (2016). Guidance on assessing the methodological and reporting quality of toxicologically relevant studies: A scoping review. Environ Int. 92-93: 630-646. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4262966. 15. U.S. EPA. (2006). Approaches for the application of physiologically based pharmacokinetic (PBPK) models and supporting data in risk assessment (Final Report) [EPA Report] (pp. 1-123). (EPA/600/R- 05/043F). Washington, DC: U.S. Environmental Protection Agency, Office of Research and Development, National Center for Environmental Assessment. http://cfpub.epa.gov/ncea/cfm/recordisplay.cfm?deid=157668. 222 ------- APPENDIX H: DATA QUALITY CRITERIA FOR EPIDEMIOLOGICAL STUDIES H. 1 Types of Data Sources The data quality will be evaluated for the epidemiological studies listed in Table H-l. Table H-l. Types of Epidemiological Studies Data Category Types of Data Sources Epidemiological Studies Controlled exposure, cohort, case-control, cross-sectional, case-crossover H.2 Data Quality Evaluation Domains The data sources will be evaluated against the following six data quality evaluation domains: study participation, exposure characterization, outcome assessment, potential confounding/variability control, analysis, and other. These domains, as defined in Table H-2, address elements of TSCA Science Standards 26(h)(1) through 26(h)(5). Table H-2. Data Evaluation Domains and Definitions Evaluation Domain Definition Study Participation Study design elements characterizing the selection of participants in or out of the study (or analysis sample), which influence whether the exposure-outcome distribution among participants is representative of the exposure-outcome distribution in the overall population of eligible persons. Exposure Characterization Evaluation of exposure assessment methodology that includes consideration of methodological quality, sensitivity, and validation of the methods used, degree of variation in participants, and an established time order between exposure and outcome. Outcome Assessment Evaluation of outcome (effect) assessment methodology that includes consideration of diagnostic methods, training of interviewers, data sources including registries, blinding to exposure status or level, and reporting of all results. Potential Confounding / Variability Control Valid and reliable methods to reduce research-specific bias, including standardization, matching, adjustment in multivariate models, and stratification. This includes control of potential co-exposures when it is known that there is potential for co-exposure to occur and the co-exposure could influence the outcome of interest. Analysis Appropriate study design chosen for the research question with evaluation of statistical power, reproducibility, and statistical or modelling approaches. Other / Consideration for Biomarker Selection and Measurement Measures of biomarker (exposure and/or effect) data reliability. This includes but is not limited to evaluations of storage, stability and contamination of samples, validity and limits of detection of methods, method requirements, inclusion of matrix-specific considerations, and relationship of biomarker with external exposure, internal dose, or target dose. 223 ------- H.3 Data Quality Evaluation Metrics The data quality evaluation domains are evaluated by assessing two to seven unique metrics. Each metric is binned into a confidence level of High, Medium, Low, and/or Unacceptable. Each confidence level is assigned a numerical score (i.e., 1 through 4) that is used in the method of assessing the overall quality of the study. A summary of the number of metrics and metric name for each data type is provided in Table H-3. Each domain has between 2 and 7 metrics. Metrics may be modified as EPA/OPPT acquires experience with the evaluation tool to support fit-for-purpose TSCA risk evaluations. Any modifications will be documented. Detailed tables showing confidence level specifications of the metrics are provided in Tables H- 6 through H-8 for each data type, including separate tables which summarize the serious flaws which would make the data source unacceptable for use in the hazard assessment. Table H-3. Summary of Metrics for the Seven Data Types Evaluation Domain Number of Metrics Overall Metrics (Metric Number and Description) Study Participation 3 • Metric 1: Participant Selection • Metric 2: Attrition • Metric 3: Comparison Group Exposure Characterization 3 • Metric 4: Measurement of Exposure • Metric 5: Exposure Levels • Metric 6: Temporality Outcome Assessment 2 • Metric 7: Outcome Measurement or Characterization, • Metric 8: Reporting Bias Potential Confounding / Variability Control 3 • Metric 9: Covariate Adjustment • Metric 10: Covariate Characterization • Metric 11: Co-exposure Counfounding/Moderation/Mediation Analysis 4 • Metric 12: Study Design and Methods • Metric 13: Statistical Power • Metric 14: Reproducibility of Analyses • Metric 15: Statistical Models Other / Consideration for Biomarker Selection and Measurement 7 • Metric 16: Use of Biomarker of Exposure • Metric 17: Effect Biomarker • Metric 18: Method Sensitivity • Metric 19: Biomarker Stability • Metric 20: Sample Contamination • Metric 21: Method Requirements • Metric 22: Matrix Adjustment 224 ------- H.4 Scoring Method and Determination of Overall Data Quality Level A scoring system is used to assign the overall quality of the data source, as discussed in Appendix A. Each data source is assigned an overall qualitative confidence level of High, Medium, Low, or Unacceptable. This section provides details about the scoring system that will be applied to epidemiologic studies, including the weighting factors assigned to each metric score of each domain. H.4.1 Weighting Factors The weighting method assumes that each domain carries an equal amount of weight of 1. However, some metrics within a given domain are given greater weights than others in the same domain, if they are regarded as key or critical metrics. Thus, EPA will use a weighting approach to reflect that some metrics are more important than others when assessing the overall quality of the epidemiologic data. Each key or critical metric is assigned a higher weighting factor. The critical metrics are identified based on professional judgment in conjunction with consideration of the factors that are most frequently included in other study quality/risk of bias tools for epidemiologic literature. In developing metrics for each domain, several basic elements for epidemiologic studies were incorporated to form the structure of the 6 domains (Blumentthal et al. 2001), each of which are considered to be equally important aspects of an epidemiologic study. The critical metrics within each domain are those that cover the most important aspects of the domain and are those that more directly evaluate the role of confounding and bias. After pilot testing the evaluation tool, EPA recognized that more attention (or weight) should be given to studies that measure exposure and disease accurately and allow for the consideration of potential confounding factors. Therefore, metrics deemed as critical metrics are those that identify the major biases associated with the domain, evaluate the measurement of exposure and disease, and/or address any potential confounding. EPA/OPPT assigned a weighting factor that is twice the value of the other metrics within the same domain to each critical metric. Remaining metrics are assigned a weighting factor of 0.5 times the weighting factor assigned to the critical metric(s) in the domain. The sum of the weighting factors for each domain equals one. Tables H-4 identifies the critical metrics for epidemiologic studies, respectively, and provides a rationale for why the metrics are considered to be of greater importance than others within the domain. Table H-5 identifies the weighting factors assigned to each metric for epidemiologic studies, respectively. 225 ------- Table H-4. Epidemiology Metrics with Greater Importance in the Evaluation and Rationale for Selection Domain Critical Metrics with Higher Weighting Factors (Metric Number)a Rationale Study Participation Study Participation Participant Selection (Metric 1) The participants selected for the study must be representative of the target population. Differences between participants and nonparticipants determines the amount of bias present, and differences should be well-described (Galea and Tracy 2007). Attrition (Metric 2) Study attrition threatens the internal validity of studies, affects sample size, and compromises the precision of the measured associations (Kristman et al. 2004). Exposure characterization Measurement of Exposure (Metric 4) The exposure of interest of should be well-defined and measured in a manner that is accurate, precise, and reliable to ensure the internal and external validity of the study findings (Blumenthal et al. 2001, Nieuwenhuijsen 2015). Temporality (Metric 6) Temporality is essential to causal inference. Details must be provided to ensure the exposure sufficiently preceded the outcome and that enough time has passed since the exposure to observed said effect (Fedak et al. 2015). Outcome assessment Outcome Measurement or Characterization (Metric 7) The methods used for outcome assessment must be fully described, valid, and sensitive to ensure that the observed effects are true, and to enable valid comparisons across studies (Blumenthal et al. 2001). Potential Confounding/ variable control Covariate Adjustment (Metric 9) Control for confounding variables either through study design or analysis is considered important to ensure that any observed effects are attributable to the chemical exposure of interest and not to other factors (Blumenthal et al. 2001). Analysis Study Design and Methods (Metric 12) The study design selected and applied analytical techniques for the collected data must be suitable to address the research question at hand (Checkoway et al. 2007). aFor the remaining metrics within the same domain, a weighting factor of 0.5*the key metric weighting factor is assigned H.4.2 Calculation of Overall Study Score A confidence level (1, 2, or 3 for High, Medium, or Low confidence, respectively) is assigned for each relevant metric within each domain. To determine the overall study score, the first step is to multiply the score for each metric (1, 2, or 3 for High, Medium, or Low confidence, respectively) by the appropriate weighting factor to obtain a weighted metric score. The weighted metric scores are then summed and divided by the sum of the weighting factors (for all metrics that are scored) to obtain an overall study score between 1 and 3. The equation for calculating the overall score is shown below: Overall Score (range of 1 to 3) = Z (Metric Score x Weighting Factor)/Z(Weighting Factors) 226 ------- Tables H-5 and H-6 present a summary of the domain, metrics and weighting approach for epidemiological studies with or without biomarkers, respectively. Table H-7 provides a scoring example for epidemiological studies where sample size is not applicable. EPA/OPPT plans to use data with an overall quality level of High, Medium, or Low confidence to quantitatively or qualitatively support the risk evaluations, but does not plan to use data rated as Unacceptable. Studies with any single metric scored as 4 will be automatically assigned an overall quality score of Unacceptable and further evaluation of the remaining metrics is not necessary. An Unacceptable score means that serious flaws are noted in the domain metric that consequently make the data unusable (or invalid). Any metrics that are Not rated/not applicable to the study under evaluation are not considered in the calculation of the study's overall quality score. These metrics are not included in the nominator or denominator of the overall score equation. The overall score is calculated using only those metrics that receive a numerical score. In addition, if a publication reports more than one study or endpoint, each study and, as needed, each endpoint will be evaluated separately. Detailed tables showing quality criteria for the metrics are provided in Tables H-8 and H-9, including a table that summarizes the serious flaws that would make the data unacceptable for use in the human health hazard assessment. 227 ------- Table H-5. Summary of Domain, Metrics, and Weighting Approach with Biomarkers Domain Metric Range of Metric Scores Metric weighting Factor Domain Weight Range of Weighted Metric Scores Study Participant Selection 1 to 3 0.4 1 0.4 to 1.2 Participation Attrition 1 to 3 0.4 0.4 to 1.2 Comparison Group 1 to 3 0.2 0.2 to 0.6 Exposure Characterization Measurement of Exposure 1 to 3 0.4 1 0.4 to 1.2 Exposure Levels 1 to 3 0.2 0.2 to 0.6 Temporality 1 to 3 0.4 0.4 to 1.2 Outcome Outcome measurement or characterization 1 to 3 0.67 1 0.67 to 2.01 Assessment Reporting Bias 1 to 3 0.33 0.33 to 0.99 Covariate Adjustment 1 to 3 0.5 0.5 to 1.5 Potential Covariate Characterization 1 to 3 0.25 0.25 to 0.75 Confounding/ Variable Control Co-exposure Confounding/Moderation/ Mediation 1 to 3 0.25 1 0.25 to 0.75 Study Design and Methods 1 to 3 0.4 0.4 to 1.2 Analysis Statistical Power 1 to 3 0.2 0.2 to 0.6 Reproducibility of Analyses 1 to 3 0.2 1 0.2 to 0.6 Statistical Models 1 to 3 0.2 0.2 to 0.6 Other (if applicable) Considerations for Use of Biomarker of Exposure 1 to 3 0.143 Effect Biomarker 1 to 3 0.143 1 Method Sensitivity 1 to 3 0.143 Biomarker Biomarker Stability 1 to 3 0.143 0.143 to 0.429 Selection and Measurement Sample Contamination 1 to 3 0.143 (Lakind et al., 2014) Method Requirements 1 to 3 0.143 Matrix Adjustment 1 to 3 0.143 Sum of Weighted Equation: Scores = 6 to 18 Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor Sum of Metric Weighting Factors= 6/6=1; 18/6=3 Range of overall score = 1 to 3 228 ------- Table H-6. Summary of Domain, Metrics, and Weighting Approach for Studies without Biomarkers Domain Metric Range of Metric Metric weighting Scores Factor Domain Weight Range of Weighted Metric Scores Study Participation Participant Selection 1 to 3 0.4 1 0.4 to 1.2 Attrition 0.4 0.4 to 1.2 Comparison Group 0.2 0.2 to 0.6 Exposure Characterization Measurement of Exposure 0.4 1 0.4 to 1.2 Exposure Levels 0.2 0.2 to 0.6 Temporality 0.4 0.4 to 1.2 Outcome Assessment Outcome measurement or characterization 0.67 1 0.67 to 2.01 Reporting Bias 0.33 0.33 to 0.99 Potential Confounding/ Variable Control Covariate Adjustment 0.5 1 0.5 to 1.5 Covariate Characterization 0.25 0.25 to 0.75 Co-exposure Confounding/Moderation/Mediation 0.25 0.25 to 0.75 Analysis Study Design and Methods 0.4 1 0.4 to 1.2 Statistical Power 0.2 0.2 to 0.6 Reproducibility of Analyses 0.2 0.2 to 0.6 Statistical Models 0.2 0.2 to 0.6 Equation: Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor Sum of Weighted Scores = 5 to 15 Sum of Metric Weighting Factors= 5 5/5=1; 15/5=3 Range of overall score = 1 to 3 229 ------- Table H-7. Example of Scoring for Epidemiologic Studies where Sample Size is Not Applicable Domain Metric Metric Score Metric Weighting Factor Weighted Score Study Participation 1. Participant Selection 1 0.4 0.4 2. Attrition 3 0.4 1.2 3. Comparison Group 2 0.2 0.4 Exposure Characterization 4. Measurement of Exposure 1 0.4 0.4 5. Exposure Levels 1 0.2 0.2 6. Temporality 1 0.4 0.8 Outcome Assessment 7. Outcome measurement or characterization 3 0.67 2.01 8. Reporting Bias 2 0.33 0.33 Potential Confounding/ Variable Control 9. Covariate Adjustment 1 0.67 0.67 10. Covariate Characterization 1 0.33 0.33 11. Co-exposure Confounding/Moderation/Mediation NR NR NR Analysis 12. Study Design and Methods 1 0.4 1.2 13. Statistical Power 1 0.2 0.4 14. Reproducibility of Analyses 3 0.2 0.2 15. Statistical Models 3 0.2 0.6 Sum of scores 5 8.47 Overall Study Score 1.7 = Medium NR= not rated/not applicable Equation: Overall Score = Sum of Weighted Scores/Sum of Metric Weighting Factor High Medium Low >1 and <1.7 J >1.7 and <2.3 >2.3 and <3 230 ------- H.5 Data Quality Criteria Table H-8. Serious Flaws that Would Make Epidemiological Studies Unacceptable for Use in the Hazard Assessment Optimization of the list of serious flaws may occur after pilot calibration exercises. Domain Metric Description of Serious Flaw(s) in Data Source Study Participation Participant Selection For all study types: The reported information indicates that selection in or out of the study (or analysis sample) and participation was likely to be significantly biased (i.e., the exposure-outcome distribution of the participants is likely not representative of the exposure-outcome distributions in the overall population of eligible persons.) Attrition For cohort studies: The loss of subiects (i.e., incomplete outcome data) was large and unacceptably handled (as described above in the low confidence category) (Source: OHAT). OR Numbers of individuals were not reported at important stages of study (e.g., numbers of eligible participants included in the study or analysis sample, completing follow-up, and analyzed). Reasons were not provided for non-participation at each stage [STROBE Checklist Item 13 (Von Elm et al., 2008)1. For case-control and cross-sectional studies: The exclusion of subiects from analyses was large and unacceptably handled (as described above in the low confidence category). OR Reasons were not provided for non-participation at each stage [STROBE Checklist Item 13 (Von Elm et al., 2008)1. Comparison Group For cohort studies: Subiects in all exposure groups were not similar, recruited within very different time frames, or had the very different participation/ response rates (NTP, 2015a). OR Information was not reported to determine if participants in all exposure groups were similar [STROBE Checklist 6 (Von Elm et al., 2008)1 For case-control studies: Controls were drawn from a verv dissimilar population than cases or recruited within very different time frames (NTP, 2015a). OR Rationale and/or methods for case and control selection, matching criteria including number of controls per case (if relevant) were not reported [STROBE Checklist 6 (Von Elm et al., 2008)1. For cross-sectional studies: Subiects in all exposure groups were not similar, recruited within very different time frames, or had the very different participation/response rates (NTP, 2015a). 231 ------- Domain Metric Description of Serious Flaw(s) in Data Source OR Sources and methods of selection of participants in all exposure groups were not reported [STROBE Checklist 6 (Von Elm et al., 2008)1. Exposure Characterization Measurement of Exposure For all study types: Exposure variables were not well defined, and sources of data and detailed methods of exposure assessment were not reported [STROBE Checklist 7 and 8 (Von Elm et al., 2008)1. OR Exposure was assessed using methods known or suspected to have poor validity (Source: OHAT). OR There is evidence of substantial exposure misclassification that would significantly alter results. Exposure Levels For all studv tvoes: The levels of exposure are not sufficient or adeauate (as defined above) to detect an effect of exposure (Cooper et al., 2016). OR No description is provided on the levels or range of exposure. Temporality For all studv tvoes: Studv lacks an established time order, such that exposure is not likelv to have occurred prior to outcome (Lakind et al., 2014). OR Exposures clearly fell outside of relevant exposure window for the outcome of interest. OR For each variable of interest (outcome and predictor), sources of data and details of methods of assessment were not reported (e.g., periods of exposure, dates of outcome ascertainment, etc.) [STROBE Checklist 8 (Von Elm et al., 2008)1. Outcome Assessment Outcome measurement or characterization For all studv tvoes: Numbers of outcome events or summarv measures, or diagnostic criteria were not defined or reported [STROBE Checklist 15 (Von Elm et al., 2008)1. Potential Confounding/Variable Control Covariate adjustment For cohort and cross-sectional studies: The distribution of orimarv covariates (excluding co-exposures) and known confounders differed significantly between the exposure groups OR Confounding was demonstrated and was not appropriately adjusted for in the final analyses (NTP, 2015a). For case-control studies: The distribution of orimarv covariates (excluding co-exposures) and known confounders differed significantly between cases and controls. OR Confounding was demonstrated and was not appropriately adjusted for in the final analyses (NTP, 2015a). 232 ------- Domain Metric Description of Serious Flaw(s) in Data Source Covariate characterization For all studv tvpes: Primary covariates (excluding co-exposures) and confounders were not assessed. Co-exposure Confounding/ Moderation/ Mediation For cohort and cross-sectional studies: There is direct evidence that there was an unbalanced provision of additional co-exposures across the primary study groups, which were not appropriately adjusted for. For case-control studies: There is direct evidence that there was an unbalanced provision of additional co-exposures across cases and controls, which were not appropriately adjusted for, and significant indication a biased exposure-outcome association. Analysis Study design and methods For all studv tvpes: The studv design chosen was not appropriate for the research question. OR Inappropriate statistical analyses were applied to assess the research questions. Statistical power (sensitivity) For cohort and cross-sectional studies: The number of participants are inadequate to detect an effect in the exposed population and/or subgroups of the total population. For case-control studies: The number of cases and controls are inadequate to detect an effect in the exposed population and/or subgroups of the total population. Other (if applicable) Considerations for Biomarker Selection and Measurement (Lakind et al., 2014) Use of Biomarker of Exposure Biomarker in a specified matrix is a poor surrogate (low accuracy and precision) for exposure/dose. Effect biomarker Biomarker has undetermined consequences (e.g., biomarker is not specific to a health outcome). Method sensitivity Frequency of detection too low to address the research hypothesis. OR LOD/LOQ (value or %) are not stated. Biomarker stability Samples with either unknown storage history and/or no stability data for target analytes and high likelihood of instability for the biomarker under consideration. Sample contamination There are known contamination issues and no documentation that the issues were addressed. Method requirements Instrumentation that only allows for possible quantification of the biomarker, but the method has known interferants (e.g., GC-FID, spectroscopy). Matrix adjustment If applicable for the biomarker under consideration, no established method for matrix adjustment was conducted. 233 ------- Table H-9. Evaluation Criteria for Epidemiological Studies Confidence Level (Score) Description Selected Score Domain 1. Study Participation Metric 1. Participant selection (selection, performance biases) Instructions: To meet criteria for confidence ratings for metrics where 'AND' is included, studies must address both of the conditions where "AND" is stipulated. To meet criteria for confidence ratings for metrics where 'OR' is included studies must address at least one of the conditions stipulated. High (score = 1) • For all studv tvoes: All kev elements of the studv design are reported (i.e., setting, participation rate described at all steps of the study, inclusion and exclusion criteria, and methods of participant selection or case ascertainment) AND The reported information indicates that selection in or out of the study (or analysis sample) and participation was not likely to be biased (i.e., the exposure-outcome distribution of the participants is likely representative of the exposure-outcome distributions in the overall population of eligible persons.) Medium (score = 2) • For all studv tvoes: Some kev elements of the studv design were not present but available information indicates a low risk of selection bias (i.e., the exposure-outcome distribution of the participants is likely representative of the exposure-outcome distributions in the overall population of eligible persons.) Low (score = 3) • For all studv tvoes: Kev elements of the studv design and information on the comparison group (i.e., setting, participation rate described at most steps of the study, inclusion and exclusion criteria, and methods of participant selection or case ascertainment) are not reported [STROBE checklist 4, 5 and 6 (Von Elm et al., 2008)1. Unacceptable (score = 4) • For all studv tvoes: The reported information indicates that selection in or out of the study (or analysis sample) and participation was likely to be significantly biased (i.e., the exposure-outcome distribution of the participants are likely not representative of the exposure-outcome distributions in the overall population of eligible persons.) Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 2. Attrition missing data/attrition/exclusion, reporting biases) High (score = 1) • For cohort studies: There was minimal subiect attrition during the study (or exclusion from the analysis sample) and outcome data were largely complete. OR • Any loss of subjects (i.e., incomplete outcome data) was adequately* addressed (as described above) and reasons were documented when human subjects were removed from a studv (NTP, 2015a). OR • Missing data have been imputed using appropriate methods (e.g., random regression imputation), and characteristics of subjects lost to follow up or with unavailable records are described in identical way and are not significantly different from those of the studv participants (NTP, 2015a). • For case-control studies and cross-sectional studies: There was minimal subiect 234 ------- Confidence Level (Score) Description Selected Score withdrawal from the study (or exclusion from the analysis sample) and outcome data were largely complete. OR • Any exclusion of subjects from analyses was adequately* addressed (as described above), and reasons were documented when subjects were removed from the studv or excluded from analyses (NTP, 2015a). *NOTE for all studv tvpes: Adequate handling of subject attrition includes: very little missing outcome data; reasons for missing subjects unlikely to be related to outcome (for survival data, censoring was unlikely to introduce bias); missing outcome data balanced in numbers across study groups, with similar reasons for missing data across groups. Medium (score = 2) • For cohort studies: There was moderate subiect attrition during the studv (or exclusion from the analysis sample). AND • Any loss or exclusion of subjects was adequately addressed (as described in the acceptable handling of subject attrition in the high confidence category) and reasons were documented when human subjects were removed from a study. • For case-control studies and cross-sectional studies: There was moderate subject withdrawal from the study (or exclusion from the analysis sample), but outcome data were largely complete. AND • Any exclusion of subjects from analyses was adequately addressed (as described above), and reasons were documented when subjects were removed from the studv or excluded from analyses (NTP, 2015a). Low (score = 3) • For cohort studies: There was large subiect attrition during the studv (or exclusion from the analysis sample). OR • Unacceptable handling of subject attrition: reason for missing outcome data likely to be related to true outcome, with either imbalance in numbers or reasons for missing data across study groups; or potentially inappropriate application of imputation (Source: OHAT). • For case-control and cross-sectional studies: There was large subiect withdrawal from the study (or exclusion from the analysis sample). OR • Unacceptable handling of subject attrition: reason for missing outcome data likely to be related to true outcome, with either imbalance in numbers or reasons for missing data across study groups; or potentially inappropriate application of imputation. Unacceptable (score = 4) • For cohort studies: The loss of subjects (i.e., incomplete outcome data) was large and unacceptably handled (as described above in the low confidence category) (Source: OHAT). OR • Numbers of individuals were not reported at important stages of study (e.g., numbers of eligible participants included in the study or analysis sample, completing follow-up, and analyzed). Reasons were not provided for non- participation at each stage [STROBE Checklist Item 13 (Von Elm et al., 2008)1. • For case-control and cross-sectional studies: The exclusion of subjects from 235 ------- Confidence Level (Score) Description Selected Score analyses was large and unacceptably handled (as described above in the low confidence category). OR • Reasons were not provided for non-participation at each stage [STROBE Checklist Item 13 (Von Elm et al., 2008)1. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 3. Comparison Group (selection, performance biases) High (score = 1) • For cohort and cross-sectional studies: Key elements of the study design are reported (i.e., setting, inclusion and exclusion criteria, and methods of participant selection), and indicate that subjects (in all exposure groups) were similar (e.g., recruited from the same eligible population with the same method of ascertainment and within the same time frame using the same inclusion and exclusion criteria, and were of similar age and health status) (NTP, 2015a). • For case-control studies: Kev elements of the studv design are reported (i.e., setting, inclusion and exclusion criteria, and methods of case ascertainment or control selection), and indicate that that cases and controls were similar (e.g., recruited from the same eligible population with appropriate matching criteria, such as age, gender, and ethnicity, the number of controls described, and eligibility criteria other than outcome of interest as appropriate), recruited within the same time frame, and controls are described as having no history of the outcome (NTP, 2015a). OR • For all studv tvoes: Baseline characteristics of groups differed but these differences were considered as potential confounding or stratification variables, and were thereby controlled by statistical analysis (Source: OHAT). Medium (score = 2) • For cohort studies: There is indirect evidence (e.g., stated bv the authors without providing a description of methods) that subjects (in all exposure groups) are similar (as described above for the high confidence rating). AND • The baseline characteristics for subjects (in all exposure groups) reported in the studv are similar (NTP, 2015a). • For case-control studies: There is indirect evidence (i.e., stated bv the authors without providing a description of methods) that that cases and controls are similar (as described above for the high confidence rating). AND • The characteristics of case and controls reported in the studv are similar (NTP, 2015a). • For cross-sectional studies: There is indirect evidence (i.e., stated bv the authors without providing a description of methods) that subjects (in all exposure groups) are similar (as described above for the high confidence rating) (Source: OHAT). AND • The characteristics of participants (in all exposure groups) reported in the study are similar. 236 ------- Confidence Level (Score) Description Selected Score Low (score = 3) • For cohort studies: There is indirect evidence (i.e., stated bv the authors without providing a description of methods) that subjects (in all exposure groups) were similar (as described above for the high confidence rating). AND • The baseline characteristics for subjects (in all exposure groups) are not reported (NTP, 2015a). • For case-control studies: There is indirect evidence (i.e., stated bv the authors without providing a description of methods) that that cases and controls were similar (as described above for the high confidence rating). AND • The characteristics of case and controls are not reported (Source: (NTP, 2015a). • For cross-sectional studies: There is indirect evidence (i.e., stated bv the authors without providing a description of method) that subjects (in all exposure groups) were similar (as described above for the high confidence rating). AND • The characteristics of participants (in all exposure groups) are not reported (Source: OHAT). Unacceptable (score = 4) • For cohort studies: Subjects in all exposure groups were not similar, recruited within very different time frames, or had the very different participation/ response rates (NTP, 2015a). OR • Information was not reported to determine if participants in all exposure groups were similar [STROBE Checklist 6 (Von Elm et al., 2008)1 • For case-control studies: Controls were drawn from a verv dissimilar population than cases or recruited within verv different time frames (NTP, 2015a). OR • Rationale and/or methods for case and control selection, matching criteria including number of controls per case (if relevant) were not reported [STROBE Checklist 6 (Von Elm et al., 2008)1. • For cross-sectional studies: Subjects in all exposure groups were not similar, recruited within very different time frames, or had the very different participation/response rates (NTP, 2015a). OR • Sources and methods of selection of participants in all exposure groups were not reported [STROBE Checklist 6 (Von Elm et al., 2008)1. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 2. Exposure Characterization Metric 4. Measurement of Exposure (Detection/measurement/information, performance biases) High (score = 1) • For all studv tvoes: Exposure was consistently assessed (i.e., under the same method and time-frame) using well-established methods (e.g., personal and/or industrial hygiene data used to determine levels of exposure, a frequently used biomarker of exposure) that directly measure exposure (e.g., measurement of the chemical in the environment (air, drinking water, consumer product, etc.) or 237 ------- Confidence Level (Score) Description Selected Score measurement of the chemical concentration in a biological matrix such as blood, plasma, urine, etc.) (NTP, 2015a). Medium (score = 2) • For all studv tvoes: Exposure was directlv measured and assessed using a method that is not well-established (e.g., newly developed biomarker of exposure), but is validated against a well-established method and demonstrated a high agreement between the two methods. Low (score = 3) • For all studv tvoes: A less-established method (e.g., newlv developed biomarker of exposure) was used and no method validation was conducted against well-established methods, but there was little to no evidence that the method had poor validity and little to no evidence of significant exposure misclassification (e.g., differential recall of self-reported exposure) (Source: OHAT). Unacceptable (score = 4) • For all studv tvpes: Exposure variables were not well defined, and sources of data and detailed methods of exposure assessment were not reported [STROBE Checklist 7 and 8 (Von Elm et al., 2008)1. OR • Exposure was assessed using methods known or suspected to have poor validity (Source: OHAT). OR • There is evidence of substantial exposure misclassification that would significantly alter results. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 5. Exposure levels (Detection/measurement/information biases) High (score = 1) • For all studv tvpes: The levels of exposure are sufficient* or adequate to detect an effect of exposure {Cooper, 2016, 3121908}. * Sufficient or adequate for cohort and cross-sectional studies includes the reporting of at least 2 levels of exposure (referent group + 1 or more exposure groups) (Cooper) that capture exposure spatial and temporal variability within the study population (Source: IRIS). Medium (score = 2) • Do not select for this metric. Low (score = 3) • Do not select for this metric. Unacceptable (score = 4) • For all studv tvoes: The levels of exposure are not sufficient or adequate (as defined above) to detect an effect of exposure (Cooper et al., 2016). OR • No description is provided on the levels or range of exposure. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 238 ------- Confidence Level (Score) Description Selected Score Metric 6. Temporality (Detection/measurement/information biases) High (score = 1) • For all study tvpes: The study presents an established time order between exposure and outcome. \ND • The interval between the exposure (or reconstructed exposure) and the outcome has an appropriate consideration of relevant exposure windows (Lakind et al., 2014). Medium (score = 2) • For all studv tvoes: Temporality is established, but it is unclear whether exposures fall within relevant exposure windows for the outcome of interest (Lakind et al., 2014). Low (score = 3) • For all studv tvoes: The temporality of exposure and outcome is uncertain. Unacceptable (score = 4) • For all studv tvoes: Studv lacks an established time order, such that exposure is not likelv to have occurred prior to outcome (Lakind et al., 2014). OR • Exposures clearly fell outside of relevant exposure window for the outcome of interest. OR • For each variable of interest (outcome and predictor), sources of data and details of methods of assessment were not reported (e.g. periods of exposure, dates of outcome ascertainment, etc.) [STROBE Checklist 8 (Von Elm et al., 2008)1. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 3. Outcome Assessment Metric 7. Outcome measurement or characterization (detection/measurement/information, performance, reporting biases) High (score = 1) • For cohort studies: The outcome was assessed using well-established methods (e.g., the "gold standard"). AND • Subjects had been followed for the same length of time in all study groups. • For case-control studies: The outcome was assessed in cases (i.e., case definition) and controls using well-established methods (the gold standard). AND • Subjects had been followed for the same length of time in all studv groups (NTP, 2015a). For cross-sectional studies: There is direct evidence that the outcome was assessed using well-established methods (the gold standard) (NTP, 2015a). Mote: Acceptable assessment methods will depend on the outcome, but examples of such methods may include: objectively measured with diagnostic methods, measured bv trained interviewers, obtained from registries (NTP, 2015a: Shamlivan et al., 2010). 239 ------- Confidence Level (Score) Description Selected Score Medium (score = 2) • For all study tvpes: A less-established method was used and no method validation was conducted against well-established methods, but there was little to no evidence that that the method had poor validity and little to no evidence of outcome misclassification (e.g., differential reporting of outcome by exposure status). Low (score = 3) • For cohort studies: The outcome assessment method is an insensitive instrument or measure. OR • The length of follow up differed by study group (NTP, 2015a). • For case-control studies: The outcome was assessed in cases (i.e., case definition) using an insensitive instrument or measure (NTP, 2015a). • For cross-sectional studies: The outcome assessment method is an insensitive instrument or measure (NTP, 2015a). Unacceptable (score = 4) • For all study types: Numbers of outcome events or summary measures, or diagnostic criteria were not defined or reported [STROBE Checklist 15 (Von Elm etal., 2008)1. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 8. Reporting Bias High (score = 1) • For all study types: All of the studv's measured outcomes (primary and secondary) outlined in the protocol, methods, abstract, and/or introduction (that are relevant for the evaluation) are reported. This would include outcomes reported with sufficient detail to be included in meta-analysis or fully tabulated during data extraction and analyses had been planned in advance (NTP, 2015a). Medium (score = 2) • For all study types: All of the studv's measured outcomes (primary and secondary) outlined in the protocol, methods, abstract, and/or introduction (that are relevant for the evaluation) are reported, but not in a way that would allow for detailed extraction (e.g., results were discussed in the text but accompanying data were not shown). Low (score = 3) • For all study types: All of the studv's measured outcomes (primary and secondary) outlined in the protocol, methods, abstract, and/or introduction (that are relevant for the evaluation) have not been reported. In addition to not reporting outcomes, this would include reporting outcomes based on composite score without individual outcome components or outcomes reported using measurements, analysis methods or subsets of the data (e.g., subscales) that were not pre-specified or reporting outcomes not pre-specified, or that unplanned analyses were included that would appreciably bias results (NTP, 2015a). Unacceptable (score = 4) • Do not select for this metric. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 240 ------- Confidence Level (Score) Description Selected Score Domain 4. Potential Confounding/Variable Control Metric 9. Covariate Adjustment (confounding) High • For all study types: Appropriate adjustments or explicit considerations were (score = 1) made for primary covariates (excluding co-exposures) and confounders in the final analyses through the use of statistical models to reduce research-specific bias, including standardization, matching, adjustment in multivariate models, stratification, or other methods that were appropriately justified (NTP, 2015a). Medium • For all study tvpes: There is indirect evidence that appropriate adjustments (score = 2) were made (i.e., considerations were made for primary covariates (excluding co- exposures) and confounders adjustments) without providing a description of methods. OR • The distribution of primary covariates (excluding co-exposures) and known confounders did not differ significantly between exposure groups or between cases and controls. OR • The majority of the primary covariates (excluding co-exposures) and any known confounders were appropriately adjusted and any not adjusted for are considered not to appreciably bias the results. Low • For all study types: There is indirect evidence (i.e., no description is provided in (score = 3) the study) that considerations were not made for primary covariates (excluding co-exposures) and confounders adjustments in the final analyses (NTP, 2015a). AND • The distribution of primary covariates (excluding co-exposures) and known confounders was not reported between the exposure groups or between cases and controls (NTP, 2015a). Unacceptable • For cohort and cross-sectional studies: The distribution of primarv covariates (score = 4) (excluding co-exposures) and known confounders differed significantly between the exposure groups OR • Confounding was demonstrated and was not appropriately adjusted for in the final analyses (NTP, 2015a). • For case-control studies: The distribution of primarv covariates (excluding co- exposures) and known confounders differed significantly between cases and controls. OR • Confounding was demonstrated and was not appropriately adjusted for in the final analyses (NTP, 2015a). Not • Do not select for this metric. rated/applicable Reviewer's [Document concerns, uncertainties, limitations, and deficiencies and any comments additional comments that may highlight study strengths or important elements such as relevance] 241 ------- Confidence Level (Score) Description Selected Score Metric 10. Covariate Characterization (measurement/information, confounding biases) High (score = 1) • For all studv tvoes: Primary covariates (excluding co-exoosures) and confounders were assessed using valid and reliable methodology (e.g., validated questionnaires, biomarker). Medium (score = 2) • For all studv tvoes: A less-established method was used and no method validation was conducted against well-established methods, but there was little to no evidence that that the method had poor validity and little to no evidence of confounding. Low (score = 3) • For all studv tvoes: The orimarv covariate (excluding co-exoosures) and confounder assessment method is an insensitive instrument or measure or a method of unknown validity. Unacceptable (score = 4) • For all studv tvoes: Primarv covariates (excluding co-exoosures) and confounders were not assessed. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 11. Co-exposure Confounding/Moderation/Mediation (measurement/information, confounding biases) High (score = 1) • For all studv tvoes: Anv co-exoosures to oollutants that are not the target exposure that would likely bias the results were not present. OR • Co-exposures to pollutants were appropriately measured and adjusted for. Medium (score = 2) • Do not select for this metric. Low (score = 3) • Do not select for this metric. Unacceptable (score = 4) • For cohort and cross-sectional studies: There is direct evidence that there was an unbalanced provision of additional co-exposures across the primary study groups, which were not appropriately adjusted for. • For case-control studies: There is direct evidence that there was an unbalanced provision of additional co-exposures across cases and controls, which were not appropriately adjusted for, and significant indication a biased exposure- outcome association. Not rated/applicable • Enter 'NA' and do not score this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 5. Analysis Metric 12. Study Design and Methods (reporting bias) High (score = 1) • For all studv tvoes: The studv design chosen was aoorooriate for the research question (e.g. assess the association between exposure levels and common chronic diseases over time with cohort studies, assess the association between exposure and rare diseases with case-control studies, and assess the association between exposure levels and acute disease with a cross-sectional study design). 242 ------- Confidence Level (Score) Description Selected Score AND • The study uses an appropriate statistical method to address the research question(s) (e.g., repeated measures analysis for longitudinal studies, logistic regression analysis for case-control studies). Medium (score = 2) • Do not select for this metric. Low (score = 3) • Do not select for this metric. Unacceptable (score = 4) For all studv tvoes: The studv design chosen was not appropriate for the research question. OR • Inappropriate statistical analyses were applied to assess the research questions. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 13. Statistical power (sensitivity, reporting bias) High (score = 1) • For cohort and cross-sectional studies: The number of participants are adequate to detect an effect in the exposed population and/or subgroups of the total population. OR • The paper reported statistical power high enough (> 80%) to detect an effect in the exposure population and/or subgroups of the total population. • For case-control studies: The number of cases and controls are adeauate to detect an effect in the exposed population and/or subgroups of the total population. OR • The paper reported statistical power was high (> 80%) to detect an effect in the exposure population and/or subgroups of the total population. Medium (score = 2) • Do not select for this metric. Low (score = 3) • Do not select for this metric. Unacceptable (score = 4) • For cohort and cross-sectional studies: The number of participants are inadequate to detect an effect in the exposed population and/or subgroups of the total population. • For case-control studies: The number of cases and controls are inadeauate to detect an effect in the exposed population and/or subgroups of the total population. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] 243 ------- Confidence Level (Score) Description Selected Score Metric 14. Reproducibility of analyses (adapted from Blettner et al. (2001)1 High (score = 1) • For all studv tvoes: The description of the analysis is sufficient to understand precisely what has been done and to be reproducible. Medium (score = 2) • Do not select for this metric. Low (score = 3) • For all studv tvoes: The description of the analysis is insufficient to understand what has been done and to be reproducible OR a description of analyses are not present (e.g., statistical tests and estimation procedures were not described, variables used in the analysis were not listed, transformations of continuous variables (such as logarithm) were not explained, rules for categorization of continuous variables were not presented, deleting of outliers were not elucidated and how missing values are dealt with was not mentioned). Unacceptable (score = 4) • Do not select for this metric. Not rated/applicable • Do not select for this metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 15. Statistical Models (confounding bias) High (score = 1) • For all studv tvoes: The statistical model building process is transparent (it is stated how/why variables were included or excluded from the multivariate model) AND model assumptions were met. Medium (score = 2) • Do not select for this metric. Low (score = 3) • For all studv tvoes: The statistical model building process is not transparent OR it is not stated how/why variables were included or excluded from the multivariate model OR model assumptions were not met OR a description of analyses are not present OR no sensitivity analyses are described OR model assumptions were not discussed [STROBE Checklist 12e (Von Elm et al., 2008)1. Unacceptable (score = 4) • Do not select for this metric. Not rated/applicable • Enter 'NA' if the study did not use a statistical model. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Domain 6. Other (if applicable) Considerations for Biomarker Selection and Measurement Lakind et al. (2014) Metric 16. Use of Biomarker of Exposure (detection/measurement/information biases) High (score = 1) • Biomarker in a specified matrix has accurate and precise quantitative relationship with external exposure, internal dose, or target dose. AND • Biomarker is derived from exposure to one parent chemical. Medium (score = 2) • Biomarker in a specified matrix has accurate and precise quantitative relationship with external exposure, internal dose, or target dose. AND • Biomarker is derived from multiple parent chemicals. 244 ------- Confidence Level (Score) Description Selected Score Low (score = 3) • Evidence exists for a relationship between biomarker in a specified matrix and external exposure, internal dose or target dose, but there has been no assessment of accuracy and precision or none was reported. Unacceptable (score = 4) • Biomarker in a specified matrix is a poor surrogate (low accuracy and precision) for exposure/dose. Not rated/applicable • Enter 'NA' and do not score the metric if no biomarker of exposure was measured. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 17. Effect biomarker (detection/measurement/information biases) High (score = 1) • Bioindicator of a key event in an AOP. Medium (score = 2) • Biomarkers of effect shown to have a relationship to health outcomes using well validated methods, but the mechanism of action is not understood. Low (score = 3) • Biomarkers of effect shown to have a relationship to health outcomes, but the method is not well validated and mechanism of action is not understood. Unacceptable (score = 4) • Biomarker has undetermined consequences (e.g., biomarker is not specific to a health outcome). Not rated/applicable • Enter 'NA' and do not score the metric if no biomarker of effect was measured. Reviewer's comments Metric 18. Method sensitivity (detection/measurement/information biases) High (score = 1) • Limits of detection are low enough to detect chemicals in a sufficient percentage of the samples to address the research question. Medium (score = 2) • Do not select for this metric. Low (score = 3) • Do not select for this metric. Unacceptable (score = 4) • Frequency of detection too low to address the research hypothesis. OR • LOD/LOQ (value or %) are not stated. Not rated/applicable • Enter 'NA' and do not score the metric. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 19. Biomarker stability (detection/measurement/information biases) High (score = 1) • Samples with a known history and documented stability data or those using real-time measurements. Medium (score = 2) • Do not select for this metric. Low (score = 3) • Samples have known losses during storage, but the difference between low and high exposures can be qualitatively assessed. Unacceptable (score = 4) • Samples with either unknown storage history and/or no stability data for target analytes and high likelihood of instability for the biomarker under consideration. • 245 ------- Confidence Level (Score) Description Selected Score Not rated/applicable • Enter 'NA' and do not score the metric if no biomarkers were assessed. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 20. Sample contamination (detection/measurement/information biases) High (score = 1) • Samples are contamination-free from the time of collection to the time of measurement (e.g., by use of certified analyte free collection supplies and reference materials, and appropriate use of blanks both in the field and lab). AND • Documentation of the steps taken to provide the necessary assurance that the study data are reliable is included. Medium (score = 2) • Samples are stated to be contamination-free from the time of collection to the time of measurement. AND • There is incomplete documentation of the steps taken to provide the necessary assurance that the study data are reliable. Low (score = 3) • Samples are known to have contamination issues, but steps have been taken to address and correct contamination issues. OR • Samples are stated to be contamination-free from the time of collection to the time of measurement, but there is no use or documentation of the steps taken to provide the necessary assurance that the study data are reliable. Unacceptable (4) • There are known contamination issues and no documentation that the issues were addressed. Not rated/applicable • Enter 'NA' and do not score the metric if no samples were collected. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 21. Method requirements (detection/measurement/information biases) High (score = 1) • Instrumentation that provides unambiguous identification and quantitation of the biomarker at the required sensitivity (e.g., GC-HRMS, GC-MS/MS, LC- MS/MS). Medium (score = 2) • Do not select for this metric. Low (score = 3) • Instrumentation that allows for identification of the biomarker with a high degree of confidence and the required sensitivity (e.g., GC-MS, GC-ECD). Unacceptable (score = 4) • Instrumentation that only allows for possible quantification of the biomarker, but the method has known interferants (e.g., GC-FID, spectroscopy). Not rated/applicable • Enter 'NA' and do not score the metric if biomarkers were not measured. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] Metric 22. Matrix adjustment (detection/measurement/information biases) High (score = 1) • If applicable for the biomarker under consideration, study provides results, either in the main publication or as a supplement, for adjusted and unadjusted 246 ------- Confidence Level (Score) Description Selected Score matrix concentrations (e.g., creatinine-adjusted or SG-adjusted and non- adjusted urine concentrations) and reasons are given for adjustment approach. Medium (score = 2) • Do not select for this metric. Low (score = 3) • If applicable for the biomarker under consideration, study only provides results using one method (matrix-adjusted or not). Unacceptable (score = 4) • If applicable for the biomarker under consideration, no established method for matrix adjustment was conducted. Not rated/applicable • Enter 'NA' and do not score the metric if not applicable for the biomarker or no biomarker was assessed. Reviewer's comments [Document concerns, uncertainties, limitations, and deficiencies and any additional comments that may highlight study strengths or important elements such as relevance] H.6 References I. Blettner. MH. C. Razum. O. (2001). Critical reading of epidemiological papers. A guide. Eur J Public Health. 11(1): 97-101. 2. Checkoway, H; Pearce, N; Kriebel, D. (2007). Selecting appropriate study designs to address specific research questions in occupational epidemiology. Occup Environ Med 64: 633-638. http://dx.doi.org/10.1136/oem.2006.029967 3. Cooper, GL, R. Agerstrand, M. Glenn, B. Kraft, A. Luke, A. Ratcliffe, J. (2016). Study sensitivity: Evaluating the ability to detect effects in systematic reviews of chemical exposures. Environ Int. 92- 93: 605-610. http://dx.doi.Org/10.1016/i.envint.2016.03.017. 4. Fedak, KM; Bernal, A; Capshaw, ZA; Gross, S. (2015). Applying the Bradford Hill criteria in the 21st century: how data integration has changed causal inference in molecular epidemiology. Emerging Themes in Epidemiology 12: 14. http://dx.doi.org/10.1186/sl2982-015-0037-4 5. Galea, S; Tracy, M. (2007). Participation rates in epidemiologic studies [Review], Ann Epidemiol 17: 643-653. http://dx.doi.Org/10.1016/i.annepidem.2007.03.013 6. Kristman, V; Manno, M; Cote, P. (2004). Loss to follow-up in cohort studies: how much is too much? Eur J Epidemiol 19: 751-760. 7. Lakind, JSS, J. Goodman, M. Barr, D. B. Fuerst, P. Albertini, R. J. Arbuckle, T. Schoeters, G. Tan, Y. Teeguarden, J. Tornero-Velez, R. Weisel, C. P. (2014). A proposal for assessing study quality: Biomonitoring, Environmental Epidemiology, and Short-lived Chemicals (BEES-C) instrument. Environ Int. 73: 195-207. http://dx.doi.Org/10.1016/i.envint.2014.07.011: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4310547/pdf/nihms-656623.pdf. 8. Nieuwenhuijsen, MJ. (2015). Exposure assessment in environmental epidemiology. In MJ Nieuwenhuijsen (Ed.), (2 ed.). Canada: Oxford University Press. 9. NTP. (2015). Handbook for conducting a literature-based health assessment using OHAT approach for systematic review and evidence integration. U.S. Dept. of Health and Human Services, National Toxicology Program, http://ntp.niehs.nih.gov/pubhealth/hat/noms/index-2.html. 10. Shamliyan, TK, R. L. Dickinson, S. (2010). A systematic review of tools used to assess the quality of observational studies that examine incidence or prevalence and risk factors for diseases [Review], J Clin Epidemiol. 63(10): 1061-1070. http://dx.doi.Org/10.1016/i.iclinepi.2010.04.014. 11. Von Elm, EA, D. G. Egger, M. Pocock, S. J. Ggtzsche, P. C. Vandenbroucke, J. P. (2008). The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: 247 ------- guidelines for reporting observational studies. J Clin Epidemiol. 61(4): 344-349. https://hero.epa.gov/heronet/index.cfm/reference/download/reference id/4263036. 12. WHO (World Health Organization). (2001). Epidemiology: A tool for the assessment of risk. In L Fewtrell; J Bartram (Eds.), Water Quality: Guidelines, Standards and Health: Assessment of risk and risk management for water-related infectious disease (pp. 135-160). London, UK: IWA Publishing. http://www.who.int/water sanitation health/dwq/iwaforeword.pdf 248 ------- |