&EPA
United States
Environmental Protection
Agency
EPA/100/R-16/001 December 2016
www.epa.gov/osa
Weight of Evidence in Ecological Assessment
Search
Literature
Design &
Conduct
Studies
Weigh Body of
Evidence
Integrate Evidence
Interpret Bodies
of Evidence
Explain
Ambiguities &
Discrepancies
Office of the Science Advisor
Risk Assessment Forum

-------
EPA/100/R16/001
www.epa.gov/osa
WEIGHT OF EVIDENCE IN ECOLOGICAL ASSESSMENT
Risk Assessment Forum
U.S. Environmental Protection Agency
Washington, DC 20460

-------
DISCLAIMER
This document has been reviewed in accordance with U.S. Environmental Protection Agency (EPA)
policy. Mention of trade names or commercial products does not constitute endorsement or
recommendation for use.
l

-------
CONTENTS
TABLES	iv
FIGURES	v
BOXES	vi
AUTHORS, CONTRIBUTORS. AND REVIEWERS	vii
PREFACE	ix
ACRONYMS AND ABBREVIATIONS	x
EXECUTIVE SUMMARY	xi
1.	INTRODUCTION	1
1.1.	Definitions	1
1.2.	After 50 Years, Advancing Beyond Hill's Considerations	2
1.3.	Scope	3
1.4.	Benefits and Challenges of Weight of Evidence	4
2.	APPLICATIONS OF WEIGHT OF EVIDENCE	7
2.1.	Contaminated Sites	8
2.2.	Environmental Condition	10
2.3.	Existing Pesticides and Industrial Chemicals	10
2.4.	New Pesticides and Industrial Chemicals	11
2.5.	Benchmark Derivation	12
2.6.	Proposed Discharges	12
2.7.	Special Purpose Assessments	12
3.	PROCESSES I OR WEIGHING EVIDENCE	13
3.1.	An Introduction to the Weight of Evidence Process	13
3.2.	Planning the Assessment to Use Weight of Evidence	15
3.3.	Results and Transition	19
4.	ASSEMBLING EVIDENCE	20
4.1.	The Process for Assembling Evidence	20
4.2.	Searching Literature and Assembling Evidence	21
4.2.1.	Search the literature	22
4.2.2.	Screen the studies	23
4.2.3.	Categorize the studies	24
4.2.4.	Derive evidence from data and general knowledge	24
4.3.	Design and Conduct Studies and Assemble the Evidence	25
4.4.	Summary	26
4.5.	Results and Transition	26
5.	WEIGHTING EVIDENCE	27
5.1.	The Process of Weighting Evidence	27
5.2.	Scoring Systems	28
5.3.	Properties to Be Weighted	29
5.4.	Tables of Weights	35
5.5.	Not Combining Weights for Properties of Evidence	37
5.6.	Summary	37
5.7.	Results and Transition	38
ii

-------
6.	WEIGHING BODIES 01 EVIDENCE	39
6.1.	The Weighing Process	39
6.2.	Integrating Evidence	39
6.3.	Interpreting Bodies of Evidence	45
6.4.	Explaining Ambiguities and Discrepancies	47
6.5.	Presenting Results	49
6.6.	Summary	50
7.	SPECIAL CASES AND ABBREVIATED PROCESSES	51
7.1.	Weighing Without Weighting	51
7.2.	Weighting a Single Piece of Evidence	52
8.	WEIGHT OF EVIDENCE FOR QUANTITATIVE RESULTS	53
8.1.	Weight of Evidence for the Quality to Be Quantified	53
8.2.	Weight of Evidence for Deriving the Quantity	54
8.2.1.	Combining quantitative evidence	55
8.2.2.	Choosing the best quantitative evidence	55
8.3.	Weighting a Quantitative Result	56
9.	WEIGHT OF EVIDENCE AND UNCERTAINTY	57
10.	WEIGHT-OF-EVIDENCE SUMMARY AND THE PATH FORWARD	59
11.	REFERENCES	61
APPENDIX A. GLOSSARY OF WEIGHT-OF-EVIDENCE TERMS	A-l
APPENDIX B. WE1GHT-OF-EV1DENCE METHODS FOR QUANTITATIVE
RESULTS	B-l
APPENDIX C. WEIGHT-OF-EVIDENCE METHODS FOR DERIVING A MODEL	C-l
APPENDIX D. WEIGHT-OF-EVIDENCE APPROACHES FOR QUALITATIVE
CONCLUSIONS	D-l
APPENDIX E. CHARACTERISTICS OF INFERRED QUALITIES	E-l
in

-------
TABLES
Table 5-1. Weighting the strength of correlations (absolute value of r) and noting the
logical implication—an example for evidence from stream biological
surveys	31
Table 5-2. Table of standard scores for 3 example types of evidence out of 15 types
in CADDIS	32
Table 5-3. Generic scoring table based on conventional types of evidence, with first
line hypothetically completed	36
Table 5-4. Example scoring table: scoring types of evidence for sufficiency	37
Table 6-1. A generic weight-of-evidence table for n alternative causal hypotheses
(Hi, H2, ... H„), based on causal characteristics and collective properties of
the bodies of evidence	42
Table 6-2. Partial WoE table for alternative possible causes of the decline of San
Joaquin kit foxes	43
Table 6-3. Summary of evidence concerning risks to fish from a diesel spill	44
Table 6-4. Example of weighing evidence for a potential cause, major ions measured
as conductivity, of the loss of macroinvertebrate genera	45
Table 6-5. Weight of evidence for causal determinations in the 2013 lead ISA	47
Table 7-1. Summary of evidence for lead as a cause of mass mortality of tundra
swans in the Coeur d'Alene River Watershed	52
Table 8-1. Qualities that could be identified by qualitative WoE and the associated
quantities that could be derived by the quantitative WoE process	54
Table 8-2. WoE matrix to summarize quantitative risk estimates for four evidence
types or groups (numbered circles) and their weights	56
Table D-l. Inference based on the sediment quality triad	D-3
Table D-2. Some ways in which the triad decision table might fail	D-4
Table E-l.	Characteristics of causal relationships	E-2
Table E-2.	Potential characteristics of a protective benchmark	E-3
Table E-3.	Characteristics of a contaminant of concern at a contaminated site	E-3
Table E-4.	Potential characteristics of remediation	E-4
iv

-------
FIGURES
Figure S-l. An annotated diagram of the process for WoE to infer qualities by
assembling evidence, weighting it, and weighing the resulting body of
evidence, explained in Sections 4-7	xiii
Figure S-2. An annotated diagram of the process for using WoE to estimate quantities,
explained in Section 8	xiv
Figure 2-1. A framework depicting the relationships among types of environmental
assessments	7
Figure 3-1. A basic framework for all types of environmental assessments	13
Figure 3-2. The basic WoE process	14
Figure 3-3. Conceptual model for a hypothetical ecological risk assessment of the
relationship of phosphorous releases from a vacation home development
to the risk of fish kills in a lake	17
Figure 4-1. An elaboration of the process for assembling evidence, the first step in
WoE	20
Figure 4-2. An exposure-response relationship (black curve) alone is evidence that the
measured chemical can cause the effect	21
Figure 5-1. An elaboration of the process for weighting evidence, the second step in
WoE	27
Figure 6-1. An elaboration of the process for weighing the body of evidence, the third
step in weight of evidence	39
Figure 6-2. Some alternative approaches (a-d) to weighting and weighing evidence
based on different approaches to aggregating evidence	41
Figure 7-1. Steps in abbreviated weight-of-evidence processes: (a) skipping the
weighting step when all evidence is equivalent or (b) weighting a single
piece of evidence when multiple pieces are not available	51
Figure 8-1. Potential steps in a process for using WoE to derive a quantitative result.
Note that the top box of this process diagram encompasses the qualitative
WoE process	53
Figure 9-1. A diagram of the combination of statistical scatter and qualitative weight
to define the confidence that should be afforded an assessment result	57
Figure 10-1. The detailed framework for qualitative WoE	60
v

-------
BOXES
Box 1-1.	Weight versus Weigh, Weighting versus Weighing	2
Box 1-2.	Aspects of Evidence	3
Box 1-3.	Qualities and Qualitative Weight of Evidence	4
Box 1-4	Potential Benefits and Challenges	5
Box 1-5.	Subjectivity and Objectivity	5
Box 2-1.	Qualitative Questions for Which Evidence is Weighed in Different Types
of Assessments	9
Box 3-1.	Best Practices for Narrative Weight of Evidence	15
Box 3-2.	Standardization of Weight of Evidence	15
Box 3-3.	Weighing Evidence for Adverse Outcome Pathways	18
Box 3-4.	Data Quality Assurance and Weight of Evidence	19
Box 4-1.	Systematic Review	22
Box 5-1.	Relevance of a Piece or Type of Evidence	29
Box 5-2.	Strength of a Piece or Type of Evidence	30
Box 5-3.	Reliability of Evidence	34
Box 6-1.	Collective Properties of Bodies of Evidence. Modified from Norton et al.
(2014)	42
Box B-l.	Combining and Weighting Data in Species Sensitivity Distributions	B-3
Box B-2.	Cleanup Goals by Weight of Evidence Using the Rule of Five	B-4
Box D-l.	Independent Applicability and Weight of Evidence	D-5
vi

-------
AUTHORS, CONTRIBUTORS, AND REVIEWERS
AUTHOR
Glenn W. Suter II
National Center for Environmental Assessment
Office of Research and Development
Cincinnati, OH
TECHNICAL PANEL (Contributors)
Mace G. Barron
National Health and Environmental Effects
Research Laboratory
Office of Research and Development
Gulf Breeze, FL
David Charters
Office of Superfund Remediation and
Technology Innovation
Office of Land and Emergency Management
Edison, NJ
Susan M. Cormier
National Center for Environmental Assessment
Office of Research and Development
Cincinnati, OH
Karen Eisenreich
Office of Pollution Prevention and Toxics
Office of Chemical Safety and Pollution
Prevention
Washington, DC
Kris Garber
Office of Pesticide Programs
Office of Chemical Safety and Pollution
Prevention
Washington, DC
Wade Lehmann
Office of Science and Technology
Office of Water
Washington, DC
Scott Lynn
Office of Science Coordination and Policy
Office of Chemical Safety and Pollution
Prevention
Washington, DC
Chris Sarsony
Office of Air Quality Planning and Standards
Office of Air and Radiation
Research Triangle Park, NC
Glenn W. Suter II (Panel Chairman)
National Center for Environmental Assessment
Office of Research and Development
Cincinnati, OH
Risk Assessment Forum Staff
Lawrence Martin
Office of the Science Advisor
Washington, DC
EPA Peer Reviewers
Daniel A. Axelrad
National Center for Environmental Economics
Office of Policy
Washington, DC
Bruce Duncan
Office of Environmental Assessment
Region 10
Seattle, WA
Wayne Munns
National Health and Environmental Effects
Research Laboratory
Office of Research and Development
Narraganset, RI
Deirdre Murphy
Office of Air Quality Planning and Standards
Office of Air and Radiation
Research Triangle Park, NC
vii

-------
Kathleen Raffaele
Policy Analysis and Regulatory Management
Staff
Office of Land and Emergency Management
Washington, DC
Mary Reiley
Office of Science and Technology
Office of Water
Washington, DC
External Peer Reviewers
Brian W. Brooks
Center for Reservoir and Aquatic Systems
Research
Baylor University
Waco, Texas
Peter M. Chapman
Chapman Environmental Strategies Ltd.
N. Vancouver, British Columbia, Canada
Valery E. Forbes
College of Biological Sciences
University of Minnesota
Saint Paul, Minnesota
Igor Linkov
Risk and Decision Science Focus Area
U.S. Army Engineer Research and Development
Center
Concord, Massachusetts
viii

-------
PREFACE
The impetus for this document includes U.S. Environmental Protection Agency (EPA) policy, outside
advice, and the expressed needs of EPA ecological assessors. Ensuring and maximizing the quality,
objectivity, utility and integrity of information disseminated by the EPA "involves a 'weight-of-evidence'
(WoE) approach that considers all relevant information and its quality, consistent with the level of effort
and complexity of detail appropriate to a particular risk assessment" (U.S. EPA. 2002^. The EPA
Science Advisory Board recommended that the Risk Assessment Forum's Ecological Oversight
Committee consider the development of guidance for using the WoE approach a high priority (SAB.
2012). This document was prepared for EPA ecological assessors who expressed a desire for assistance
in determining which WoE approaches are potentially appropriate for their assessments (U.S. EPA.
2010b).
This document has three goals. The first is to assist ecological assessors who plan to weigh multiple
pieces of scientific evidence. We provide a standard framework consisting of three steps: assemble
evidence, weight evidence, and weigh the body of evidence. We also present a broadly applicable system
for assigning weights by evaluating and scoring the evidence and for combining the weighted evidence to
determine the best-supported hypothesis. Additional material addresses how to deal with bodies of
evidence that are discrepant or anomalous and how to express confidence in inferences based on the
weight of evidence. Finally, we briefly address the weighing of evidence to derive quantities used in or
generated by assessments.
The second goal is to make the weighing of evidence in ecological assessments more formal and
defensible. Weighing evidence is best performed using pre-existing and well-described methods. The
standard framework and default approach to weighing evidence presented in this document are expected
to improve the practice and acceptance of weighing evidence.
The third goal is to make the logic of weighing evidence clearer and more consistent. For example, many
of the EPA's weight-of-evidence analyses are based on Hill's considerations, a 50-year-old list that mixes
characteristics of causal relationships, types of evidence, and properties of evidence. This document
addresses those aspects of evidence separately and uses them systematically.
This document provides methods for weighing ecological evidence. Use of the methods will improve the
consistency and reliability of WoE-based assessments and the defensibility of scientific input to decision
making. This guidance is not meant to be prescriptive, nor does it dictate methods for specific programs
and applications.
Tables, figures, and text boxes throughout the document provide examples of WoE practices. The
specific methods in the examples were designed for particular statutory contexts and might need to be
adapted before they are used in other contexts.
This document was prepared under the auspices of EPA's Risk Assessment Forum. The Risk Assessment
Forum was established to promote scientific consensus on risk assessment issues and to incorporate this
consensus into appropriate risk assessment guidance. To accomplish this purpose, the Forum assembles
experts from throughout EPA in a formal process to study and report on these issues from an
Agency-wide perspective.
IX

-------
BBN
EPA
FIFRA
IRIS
ISA
LC50
LOAEL
MCDA
NE
NOAEL
OECD
PRG
QA
QAPP
r
RQ
SMAV
SSD
TMDL
WoE
ACRONYMS AND ABBREVIATIONS
Bayesian belief network
U.S. Environmental Protection Agency
Federal Insecticide, Fungicide, and Rodenticide Act
Integrated Risk Information System
Integrated Science Assessment
median lethal concentration
lowest-observed-adverse-effect level
multi-criteria decision analysis
No Evidence
no-observed-adverse-effect level
Organization for Economic Cooperation and Development
preliminary remedial goal
quality assurance
Quality Assurance Project Plan
correlation coefficient
risk quotient
species mean acute value
species sensitivity distribution
total maximum daily load
weight of evidence
x

-------
EXECUTIVE SUMMARY
This document presents an approach for using weight of evidence (WoE) in ecological assessments. WoE
integrates multiple pieces of evidence to infer a quality such as causality, teratogenicity, impairment,
protection, or recovery. WoE can also be used to derive quantities such as a benchmark value, a
magnitude of effect or a bioaccumulation factor when multiple estimates are available. WoE is essential
for ecological risk assessment, because diverse laboratory and field information must be assembled,
evaluated and integrated. The EPA has often employed WoE in ecological assessments, but, in the
absence of guidelines, the methods are inconsistent and often informal. Advisory bodies, professional
societies and the assessment science literature all encourage more and better WoE. The proper use of an
explicit WoE methodology can mean the difference between an ecological risk characterization with a
murky rationale and one that is persuasive and consistent with best practices. The system presented here
to improve the practice and acceptance of weighing evidence is broadly applicable for evaluating and
scoring evidence and combining the weighted evidence to determine the best supported risk
characterization. By helping to make the logic of weighing evidence clearer and more consistent, WoE
will help make ecological assessments more informative and defensible.
Ecological assessments that employ WoE have made numerous important contributions to environmental
protection. The following five examples of high profile ecological assessments demonstrate the value
added through incorporating a WoE methodology. l)The Bristol Bay, Alaska watershed assessment to
protect the world's best remaining wild salmon populations, and the people and wildlife that depend on
them, used formal WoE analyses to integrate evidence from laboratory tests, field studies, and effects of
similar mines to estimate risks to salmon of various mining activities. 2) The restoration of thousands of
biologically impaired waters uses a WoE method to determine the cause of impairment
(https://www3 .epa.gov/caddis). 3) The assessment to support the definition of waters of the United
States used WoE to show how different types of aquatic systems are connected. 4) The protection of
ambient water quality depends on the ecological assessments that derive aquatic life criteria. The recent
benchmark and proposed criteria method for major ions measured as conductivity are based on effects
observed in the field. They depend on WoE to determine that the relationships in the field are not
confounded and that the same value applies to different areas. 5) The high profile assessment of risks to
pollinators from the neonicotinoid pesticide imidicloprid relies on weighing evidence from laboratory
tests, semi-field tests, crop applications, and reports from bee keepers. WoE has been increasingly
important to ecological assessment and Agency decisions because of the use of eco-epidemiological
approaches.
Although WoE has been used in various types of assessments for programs across the U.S. Environmental
Protection Agency (EPA), approaches have varied. Many of the EPA's WoE applications use a narrative
approach with some guidance provided by lists of considerations, but that approach has been deemed
inadequate by the National Research Council (NRC. 2014). The approach in this document is more
formal. It provides a framework, a set of properties of evidence, a scoring system, tables for presenting
results of weighting, and weighing evidence, a system for organizing evidence in terms of types and the
characteristics they address and a means of dealing with ambiguous or discrepant results.
The framework and methods in this document also provide an integrated approach to both infer a quality
of interest and estimate an associated quantitative value. For example, WoE can be used to weigh the
evidence that a chemical causes malformations in fish in ambient exposures (qualitative) and to derive the
best estimate of a benchmark concentration for the chemical (quantitative). Unlike qualitative WoE,
quantitative WoE is not an established practice in the EPA. Therefore, quantitative methods are only
briefly discussed.
XI

-------
The frameworks for qualitative and quantitative WoE are presented in Figure S-l and Figure S-2.
Although the qualitative processes might appear complex, the basic framework is simple: assemble the
evidence, weight the evidence, and weigh the body of evidence. The document explains how to perform
each step (Sections 4-6) and explains when and how the process can be adapted or abbreviated
(Section 7). Figure S-2 previews Section 8. which presents the process for using WoE to estimate
quantities.
The sections preceding the explanation of the process include an introduction that defines and explains
WoE (Section 1). a description of the uses of WoE in the EPA's ecological assessments (Section 2). and
an explanation of how to perform planning and problem formulation for assessments that includes WoE
(Section 3). Sections following the process sections describe the complementary relationship between
WoE and uncertainty (Section 9) and the path forward for implementing WoE in the EPA (Section 10).
Section 11 presents the literature cited in support of the document.
Appendix A presents a glossary of terms, as used in this document, related to WoE. Additional
appendices briefly review WoE methods for estimating quantities (Appendix B). selecting models
(Appendix C). and inferring qualities (Appendix D). They provide context for the methods in this
document. Appendix E explains that qualities such as causation have characteristics such as
co-occurrence that can be used to organize bodies of evidence and to provide a basis for determining the
completeness of the body of evidence. Characteristics of causation are established, but characteristics of
other inferred qualities such as protective or impaired are new and have not been tested.
The result of the qualitative WoE process (Figure S-l) is a qualitative conclusion (e.g., the metal mixture
is the most likely cause) and an overall weight that expresses confidence in the conclusion (e.g., weight of
evidence for the conclusion is moderate) or multiple weights (e.g., relevance of the evidence is low, but
strength and reliability are both high).
The result of the quantitative WoE process (Figure S-2) is a weighted quantitative value. In a WoE
analysis, those results could consist of the value and its units (the proposed remedial goal is 2 mg/kg dry
sediment), one or more expressions of weight (the overall weight of evidence is high) and the estimated
statistical scatter of the value (the 95% confidence interval [CI] = ± 0.5 mg/kg dry sediment).
This document provides methods for weighing ecological evidence. Use of the methods will improve the
consistency and reliability ofWoE-based assessments and the defensibility of scientific input to decision
making. This guidance is not meant to be prescriptive, nor does it dictate methods for specific programs
and applications.

-------
Assemble Evidence
Complement existing
evidence with new
studies to create a
complete body of evidence
Systematically
search the
literature
Eliminate irrelevant and
unacceptable studies
Screen
Put the studies in bins
that may be treated as
types of evidence
Extract data, reanalyze
and combine with
other information
to create evidence
Evaluate with
respect to
3 properties
Assign scores
Combine into
weight of a piece or
type of evidence
Categorize
Derive
Evidence
Weight
Strength
Search
Literature
Weight
Relevance
Weight
Reliability
Combine
Weights
Design &
Conduct
Studies
Weight Evidence
Weigh Body of Evidence
Determine the overall weight
of a body of evidence based
on the weights of pieces, or
types of evidence
Deal with cases in which
the evidence does not
support any hypothesis
Integrate Evidence
Interpret Bodies
of Evidence
Explain
Ambiguities &
Discrepancies
Which, if any,
hypothesis
is supported by the
weight of evidence and
with how much weight?
Figure S-1. An annotated diagram of the process for WoE to infer qualities by assembling evidence,
weighting it, and weighing the resulting body of evidence, explained in Sections 4-7.
Xlll

-------
Use the qualitative
WoE framework
to weigh evidence
for the quality
Weigh
Evidence for the Quality
to be Quantified
Weighted mean
or other meta-analysis
to estimate the quantity
Merge
Quantitative
Evidence
Choose Best
Quantitative
Evidence
Weigh evidence
to choose best
estimate of
the quantity
Integrate statistical
scatter and qualitative
weight of the quantity
Define Confidence
in Quantitative Evidence
Figure S-2. An annotated diagram of the process for using WoE to estimate quantities, explained in
Section 8. It begins by weighing evidence to determine the quality of interest (a qualitative WoE as depicted in
Figure S-1), then uses one of two alternative methods to obtain results from multiple quantitative estimates, and
finally defines the confidence in the result.
xiv

-------
1. INTRODUCTION
The process of weighing multiple pieces of evidence is useful forjudging the truth of hypotheses,
identifying the best explanation of phenomena, or deriving best models or best estimates. The weight-of-
evidence (WoE) process involves (1) assembling evidence, (2) weighting evidence with respect to
properties, and (3) weighing the integrated body of evidence. This WoE process is embedded in larger
assessment processes, which include planning, problem formulation, analysis, synthesis, and
communicating results.
Scientific investigations of a topic often generate multiple pieces of evidence of various types. For that
reason, WoE is inherent in medicine, engineering, and other applied sciences, including environmental
assessment. Because weighing evidence is so common, the importance of the process is often overlooked
and as a result, it is generally performed informally and presented as a narrative. WoE, like other
scientific activities, depends on transparency for its credibility. This document encourages more explicit
WoE methods, but it also encourages fitting the WoE process to the assessment to avoid processes that
complicate the assessment without enhancing the results.
The Guidelines for Ecological Risk Assessment did not include WoE (U.S. EPA. 1998). Instead,
guidance was provided for a lines-of-evidence process that is equivalent to a narrative WoE guided by
considerations (see Appendix D.2). Assessment practices now have advanced sufficiently to recommend
WoE for ecological assessments.
1.1. Definitions
WoE is a metaphor adapted from jurisprudence, in which multiple pieces of evidence are metaphorically
placed in the pans of the scales of justice and the side with the greatest weight prevails. The metaphor is
appropriate for any situation in which multiple and diverse evidence are evaluated to reach a conclusion.
In this document, WoE is defined as an inferential process that assembles, evaluates, and integrates
evidence to perform a technical inference in an assessment. WoE methods have been derived to estimate
a quantity (Appendix B). inform model selection (Appendix C). or reach qualitative conclusions in an
assessment (Appendix D).
As part of that inferential process, WoE characterizes properties of pieces of evidence and of bodies of
evidence. First, WoE determines the degree of support for a hypothesis that a piece or type of evidence
provides (i.e., the weight of a piece of evidence dropped into a pan of the scales). Hence, weights indicate
which pieces and types of evidence make the greatest contribution to the inference. Second, WoE
determines the degree of support for a hypothesis, relative to alternatives, that the available body of
evidence provides (i.e., the accumulated weight in one pan relative to the other). These cumulative
weights not only inform inferences, they also indicate how much confidence assessors have in the
conclusion (Section 9).
For example, "WoE was used to determine the likely cause of a bird kill" (the process). "The occurrence
of carbofuran granules in the crops of dead birds is convincing evidence of exposure" (a property of a
piece of evidence). "The body of evidence consistently supports carbofuran as the cause" (a property of
the body of evidence). Further discussion of these terms is presented in Box 1-1. Other terms are defined
in the glossary in Appendix A.
1

-------
1.2. After 50 Years, Advancing Beyond Hill's
Considerations
The touchstone of WoE in health and
environmental assessments is Hill's
considerations (Hill. 1965). Hill recognized that
causal inference requires qualitative WoE
because no amount of quantitative analysis of
associations can suffice (i.e., correlation is not
causation). He listed nine considerations to guide
the process. Hill's considerations include, but do
not distinguish, characteristics of causal
relationships (e.g., temporality), types of
evidence (e.g., experiment), properties of
evidence (e.g., strength), and properties of bodies
of evidence [e.g., consistency; (Cormier et al..
2010)1. EPA has adopted Hill's considerations
for hazard identification in Integrated Risk
Information System (IRIS) assessments and
Integrated Science Assessments (ISAs).1 Hill's
considerations were designed for determining the
causes of observed epidemiological effects,
however, and not for applications in which
effects have not been observed. In particular,
Hill presumed that co-occurrence of the putative
causal agent and the effect of interest (smoking
and lung cancer, in his case) was clearly documented, and his goal was to demonstrate that the
relationship is causal. We, however, must also address cases for which a clear real-world association has
not been demonstrated.
The WoE system presented here treats explanatory implication of evidence, types of evidence, properties
of evidence, and properties of bodies of evidence as separate aspects of WoE (Box 1-2). Because each
serves a different function in the WoE process, they are not treated as equivalent considerations.
Distinguishing them makes clear that Hill's list lacks essential characteristics, types, and properties of
evidence. These aspects are the basis for weighting and weighing evidence for causation and other
qualities as presented in Sections 5-7.
Box 1-1. Weight versus Weigh, Weighting
versus Weighing
The terms weight, weigh, weighting, and weighing,
which are used throughout this document, can be
confusing. This ambiguity results in part from the
fact that weight is both a noun and a verb.
To weight a piece of evidence is to assign
importance to it (a process designated by a verb).
As a result of that action, the evidence has weight
(a result designated by a noun). The weighting
process formalizes the evaluation of evidence, by
assigning a descriptor or score.
To weigh a body of evidence is to combine the
weights assigned to each of the pieces (a process
designated by a verb). As a result of weighing, the
body of evidence has weight (a result designated
by a noun). The weighing process formalizes the
integration of evidence by determining the weight
of the body of evidence.
1 IRIS is a program that evaluates information on health effects of environmental contaminants to determine
what hazards they pose to humans and to develop benchmark values fhttp: IIwww.epa.gov/iris/l. ISAs
assess the scientific evidence for health and welfare effects of criteria air pollutants
fhttp: //www.epa.gov/ncea/isa/1
2

-------
Box 1-2. Aspects of Evidence
This approach to weighing evidence distinguishes aspects of evidence in the WoE process that may not be
clearly distinguished in prior approaches. They are introduced here and explained in more detail as they
appear in the WoE process.
Piece of evidence: A piece is the evidence derived from a particular experiment or observational study. A
piece of evidence is the minimum unit that might be weighted.
Type of evidence: A type is a category of evidence based on the nature of the study from which the evidence
is derived. Examples include single-species, laboratory toxicity tests; microcosms; mesocosms; biomarker
surveys; and community surveys. Types are used to organize and combine pieces of evidence to simplify
weighting and weighing.
Explanatory implication of evidence: An explanation expresses the logical relationships between evidence and
inference. For example, laboratory toxicity tests may provide evidence of the sufficiency of an exposure to
cause the effect, and community surveys may provide evidence of co-occurrence of the effect and putative
cause.
Body of evidence: A body of evidence is all of the evidence that applies to a particular hypothesis.
Property of evidence: Properties are the aspects of evidence that determine how much weight (influence) it
should have. This approach uses three general properties: relevance, strength, and reliability.
Property of bodies of evidence: Bodies of evidence are weighted with respect to collective properties
including number, coherence, diversity, and absence of bias.
1.3. SCOPE
This document is intended to inform ecological assessments. Human health assessments are mentioned
and cited only as background and for purposes of comparison. Weighing multiple types of evidence has
been performed more widely in ecological assessments than in human health assessments (Krimskv.
2005; Suter. 1993). Although ecological and human health assessments can be performed similarly, the
types of ecological assessments that have driven the development of WoE tend to differ in important
respects from human health assessments. In particular, types of ecological studies that contribute
evidence to Superfund and Clean Water Act assessments include effluent and ambient media toxicity
tests, toxicity identification evaluation, in situ tests, biotic community surveys, and demographic models.
In addition, ecological assessments must weigh evidence related to multiple endpoint species and levels
of organization, which increases the need for formal WoE methods.
This document presents a framework for application of WoE in various assessment contexts. WoE is an
assessment tool, just as modeling and statistical analysis are assessment tools. Therefore, the framework
for WoE is not a substitute for assessment frameworks (see Section 3).
WoE can be carried through the entire assessment process or be applied to an inference that is only a
component of the assessment. WoE can be limited to hazard identification (e.g., Does ozone reduce plant
growth at ambient levels?), the determination of an assumption (e.g., Should a bioaccumulation factor be
used?), or the estimation of a parameter (e.g., weighing field and laboratory estimates of a chemical's
half-life in water). In contrast, a WoE process can carry through an entire condition or causal assessment.
3

-------
This document cites literature that is directly relevant to explaining this approach to WoE but does not
attempt to cover the history of WoE in environmental assessments. Reviews of WoE approaches are
available elsewhere (Rhombcrg et al.. 2013; Linkov et al.. 2009; Pope et al.. 2007; Krimskv. 2005; Weed.
2005).
This document focuses on WoE for qualities such as causality and impairment (Box 1-3). WoE methods
to derive quantities and to choose models are discussed in Appendix B and Appendix C. and the use of
qualitative WoE to enhance the derivation of quantitative results is discussed in Section 8. Qualitative
WoE is emphasized because it is more common in ecological assessments than is quantitative WoE
(Appendix D).
1.4. Benefits and Challenges of Weight of
Evidence
Although this document recommends the use of
WoE, it recognizes that WoE can have limitations
in practice. We assume that assessors will
consider the potential benefits and challenges of
WoE (Box 1-4) relative to the requirements of
their assessment before proceeding to implement
the approach recommended here.
WoE techniques inevitably involve subjective
expert judgments. Such judgments can cause
WoE to be criticized for being biased or arbitrary.
However, the objections to subjectivity can be
diminished by some good practices.
1.	Prior to the assessment, specify the WoE
method in as much detail as is practical to
minimize the need for improvised
judgments about methods or assumptions
during the assessment.
2.	Use the standard judgments of a program
expressed as standard criteria for
assigning or integrating weights. For
example, tests performed using an
Organization for Economic Cooperation
and Development (OECD) protocol with good laboratory practices might be given a standard
score of +++ for reliability.
3.	Be objective, in the sense of being unbiased, by self-auditing (Box 1-5).
4.	Work in groups and attempt to achieve consensus concerning judgments.
5.	When making judgments, try to represent the opinions that knowledgeable and unbiased members
of the scientific community would have, given the evidence and the inference to be made.
6.	Find assessors with sufficient relevant knowledge and experience to qualify as experts.
Box 1-3. Qualities and Qualitative Weight of
Evidence
Qualitative WoE: A qualitative WoE is an
assessment for which the endpoint is a quality
such as a condition, mode of action, source, or
type of effect. It typically includes both
quantitative and qualitative evidence, but the
result is not a numerical quantity.
Quality of Interest: The quality for which evidence
is weighed is most often causality, but it can be
any quality of interest. Examples of qualities that
can be or have been inferred by WoE include:
•	Coxie Creek is impaired.
•	Acid mine drainage causes the impairment.
•	Polychlorinated biphenyls are bioaccumulative.
•	Selenium is teratogenic in fish.
Quality of Evidence: This document does not
recommend weighting the quality of evidence.
Instead, specific properties and sub-properties of
evidence are weighted. This is because the term
quality has been found to be too broad to be
useful in the weighing of evidence (Higgins and
Green, 2011). However, quality is used in a broad
sense that is consistent with U.S. government data
quality policy (U.S. EPA, 2002b).
4

-------
Box 1-4 Potential Benefits and Challenges
Inference by weighing multiple pieces of evidence has advocates who recognize its potential benefits, but
challenging aspects of its application have led some assessors to oppose its use.
Potential Benefits of WoE
The primary potential benefit of WoE is the greater confidence in results obtained by considering all relevant
and reliable evidence. For example, it is not uncommon for causal assessments to consider only statistical
evidence of co-occurrence of an effect and its potential causes. This approach provides much less confidence
than one that also considers evidence of temporal sequence, interaction, and other characteristics of causal
relationships. In many cases, no single type of evidence is sufficient to reach a conclusion. Ecological
assessments of polluted ecosystems commonly benefit from considering evidence from laboratory toxicity
tests of chemicals, tests of effluents, or ambient media and biological surveys. This benefit of WoE occurs
because the body of relevant and scientifically credible evidence provides a more complete picture than does
any piece of evidence alone.
A second potential benefit is an increase in the defensibility of an assessment. An explicit WoE method
demonstrates that all relevant evidence has been considered and no credible evidence, either in support of
or contrary to a hypothesis, has been arbitrarily dismissed. Without an explicit process planned in advance,
reviewers might criticize or even dismiss an assessment for excluding data or evidence that they believe
should have been given more consideration.
Defensibility of assessments also might be increased by the transparency of the processes it uses for
inference. A formal WoE method enables reviewers and readers to understand and critique the processes of
assembling, weighting, and weighing the evidence.
Potential Challenges to WoE
A formal WoE process can require considerable time and effort, which could lead to performance of fewer
assessments or delayed decisions. Completing and documenting a formal systematic literature review or
implementing an evidence scoring system might not be resource effective if the same conclusion can be
obtained with a less resource-intensive assessment. The solution to this challenge is to tailor the WoE
method to the assessment.
Box 1-5. Subjectivity and Objectivity
WoE-based assessments rely on subjective professional judgments because there is no other means of
weighing a diverse body of evidence from models, laboratory tests, field tests, field surveys, and other
information sources to identify the best-supported hypothesis. Although inevitable, subjectivity often is
considered undesirable in assessments as in other scientific contexts.
Objective properties, such as the number offish species in a stream, are properties external to the
investigator that can be confirmed by any other investigator. Subjective properties such as the reliability of
biological survey results are opinions of the investigator evaluating the survey, not inherent properties. The
extent to which assessors agree in their judgment of reliability is due to shared preferences, not a
measurable attribute of the surveyed community.
Subjective inferences can be objective in another sense, which is defined in federal policy (U.S. EPA, 2002b):
Any inference can be considered objective if it is performed in a disinterested—and therefore
impartial—manner. The guidance in this document aims to create circumstances for which an inference is
minimally influenced by any personal or institutional biases and in which any systematic bias can be
detected.
5

-------
Performing laboratory or field studies to generate data for multiple types of evidence (e.g., chemistry,
toxicity, and biology) for a WoE analysis—as opposed to simply summarizing data from available
literature—is more resource intensive than generating a single type of evidence (e.g., only chemistry).
The value of new information to the assessment should be considered during the planning stage to obtain
sufficient, but not surplus, evidence (Keilser et al.. 2014V
Complex WoE methods can obscure rather than clarify the derivation of results, particularly if the method
is not clearly presented. A reader is likely to dismiss the results if the assessment is incomprehensible.
Clear and consistent methods can reduce this problem.
Finally, a fundamental objection to WoE is that it might mix scientifically less robust evidence with
highly reliable evidence. In some cases, the body of at least minimally relevant and reliable evidence
might not contribute information beyond that supplied by a single best piece of evidence. This issue is
addressed by including explicit steps for screening and weighting evidence rather than treating all
evidence as equal.
These objections can be minimized by careful planning and by clearly presenting the WoE process. The
WoE methods should be appropriate to the assessment. The amount of detail in evaluating and scoring
evidence should be appropriate to the amount and diversity of the evidence, the time and resources
available, and the degree to which decision makers and stakeholders wish to engage in a WoE process.
The amount of detail can also be based on the degree to which the results of an assessment are potentially
contentious. If, for example, an informal review of the evidence clearly shows a site is highly toxic and
the obvious decision will be to remediate, a more detailed WoE process might waste resources, prolong
the contamination, and confuse the issue. In situations for which the decision is obvious and could be
urgent, a protracted WoE process is counterproductive.
6

-------
2. APPLICATIONS OF WEIGHT OF EVIDENCE
The uses of WoE in ecological assessments are diverse, in part because ecological assessments are
conducted in diverse regulatory contexts that require different assessment methods. In addition to asking
assessors for predictions of potential adverse outcomes of proposed actions (i.e., to perform risk
assessments), decision makers ask assessors to determine conditions (i.e., to perform condition
assessments), determine the likely causes of adverse conditions (i.e., to perform causal assessments), and
determine the actual outcomes of actions [i.e., to perform outcome assessments (U.S. EPA. 2010b;
Cormier and Suter. 2008)1. Each type of assessment has its own logic and methods, and therefore, weighs
evidence somewhat differently. Figure 2-1 illustrates a way to organize such assessments. Assessments
can be initiated by the results of a prior assessment, and WoE results from prior assessments can inform
subsequent assessments. For example, if a condition assessment identifies a biological impairment, a
causal assessment should follow to identify the cause and source of the causal agent, a predictive risk
assessment should follow to determine the remedy, and an outcome assessment should determine whether
the remedy has sufficiently improved conditions. Assessments also can be initiated by external demands.
For example, a predictive risk assessment might be prompted by an application to market a new pesticide.
Problem Detection Problem Resolution
Condition
Assessment
Causal
Assessments
Resolution
Outcome
Assessment
Predictive
Assessments
Figure 2-1. A framework depicting the relationships among types of environmental assessments. Modified
from U.S. EPA (2010b); Cormier and Suter (2008).
7

-------
All types of ecological assessments can weigh evidence quantitatively to combine multiple numerical
values or to choose a model (see Appendix B and 11.Appendix C). This document, however, emphasizes
using WoE for integrating evidence to derive a qualitative conclusion that supports a decision. Examples
of qualitative conclusions are the biotic community of a stream is impaired, selenium causes cranial
deformities, low pH caused the impairment, or coal ash is the source of the arsenic. To determine what
qualities might require WoE, it can be helpful to consider what qualitative questions are answered by the
various types of environmental assessments (Box 2-IV
Environmental problem solving often requires sequences of inferences to derive qualitative results and
then quantitative results. For example, an ISA, that qualitatively weighs evidence to determine whether
relevant concentrations of the pollutant cause particular effects, precedes the development of National
Ambient Air Quality Standards. A qualitative weighing of evidence to determine whether the
contaminated medium is causing significant adverse impacts precedes development of quantitative
cleanup goals for a contaminated site. A qualitative weighing of evidence to determine whether a water
pollutant is causing biological impairment might precede the development of quantitative total maximum
daily load (TMDL) for the pollutant. These qualitative assessments could be treated as a separate product
from the quantitative assessment (e.g., a qualitative ISA before a quantitative Welfare, Risk and Exposure
Assessment). Alternatively, the qualitative assessment might be nested within the quantitative assessment
as part of the problem formulation (e.g., as a hazard identification).
This section describes the various types of assessments, their application by EPA and the current and
potential roles of WoE in performing the assessments.
2.1. Contaminated Sites
Contaminated sites provide a wide scope for applying WoE because the contaminated and potentially
biologically impaired conditions allow for the generation of diverse evidence from observation, sampling,
analysis, and testing. As a result, most of the relevant methods and literature on ecological WoE address
contaminated sediments, soils, and waters (Appendix D). In the EPA, the primary venue for
contaminated site assessments is Superfund sites.
The goal of contaminated site assessments is to determine whether contaminants pose an unacceptable
ecological risk and, if so, what should be done to reduce it. When assessing a site, specific impaired areas
can be identified without WoE by comparing contaminant concentrations to screening benchmarks.
Those benchmark concentrations also could serve in some cases as the remedial goals, allowing for
completion of the assessment without weighing evidence. For Superfund sites, however, multiple types
of site-specific evidence are typically collected, including results of toxicity tests of contaminated media,
biological surveys, or analysis of biological samples for biomarkers or body burdens of contaminants
(Luftig. 1999). Such bodies of evidence can be weighed in a merged condition and causal assessment to
answer the question: Is an unacceptable risk associated with the site contaminants? Such a qualitative
assessment is the topic of the approaches for the sediment quality triad and the Massachusetts and Oak
Ridge National Laboratory WoE methods described in Appendices D.3. 11.D.5. and D.7. Such
assessments can also be performed using the approach presented in the following sections.
8

-------
Box 2-1. Qualitative Questions for Which Evidence is Weighed in Different Types of Assessments
Causal Assessment
Causal assessments identify causes of observed effects and they take one of two general forms.
What causes general effect y? This is the general causal question. Examples: What causes colony collapse
disorder in honeybees? What causes coral bleaching?
What causes specific effect y? This is the specific causal question. Examples: What caused low invertebrate
species richness in the Coal River? What caused the 1980s decline in San Joaquin kit foxes in Elk Hills?
Risk (Predictive) Assessment
Most EPA predictive assessments are risk assessments. They begin with a problem formulation that includes
a causal question regarding what hazard is associated with an agent and could be assessed.
Does agent x cause general effect y? This is a general hazard identification question for assessments like
setting benchmarks or permitting use of a chemical. Examples: Does smoking cause lung cancer? Does
atrazine cause deformities in frogs?
Might agent x cause effect y? This is a hazard identification question asked in risk assessments for specific
actions. Examples: Might a tailings spill at the proposed mine cause a loss of salmon spawning? Might
permitting a new use of atrazine result in an increase in frequency of frog deformities?
Is agent x causing effect y? This is a hazard identification question asked in risk assessments for an agent
under existing conditions. Examples: Are the sediment contaminants reducing invertebrate species richness
or abundance? Is acid deposition reducing forest production?
These causal questions can be answered by a separate causal assessment prior to the risk assessment or in
the problem formulation for the risk assessment. Without hazard identification, defining the endpoint entity
and attribute has no basis beyond using default endpoints such as fish mortality.
Other qualitative questions that might involve WoE also are answered in risk assessments. For example:
•	Should dietary exposure be considered?
•	Will the endangered species occur on the site in the future?
•	Is the site large enough to support an assessment population of the species of concern?
•	Is long-range transport a significant source?
•	What are the contaminants of concern?
•	Is the chemical a persistent organic pollutant?
Condition Assessment
Qualitative questions concerning conditions, such as the following, are relatively straightforward to assess
when standard definitions of conditions are provided.
•	Is the ecosystem impaired?
•	Is the species endangered?
•	Is the water contaminated?
However, framing the question could be conceptually difficult, particularly with respect to the standard of
comparison: natural conditions, reference conditions, historical conditions, or defined standards.
Outcome Assessment
Qualitative questions of concerning outcome are like condition questions but include issues regarding the
efficacy of remedial or regulatory actions.
•	Is bioremediation sufficiently reducing exposure of endpoint species?
•	Is the liner leaking?
•	Is the site acceptable for recreational fishing?
9

-------
Many Superfund assessments treat pieces of evidence as independent lines of evidence without weighing
them (Integral Consulting Inc.. 2013). Causal assessments performed by WoE for contaminated sites are
illustrated by the Elk Hills and California Gulch cases (U.S. EPA. 2011c. 2009a). Such WoE assessments
can be sufficient to complete the investigation, provided the decision logic is as follows: If the sediment
or soil presents unacceptable risk due to waste contamination, remediate it (e.g., cap, remove, or biotreat
it). For Superfund sites, however, a separate step defines chemical-specific preliminary remedial goals
(PRGs). Either site-specific values or standard benchmark values might be used as PRGs for the areas
identified as impaired by WoE. Site-specific PRGs can be derived from tests of site media or field
exposure-response relationships used in the qualitative WoE to identify waste-impaired areas. However
they are derived, PRGs represent acceptable risks, and are typically used in selecting a final remedy.
2.2.	Environmental Condition
Although the condition of all ecosystems is of concern, the Clean Water Act is unique among U.S.
antipollution laws in requiring that all sites (i.e., all state waters) be assessed to determine if they are
impaired. The result is each state's 303(d) lists of impaired waters. Most listings are for exceedance of
water quality standards, including concentrations of chemicals, levels of biological pollutants (e.g., fecal
coliform counts), and physical properties (e.g., temperature). In addition, ecological properties can serve
to identify impaired waters based on biological or narrative criteria. Commonly, multiple types of
biological data, termed metrics, are combined into an index (Blocksom and Johnson. 2009; U.S. EPA.
1996). The regulations indicate that states should evaluate "all existing and readily available
information" in developing their 303(d) lists (40 CFR §130.7(b)(5)), which suggests a WoE approach or
other approaches (such as independent applicability) that consider all information (Appendix D. Box D-
1).
If a water body is declared impaired based on biological effects, the cause should be identified. Guidance
for using WoE to determine the cause of biological impairment was developed for the Office of Water
(U.S. EPA. 2000). and a web-based support system was developed to aid in its application
(http://www, epa. gov/caddis/). This system is the principal model for the WoE approach presented in this
document.
Once the cause of water body impairment has been determined, the sources can be identified so that
TMDLs can be developed (Clean Water Act §§ 130.2(f)-(i) and 130.7(c)). This is similar to the practice
of determining sources at contaminated sites is known as environmental forensics because it often
involves establishing legal liability (Murphy and Morrison. 2002). Its methods include environmental
fingerprinting, isotope analyses, tracer studies, and transport modeling, and it often uses WoE. TMDLs
are developed and implemented to eliminate the impairment by apportioning pollutant loading among the
sources. This step typically does not involve WoE. The TMDL process often includes evaluation of the
results, which can lead to removal of the stream reach from the 303(d) list. This outcome assessment can
rely on biological endpoints and could involve WoE.
2.3.	Existing Pesticides and Industrial Chemicals
The reregistration of pesticides provides an opportunity to collect multiple pieces and types of evidence
related to pesticide use, as well as conventional laboratory data, that is generated from the registrant's
data package and literature reviews by EPA. This body of evidence, which is weighed in a WoE
narrative, can include results from laboratory studies of chemistry and toxicity, mesocosms, field
experiments, surveys, and incident reports; pesticide monitoring and use data; and transport, fate, and
exposure modeling.
10

-------
Risks to animals and plants from pesticides are assessed using a tiered approach. In general, most
assessments are focused on Tier 1, intended to estimate conservative environmental exposures. The risks
associated with those exposures are assessed using risk quotients (RQs), which are predicted exposure
levels divided by effect benchmark levels. RQs are based on the most sensitive endpoints available for
survival, growth, or reproduction. When RQs exceed levels of concern, risks are characterized using a
narrative WoE analysis. Although the process for deriving RQs is well established, the WoE analysis
depends on the nature of the available data and the assessed chemical and its use patterns. Examples of
evidence analyzed to evaluate risk conclusions include comparing estimates of exposure from models or
monitoring data to available toxicity endpoints, evaluating whether laboratory fate studies are consistent
with field dissipation studies, and using species sensitivity distributions to evaluate impacts of
conservative assumptions on risk conclusions.
Assessments of existing chemicals under the Toxic Substances Control Act (TSCA) can weigh diverse
bodies of evidence to determine whether a chemical substance presents or could present unreasonable risk
of injury to health or the environment. WoE is used when conducting risk evaluations under TSCA
Section 6(b) and making risk management decisions for existing chemical substances. Existing chemical
assessments typically rely on a combination of predictive modeling and other computational approaches,
empirical laboratory or field data specific for the chemical being assessed and empirical laboratory or
field data from analogous chemicals. For example, the body of evidence can include empirical or
modeled data on use and release and on transport and fate, physical/chemical properties of the chemical in
question, exposure data that include field monitoring and modeled data and hazard data from laboratory
toxicity tests or modeling. This information is used to assess risk and, much like the registration of
pesticides, RQs are developed and the analysis depends on the available data. The assessments rely on a
set of seven study quality and selection considerations (U.S. EPA. 2015b). Implementation of the new
(signed June 22, 2016) Frank R. Lautenberg Chemical Safety for the 21st Century Act may change these
practices. The act expressly uses the term "the weight of the scientific evidence."
2.4. New Pesticides and Industrial Chemicals
Registration of new pesticides is based on a large and diverse body of evidence (relative to other new
chemicals) pursuant to the Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) pesticide
registration data requirements (http://www2.epa.gov/pesticide-registration/data-requirements'). Similar to
risk assessments for existing chemicals, a tiered approach is used. The major difference between risk
assessments for new versus existing pesticides is generally the amount and types of available data. For
new chemicals, the data required under FIFRA generally are available. Incident reports, monitoring data,
and toxicity data for nonstandard test species, however, are rarely available for new chemicals.
The available evidence is also weighed when reviewing, assessing, and regulating new industrial
chemicals under TSCA before they enter commerce. Assessments of new chemicals rely on a
combination of predictive modeling and other computational approaches and empirical data from
analogous substances and chemical categories and from empirical data the manufacturer submits. As
with existing chemicals, the implementation of the Lautenberg act is likely to increase the importance of
WoE for new chemicals.
Review of chemicals for potential endocrine-disruptive mechanisms of action is required for all pesticide
active ingredients, all food-use pesticide inert ingredients, and some industrial chemicals. A WoE
approach based on a set of considerations has been used for that purpose (U.S. EPA. 2012b).
11

-------
2.5.	Benchmark Derivation
The derivation of numerical criteria, standards, remedial goals, screening levels, and other benchmark
concentrations or doses involves performing a type of risk assessment. Qualitative WoE can be applied
during problem formulation to determine what hazards should be considered in the assessment. For
example, ISAs—which identify hazards that may be assessed in Welfare Risk and Exposure Assessments
for National Ambient Air Quality Standards—use WoE to determine what effects on welfare occur in the
relevant range of air pollutant levels (U.S. EPA. 2014b. 2013). If different methods or data sets are used
to derive benchmarks for different exposure routes or modes of action, WoE can be used to determine
which approach to apply, as in the development of IRIS values (human health benchmarks) for
carcinogens versus noncarcinogens (U.S. EPA. 2005b). Finally, if a benchmark is derived using field
data, WoE can be used to determine whether the evidence suggests the relationship is causal or
confounded (U.S. EPA. 2011b).
2.6.	Proposed Discharges
WoE is seldom an option when assessing risks from proposed new discharges because evidence is limited
to predicted release rates of constituent chemicals used in transport models, and the resulting estimated
concentrations are related to benchmark concentrations. Tests of synthetic effluents or analogies to
existing effluents that are expected to be similar to the proposed discharge, however, could create bodies
of evidence that could be weighed.
2.7.	Special Purpose Assessments
Some assessments are performed for particular nonroutine decisions. For example, the EPA performed a
watershed risk assessment to inform potential decisions concerning the mining of metals in the Bristol
Bay, Alaska, watershed (U.S. EPA. 2014a'). The evidence used was diverse and included analogies to
prior metal mines and pipelines. The evidence was therefore weighted and weighed using a method like
the one in this document. Similarly, EPA borrowed from the WoE method developed for causal
assessment of impaired waters to assess the connectivity of streams and wetlands to support the definition
of waters of the United States (U.S. EPA. 2015a). These examples show how one-of-a-kind assessments
can adapt the general approach presented in this document.
12

-------
3. PROCESSES FOR WEIGHING EVIDENCE
3.1. An Introduction to the Weight of Evidence Process
The weighing of evidence is a tool that can be used for any type of assessment, at any point in an
assessment process or throughout the entire assessment (Section 2: Figure 3-1). In a risk assessment,
WoE is used in the problem formulation phase to identify the hazardous properties of the agent being
assessed and the appropriate assessment endpoints. In particular, evidence can be weighed to determine
whether it supports the hypothesis that a chemical causes some particular effect, such as failure to hatch,
effects on a particular taxon, or effects by a particular route of exposure or mode of action (Meek ct al..
2014). In the analysis phase, WoE can be used to select assumptions, estimate parameters, and develop
models (Section 8. Appendix B. and Appendix C). In the synthesis phase, the models and parameter
estimates are used to characterize risks, determine most likely causes, or determine whether an ecosystem
is impaired. When multiple exposure-response relationships, spatial/temporal associations, or other
relationships have been developed, WoE is used to combine them or to determine which relationship is
the best (weightiest). Other types of assessments use WoE at different points in their frameworks
(Section 2). In a causal assessment, WoE is carried through all analysis and synthesis phases (Norton et
al.. 2014).
WoE
WoE
WoE
=>
Initiator
Assessment
Planning
v
Analysis
xz
Synthesis
Decision/
Action ^
WoE
Figure 3-1. A basic framework for all types of environmental assessments (U.S. EPA. 2010b; Cormier and
Suter, 2008). Weight of evidence can contribute to one or more individual steps (WoE on the left) or can be the
basis for the entire assessment (WoE on the right).
However used, a formal WoE involves the same basic process (Figure 3-2). First, the evidence is
assembled. This step involves finding published studies or performing new studies, determining whether
their results are acceptably relevant and reliable, and extracting and analyzing data to generate useful
evidence. Second, the pieces of evidence are weighted (i.e., evaluated and scored) to determine their
influence on the results. Third, the body of evidence is weighed (i.e., weights are integrated and the body
of evidence is interpreted) to arrive at a result.
13

-------
Assemble
Evidence
1
Weight
Evidence
1
Weigh Body
of Evidence
Figure 3-2. The basic WoE process (Suterand Cormier. 2011). The steps are elaborated in Sections 4-6.
respectively, in this document.
WoE processes vary among applications, but these three steps are fundamental. First, you must have
evidence. The final step, combining and interpreting (i.e., weighing) the evidence, is necessary to make
the body of evidence inform the inference. The middle step, weighting, is also essential in that different
pieces and types of evidence seldom have the same influence in an inference. However, the explicit
assignment of weights is often skipped. This should be done only if the evidence is found to all be
equally relevant and reliable (Section 7).
In many cases, multiple pieces of evidence are not explicitly weighed. Instead, evidence is informally
weighed in a narrative (Box 3-1).
WoE can be applied to derive qualitative or quantitative results. It might be used to assess qualities such
as causality, impairment, recovery, or occurrence of a specific effect or used to estimate quantitative
results such as a benchmark value, a model parameter value, or the magnitude of effects. Inferences to
quantities begin with quantitative evidence and apply quantitative methods (Section 8 and Appendix B).
Inferences to qualities, however, use all relevant evidence (qualitative and quantitative) but apply
qualitative inferential methods (Sections 5-7 and Appendix D). Even in WoE methods such as the
Massachusetts system that use numerical scores, those scores are used to document qualitative judgments
to reach a qualitative result (Menzie et al.. 1996). The same qualitative methods can be used to make
qualitative judgments about quantitative results (Section 8). For example, they might be used to judge the
reliability of a quantitative effect estimate (e.g., 26 km of stream would be impaired and that result is
highly reliable).
The general approach introduced in this section is intended to be flexible and broadly applicable. It is
intended to provide more rigor and transparency than narrative WoE but less complexity and more
flexibility than numerical systems and indices (Appendix D). The approach also is intended to merge
inferences to qualitative and quantitative results in a useful manner. It can be applied to all types of
environmental assessments and to both data-rich and data-poor cases.
14

-------
Box 3-1. Best Practices for Narrative Weight of Evidence
Although this document explains how to use a formal method for WoE with explicit weighting and
weighing, we recognize not everyone will follow this approach. Some assessors will continue to use a
traditional narrative approach. In such cases, WoE narratives can benefit from some good practices.
1.	When using evidence from the literature, design and carry out a literature search that will find the
information needed in an unbiased manner (Section 4). Define not only the search terms but also
the criteria for screening the results to obtain relevant and reliable evidence.
2.	Even in a narrative approach, provide a table summarizing the selected evidence and indicate what
was excluded and why.
3.	Avoid narrative reviews that describes one piece of evidence after another and then present a
conclusion. Instead, logically structure the narrative. At a minimum, present the evidence for a
hypothesis and the evidence against it, and explain why one side has greater weight.
4.	Make the narrative organization clear. Use lists and subsection headings to help the reader
understand the logical structure.
5.	Present the results clearly. Which hypothesis is best supported by the WoE, and how much
confidence can be placed in the conclusion?
6.	Express the degree of confidence in the conclusion.
3.2. Planning the Assessment to Use Weight of Evidence
All environmental assessments begin with a
planning phase that defines how the assessment
will be conducted. In risk assessments, this phase
includes two steps termed planning and problem
formulation (U.S. EPA. 1998). Assessment plans
identify the evidence needs, methods that will be
used to generate the evidence and methods for
weighting and weighing multiple pieces and
types of evidence. Specifying the WoE method
in advance enhances transparency and
defensibility. In particular, a previously defined
system for evaluating evidence and assigning
scores prevents the adjustment of weights to
achieve a desired result. To the extent feasible
and helpful, standard frameworks and methods
should be used (Box 3-2).
In practice, there are levels of standardization.
Sections 4-7 present a general WoE approach for
EPA, which could reduce the burden of planning
an assessment. Additional standardization can be
achieved at the program level or in individual
regions. The greatest specificity, however,
occurs in the planning of individual assessments.
Not specifying in advance how evidence will be
Box 3-2. Standardization of Weight of
Evidence
Standardization of assessment practices is
desirable in general. Standardization can
increase efficiency by reducing the number of
decisions to be made, improve assessments by
setting standards of practice, and reduce bias by
reducing the opportunities for assessors to make
personal judgments. If the standard practices
are too rigid and prescriptive given the variability
in assessment problems and conditions, they can
reduce efficiency, force less than optimal
assessment practices, and create frustration.
This conundrum is compounded when WoE
approaches are used. Decisions are made
concerning, among others, which inferences
should be based on WoE, what evidence should
be included, what properties of the evidence
should be considered, and how the weights
should be expressed. These decisions should be
made on a program-specific basis.
15

-------
weighed delays the decisions until the data are in hand and the evidence is being generated, weighted and
weighed. At that point, the consequences of decisions might be apparent to the assessor. For example, if
sediment toxicity tests for an area result in an average 33-percent mortality and no standard for scoring
has been set, an assessor can score the evidence as 0, + or ++ (ambiguous, weak, or strong). That
judgment could be biased if the assessor knows that the choice can tip the balance for or against dredging
the sediment. Removing the temptation to choose a weight that gives a preferred answer by standardizing
weights whenever appropriate is the better option.
Exceptions do exist. A detailed plan, on occasion, can result in nonsensical results when implemented,
due to peculiarities of the case (Johnston et al.. 2002). Therefore, allowing for deviations, and
documenting the rationale, as Johnston et al. (2002) did, is essential.
Finally, when a decision is made to use WoE but the method was not prescribed in an analysis plan,
assessors should document how they weighed the evidence and why they made their choices. Expert
judgment, although essential, should be transparent.
Assessment planning should be adapted to include WoE considerations. Stakeholders and decision
makers can be consulted for assurance that the weighing process is acceptable. WoE methods that are
compatible with the potentially available data and evidence should be chosen. Alternatively, data needs
should be specified with WoE in mind. In particular, literature reviews and new studies should be
coordinated to generate complementary evidence. If a field survey will sample benthic invertebrates,
literature reviews should seek toxicity studies of benthic invertebrates. Any new laboratory tests should
address benthic invertebrates and preferably taxa that are sensitive or important at the site. Field
sampling for different pieces of evidence (e.g., chemical concentrations, habitat characteristics, and
occurrence of biological taxa) should be collocated in space and time so that the results can be used to
derive relevant relationships or make comparisons.
When possible, the assessment problem to be addressed by WoE should be formulated in terms of
alternative hypotheses. Identifying the hypothesis best supported by the WoE is conceptually simpler and
more convincing than determining whether a single hypothesis has sufficient WoE. One hypothesis,
considered alone, may appear to have sufficient evidence, but another hypothesis may have more and
weightier evidence. Alternatively, a hypothesis may appear weak, but its status will be unclear until a
stronger alternative is identified. For example, a study of a declining San Joaquin kit fox population
began as an attempt to determine whether toxic chemicals were the cause. Its conclusion that they were
not the cause became more convincing when a strongly supported alternative cause, predation, was
identified (U.S. EPA. 2009a').
Because of the complexity of ecological systems and their responses to interventions, the development of
conceptual models is essential to planning an ecological assessment (U.S. EPA. 1998). Conceptual
models convey the processes and entities that link sources with effects on endpoints. The links can be
used to guide the development of evidence for the WoE process. In risk assessments and other predictive
assessments, questions can be asked such as, what sequence of events must occur following the proposed
action before the endpoint effect can occur? For example, when assessing the risks from input of
phosphorus to a lake, if an endpoint is fish kills due to low dissolved oxygen, the WoE should consider
evidence for the occurrence of algal blooms, respiration and decomposition, and evidence against the
occurrence of mixing because they all can influence dissolved oxygen (Figure 3-3). Similarly, when
assessing the cause of an observed fish kill, if low dissolved oxygen is a hypothesized proximate cause,
evidence of phosphorus input, algal production, decomposition, and the absence of mixing are supporting
evidence that could be generated and evaluated. An approach for applying WoE to conceptual models is
provided by the OECD (2013) guidance on adverse outcome pathways (Box 3-3).
16

-------
vacation homes
V
fertilizers
V
septic systems
V
V
1s dissolved
phosphorus
V
^ algae
V
V
photosynthesis
^ algal
respiration
—\ r
V
^ microbial
respiration

/ts dissolved oxygen
V
si/ dissolved oxygen
V
V
1s dissolved oxygen
fluctuation
<-

\j/		y
1s fish mortality"^)
LEGEND
human activity
^ source ^
additional step in
causal pathway
proximate
stressor
modifying factor
interacting
response
1s temperature
mixing
Figure 3-3. Conceptual model for a hypothetical ecological risk assessment of the relationship of
phosphorous releases from a vacation home development to the risk offish kills in a lake (graphic by Kate
Schofield, using the conventions in CADDIS, at http://www3.epa.gov/caddis/).
17

-------
Box 3-3. Weighing Evidence for Adverse Outcome Pathways
Adverse outcome pathways (AOPs) are representations of linkages between molecular initiating events and
adverse outcomes measured at levels of biological organization considered relevant to risk assessments
(Anklev et al., 2010). Although a goal of AOP development is to quantitatively implement them and predict
relevant effects, currently they are primarily a form of hazard identification represented as conceptual
models. For example, an AOP might link the binding of a chemical with a receptor in larval fish through a
series of steps to failure of swim bladder inflation and then to reduced survival and perhaps to reduced
population production (Villeneuve et al., 2014). As with other hazard identification exercises, WoE can be
applied to determine the most likely hazards posed by a chemical and also to reveal gaps and weaknesses in
the evidence. The OECD (2013) has recommended using Hill's considerations, tailored for AOPs, to evaluate
key events, key event relationships, and overall AOPs. The tailored considerations are biological plausibility,
essentiality, and empirical evidence. For each consideration, definitions of high, moderate and low
confidence are provided. For example, moderate confidence in biological plausibility is defined as "the key
event relationship is plausible based on analogy to accepted biological relationships but scientific
understanding is not completely established." These OECD standard weights and definitions are equivalent to
the standard findings and scores in Table 5-2, but types of evidence are not separately evaluated. Example
cases of WoE are provided in the OECD (2013) guidance and derivative publications (Becker et al., 2015;
Villeneuve et al., 2014). Subsequently, another WoE approach for AOPs was developed and demonstrated
using numerical scores that were aggregated in a linear additive fashion into an overall WoE, a procedure
analogous to multi-criteria decision analysis (Collier et al., 2016).
The planning of an assessment should include consideration of uncertainty. However, in the context of
WoE analyses, uncertainty is one component of a set of complex issues related to confidence in the results
(Section 9). As the OECD found when applying WoE to adverse outcome pathways, analysis of
qualitative uncertainty is redundant with WoE (Becker et al.. 2015). Conventional statistical measures
such as distribution functions, standard deviations and confidence intervals are used to express the
variability and uncertainty that appear as scatter in the data or scatter in modeling results. Many sources
of uncertainty are unquantified or unquantifiable, however, and conventionally are simply listed. WoE
can address this wider range of qualitative issues that determine confidence in the results—not just how
closely the curve fits the points, but does the curve represent a causal relationship, and how relevant is the
relationship to the case? A formal WoE method is a means to engage with these issues more clearly and
consistently. Ideally, each quantitative result would have a conventional statistical estimate of uncertainty
or variability and associated qualitative weights.
All assessments require professional judgments by assessors, but WoE makes the judgments more
transparent. Clearly indicating where the assessors" judgments end and those of the decision maker begin
is critical. It is clear that the decision maker, after conferring with assessors and potentially with
stakeholders during the planning phase, specifies the topic and scope of the assessment and the decision
to be made (U.S. EPA. 1998). It is also clear that, in the end, the decision maker determines what action
will be taken. Judgments regarding the scientific evidence made between these management decisions
are generally considered to be in the purview of science and are made by assessment scientists to avoid
any potential political biases of decision makers (NRC. 1983). A point of potential ambiguity arrives at
the conclusion of the assessment. A decision maker might request only a compilation and summary of
the evidence and then decide, for example, whether the stream is impaired, which cause is the most likely
or what concentration is the best benchmark value? The boundary between assessment and management,
therefore, should be made clear in the planning phase because it determines the ultimate product of the
assessment.
18

-------
Because professional judgments play such a prominent role when weighing evidence, avoiding making
the judgments in isolation is essential, particularly for less experienced assessors and for unconventional
assessments. Support in professional judgment can be provided during the assessment by collaboration
within the assessment team or afterwards by internal and external peer reviewers.
In the EPA, the planning of assessments includes the development and approval of a Quality Assurance
Project Plan (QAPP). Although quality assurance (QA) and weighing evidence are not the same, WoE
includes the screening of studies and the evaluation of evidence to weight it with respect to relevance and
reliability, which encompasses evaluating the quality of input data. To avoid duplication of effort, an
explicit WoE process can be used to meet the requirement for a QAPP (Box 3-4 and Section 9).
Box 3-4. Data Quality Assurance and Weight of Evidence
Quality assurance (QA) and WoE both consider the quality of scientific information and thus are related.
The QA process determines whether the quality of environmental data and information supporting EPA
decisions are appropriate for their intended uses (U.S. EPA, 2008). It includes documenting the process for
ensuring information quality, including data generation, data analysis, and assessment methods. Thus, the
development of a QAPP should encompass all aspects of the assessment planning process, including
methods for WoE (U.S. EPA, 2002a). WoE contributes to QA because QA for assessments "involves a
'weight-of-evidence' approach that considers all relevant information and its quality" (U.S. EPA, 2002b).
Therefore, during planning, QA should determine what WoE method is acceptable, and during the
assessment, weighting determines whether the quality of information is acceptable. A formal WoE process
should provide the rigor and transparency needed to meet QA requirements. Assessors should read the QA
guidance and check with their organization's QA official to ensure that they properly merge QA and WoE.
3.3. Results and Transition
In this section we have dealt with determining the role of WoE in the assessment by integrating WoE
considerations into the planning and problem formulation process. That is, we have identified one or
more assessment problems with multiple pieces of evidence to be weighed, we have identified the WoE
approach to be used, and we have determined how it relates to other aspects of the assessment. We are
ready for the first step in the WoE framework, assembling the evidence.
19

-------
4. ASSEMBLING EVIDENCE
4.1. The Process for Assembling Evidence
The success of WoE depends on identifying or generating useful evidence (Figure 4-1). Evidence is
information that can be used to make an inference. For risk assessments and causal assessments, useful
evidence principally includes information about causality (i.e., a relationship between the exposure and
response and either an estimated exposure level or a response level). Exposure levels are used to solve
the exposure-response relationship to estimate a future response to the potential exposure (e.g., to
estimate the frequency of mortality in a conventional risk assessment) or plausible effects of the exposure
(e.g., to determine whether the level of exposure was sufficient to cause the observed effect in a causal
assessments; Figure 4-2). Response levels considered thresholds for unacceptability (e.g., 10-percent
mortality) are used to solve the exposure-response relationship to estimate the exposure level that will
protect against that unacceptable response (criteria assessments) or the exposure level that would be
required to cause an observed effect (causal assessments; Figure 4-2). Exposure and response
information can occur separately in a condition assessment. For example, evidence that the abundance of
an endangered species is declining is sufficient for a condition assessment to prompt a causal assessment.
Similarly, accumulation of a chemical in fish can be sufficient evidence of condition to prompt a risk
assessment. In addition to information about exposure, response and the relationship between them,
information about environmental conditions such as habitat structure or water chemistry also might be
required to complete the information that constitutes a piece of evidence.
Assemble Evidence
Search
Literature
Screen
Categorize
Extract and
Analyze
Design and
Conduct Studies
1
Weight Evidence
I
Weigh Body of
Evidence
Figure 4-1. An elaboration of the process for assembling evidence, the first step in WoE.
20

-------
a Q.4
60
Concentration
Figure 4-2. An exposure-response relationship (black curve) alone is evidence that the measured chemical
can cause the effect. With a concentration, it can provide evidence of the level of effect that is expected (blue
dashed arrow). With a level of effect, it can provide evidence of the concentration that would cause that level of
effect or, in benchmark derivation, a concentration that would prevent greater effects (red dotted arrow).
Data alone are not evidence because evidence should indicate a hypothesized spatial, temporal, or causal
relationship or the absence of a relationship. For example, contaminant concentrations are related to a
particular place that has been, or potentially will be, exposed to some source and to reference
concentrations from sites that are not exposed to the source. Concentrations without those relationships
do not constitute evidence.
In sum, evidence usually requires more than one type of information. Recognizing this, Hope and
Clarkson (2014) developed the term "evidence group" to describe the combination of an
exposure-response relationship, information concerning environmental conditions that influence the
relationship and either an exposure estimate or a response level. Thus, as this document discusses
evidence, note that the discussion typically refers to a set of related bits of information that together
constitute evidence.
4.2. Searching Literature and Assembling Evidence
A systematic literature review can provide more complete information than an informal review and can
reduce the perception of bias in data selection (NRC. 2014). Recognition of the importance of performing
reviews systematically has prompted development of several formal methods for systematic reviews (Box
4-1). Recent peer reviews of high-profile EPA assessments such as the stream and wetland connectivity
report have called for better documented and systematic literature reviews (SAB. 2014). The method for
literature review should be described in the analysis plan (U.S. EPA. 1998).
The EPA has developed the Health and Environmental Research Online (http://hero.epa.gov) database to
document the literature searches and screening processes for ISAs and IRIS assessments. It provides
transparency in literature searching, screening, and sorting processes and makes the results available to
the public.
21

-------
Box 4-1. Systematic Review
Systematic review is an approach for reviewing published scientific evidence based on a formal search
protocol that provides a reasonably complete set of relevant studies that has been screened and reviewed
in a consistent and replicable manner (Bilotta et al., 2014). The goal of systematic review methods is to
ensure that the review is complete, unbiased, and transparent.
Systematic review was developed as a tool for integrating results of clinical trials in evidence-based
medicine. The most prominent example is the Cochrane Collaboration, an organization that reviews
evidence of the efficacy of medical treatments (Higgins and Green, 2011). The Campbell Collaboration
(http://www.campbellcollaboration.org) extends systematic reviews to education, crime and justice, social
welfare, and other issues. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)
provides "an evidence-based minimum set of items for reporting in systematic reviews and meta-analyses"
(http://www.prisma-statement.org/). More recently, systematic review has been adapted to mammalian
toxicology, epidemiology, and human health risk assessment. Examples include the industry-funded
Evidence-Based Toxicology Collaboration (http://www.ebtox.com), the foundation-funded Navigation
Guide (Woodruff and Sutton, 2014), and the U.S. National Toxicology Program's Office of Health
Assessment and Translation method (Roonev et al., 2014). Another systematic review system has been
developed for environmental management (CEE, 2013). None of the specific existing methods for
systematic review are directly applicable to weighing evidence in the EPA's ecological assessments. In
particular, systematic review does not integrate heterogeneous evidence, which is typically necessary in
ecological assessments. However, they provide methods for the assembly of information from the
literature.
4.2.1. Search the literature
The two major components of a literature search are defining the topic and designing the search. The first
step identifies the topic with sufficient specificity to enable assessors or information specialists to design
a search strategy. Examples include the chronic toxicity of arsenic to marine teleost fish and the
biodegradation rate of benzene in fresh surface waters. If the assessments performed by an organization
are sufficiently consistent, developing a standard format and content for defining the search topic might
be desirable. The variants of Population, Exposure, Comparator, Outcomes, and Study Design (PECOS)
systems, developed for reviews of epidemiological or clinical studies, could serve as models for defining
ecological search topics (Woodruff and Sutton. 2014; CRD. 2009).
Comparisons of systematic reviews to conventional literature reviews have revealed that the latter are
often incomplete or biased (Egger et al.. 2003). One reason is the lack of professional assistance with
searches. Some useful strategies such as the use of wild cards and truncation to obtain all forms of a
search term as well as procedures for using different databases and search engines might not be familiar.
Although guidance on literature searching is available (Gough et al.. 2012; Higgins and Green. 2011;
CRD. 2009). it is not specific to ecological assessment and could soon become outdated due to advances
in information science such as text mining techniques.
Multiple search strategies and tools can be employed. Some examples include:
•	Electronic databases such as ECOTOX;
•	Literature search engines such as PUBMED, Web of Science, or Google Scholar;
•	General search engines such as Google or Bing;
22

-------
•	Hand searches of key journals and books;
•	Citation tracking of literature cited in key papers, books, or reports;
•	Citation searching for publications citing key papers, books, or reports;
•	Contacts with key authors and experts;
•	Solicitation of information from stakeholders; and
•	Internal databases (if they are not business confidential).
One should maintain a search log of sources searched, date, terms, and syntax (i.e., the combinations of
the Boolean operators: and, or, not) used. Doing so can make the search explicit, transparent, and
potentially replicable. Downloading search results to reference management software such as Endnote or
Reference Manager, with a distinct name for the results of each search, also can be helpful.
The inclusion of difficult-to-locate studies (i.e., not published in a journal, not in English, or not indexed
in MEDLINE) influenced the results of medical meta-analyses, but the influence varied among fields
(Egger et al.. 2003). Such studies tended to be less comprehensive and to have lower methodological
quality. Identifying relevant studies in an unbiased fashion, however, is important. Issues of study
relevance and reliability should be addressed during screening or weighting of the evidence rather than
during the literature search.
4.2.2. Screen the studies
Screening the search results and studies performed for the assessment identifies irrelevant or clearly
unreliable studies for elimination. Elimination criteria should be defined in advance, to the extent
practicable, to avoid perceived or actual bias. Defining irrelevance is relatively straightforward. If the
topic is chronic toxicity of arsenic to marine teleost fish, a freshwater invertebrate study would be
eliminated. Some issues of relevance might not be apparent in advance. For example, assessors might
realize during the screening process that not all arsenic species are relevant, so they should also screen out
studies of irrelevant arsenic species. Screening for reliability is generally more difficult than screening
for relevance and might not be performed at all. That is, assessors might choose to include all relevant
studies and leave differences in reliability for the evidence-weighting phase. Eliminating studies with
some obviously unacceptable attribute, such as lack of replication or lack of controls in a test, however,
could be efficient. If the weighting phase is skipped (Section 7). performing a more thorough screening
and considering the elimination of marginally reliable studies is important.
Some criteria commonly used to screen studies are not so clearly related to relevance or reliability. They
include language (English only, unless translation services are available); peer review; and use of a
standard protocol or good laboratory practices. Standard protocols ensure consistent and interpretable
results, and good laboratory practices improve methodological reliability. Nonstandard studies, however,
could provide relevant results and might also have good QA. Entire categories of sources could be
eliminated, particularly secondary sources (if possible, primary literature should be used as sources of
information), unpublished reports, or abstracts.
Some EPA programs have specific data-screening criteria. For example, the Office of Pesticide Programs
screens open literature publications using 14 criteria (U.S. EPA. 201^). and the ecological soil screening
levels for plants and soil invertebrates have 11 acceptance criteria (U.S. EPA. 2005a). Other screening
criteria are assessment specific, such as the assessment of connectivity of U.S. waters, which used a logic
diagram for screening literature (U.S. EPA. 2015aV
23

-------
When very large numbers of studies are screened, a tiered process can be used. For example, in the ISAs
for criteria air pollutants, publications are screened based on their titles, then on abstracts, and finally on a
full reading of the text (U.S. EPA. 2013). Duplicate screeners can minimize errors by identifying and
resolving differences.
The EPA has published guidance on determining whether information quality is adequate (see Box 3-3).
This procedure is conceptually equivalent to the screening process described here, but it is not followed
by evidence weighting and weighing. The screening and weighting processes together should be
sufficient to fulfill the EPA's mandate for information quality review.
4.2.3.	Categorize the studies
Sorting studies is generally desirable so that distinct categories of evidence can be weighed before they
are integrated to reach a conclusion (see Section 6.2). The categorization depends on the type of
assessment, amount and diversity of evidence, circumstances of the assessment, and preferences of the
assessors.
The most common approach to categorization is to assign evidence to types based on the sorts of studies
from which the evidence is derived. For example, the Oak Ridge National Laboratory scheme for
ecological assessment of contaminated sites divided evidence into single-chemical toxicity tests, body
burdens, ambient media toxicity tests, biomarkers and pathologies, and biological surveys (Suter et al..
2000; Suter. 1996). Other types of assessments would use different types of evidence. In general, field
studies are separated from laboratory studies, mechanistic studies are separated from studies of overt
effects, and studies of effects at different levels of organization are separated. When only toxicity
benchmark values are used as measures of effects, evidence might be categorized by mode of exposure
[e.g., whole sediment, pore water, or tissue concentrations (Integral Consulting Inc.. 2013)1. An
assessment of a specific site might separate evidence from the site from evidence derived from other
locations.
Evidence could also be categorized by the characteristic it illustrates. For example, evidence in causal
assessments can be organized by characteristics of causal relationships such as interaction and
co-occurrence (see Appendix E). Classification in terms of characteristics serves to explain the
implication of the evidence for the inference.
4.2.4.	Derive evidence from data and general knowledge
Data are raw materials for generating evidence. Data are generated by observation or experimentation
and must be obtained from the investigators or extracted from the published studies that pass screening.
Data can include:
•	Numerical data such as the median lethal concentration (LC50) values or raw exposure-response
data;
•	Categorical data such as whether benchmark levels are exceeded; or
•	Narrative data such as the appearance of a water body, the behavior of animals, or a researcher's
interpretations.
In addition to data, generally accepted knowledge such as physical laws and biological principles
(e.g., receptor binding of certain chemicals or the role of species competition in structuring communities)
contributes to evidence generation. Such knowledge can determine the information sought for generating
evidence and the way to structure that evidence. Knowledge also might be the evidence itself. For
24

-------
example, evidence that a biotic community was exposed to a release from an upstream source is provided
by the general knowledge that flow would carry a persistent and soluble chemical downstream. General
knowledge might appear as facts (water flows downhill) or as a mathematical model (e.g., a transport and
fate model for streams).
If possible, numerical data sets should be obtained from the journal's supplementary material files or
directly from the investigator. Transcribing numbers from data tables is prone to error.
If the data to be extracted are sufficiently consistent (e.g., all are from laboratory toxicity tests),
development of a data form can facilitate extraction and increase completeness and consistency of the
extracted data. Particularly if quantitative analysis is planned, an electronic form can efficiently combine
data extraction and data entry. Any form should be piloted with a subset of the studies.
Ensuring data quality by checking the entered data for consistency with the source is desirable. One way
to accomplish this is to have two individuals separately extract the data and then compare their results.
Discrepancies can be due to errors or to ambiguities in the source that might be variously interpreted.
Some processing of the extracted data could be required before they can be used. In the simplest cases, a
summary statistic such as an annual average is calculated. If evidence will be weighed, additional
processing often is required. Evidence (as opposed to data) generally involves some sort of relationship
such as the impaired stream reach was channelized, but the reference reach was not. Evidence of a
condition could involve simple observation. For example, the absence of a taxon (e.g., no fish) is
evidence of impairment. In addition, pieces of evidence that will be numerically combined must be
converted into the same form and units (e.g., LC50S converted to mg/L).
Processing data to obtain evidence also might require adjusting for the conditions to which the evidence
will be applied. For example, laboratory test data might not be comparable to field data until they have
been adjusted for water properties such as pH, hardness, or dissolved organic matter levels. Adjusting for
the occurrence of mixtures involves applying an additivity model or other combined effects model to the
individual chemical test data. Adjusting for biology might require considering seasonal occurrence of
sensitive life stages.
Derivation of evidence involves the consideration of variance. Different data sets might appear to be
inconsistent, until it is recognized that their distributions overlap.
4.3. Design and Conduct Studies and Assemble the Evidence
In some cases, studies can be performed to provide evidence for an assessment. In such cases, the
inferential process should determine the evidence needs and drive the research planning. A simple
example is provided by the sediment quality triad I (Chapman. 1996); see Appendix D. Table D-ll. That
WoE method includes standard inference rules that require reliable data for the sediment's chemical
composition, toxicity, and biological community composition. If any of these types of evidence are
missing or unreliable, the inferential method fails. Some EPA offices and programs specify data to be
generated for assessments. A prime example is the Office of Pesticide Programs' FIFRA data
requirements. For other purposes, such as Superfund remedial investigations, the data needs are
determined for individual cases. The Data Quality Objectives Process was developed for that purpose
(Bilvard et al.. 1997: U.S. EPA. 1994).
Like data from the literature, data produced for the assessment should be screened for acceptability. The
Office of Solid Waste and Emergency Response provided detailed guidance to determine if contaminant
data are useable (U.S. EPA. 1992a. b). If data quality objectives are developed, they should specify
25

-------
standards for data acceptability. The description of data to be generated is a conventional component of
the analysis plan (U.S. EPA. 1998).
As with data from literature reviews, data generated for the assessment must be input to the assessment's
data management system and analyzed to derive the evidence. Unlike data from the literature, the raw
data from studies performed for the assessment always should be available for analysis. Results of those
studies should be categorized into the same types of evidence as the results of literature reviews.
4.4.	Summary
The process of assembling evidence is more complex than the simple phrase implies. When weighing
evidence, ensuring the process begins with a complete search of the literature is essential. QA procedures
should be applied to the processes of obtaining or extracting data or other information from the literature.
Information should be screened for relevance and reliability. Finally, information should be analyzed and
organized in a way that facilitates its use as evidence.
4.5.	Results and Transition
The results of assembling evidence are pieces of evidence that are at least minimally relevant and reliable,
and are organized and presented in a useful form. The next step in WoE is to assign weights to the
evidence that have been assembled.
26

-------
5. WEIGHTING EVIDENCE
5.1. The Process of Weighting Evidence
In most assessments, some pieces of evidence will be more influential than others. The evaluation of the
evidence can be implicit, as in a narrative WoE. When a formal WoE method is used, however,
weighting the evidence determines the appropriate degree of influence (Figure 5-1). By not only
evaluating the evidence but also expressing the weights explicitly as scores, assessors transparently
determine the influence that particular pieces of evidence will have when the entire body of evidence is
weighed (see Section 6). Weights are usually assigned to pieces of evidence. However, if the pieces of
evidence in a category (Section 4.2.3) are similar, the category may be assigned a weight (see
Section 6.2).
Assemble
Evidence
Weight Evidence
Combine
Weights
Evaluate & Score
Strength
Evaluate & Score
Relevance
Evaluate & Score
Reliability
Weigh Body of
Evidence
Figure 5-1. An elaboration of the process for weighting evidence, the second step in WoE.
The concept of weighting is familiar in statistics. For example, a weighted mean is the mean of study or
sample values multiplied by weights, which are usually the numbers of observations. If one stream
invertebrate survey of 10 sites reported a mean of 20 species and another survey of 40 sites reported a
mean of 30 species, the true mean species richness across studies is found by multiplying each mean by
the number of sites, adding the resulting products and dividing by 50. By weighting, the mean is shown
27

-------
to be 28, not 25. Statistical weighting is an important tool of WoE for deriving quantities (see Section 8
and Appendix B) but is not considered further in this section.
The qualitative weighting of evidence has two components: evaluation and scoring. Evaluation of
evidence determines the weight—how influential a piece of evidence should be, preferably based on
defined properties. Scoring uses symbols to formalize the results of the evaluation. The weight is
conceptual and the scores are symbolic. For example, a piece of evidence may be evaluated as
moderately relevant, so the score for relevance is ++.
Many uses of WoE have not involved scoring. In these cases, either no results of evaluation are
expressed or the results are expressed by descriptions. When assessors score the evidence, they are
compelled to clarify the evaluation for the reader and themselves. In addition, scoring helps assessors
provide clearly weighted evidence for the next step, the integration of evidence (see Section 6).
The approach presented here for qualitatively weighting evidence has been used in assessments of
specific causation in the Stressor Identification Guidance (U.S. EPA. 2000) and in the Causal
Analysis/Diagnosis Decision Information System (CADDIS, http://www3.cpa.gov/caddis) and its
applications. It has been used in an assessment of general causation for major ions as a cause of the
impairment in stream communities (U.S. EPA. 201 lb) and has been adapted to integrate evidence in a
watershed risk assessment (U.S. EPA. 2014a). Below, the scoring system is presented first and then used
to illustrate the evaluation of evidence.
5.2. Scoring Systems
A generally useful qualitative scoring system uses the symbols, +, - and 0, to represent evidence that,
respectively, supports, weakens, or has no effect on the credibility of a hypothesis. More symbols
represent greater weight. For example, the Stressor Identification Guidance and the CADDIS system for
ecological causal assessment use the following scores.
+ + +,		Convincingly supports or weakens
+ +, —	Strongly supports or weakens
+, -	Somewhat supports or weakens
0	No effect (neutral or ambiguous)
NE	No evidence
The interpretation of plus or minus symbols is reasonably intuitive, and they have been used in other
WoE systems (Roonev et al.. 2014; Higgins and Green. 2011; Fox. 1991; Susscr. 1986). These scoring
symbols are quite general. They can be applied to a simple quantifiable property like strength of
association or to a complex qualitative property like study design. Their regular use could result in clear
and consist expression of WoE results. Nevertheless, other scoring systems might be more useful in
particular applications.
Most other scoring systems used in environmental assessments only distinguish different degrees of
support for a hypothesis. Some evidence, however, is clearly contrary to a hypothesis or has no effect.
When a scoring system does not include those possibilities, a separate step must follow scoring to
distinguish the logical implications of evidence [e.g., the evidence for the hypothesis is ++ for reliability
and +++ for strength but it is contrary to the hypothesis (Johnston et al.. 2002)1. The use of 0 and -
symbols as well as + makes the logical implication of the evidence part of the weighting process rather
than an afterthought.
28

-------
Symbols are preferable to numerical scores because their use implies that they cannot be numerically
combined. Two strongly supporting laboratory tests (++ and ++) are not equal to four somewhat
supporting field tests (+, +, +, +). For a test result, a - score for study design and a + score for replication
of the test do not sum to 0, because they are not commensurable. They simply signify different results for
the different qualitative properties. Adding numerical scores generates a number with no units that
signifies no quantity in particular.
These symbols also facilitate interpretation of bodies of evidence. When scanning a WoE table (see
Section 6.2). seeing patterns in the frequencies of +, - and 0 symbols that indicate which hypotheses are
supported by the weight of evidence is easier than if words or numbers are used to score evidence.
The system described above with three types of weight (positive, negative, and none) and three levels of
weight (low, medium and high) has been found useful in environmental assessments, but more or less
discrimination might be useful in some assessments. Binary (+/-) scores such as accept/reject and
consistent/inconsistent have been recommended because more complex systems could be confusing or
overwhelming (Hope and Clarkson. 2014). Alternatively, studies of survey responses have shown that
people can distinguish five to seven possible response levels as in the Likert scale [e.g., very low, low,
medium, high, very high (Dawes. 2008)1. Hence, the complexity of the scoring system can be adapted to
the assessment and the desired degree of discrimination.
5.3. Properties to Be Weighted
Various properties of a piece or type of evidence
can contribute to the degree of influence that it
should exert. The specific properties fall within
three general properties: relevance, strength, and
reliability (defined in Box 5-1. Box 5-2. and Box
5-3).
Relevance (Box 5-1) includes biological,
physical/chemical, and environmental aspects. It
is at least minimally ensured when the assembled
evidence is screened. Screening might be
sufficient, and no further weighting of individual
pieces of evidence for relevance may be
necessary. For example, when assessing risk to
black bears, if the only available mammalian
toxicity tests are for laboratory rats and mice, the
relevance of the tests cannot be distinguished
because the relationships of rat and mouse
sensitivities to bear sensitivity is unknown. In
many cases, however, differences in study
relevance are important to consider.
A strong signal is better differentiated from noise
than is a weak signal, so a strong signal should be
given more weight. Strong evidence shows (1) a
large magnitude of difference between a
treatment and control in an experiment or
between exposed and reference conditions in an
observational study, (2) a high degree of
Box 5-1. Relevance of a Piece or Type of
Evidence
The relevance of a piece or type of evidence is
the degree of correspondence between the
evidence and the assessment endpoint to which
it is applied.
Biological relevance—correspondence among
the taxa, life stages, and processes measured or
observed and the assessment endpoint (e.g., a
Daphnia acute lethality test does not correspond
well to insect life-cycle survival and fecundity, so
relevance may be low).
Physical/chemical relevance—correspondence
between the chemical or physical agent tested or
measured at the studied site and the chemical or
physical agent constituting the stressor of
concern (e.g., pyrene may be used to represent a
polycyclic aromatic hydrocarbon mixture, but
relevance may be low).
Environmental relevance—correspondence
between test conditions and conditions at the
assessed site or the environmental conditions in
a studied system and conditions in the region of
concern (e.g., a pond mesocosm does not
correspond well to a stream for many
environmental parameters, so relevance may be
low).
29

-------
association between a putative cause and effect,
or (3) a large number of elements in a set of
evidence (see Box 5-2). Strength is a property of
the study results, not the type of evidence or
study method. The metrics for strength of
evidence are familiar. Magnitude is typically
represented by absolute and relative differences
(e.g., body burdens were 20 times higher than at
reference sites). Association is typically
represented by correlation coefficients
(e.g., Pearson's r for the correlation of emission
rate and ambient concentrations was 0.8) or
slopes (e.g., the regression of species richness on
dissolved oxygen had a slope of 7). Number is
commonly represented by the number of
elements or frequency of occurrences (e.g., a fish
kill has occurred at snowmelt every year for the
past 6 years). Although strength occurs less
frequently in WoE systems than relevance and
reliability, some weight only strength (Chapman.
2007).
Strength metrics lend themselves to
standardization. For example, correlation
coefficients were calculated for several
associations in regional field data to determine
the WoE for major ions as a cause of the
extirpation of invertebrate genera and for potential confounding by other variables (U.S. EPA. 2011b).
Standard scores were developed for the strength of correlations based on the authors" experience with
correlations of parameters in surveys of physical, chemical, and biological properties of streams (Table
5-1). In this example, a correlation coefficient (r) between 0.25 and 0.75 is supportive but is not assigned
an extra plus for strength. Because r > 0.75 is considered relatively strong for a correlation between a
water quality measure and a biological response from a regional data set, it is given a second plus. No
field correlations are believed to be convincing evidence, so none receive three + or - signs. Consistency
and reasonableness of the scores are more important than the precise values chosen for the cutoffs, when
they are used to compare alternative hypotheses. Documenting scoring criteria in advance also reduces
the opportunities to bias the scoring relative to scoring performed without previously defined cutoff
values.
Standard scores based on strength and logical implication are provided for evidence in causal assessments
in the CADDIS system. The examples in Table 5-2 are based on standard alternative possible outcomes
(findings) for each of 16 types of evidence and the interpretations of those outcomes in terms of the
degree to which they support a potential cause. These scores show that different types of evidence have
different outcomes with different implications for a hypothesis. The CADDIS table was developed for
causal assessment in aquatic ecosystems. This approach to standard scoring is useful if the types of
evidence and their findings are conventional. Other applications of this approach could have different
types of evidence, possible outcomes, and interpretations.
Box 5-2. Strength of a Piece or Type of
Evidence
The strength of a piece or type of evidence is the
degree of differentiation from control, reference,
or randomness [modified from Norton et al.
(2014)1.
Magnitude—degree of difference between the
amount of response at affected sites and at
reference sites or in treatments and controls,
between degrees of exposure or other relevant
differences in the evidence, most commonly
expressed as a difference between means or a
ratio of means.
Association—degree to which variation in a
variable representing a cause explains variation
in a variable representing an effect, most
commonly expressed as a correlation coefficient.
Number—the number of elements of a set of
evidence (e.g., of symptoms or overt effects in a
response or of steps in a causal pathway) that
are reported to be observed or the number of
occurrences.
30

-------
Table 5-1. Weighting the strength of correlations (absolute value of t) and noting the logical implication-an
example for evidence from stream biological surveys (Cormier and Suter. 2013; U.S. EPA. 2011b)
Assessment
Logical Implication and Strength
Score
The sign of the correlation coefficient depends on the
relationship. For toxic relationships such as the correlation
between conductivity and number of Ephemeroptera, the
sign should be negative. Weak or positive correlations
weaken the case for that candidate cause.
|r| >0.75
+ +
0.75 >_ r >0.25
+
0.1 < r <0.25
0
r <0.1
-
/"has the wrong sign

31

-------
Table 5-2. Table of standard scores for 3 example types of evidence out of 15 types in CADDIS
(http://www.epa.gov/caddis/si step scores.html). Each type of evidence is explained in its own CADDIS page.
Type of
Evidence
Finding
Interpretation
Score
Spatial/temp
oral co-
occurrence
in site
The effect occurs where or when the
candidate cause occurs, OR the effect
does not occur where or when the
candidate cause does not occur.
This finding somewhat supports the case for the
candidate cause, but is not strongly supportive
because the association could be coincidental.
+
surveys
Whether the candidate cause and the
effect co-occur is uncertain.
This finding neither supports nor weakens the
case for the candidate cause because the
evidence is ambiguous.
0

The effect does not occur where or when
the candidate cause occurs, OR the effect
occurs where or when the candidate cause
does not occur.
This finding convincingly weakens the case for
the candidate cause because causes must co-
occur with their effects.


The effect does not occur where and when
the candidate cause occurs, OR the effect
occurs where or when the candidate cause
does not occur, and the evidence is
indisputable.
This finding refutes the case for the candidate
cause because causes must co-occur with their
effects. Because the evidence is indisputable,
other evidence need not be assessed.
R
Laboratory
tests of site
media
Laboratory tests with site media show clear
biological effects that are closely related to
the observed impairment.
This finding convincingly supports the case for
the candidate cause.
+ + +

Laboratory tests with site media show
ambiguous effects, OR show clear effects
that are not closely related to the observed
impairment.
This finding somewhat supports the case for the
candidate cause.
+

Laboratory tests with site media show
uncertain effects.
This finding neither supports nor weakens the
case for the candidate cause.
0

Laboratory tests with site media show no
toxic effects that can be related to the
observed impairment.
This finding somewhat weakens the case for
the candidate cause but is not strongly
weakening because test species, responses, or
conditions might be inappropriate relative to
field conditions.

Symptoms
Symptoms or species occurrences
observed at the site are diagnostic of the
candidate cause.
This finding is sufficient to diagnose the
candidate cause as the cause of the
impairment, even without the support of other
types of evidence.
D

Symptoms or species occurrences
observed at the site include some but not
all of a diagnostic set, OR symptoms or
species occurrences observed at the site
characterize the candidate cause and a few
others.
This finding somewhat supports the case for the
candidate cause, but is not strongly supportive
because symptoms or species are indicative of
multiple possible causes.
+
32

-------
Type of
Evidence
Finding
Interpretation
Score

Symptoms or species occurrences
observed at the site are ambiguous or
occur with many causes.
This finding neither supports nor weakens the
case for the candidate cause.
0

Symptoms or species occurrences
observed at the site are contrary to the
candidate cause.
This finding convincingly weakens the case for
the candidate cause.


Symptoms or species occurrences
observed at the site are indisputably
contrary to the candidate cause.
This finding refutes the case for the candidate
cause.
R
Properties of evidence that suggest the evidence is more reliable and should be given greater weight are
listed in Box 5-3. Additional properties may be applicable in particular cases. Although scoring
numerous properties for every piece of evidence could be burdensome, doing so would provide
completeness and transparency for the weighting process. An alternative is to choose one or a few
properties to be weighted that are judged most important or most likely to discriminate the various pieces
of evidence. For example, if all evidence is consistent with prior knowledge (as is usually the case),
consilience need not be scored. This approach—scoring the most important component properties of
reliability—is generally useful. The least burdensome, but also least transparent, approach is to integrate
the component properties of reliability implicitly for each piece of evidence and assign an overall
reliability score.
Of these 11 component properties of reliability, the most attention has been devoted to study design.
Developing a checklist of design features for each commonly used type of evidence is advisable. For
example, Batlev et al. (2002) developed a checklist of considerations for laboratory and field studies of
sediment chemistry, toxicology, and community structure.
33

-------
Box 5-3. Reliability of Evidence
Reliability consists of inherent properties that make evidence convincing [modified from Norton et al.
(2014)1.
Design and execution—evidence generated with a good study design that is well performed is more
reliable.
Abundance—evidence from more numerous data is more reliable because it reflects greater replication or
resolution.
Minimized confounding—evidence is more reliable when the sampling design or analysis controls
extraneous correlates.
Specificity—evidence (e.g., a symptom or set of symptoms) specific to one cause or a few related causes is
more reliable.
Potential for bias—evidence from a study that is not funded by an interested party, is not produced for
advocacy, and is not produced by an investigator with conflicts of interest is more reliable.
Standardization—a standard method decreases the likelihood that the evidence is biased or analyses are
inaccurate.
Corroboration—using models, indicators, or symptoms that have been verified by many studies and are
accepted technical practice can greatly increase reliability.
Transparency—complete description of methods and inferential logic and availability of data for reanalysis
provide the means to check the results and are presumed to increase reliability by reducing the likelihood
of hidden faults.
Peer review—an independent peer review increases reliability of a source of information.
Consistency—the degree to which evidence does not vary in repeated instances within a study (e.g., across
years, locations, sampling teams, or methods) is an indicator of reliability of a piece of evidence; when
weighting types of evidence, consistency among studies of the same type is an indicator of reliability of the
type.
Consilience—evidence shown to be consistent with scientific knowledge and theory, particularly with
respect to underlying mechanisms, is more reliable.
Reliability is less if a study or a body of work appears biased. The potential for bias in scientific
publications, although controversial, is well documented (Suter and Cormier. 2014; McGaritv and
Wagner. 2008). One clearly important source is publication bias, the disinclination of authors to submit
and editors to accept studies with negative results (i.e., no effect is found). This bias can be detected by
statistical analysis if the number of studies is sufficiently large (Ferguson and Brannick. 2012; Hunter and
Schmidt. 2004). Other sources of bias can be detected only by comparing the results of studies from
different sources. In particular, numerous reviews have identified different results from industry-funded
studies and studies funded by neutral sources (Suter and Cormier. 2014; McGaritv and Wagner. 2008).
The most conspicuous example in ecological risk assessment is the difference between industry-funded
and other studies and reviews concerning the teratogenicity of atrazine (Ralof. 2010). Such patterns of
biases are not used to exclude data, but potential sources of bias should be identified, if possible, as they
help interpret the body of evidence (NRC. 2014). For example, if results of industry-funded studies differ
from government- or foundation-funded studies, they all must be carefully reviewed. If the cause of the
difference cannot be determined, the reliability of the whole body of evidence may be down-weighted.
34

-------
Biases may also result from personal ideologies, from a desire for publicity, or other reasons. These more
personal sources are more difficult to detect but may show up in non-scientific writings.
In general, secondary sources should be screened out (Section 4.2.2). but if information from secondary
sources is used, it cannot be considered highly reliable. The creation of secondary sources is an
opportunity to introduce errors and biases in the extraction, analysis, and interpretation of data. Bad
citing practices have been reported in the ecological literature, resulting in unsubstantiated conclusions
(Sanz-Martin et al.. 2016; Tood et al.. 2010; Todd et al.. 2007). Hence, if primary sources are
unavailable, information may be taken from secondary sources but should be considered to have weak
reliability. Cases in which a secondary source identifies and corrects an error in a primary source may be
exceptions.
In addition to relevance, strength, and reliability, the type of evidence has sometimes been a weighting
consideration. In particular, some assessors give more weight to field biological survey results than to
other evidence (Chapman and Anderson. 2005). However, some field surveys have low relevance
because they do not address the endpoint, do not address the sensitive taxa or other sensitive entities, or
measure an irrelevant or insensitive response. In addition, field surveys might be so poorly designed or
executed that the results are unreliable. For example, field data were given priority in a contaminated
sediment study even though the field survey had "many limitations," including no appropriate reference
sites (McPherson et al.. 2008). Weighting the properties of the evidence is preferable to assuming that
one type is always weightier.
Statistical analysis also has been a consideration in WoE. For example, a criterion for study evaluation in
ISAs is "Are the statistical analyses appropriate, properly performed, and properly interpreted?" (U.S.
EPA. 2013). However, the appropriateness of the analyses reported in a publication is immaterial if
assessors perform their own analyses. Published studies should include original data or the authors
should make data available for reanalysis. At least for influential evidence, assessors should consider
reanalyzing the data so that results are relevant to the assessment, properly interpreted, and free of
statistical errors or biases in assumptions. In such cases, evaluating the statistics the authors used is not
pertinent. If original data are not available, the evidence could be given a negative score for transparency.
To the extent possible, improper interpretations should be corrected (e.g. ,P> 0.05 does not mean no
effect occurred), and assessors should consider down-weighting evidence that has inappropriate statistics
if reanalysis is not feasible (e.g., time and resources are not available).
5.4. Tables of Weights
The primary inferential tool in weighting evidence is tabulations of pieces or categories of evidence and
the weights assigned to them with respect to their properties. Because the assigned weights are expressed
as scores, they are called scoring tables. Scoring tables may be aggregated to the extent that the
weighting process is aggregated. Table 5-3 is a basic, generic scoring table in which both the evidence
and properties are aggregated. The rows could have been individual pieces of evidence, but in this table,
they are conventional types of evidence. Each of the three general properties is assigned a score based on
the evaluation of the evidence each type provides. In the example, when assessing a contaminant as a
potential cause of effects in the field, the results of laboratory tests of the contaminant can be positive but
have low relevance to the field (+), the responses in the tests can be moderately strong (++), and the
reliability of the tests can be high due to standard test designs, good laboratory practices, and regular
audits (+++). In this example, the combined score (overall weight) might be + (weak supporting
evidence) because the low relevance of the tests could make the strength of the results and the reliability
of the method irrelevant to the inference. In other cases, a highly relevant test of the species of concern
may not be sufficient to overcome very low reliability (e.g., due to absence of controls). In general, a
property with very low weight will have greater influence than a moderate- or high-weight property. In
35

-------
other words, a bad property of a piece of evidence tends to contaminate the whole thing. These examples
illustrate why qualitative WoE is not arithmetic—you can't just add up the scores. It also illustrates why
weights should be explained.
Table 5-3. Generic scoring table based on conventional types of evidence, with first line hypothetically
completed. The overall weight is positive but low because the test relevance is low.
i ypes of Evidence
Relevance
Strength
Reliability
Overall
Weight
Laboratory toxicity tests of a defined agent
+
++
+++
+
Effluent toxicity tests




Ambient media toxicity tests




Field biological surveys




Field biomarkers and organ or whole-body
concentrations




Field symptoms




Table 5-4 presents an example from a general causal assessment to determine whether dissolved major
ions, measured as conductivity, cause extirpation of stream invertebrate genera. This scoring table
presents three types of evidence for one characteristic of causation: the sufficiency of the observed
conductivity levels to cause extirpation. Equivalent tables were presented for evidence related to each
characteristic of causation. The properties of evidence were treated as binary (e.g., either the evidence is
corroborated or not), so no more than one symbol was applied to a property. In this case, evidence is
screened for sufficient relevance, and each relevant type of evidence is given one score designating its
logical implication. Strength is scored using quantitative criteria as in Table 5-1. Corroboration is scored
as the most important component of reliability in this case. Inclusion of a description of the evidence aids
reader understanding.
After having assigned weights to each property of evidence (relevance, strength, and reliability), and
possibly to multiple component properties of relevance, strength, or reliability, combining them into an
overall weight for each piece or type of evidence might be desirable. If evidence is abundant, providing a
single summary weight score can facilitate the integration step. Developing a system for combining the
property weights can be relatively straightforward. The system used for the casual assessment, from
which Table 5-4 was taken, is designed to provide a maximum cumulative score of three + or - units, for
each of the three properties. However, as with other processes in WoE, the weights could be combined
by expert judgment without defined procedures or criteria to maintain flexibility.
If evidence is scored for multiple component properties, the components should be combined before the
three properties are combined. Seldom are multiple component properties of relevance or strength
separately scored for a piece of evidence. Reliability has at least 11 component properties that are
conceptually distinct, however, and more than one can be applied to a piece of evidence (Box 5-3). In
Table 5-3. only corroboration is evaluated, but if multiple component properties of reliability were judged
sufficiently important to be evaluated and scored, a separate reliability scoring table would be needed.
In any case, making the weighting convincing and transparent by explaining the reasons for the scores is
desirable. This explanation could be presented in the text, but a simple statement, as in the last line of
Table 5-4. might be sufficient.
36

-------
Table 5-4. Example scoring table: scoring types of evidence for sufficiency (U.S. EPA. 2011b). The table is a
direct copy and calls out a figure that does not appear in this document.
Type of Evidence
Description of Evidence
Log3
Strb
Corc
Laboratory tests of
ambient waters
A test showed acute lethality to an apparently resistant species,
Isonychia bicolor, at conductivity levels similar to its XC95.
+


Field exposure-response
relationships of biological
metrics
Ephemeroptera were negatively correlated with conductivity in two
data sets r = -0.61 and -0.72 (Figs. 2b and 4b) and r = -0.90 in
Pond et al. (2008). This hiahlv relevant evidence was obtained
independently in two separate data sets, with moderate-to-strong
correlations. Exposures were in the field with native species.
Removal of sites with high levels of potential confounders had little
effect on the correlation.
+

+
Field exposure-response
relationships of
composite indices
The field observations showed that, as conductivity increases,
indices of stream condition (WVSCI and GLIMPSS) decrease [Fig. 5
and Pond et al. (2008)1. Correlations were strona \r= -0.80:
r = -0.90 in Pond et al. (2008)1. Results were further corroborated
bv U.S. EPA (2010a). Exposures were in the field with native
species.
+
+
+
Field exposure-response
relationships of
susceptible genera
At 500 pS/cm, the capture probabilities of more than 65% of genera
began to decline. Similar results were obtained with West Virginia
and Kentucky data sets.
+
+
+
Summary of sufficiency: Exposure to saline waters in Appalachia is sufficient to cause the declines of genera (+) with
the salts found in the region's streams. The increases in effects of conductivity are strong even when other stressors
are present (+). Different analytical approaches demonstrate that ionic strength is associated with different effect
endpoints in different data sets in two states (+). The evidence is consistent. The total score is + + +.
a Log = logical implication of relevant evidence,b Str = strength,c Cor = corroborated evidence.
GLIMPSS = Genus-Level Index of Most Probable Stream Status, WVSCI = West Virginia Stream Condition Index.
5.5.	Not Combining Weights for Properties of Evidence
Assigning scores to each property of each piece or type of evidence might complete the weighting step.
That is, separate scores for the properties, rather than a single combined weight, could be carried forward
to the integration step. The advantage is that the weights assigned to each property are available to
assessors while they are integrating across evidence within a hypothesis as shown in Table 6-2. In such
cases, the WoE table shows which pieces of evidence are relevant, which are strong, and which are
reliable. One might even score biological, physical/chemical, and environmental relevance separately if
relevance is a particularly important issue in the case. The obvious disadvantage is that the integration
becomes more complex because the assessor must consider weights for multiple properties while
integrating multiple pieces of evidence. In addition, the relative influence of the properties might be
evaluated less consistently and transparently if it is not determined in a separate step.
5.6.	Summary
Once the evidence has been assembled, the WoE analysis begins in earnest with the assignment of
weights to the evidence. Weights are represented by a generally useful scoring system that represents
both the implication of the evidence (+, -, and 0) and the weightiness of the evidence (the number of
37

-------
symbols). Three properties are scored: relevance, strength, and reliability. The evidence and associated
weights are summarized in a scoring table. Weights for the three properties may be combined into an
overall weight for each piece of evidence or all three may be carried forward into the next step, weighing
the body of evidence.
5.7. Results and Transition
The result of the weighting step is a set of pieces or categories of evidence for each hypothesis, each of
which has been assigned weights indicating how relevant, strong, and reliable it is. It is recommended
that the weights be expressed as qualitative scores and organized in scoring tables. A narrative should
also be provided describing the evidence and explaining the weights. This weighted set of pieces or
categories of evidence are integrated and interpreted in the weighing step, which follows.
38

-------
6. WEIGHING BODIES OF EVIDENCE
6.1. The Weighing Process
After weights have been assigned to pieces or categories of evidence, each resulting body of evidence is
weighed to determine which, if any, hypothesis is supported. First, weighted evidence for each
hypothesis is integrated to form weighed bodies of evidence for the hypotheses (Figure 6-1). Second, the
bodies of evidence are interpreted to determine which hypothesis the evidence best supports. Finally, if
the bodies of evidence are ambiguous or discrepant, the case should be reconsidered and additional
evidence may be required.
Assemble
Evidence
Weight Evidence
Weigh Body of Evidence
Explain Ambiguities
& Discrepancies
Integrate Evidence
Interpret Bodies
of Evidence
Figure 6-1. An elaboration of the process for weighing the body of evidence, the third step in weight of
evidence.
6.2. Integrating Evidence
The input to the integration step is weighted pieces or categories of evidence, and the output is the
weighed body of evidence for each hypothesis. The essential purpose of integration is to aggregate the
evidence and associated weights into an overall weight. A secondary purpose of integration is to explain
the role that the evidence serves in the inference.
39

-------
1,	Integration can aggregate numerous weighted pieces or categories of evidence (depending on the
output of the weighting step) into a body of evidence. This purpose is typically served by
deriving a combined weight for each type of evidence: evidence derived from laboratory acute
tests, biomarkers, biological surveys, etc. Then, the weights are integrated across categories to
weigh the body of evidence for each hypothesis.
2.	Explanation can answer the question, what does the evidence do to support or weaken a
hypothesis? For example, evidence from field surveys can support a hypothesized cause by
showing that the cause and effect co-occur. Because WoE is often used to infer causation, the
most broadly useful explanation is to associate the evidence with the characteristics of causation
such as time order and interaction ITable 6-1. Table 6-4. and Table E-l (Norton et al.. 2014;
Cormier et al.. 2010)1. Characteristics, however, can be defined for any quality inferred by WoE
such as impairment or remediation (see Appendix E). Even without defined characteristics,
explanation can be performed by determining how each category of evidence relates to the
hypotheses. For example, when weighing evidence for dietary bioaccumulation of a chemical in
fish, concentrations in algae are evidence that the chemical occurs in the food web of the fish. If
the evidence has been associated with characteristics, the sets of evidence for the characteristics
can be used as categories of evidence in place of the conventional types of evidence (e.g., Table
6-4).
Evidence can be integrated in one or more steps, depending on the amount and its diversity and the
circumstances of the assessment (Figure 6-2). The categories into which evidence is integrated can be
types of evidence or characteristics of the quality of interest. If the pieces of evidence are few or all of
one type, the body of evidence can simply be integrated after each piece of evidence is weighted
(Figure 6-2a). If the differences among pieces of evidence within a category are small, weighting the
pieces can be skipped and the category of evidence weighted as a whole (Figure 6-2b). When evidence is
both abundant and diverse, the pieces of evidence can be weighted in each category, the categories
weighed based on the weights of the pieces and the collective properties of the category (Box 6-1) and
finally, the body of evidence weighed (Figure 6-2c). At the extreme, weighing a very large and diverse
body of evidence can be very methodical and use multiple tiers of categories (Figure 6-2d). For example,
evidence for fish acute toxicity, fish chronic toxicity, invertebrate acute toxicity, and invertebrate chronic
toxicity might be separately weighted and the weights combined to weigh the evidence from all
laboratory toxicity tests. Evidence from a laboratory toxicity test then could be weighed along with field
tests, field surveys, and any other type of evidence to determine whether the weight of the body of
evidence is sufficient to indicate that a chemical exposure is a cause of aquatic community impairment.
The degree to which the weighing of evidence is elaborated depends primarily on a tradeoff between rigor
and disclosure on one hand and efficiency and simplicity on the other. The potential to confuse or
overwhelm the reader with an elaborate weighing process, however, could also be a consideration. To
address this problem, detailed weighting tables and text could be presented in an appendix for peer
reviewers and others who need a deep understanding and a summary provided for decision makers and
stakeholders.
As in the weighting step, the fundamental tool of evidence integration is a table, but this one is called a
WoE table. A WoE table presents the evidence, organized in categories (types or characteristics), and the
WoE scores for each evidence category and alternative hypothesis (Table 6-1). These scores come from
the scoring tables created by weighting the evidence (Section 5). The same symbols are used in the
weighing step as in weighting, because they express the same conceptual weights. In addition, collective
properties of the body of evidence such as coherence are scored (Box 6-1). Finally, the overall weight for
each hypothesis is presented. This generic table form is broadly useful, but the form of the table in any
application depends on the amount of evidence, number of hypotheses, detail of the weighting, degree of
40

-------
involvement of the decision maker, and other considerations. Examples of WoE tables for integrated
evidence from actual assessments are presented in Table 6-2 through Table 6-4.
(a)
(b)
(c)
(d)
Weight
Pieces of
Evidence
Weigh
Body of
Evidence
Weight
Category of
Evidence
Weigh
Body of
Evidence
Weight
Pieces of
Evidence
Weight
Category of
Evidence
Weigh
Body of
Evidence
Weight
Pieces of
Evidence
Weight
Narrow
Category of
Evidence
Weight
Broad
Category of
Evidence
Weigh
Body of
Evidence
Figure 6-2. Some alternative approaches (a-d) to weighting and weighing evidence based on different
approaches to aggregating evidence. Categories could be types of evidence or evidence for characteristics of the
quality of interest, The choice depends on the amount and diversity of evidence, the circumstances of the
assessment, and the judgments of the assessors, From Norton et al. (2014).
41

-------
Box 6-1. Collective Properties of Bodies of Evidence. Modified from Norton etal. (2014).
"i
The following collective properties of the body of evidence can be considered in addition to considering the
relevance, strength, and reliability of the pieces or types of evidence. As with the properties of pieces or
types of evidence, any of these could be used, depending on their relevance and utility for distinguishing
among the hypotheses.
Number— more pieces or categories of evidence within a body of evidence increase the reliability, if they
are consistent and are generated independently.
Coherence—logical consistency among types of evidence increases the reliability of a body of evidence,
particularly when empirical and mechanistic evidence and evidence from a case and from elsewhere are
coherent.
Absence of bias—consistent results from different funders and types of investigators (e.g., academic,
industry, nongovernmental organization, government) are more reliable.
Diversity—evidence from more responses, taxa, life stages, or communities under more conditions is more
likely to detect important exposures, responses, and relationships reliably.
Table 6-1. A generic weight-of-evidence table for n alternative causal hypotheses (Hi, H2,... H„), based on
causal characteristics and collective properties of the bodies of evidence

Combined Weight
H,
h2
Hn
Characteristics of Causation
Co-occurrence



Sufficiency



Time order



Interaction



Specific alteration



Antecedents



Body of Evidence, Collective Properties
Number



Coherence



Absence of bias



Diversity



Integrated WoE
WoE for the hypothesis



Table 6-2 presents, for illustrative purposes, a subset of the types of evidence and causal hypotheses from
a WoE table for determining the cause of a precipitous decline in the abundance of San Joaquin kit foxes
on the Elk Hills Naval Petroleum Reserve, California (U.S. EPA. 2009a'). The full table occupies
multiple pages. The evidence was scored using the types and standard scores in CADDIS, which
eliminates the need to weight the evidence for relevance, strength, and reliability. However, it requires
accepting the relevance of the standard weights and scores. One collective property of the body of
evidence (coherence) is the primary basis for weighing the body of evidence. The conclusion was that
predation by coyotes was the dominant cause of the decline and the evidence was convincing. The
42

-------
coyotes appear to have increased on the site (co-occurrence), the cause of death of radio-collared foxes
was known (evidence of exposure or mechanism), population modeling showed that predation was
sufficient alone to account for the decline (stressor-response relationship), and coyote control coincided
with an end to the decline (manipulation of exposure). Toxic chemicals were present and exposure
occurred and was documented by analysis, but far too few foxes were significantly exposed to account for
the decline. Accidents contributed but the contribution was relatively small. Note that the + for
coherence of disease indicates that the evidence consistently fails to support that hypothesis, so that cause
can be eliminated.
Table 6-2. Partial WoE table for alternative possible causes of the decline of San Joaquin kit foxes. Only 4 of
6 potential causes and 7 of 16 types of evidence are included. Adapted from U.S. EPA (2009a).
Types of Evidence
Predation
Toxics
Accidents
Disease
Spatial/temporal co-occurrence
+
+
+
-
Temporal sequence
0
0
NE
NE
Evidence of exposure or biological mechanism
++
++
++
—
Causal pathway
+
+
+
0
Manipulation of exposure
+
NE
NE
NE
Stressor-response relationships from simulation models
+++
-
+
—
Coherence
+++
-
+
+
Table 6-3 provides an example of a WoE table from a risk assessment. One hazard in the Bristol Bay
watershed assessment was the potential failure of a diesel fuel pipeline at a stream or river crossing, so
evidence was weighed for the hypothesis that a spill at a crossing would reduce salmon production (U.S.
EPA. 2014a). Unlike the generic types of evidence, Table 6-2 shows types of evidence that are specific to
the hypothesis. Evidence concerning the exposure-response relationship was available from laboratory
tests of dissolved or dispersed diesel fuel and from eight studies of diesel spills into streams. Exposure
was estimated by modeling hypothetical spills to estimate dissolved concentrations, diesel/water ratios,
and total volume spilled. The results of combining the weights across properties for each type are
presented as brief narratives rather than as combined scores, but the scores for each property are
combined across types of evidence. This WoE table shows the consistency of results (all are supportive).
It also illustrates where the weakness in the evidence lies—studies of oil spills in streams have not
characterized exposure well, as indicated by ambiguous reliability.
Table 6-4 presents a table with combined scores for assessing whether major ions in streams cause
extirpation of benthic invertebrates (U.S. EPA. 2011b). The evidence is categorized in terms of
characteristics of causation and is derived from the results of scoring tables for each characteristic like
Table 5-4. That is, the summary score from weighting evidence for sufficiency in Table 5-4 (+++) is the
sufficiency score in Table 6-4 (also +++).
Table 6-4 illustrates how explanation of the pieces or types of evidence in terms of characteristics of
causation elucidates the meaning of the evidence for a hypothesis. For example, elevated concentrations
of a chemical in numerous streams with impaired communities mean that the chemical and community
co-occur, and therefore, exposure is likely. If the concentration is greater than concentrations that cause
relevant toxic effects, that means that the exposure believed to be sufficient to cause the effect. Although
such explanations of the evidence might appear self-evident to assessors, it can be an important part of the
inference. Although organizing evidence in terms of characteristics serves primarily to help explain the
43

-------
implications of the evidence, it can also show that a study can provide evidence of multiple
characteristics. For example, a laboratory toxicity test can provide evidence of sufficiency (the level of
exposure required to cause effects) and of alteration (symptomatic effects of the exposure that may be
observed in the field).
Table 6-3. Summary of evidence concerning risks to fish from a diesel spill (U.S. EPA. 2014a). The risk
characterization is based on weighing four pieces of evidence for different routes of exposure. All evidence is
qualitatively weighted on three properties: logical implication, strength, and reliability of methods. Here, all pieces of
evidence have the same logical implication: all suggest a diesel spill would have adverse effects.
Route of Exposure	Reliability
Source of Evidence	Logical
(Exposure/E-R) implication0 Strength Exposure E-Rb	Result
Dissolved hydrocarbons:
Model/laboratory acute tests
+
+
0
+
Modeled dissolved diesel
concentrations are clearly lethal to
invertebrates and approximately
lethal to trout.
Dissolved hydrocarbons:
Model/laboratory-based
standard
+
++
0
++
Modeled dissolved diesel
concentrations greatly exceed the
state standard.
Dispersed hydrocarbons:
Diesel oil water
ratio/laboratory acute tests
+
++
0
+
Diesel oil water ratios in the spills
and in tests suggest lethality to
invertebrates and trout.
All routes in actual spills:
Amount spilled/observed
effects
+
++
+
+
Diesel spills in other streams
cause acute biological effects.
Integrated weight of evidence
+
++
0
+
The effects by four types of
evidence are consistent, and the
observed effects are strong. The
greatest uncertainty is the relation
of laboratory to field exposures.
Notes:
a Logical implication indicates relevance and a particular direction of the evidence (supports or weakens)
b E-R = exposure-response relationship
44

-------
Table 6-4. Example of weighing evidence for a potential cause, major ions measured as conductivity, of the
loss of macroinvertebrate genera (U.S. EPA. 2011b). The evidence is organized in terms of characteristics of
causation.
Characteristic
Body of Evidence
Scores
Co-occurrence
Loss of genera occurs when conductivity is high but is rare when conductivity is
low.
+++
Antecedence
Sources of the ionic mixture are present and are shown to increase stream
conductivity in the region.
+++
Interaction
Aquatic organisms are directly exposed to dissolved ions. Based on first principles
of physics, ionic gradients in high-conductivity streams would not favor the
exchange of ions across gill epithelia. Physiological studies over the past 100
years have documented the many ways that physiological functions of organisms
are affected by the relative amounts and concentrations of ions (i.e., combinations
of ions that some genera do not have mechanisms or the capacity to regulate).
++
Alteration
Some genera and other response metrics and assemblages are affected at sites
with higher conductivity, whereas others are not. These differences are
characteristic of relative sensitivity to high conductivity.
+++
Sufficiency
Laboratory analyses report results of effects for a tolerant species, but test
durations and most ionic compositions are not representative of exposure in
streams. Based on field observations, however, regular increases in effects on
invertebrates with increased ionic exposure indicate that exposures are sufficient.
+++
Time order
Conductivity is high and extirpation has occurred after mining permits are issued,
but conductivity and biological data before and after mining began are not
available.
NE
Summary of
body of evidence
Five characteristics are supported and none weaken the body of evidence that
increases in conductivity causes extirpation of freshwater benthic invertebrates.
Very likely
6.3. Interpreting Bodies of Evidence
Interpretation is the step in which the bodies of evidence for each hypothesis are used to determine which,
if any, hypothesis is supported, and therefore, is likely to be true. Depending on the case, the
interpretation of evidence could be performed by comparing alternative hypotheses or by judging whether
the evidence sufficiently supports a particular hypothesis. Comparison of alternatives typically occurs in
specific causal assessments (e.g., What is causing the low species richness in Jones Creek?). The
hypothesis with the greatest weight of evidence is the most likely of the assessed possible causes. This is
not just a matter of counting the scores. In particular, the coherence of each body of evidence with
respect to the associated hypothesis is considered.
Multiple hypotheses might be likely or at least well supported. For example, both a water treatment plant
effluent and stream channelization might be sufficient causes of low species richness in a stream, so both
hypotheses could be accepted.
Judging a single hypothesis can be conceptualized as comparing the hypothesis against its negation
(e.g., WoE for teratogen versus WoE for not a teratogen) or as a comparison of the weight of evidence for
the hypothesis with a standard of sufficient evidence (e.g., a sufficient WoE for teratogenicity versus
insufficient evidence). One particular hypothesis can be judged in some particular types of assessments:
45

-------
(1) a general causal assessment (e.g., Does selenium cause terata in frogs?), (2) specific causal
assessments for which only one potential cause is considered (e.g., Does the effluent cause the observed
biological impairment?), (3) a condition assessment (e.g., Is the stream impaired?), or (4) an outcome
assessment (e.g., Has the wetland recovered?).
These approaches to weighing evidence are consistent with that for the conventional standard of proof in
civil legal proceedings and public policy, the "preponderance of the evidence." To meet a
preponderance-of-evidence standard, a hypothesis must be shown to be more likely than its negation or
than the alternative hypotheses. This standard is met by comparing the weights of the bodies of evidence
for a hypothesis and its alternatives.
Hume (1748). Laplace (1812). and Sagan (1980) all made the point that extraordinary claims require
extraordinary evidence. This inferential rule should be kept in mind when interpreting bodies of
evidence. A hypothesis is extraordinary if there is no prior evidence for it (e.g., the agent has not
previously been shown to cause the effect or has been shown to be effective only at much higher levels),
or it is simply not plausible. Extraordinary evidence is very weighty evidence: highly relevant, strong,
and reliable.
Interpreting the evidence for hypotheses usually involves applying a three-value logic (e.g., yes, no,
maybe; true, false, uncertain; or high, low, intermediate). If the interpretation is comparative, as in
determining which alternative cause is best supported by the evidence, examining a summary table of the
scored evidence and identifying the likely causal hypothesis are possible. Even if the best hypothesis is
not clear, proposed hypotheses that clearly are not supported and can be eliminated can nearly always be
identified. For example, an assessment of the cause of a precipitous decline in smallmouth bass in the
Susquehanna and Juniata Rivers, Pennsylvania, using the CADDIS WoE method and existing information
was inconclusive (Shull and Pulket. 2015). It did, however, classify the 18 candidate causes into three
bins: 2 likely, 8 unlikely, and 8 uncertain. These results are being used to prioritize and design
subsequent research and monitoring. Similarly, when judging a single hypothesis (e.g., spilled tailings
would pose a substantial risk to salmon production), the hope is for a body of evidence that clearly either
supports or counters the hypothesis, but the evidence might be ambiguous (U.S. EPA. 2014a).
Some WoE systems have standard categorical outcomes defined in terms of properties of the body of
evidence (U.S. EPA. 2013. 2005^. A description of the potential weights of evidence used for causal
determinations (i.e., categorical outcomes for general causal assessments) of criteria air pollutants is
shown in Table 6-5. In such cases, interpreting the evidence involves matching the properties of the body
of evidence to one of the categories. The definitions of categories, however, could be based on the scored
properties rather than narratives as in this example.
Ultimately, the interpretation of evidence is a matter of logic applied to evidence and background
knowledge, preferably by multiple individuals with a range of expertise. It is not simply a matter of
adding or counting scores. In particular, some pieces of evidence are conclusive alone. For example, if
the effect precedes a potential cause or if an aqueous effect occurs upstream of the putative source, the
cause can be eliminated even if other evidence is positive. More commonly, one piece or type of
evidence will influence the interpretation of other evidence. Even WoE systems that use standard scoring
criteria and arithmetic integration of numerical scores rely on logic to identify and correct results that are
contrary to knowledge of the system (CPA Sediment Task Group. 2008; Johnston et al.. 2002). As a
result, applying expert judgment in any WoE is necessary to achieve an adequate, explanatory account.
At this stage, it is desirable to limit the interpretation to the hypotheses and evidence that were defined in
advance. Otherwise, "ad hoc machinations" to achieve an acceptable answer can occur (Douglas. 2012).
If interpreting the evidence does not identify a clear preponderance of evidence for a hypothesis,
techniques that go beyond the simple weighing of evidence can be applied, as explained in the following
46

-------
section. However, it should be remembered that they violate the admonition against improvised
reasoning so they are less convincing than results of weighing bodies of evidence and hypotheses defined
in advance.
Table 6-5. Weight of evidence for causal determinations in the 2013 lead ISA (U.S. EPA. 2013).
Causal Determination
Evidence for Ecological and Welfare Effects
Causal relationship
Evidence is sufficient to conclude a causal relationship with relevant pollutant exposures
(i.e., doses or exposures generally within one to two orders of magnitude of current
levels). That is, the pollutant has been shown to result in effects in studies in which
chance, bias, and confounding could be ruled out with reasonable confidence.
Controlled exposure studies (laboratory or small- to medium-scale field studies) provide
the strongest evidence for causality, but the scope of inference could be limited.
Generally, determination is based on multiple studies conducted by multiple research
groups, and evidence that is considered sufficient to infer a causal relationship is usually
obtained from the joint consideration of many lines of evidence that reinforce each other.
Likely to be a causal
relationship
Evidence is sufficient to conclude a likely causal association with relevant pollutant
exposures. That is, an association has been observed between the pollutant and the
outcome in studies in which chance, bias, and confounding are minimized, but
uncertainties remain. For example, field studies show a relationship, but suspected
interacting factors cannot be controlled, and other lines of evidence are limited or
inconsistent. Generally, determination is based on multiple studies in multiple research
groups.
Suggestive of a causal
relationship
Evidence is suggestive of a causal relationship with relevant pollutant exposures, but
chance, bias, and confounding cannot be ruled out. For example, at least one high-
quality study shows an effect, but the results of other studies are inconsistent.
Inadequate to infer a
causal relationship
The available studies are of insufficient quality, consistency, or statistical power to permit
a conclusion regarding the presence or absence of an effect.
Not likely to be a causal
relationship
Several adequate studies, examining relationships with relevant exposures, are
consistent in failing to show an effect at any level of exposure.
6.4. Explaining Ambiguities and Discrepancies
The results of weighing bodies of evidence can be ambiguous and discrepancies among pieces or types of
evidence can occur. Examples include WoE results for which:
1.	The bodies of evidence for all assessed hypotheses are incoherent.
2.	One hypothesis has sufficient, consistently positive evidence to support acceptance, but some
other hypotheses also have some relevant and reliable positive evidence.
3.	No hypothesis has predominantly positive evidence.
In ambiguous, discrepant, or nonsensical cases, one can stop the assessment process and call for more
data or proceed to perform follow-on analyses to resolve problems with the existing evidence. The latter
course could allow successful completion of the assessment by reconsidering the evidence or by
reformulating the hypotheses. Having completed the WoE analysis, some hypotheses could have been
47

-------
eliminated, but the grounds for elimination might need re-examination. Also, assessors should have
become deeply familiar with the evidence. That familiarity and the assessors' background knowledge
could make it possible to reinterpret the data or revise the hypotheses to explain and resolve the
ambiguities. Reinterpretations, however, provide opportunities to fabricate false explanations of apparent
patterns in the evidence. Formulating hypotheses to accommodate the evidence is discouraged in
scientific inference because it can lead to bias or self-deception. It has come to be known as HARKing
(hypothesizing after results are known). As in other judgments during the assessment process, clearly
articulating arguments, avoiding over interpretation, involving multiple assessors with different
perspectives, and using a clear and consistent method to build the case are essential. A generally useful
approach is to (1) list the discrepancies, (2) ask whether each discrepancy could be explained by a
misinterpretation of the evidence or by a misspecification of the hypothesis, (3) determine whether all
discrepancies can be resolved by some combination of reinterpreting evidence and respecifying
hypotheses, and (4) evaluate additional evidence relevant to the explanations.
Reinterpreting the evidence is facilitated by considering ambiguities in the data and the methods used to
generate them. As presented in Appendix E. Table E-3. toxicity tests might not include sensitive species,
life stages, or responses; the bioavailability or form of a toxicant might be inappropriate; the exposure
durations might be too long or too short, etc. Biological surveys could be conducted in the wrong season,
compared to an inappropriate reference, to measured responses ortaxathat are insensitive to the causal
agent, etc. Measures of exposure might miss episodic events, be taken under atypical conditions, not
include the causal agents, be disjunct from the biological samples, etc. Any of the evidence could be
derived from studies that are biased, poorly designed, or poorly conducted in ways that were not reported
or not recognized in the weighting. Inventing explanations is not difficult—the difficulty is determining
which are likely to be true.
In addition to reinterpreting the evidence, assessors might explain discrepancies by changing the
hypotheses or adding hypotheses in light of the evidence. At least four strategies can be applied.
Redefine the endpoint effect. Causal relationships are often unclear because the effect is unclear. For
example, determining the cause of decline in peregrine falcons depended on redefining the effect as
reproductive failure associated with thin eggshells. That more specific definition steered assessors away
from potential causes such as habitat loss, shooting, and egg collecting and toward toxic effects. In some
cases, discrepancies might be explained by defining the spatial or temporal scope of the effect more
specifically. For example, defining the effect as the occurrence of fish kills in a stream as a whole will
lead to discrepancies in the evidence if kills occur only in a reach below an outfall and the dead fish drift
downstream.
Reconsider sources and agents. Discrepancies in evidence could be due to sources, components of
emissions, or actions that were not considered, because they were overlooked or of no concern to the
decision maker or stakeholders.
Integrate causal agents or networks. Rather than being alternatives, the proposed causes of an effect
could be acting jointly. Also, a proposed proximate cause might actually be an indirect cause that
contributes to the true proximate cause. For example, habitat disturbance did not cause the decline in San
Joaquin kit foxes on the Elk Hills Petroleum Reserve, but it apparently made the foxes more susceptible
to the major cause, predation (U.S. EPA. 2009a). Such possibilities can be represented in practice by
rearranging the components of a conceptual model or by combining alternative conceptual models. The
conceptual model can be revised by removing the box-and-arrow combinations that are not supported by
evidence and examining how the remaining boxes and arrows might link within the model or among
models for different causes.
48

-------
Look for patterns in the evidence. Looking for properties that the pieces or categories of evidence
supporting a hypothesis share, which the contrary evidence does not share, might be helpful. This
examination could be facilitated by creating a matrix of evidence versus relevant attributes of the
evidence. Different results might be observed, for example, in field versus laboratory studies, lotic versus
lentic studies, industry-funded versus foundation-funded studies, insect versus crustacean studies, etc.
Reinterpreted evidence or revised hypotheses could explain discrepancies in a manner that seems
convincing to the assessors who developed them, but explanations improvised after the analysis to
account for discrepancies might not convince others without independent evidence. Independent
evidence might have been available but unused in prior weighing of evidence, because it had not been
relevant. For example, if the proposed resolution of a discrepancy was low bioavailability of metals,
measurements of dissolved organic matter might be used to confirm that explanation. Similarly, if low
dissolved oxygen is a proposed cause of a fish kill, the survival of air-breathing organisms such as frogs
and turtles is supportive (Cormier et al.. 2010). Ideally, confirmatory evidence would be generated by
identifying a previously unmeasured or unobserved property of the system that should occur under the
revised hypothesis and then taking the measurements or observations that would establish its occurrence.
If the amount of evidence and the number of possible explanations of the evidence are substantial, a
formal reweighing of the evidence to address the revised hypotheses might be appropriate.
6.5. Presenting Results
The results of a qualitative WoE have two parts: the conclusion and the justification. The conclusion, in
the best case, is a statement of the hypothesis that is clearly supported by the WoE (e.g., the cause is the
thermal effluent, the stream is biologically impaired, or the population has recovered). Ambiguous results
require more explanation (e.g., hypothetical causes 1,3, and 5 can be eliminated, but 2 and 4 are
somewhat supported). Graphical presentations can be useful, such as a revised conceptual model of the
identified causal relationship or a map of the areas biologically impaired by the waste. When the
assessments are sufficiently standardized, a standard reporting format could be implemented as in the
International Uniform Chemical Information Database (IUCLID) dossier for assessing the toxic
properties of chemicals in the European Union (ECHA. 2010).
The justification of a conclusion is a summary of the evidence and logic that support the conclusion. The
justification, which will vary among assessments, typically includes:
1.	Statements of overall weight of evidence for the conclusion (e.g., the evidence convincingly
supports, strongly supports, or somewhat supports the cause) relative to the weights for
alternatives or relative to a standard of evidence.
2.	Statements of consistency or coherence (e.g., all the evidence was either convincingly or strongly
supportive of the cause, except for laboratory acute tests, which had low relevance).
3.	Statements of the extent to which characteristics of causation or other relevant characteristics are
satisfied by the evidence (e.g., evidence was available to support all characteristics of causation
except time order). Alternatively, state the extent to which standard types of evidence were
available (e.g., all three components of the sediment quality triad were available). In any case,
state how the body of evidence provides an explanation of the credibility of the accepted
hypothesis.
4.	Statements explaining why the conclusion is justified despite a low weight of the body of
evidence, if necessary (e.g., even if other causes could be substantially contributing, collection
and treatment of the effluent is likely to improve conditions and is mandated by regulations).
49

-------
6.6. Summary
Having assigned weights to the pieces of evidence, the process of weighing evidence integrates and
interprets the body of evidence to determine which, if any hypothesis has sufficient weight to be accepted
Depending on the amount and types of evidence, this may include a process of aggregating the evidence
into categories and integrating the weights within categories. Once the weights have been integrated, the
bodies of evidence for all hypotheses should be compared and the results interpreted. Interpretation may
be obvious (e.g., only one hypothesis has positive evidence) but often it requires careful logic and
judgment informed by the knowledge and experience of multiple assessors. If interpretation does not
provide a conclusion, an analysis of the ambiguities and discrepancies that confounded the interpretation
can suggest what additional evidence or alternative hypotheses should be considered. Finally, the results
should be presented in a way that make clear the results and the process that generated them.
50

-------
7. SPECIAL CASES AND ABBREVIATED PROCESSES
In some circumstances, the full three-step WoE process is not necessary or practical, but a two-step
process is useful. The systematic assembly of evidence is always appropriate, but weighting might be
unnecessary if all evidence is equivalent (Figure 7-la), and weighing a body of evidence is unnecessary if
only one piece of evidence is available (Figure 7-1 b).
(a)	(b)
Assemble	Assemble
Evidence	Evidence
~r
Weight
Evidence
Weigh Body
of Evidence
Figure 7-1. Steps in abbreviated weight-of-evidence processes: (a) skipping the weighting step when all
evidence is equivalent or (b) weighting a single piece of evidence when multiple pieces are not available.
7.1. Weighing Without Weighting
Some WoE methods presume that all evidence is equal or at least that the differences are uninformative,
so the weighting step (Section 5) can be skipped. For example, if all evidence is reliable and of the same
type (e.g., multiple acute lethality tests performed by standard protocols), each piece of evidence will
have equal influence in most cases. In practice, weighting is often skipped without determining that the
relevance, strength, and reliability of the evidence are similar across pieces of evidence. Instead, if the
pieces of evidence pass the screening step (see Section 3). the differences are assumed unimportant. An
example of weighing evidence without weighting is the causal determination in an ISA regarding
exposure to a specific air pollutant and specific effects (U.S. EPA. 2013).
A case for which distinguishing weights was unnecessary is provided by the studies used to determine
that ingesting lead from lead mining and smelting caused the tundra swan kills in the Coeur d'Alene
basin, Idaho (Table 7-1). The data used in the evidence were generated for the case, so they were all
highly relevant; the studies were conducted by competent and reputable investigators following
prescribed quality standards, so they were highly reliable; and the relationships were all strong. Because
of that ideal situation, the presentation of evidence in WoE Table 7-1 is sufficient. Table 7-1 organizes
the evidence in terms of characteristics of causation, rather than listing pieces or types, and briefly
summarizes the evidence for each.
51

-------
Table 7-1. Summary of evidence for lead as a cause of mass mortality of tundra swans in the Coeur d'Alene
River Watershed (Norton etal.. 2014). Based on evidence from URS Greiner Inc. and CH2M Hill (2011).
Causal Characteristic
Evidence
Co-occurrence
Swan kills occurred in Pb-contaminated lakes and wetlands and not elsewhere in
the region.
Sufficiency
Mortality occurred in laboratory tests at Pb doses and body burdens observed in
dead or moribund swans in the field.
Consistent mortality in the field at blood Pb levels >0.5 pg/g.
Time order
No evidence—no pre-mining information on swan mortality.
Interaction
Dead and moribund swans had high blood and liver Pb levels.
Pb-contaminated sediments were found in swan guts and excreta.
Specific alteration
Swans had pathologies characteristic of Pb toxicity, particularly, enlarged gall
bladders containing viscous dark green bile.
Antecedents
Spills of Pb mine tailings and atmospheric deposition from smelters account for the
high sediment Pb levels.
7.2. Weighting a Single Piece of Evidence
The concept of WoE was developed to combine multiple pieces of evidence in a way that gives each
piece proper influence on the conclusion. When only one piece of evidence is available for an inference,
the integration function of WoE does not apply. WoE, however, also leads assessors to consider the
importance that should be assigned to a body of evidence and reveals the assessors' judgments to decision
makers and stakeholders. This purpose of evaluating and interpreting a body of evidence can apply to a
single piece. That is, explicitly evaluating the relevance, strength and reliability of a single piece of
evidence can be informative.
Weighting a single piece of evidence also might provide consistent communication of the assessment
process. That is, if many pieces of evidence are evaluated and scored for some hypotheses, the reader
will expect to see scores for all hypotheses. If a hypothesis has only a single, stand-alone piece of
evidence, scoring that one piece provides the expected consistency in weighing the evidence for all
hypotheses.
In addition, weighting a single piece of evidence might reveal its inadequacy and lead to obtaining and
weighing more evidence. For example, for the Patrick Bayou Superfund site, assessors intended to use
modeling of toxicity to an amphipod as their primary approach to assessing risks to benthic invertebrates
(Anchor. 2013). However, they found that the data were unreliable due to lack of reference data,
confounding, poor correspondence to tests of other species, and unreliable polychlorinated biphenyl
analyses. As a result, they adopted a WoE approach using the conventional triad of sediment quality
evidence.
52

-------
8. WEIGHT OF EVIDENCE FOR QUANTITATIVE RESULTS
Quantitative results of ecological assessments, such as benchmark concentrations, areas of habitat lost or
rate constants, might be derived from multiple pieces of quantitative evidence. Commonly, these
quantitative analyses follow the qualitative analyses that establish the hazard or other qualities to be
quantified (Figure 8-1). Methods for deriving quantities by weighing evidence are discussed in Appendix
B. They fall into two basic approaches: combine the quantitative evidence or choose the best quantitative
evidence. Finally, weights can be assigned to the quantitative results to indicate how influential they
should be.
Merge
Quantitative
Evidence
Choose Best
Quantitative
Evidence
Weigh Evidence
for the Quality
to be Quantified
Weight
Quantitative
Results
Figure 8-1. Potential steps in a process for using WoE to derive a quantitative result. Note that the top box of
this process diagram encompasses the qualitative WoE process (Figure 3-2).
8.1. Weight of Evidence for the Quality to Be Quantified
Before deriving a quantity, determining what quality is to be quantified (Table 8-1) is necessary. This
determination could require weighing evidence. For example, conventional risk assessments quantify, if
possible, the magnitude and likelihood of a defined hazard. Environmental hazards are potential effects
on an attribute of a biological entity (the assessment endpoint) resulting from exposure to an agent in
particular conditions (described by the conceptual model). Similarly, criterion assessments quantify a
threshold level of exposure corresponding to an unacceptable level of effect, and the effect itself is a
quality that might be identified by WoE. Qualitative assessment could even precede the derivation of a
model parameter. For example, a narrative WoE was used to determine whether a bioavailability
adjustment factor should be used at a dioxin-contaminated site, and then the factor's value was chosen
based on a prior quantitative WoE in a published review (Integral Consulting Inc.. 2013). Quantitative
condition assessments might determine the magnitude and spatial extent of impairment after a site has
53

-------
been determined impaired by qualitative WoE applied to multiple metrics. In sum, this qualitative WoE
determines whether the property to be quantified is real and significant in the context of the assessment.
Table 8-1. Qualities that could be identified by qualitative WoE and the associated quantities that could be
derived by the quantitative WoE process (see Figure 3-2)
Example Quality
Example Corresponding Quantity
General cause: teratogen
Sufficient maternal body burden
General cause: carcinogen
Slope factor
Specific cause: chloride in effluent as cause offish kills
Allowable maximum concentration
Specific cause: ammonia in stream as cause of
biological impairment
Total maximum daily load
Bioaccumulative
Bioaccumulation factor
Exposure
Bioavailable concentration at point of contact
Susceptibility
Probability of responding
Alteration
Area of wetland
In some cases, assessments weigh evidence to derive a quantitative result without a prior WoE for a
quality. For example, the derivation of ambient water quality criteria for aquatic life is nearly always
based on a generic endpoint—protection of aquatic life from effects of direct aqueous exposures on
survival, growth, or reproduction. However, for some criteria the generic endpoint is found to not be
protective because an important specific effect or route of exposure is not adequately addressed. An
example is the water quality criterion for selenium, which addressed a nonstandard effect (skeletal
deformities in vertebrates) and route of exposure (exposure in ovo to selenium accumulated from a food
web by the female parent). Such examples suggest that it may be desirable to weigh evidence for
nonstandard qualities. Even if default qualities are assumed, the first step of WoE, the systematic
assembly of evidence, should be performed to obtain all potentially useful quantitative estimates as well
as associated information.
The place of these qualitative WoE processes in an overall assessment process depends on the type of
assessment. For a conventional risk assessment, a WoE to choose the endpoints (i.e., What effects does
the chemical or other agent cause in which potentially exposed organisms?) would be part of the problem
formulation. The quantitative analysis and characterization phases follow to quantify the risks. In other
cases, separate qualitative and quantitative assessments are performed. For example, a qualitative causal
assessment to determine the cause of an observed effect may be followed by a quantitative risk
assessment to derive a total maximum daily load, cleanup goal or other quantitative benchmark. In cases
like the bioavailability adjustment factor example presented earlier, the qualitative and quantitative WoE
processes might both be embedded in the analytical phase of an assessment.
8.2. Weight of Evidence for Deriving the Quantity
If more than one source of data with acceptable relevance and reliability is available to derive a
quantitative result, the data sets should be weighed. Conventionally, the data sets would be integrated by
combining the quantitative evidence or by choosing the best (i.e., weightiest) evidence. Another
approach, the Rule of Five, has been useful in weighing evidence to determine a contaminated site
cleanup goal (Box B-2).
54

-------
8.2.1.	Combining quantitative evidence
If the quantitative value can be derived from multiple data sets that are of sufficiently high relevance and
reliability, the data sets can be numerically combined (Appendix B). In past EPA practices, this
combining has involved taking the arithmetic or geometric mean. For example, multiple acceptable LC50
values are available for a species-chemical combination, their geometric mean is used when deriving
national ambient water quality criteria (U.S. EPA. 1985). Geometric means of toxicity values are also
used to derive soil screening levels (U.S. EPA. 2005a'). Rather than treating all evidence equally, the
values might be weighted before they are combined. Quantitative weighting could use a quantitative
property of the study that influences its reliability such as the inverse variance.
Alternatively, quantities might be weighted based on some qualitative property. In this case, the
qualitative weights are converted to numerical equivalents. For example, high, moderate, and low
reliability of the study design might be converted to weights of 1, 0.7, and 0.5 or some other values,
depending on how much influence study design should have and how great the variance in the study
design might be.
Rather than using numerical weights to express the properties of the individual estimates, qualitative
weights can be applied to the combined estimate. That is, after a combined value is derived, it might be
assigned scores to express one or more qualitative aspects of its relevance or reliability. The weighted
mean or other quantity derived from multiple sources should provide greater confidence than a quantity
derived from any single source. Otherwise, the evidence should not have been combined.
8.2.2.	Choosing the best quantitative evidence
If the range of relevance or reliability of alternative numerical values is large, choosing the best value
rather than combining them is advisable. That is, the numerical values should be weighted for relevance
and reliability, and the weightiest one used. (Note that in some contexts, established policy might require
using the most protective acceptable value to ensure the degree of protection required by law.) Choosing
the best value is particularly advisable when values can be derived from multiple types of evidence. For
example, if a benchmark value of a contaminant could be derived from laboratory toxicity tests, a
mesocosm test or a regional field survey, the results are likely to represent qualitatively different effects
and might not be averaged, with or without weighting.
Using WoE to choose the best estimate is illustrated by Table 8-2. This table format presents the
magnitude or probability of effects versus integrated weights of evidence for types or groups of evidence
as recommended by Hope and Clarkson (2014). The table shows risk assessment results (probability of
achieving the effect endpoint) estimated by analysis of each of four evidence groups (numbered circles)
with integrated weights derived as in Section 6. In this hypothetical example, Evidence Group 3 might be
chosen because it has the highest weight and is not an outlier in terms of the quantitative result
(probability of effect). The result might be, Group 3 (evidence derived from stream mesocosm tests) is
convincing and gives an estimate of 15% probability of reduced species richness.
Using WoE to choose the best value is not an established practice, but it is recommended for choosing
among alternative test data for the same species, endpoint, duration, life stage, and testing conditions by
the European Chemical Agency (ECHA. 2008). Although the European Chemical Agency does not
specify a method of weighing, it indicates that various studies can provide supporting information to add
weight to a particular test.
55

-------
Table 8-2. WoE matrix to summarize quantitative risk estimates for four evidence types or groups
(numbered circles) and their weights. Adapted from Hope and Clarkson (2014).

Evidence Weight
0
+
++
+++
Probability of
Achieving Effect
Endpoint
>75%




51-75%




25-50%


©

5-24%


©
©
<5%

©


8.3. Weighting a Quantitative Result
Quantitative results are often accompanied by statistical expressions of variability or uncertainty, but
determining how much weight the result should be given might also be useful. Weighting a quantitative
result could inform decision makers and stakeholders about the result's relevance and reliability,
irrespective of data scatter (Section 9). For example, an estimate of biological effects might have a small
confidence interval, but it could have been generated from laboratory test data with low relevance to the
case or might have been generated with poor controls or otherwise unreliable methods. As in weighting
qualitative evidence, the specific properties of relevance and reliability listed in Box 5-1 and Box 5-3 and
the collective properties listed in Box 6-1 could be included in the weighting of quantitative evidence if
they are important.
Strength is generally not a relevant property for weighting quantitative results. For example, an
exposure-response model with a high slope is strong, and therefore, lends weight to a causal hypothesis
(i.e., a qualitative WoE). However, the result of a quantitative analysis to derive a benchmark exposure
has whatever strength it has. An estimate of a benchmark value derived from an exposure-response
relationship with a high slope is not given more weight than one derived from a relationship with a low
slope.
Weighting might be used to determine the applicability of a number to particular uses. For example, at
contaminated sites, weights might be used to distinguish soil benchmark values that are suitable only for
screening values from those that might be suitable for cleanup goals. In any case, weights can be used
along with statistical uncertainty measures to communicate confidence in results.
56

-------
9. WEIGHT OF EVIDENCE AND UNCERTAINTY
The process of weighing evidence can be considered a means not only of deriving a conclusion, but also
of documenting the assessors" confidence in the conclusion. That is, the weights assigned to pieces and
categories of evidence are expressions of how confident the assessors are that the evidence should
influence the conclusion in one direction or another. Additionally, the overall weight of the body of
evidence is an expression of how much confidence the assessors have in that conclusion. If both
laboratory tests and field surveys indicate that a chemical causes spinal deformities in fish, and both
categories of evidence are evaluated and scored as highly relevant, strong, and reliable, we are confident
that the chemical is a cause of deformities. Even if the WoE analysis is limited to a narrative, the
narrative should describe the degree of confidence in the conclusion. For example, the categories of
results in the ISAs and IRIS hazard assessments are explicit expressions of confidence in WoE narratives
(e.g., "likely to be a causal relationship"—see Table 6-3). Such qualitative expressions of confidence are
appropriate for a qualitative conclusion such as the observed relationship is likely causal, the chemical is
likely a carcinogen, or the chemical is a contaminant of concern. If the standard scoring system was used
to weight evidence, the resulting expression of overall confidence might be that the evidence is
convincingly supportive, strongly supportive, or somewhat supportive of the conclusion. The appropriate
expression of weight/confidence depends on the assessment and the decision to be supported.
The confidence concerning numerical values should include qualitative weights as well as conventional
measures of scatter such as range or confidence limits[FigureJM (Spiegelhalter and Riesch. 2011)1.2 For
example, one would have low confidence in a benchmark value that is based on irrelevant evidence, even
if it is statistically precise. Hence, presentations of quantitative results might have four parts:
1.	The quality expressed—threshold for reproductive effects
2.	The numerical result and units—x mg/L.
3.	Scatter—95% CI of ±0.2x mg/L.
4.	Weight—the body of evidence is highly relevant and moderately reliable, diverse, and coherent.
Collective
Properties
I			I	I	I	I
Variability	Uncertainty	Relevance	Reliability
T
Scatter	Weight
1	T	1
Confidence
Figure 9-1. A diagram of the combination of statistical scatter and qualitative weight to define the
confidence that should be afforded an assessment result.
2 This discussion assumes an objectivist view that conceives uncertainty and variability as probabilities based
on frequencies, which are distinct from qualitative judgments of weight. A subjectivist would consider all
components of Figure 9-1 to be encompassed by degree of belief, and all components could be expressed as
subjective probabilities.
57

-------
The pieces and types of evidence that are not used to derive the numerical result could still provide
information concerning the result. For example, another piece of evidence might place a limit on the
possible range of effects and that information can be used to truncate the confidence limit on the
estimated effect. Other evidence might inform the scope of a result. For example, a field-based
benchmark value for a contaminant in streams might be based on benthic invertebrates because they
provide the best data. Fish survey data that are not sufficient to derive a benchmark, however, might be
sufficient to conclude that the invertebrate-based benchmark is likely also protective of the fish
assemblage.
Conceivably, assessors could address the uncertainty concerning the confidence expressed by the WoE.
No methods have been found for estimating or describing such meta-uncertainties. An estimate of WoE
uncertainty, however, can be obtained by replicating the assessment. That is, multiple assessment teams
could be engaged to assemble, weight, and weigh the evidence so that the variance among WoE results
might be estimated. That implies an extraordinary effort for a routine assessment, but replication of a
WoE for a condition assessment has been done experimentally (Bay et al.. 2007). Six experts were
provided conventional sediment quality triad data for 25 embayment sites in California. They were asked
to rank the sites from best to worst and to categorize them into one of six standard categories. No
instructions were provided, so the participants apparently used whatever WoE method they used in their
professional practice. The rankings were highly correlated (mean Spearman rank correlation = 0.92).
Differences in categorization of a site were common but mostly small and were primarily due to
differences in relative weighting of the three types of evidence, so common weighting guidelines might
have reduced discrepancies.
A sensitivity analysis would be more practical than replicating the assessment as a means of determining
the variability in the weighting process. That is, assessors could determine the influence of changing the
weights on WoE results. If the choice between alternative hypotheses could change when, for example,
the reliability of a piece of evidence was scored as + rather than ++, that sensitivity would be noted. Such
cases would tend to occur when few pieces of evidence are considered, when the overall weight for a
body of evidence is marginal, or the bodies of evidence for alternative hypotheses have similar weights.
58

-------
10. WEIGHT-OF-EVIDENCE SUMMARY AND THE PATH FORWARD
The basic framework of assembling evidence, weighting it, and weighing the body of evidence is
fundamental to making inferences based on WoE (Figure 10-1). Even if a WoE is not explicitly
performed using these steps, assessors should at least consider each of these processes. Adhering to an
explicit framework can improve the WoE results, enhance transparency, and increase the confidence of
reviewers, stakeholders, and decision makers.
The steps to implement the framework for qualitative WoE, with section numbers, include:
•	Finding information (4.2.1),
•	Design and conduct studies to fill information gaps (4.3)
•	Screening studies (4.2.2),
•	Categorizing studies (4.2.3),
•	Deriving evidence from the data (4.2.4),
•	Evaluating the evidence with respect to properties (5.3),
•	Assigning weight scores based on the evaluation (5.3),
•	Creating a scoring table to summarize the results of evidence weighting (5.4),
•	Integrating the weighted evidence into bodies of evidence for each hypothesis (6.2),
•	Creating a WoE table to summarize the results of evidence integration (6.2),
•	Interpreting the bodies of evidence (6.3),
•	Explaining ambiguities and discrepancies (6.4),
•	Reiterating the process if necessary (6.4), and
•	Presenting results (6.5).
In addition to the approach for applying WoE to qualities such as causality and impairment, this
document briefly addresses WoE for quantities and how WoE for qualities and quantities are related
(Section 8). Although WoE for quantities such as benchmark or parameter values is uncommon now, it
seems likely that the use of meta-analysis and other techniques for weighting and combining quantities
will gain increasing use. A combined approach to WoE for qualities and quantities is likely
advantageous.
Similarly, this document briefly discusses the relationship between uncertainty and WoE (Section 9). By
combining the somewhat narrow concept of uncertainty (i.e., measures of the scatter of data or of
estimates) with the broader concept that some evidence and conclusions deserve more weight, it should be
possible to improve the communication of confidence in assessment results.
The WoE approach in this document is based on experience in the EPA, particularly with determining the
causes of impairment in ecosystems (http://www3 .epa. gov/caddis/). Its application in practice will vary
among programs due to differences in issues, policies, prior practices, and the amount and variety of
evidence potentially available for weighing. As implementation of this approach progresses,
context-specific guidance is expected to be developed in the form of exemplary applications of WoE or
program-specific WoE guidance documents.
59

-------
Assemble Evidence
Categorize
Screen
Derive
Evidence
Weight
Strength
Weight
Reliability
Weight
Relevance
Search
Literature
Combine
Weights
Design &
Conduct
Studies
Weight Evidence
Weigh Body of Evidence
Integrate Evidence
interpret Bodies
of Evidence
Explain
Ambiguities &
Discrepancies
Figure 10-1. The detailed framework for qualitative WoE.
In summary, this document is intended to help ecological assessors improve the practice of WoE without
imposing burdensome prescriptions. If read and applied in that spirit, this document should help advance
the cause of well-informed environmental protection.
60

-------
11. REFERENCES
Anchor, Q. (2013). Baseline-ecological risk assessment report, Patrick Bayou Superfund Site, Deer Park,
Texas. Ocean Springs, MS: Patrick Bayou Joint Defense Group and U.S. EPA.
Anderson, D. (2008). Model based inference in the life sciences. New York, NY: Springer Science and
Business Media.
Ankley, GT; Bennett, RS; Erickson, RJ; Hoff, DJ; Hornung, MW; Johnson, RD; Mount, DR; Nichols,
JW; Russom, CL; Schmieder, PK; Serrrano, JA; Tietge, JE; Villeneuve, DL. (2010). Adverse
outcome pathways: a conceptual framework to support ecotoxicology research and risk
assessment. Environ Toxicol Chem 29: 730-741. http://www.ncbi.nlm.nih.gov/pubmed/20821501
Batley, GE; Burton, GA; Chapman, PM; Forbes, VE. (2002). Uncertainties in sediment quality weight-of-
evidence (WOE) assessments. Hum Ecol Risk Assess 8: 1517-1547.
http://dx.doi.org/10.1080/20028091Q57466
Bay, S; Berry, W; Chapman, PM; Fairey, R; Gries, T; Long, E; MacDonald, D; Weisberg, SB. (2007).
Evaluating consistency of best professional judgment in the application of a multiple lines of
evidence sediment quality triad. Integr Environ Assess Manag 3: 491-497.
http: //www .ncbi.nlm.nih. go v/pubmed/18046798
Becker, RA; Ankley, GT; Edwards, SW; Kennedy, SW; Linkov, I; Meek, B; Sachana, M; Segner, H; Van
Der Burg, B; Villeneuve, DL; Watanabe, H; Barton-Maclaren, TS. (2015). Increasing scientific
confidence in adverse outcome pathways: Application of tailored Bradford-Hill considerations for
evaluating weight of evidence. Regul Toxicol Pharmacol 72: 514-537.
http://www.ncbi.nlm.nih.gov/pubmed/25863193
Benedetti, M; Ciaprini, F; Piva, F; Onorati, F; Fattorini, D; Notti, A; Ausili, A; Regoli, F. (2012). A
multidisciplinary weight of evidence approach for classifying polluted sediments: Integrating
sediment chemistry, bioavailability, biomarkers responses and bioassays. Environ Int 38: 17-28.
http://www.ncbi.nlm.nih.gov/pubmed/21982029
Bilotta, G; Milner, A; Boyd, I. (2014). On the use of systematic reviews to inform environmental policies.
Environ Sci Pol 42: 67-77.
Bilyard, G; Beckert, H; Bascietto, J; Abrams, C; Dyer, S; Haselow, L. (1997). Using the data quality
objectives process during the design and conduct of ecological risk assessment. Washington, DC:
U.S. Department of Energy, Office of Environmental Policy and Assistance.
http://www.monitor2manage.com.au/userdata/downloads/p /Risk%20management%20and%20D
OO.pdf
Black & Veatch Special Projects Corps. (2011). Ecological risk assessment for the estuary at the LCP
chemical site in Brunswick, Georgia: Site investigation/analysis and risk characterization
(Revision 4). Atlanta, GA: U.S. Environmental Protection Agency, Region 4.
http://www.epa.gov/sites/production/files/2Q14-
03/documents/baseline ecological risk assessment april2011pdf.pdf
Blocksom, K; Johnson, B. (2009). Development of a regional macroinvertebrate index for large river
bioassessments. Ecol Appl 9: 313-328.
Blyth, CR. (1972). On Simpson's paradox and the sure-thing principle. J Am Stat Assoc 67: 364-366.
http://www.tandfonline.com/doi/abs/10.1080/Q1621459.1972.10482387
Bombardier, M; Bermingham, N. (1999). The SED-TOX index: toxicity-directed management tool to
assess and rank sediments based on their hazard - concept and application. Environ Toxicol Chem
18: 685-698.
Bombardier, M; Blaise, C. (2000). Comparative study of the sediment-toxicity index, benthic community
metrics and contaminant concentrations. Water Qual Res J Can 35: 753-780.
Borenstein, M; Hedges, L; Higgins, J; Rothstein, H. (2009). Introduction to meta-analysis. Chichester,
U.K.: Wiley.
Carriger, J; Barron, M. (2016). A practical probabilistic graphical modeling tool for weighing ecological
risk-based evidence. Soil Sed Contam 25: 476-487.
61

-------
Carriger, JF; Barron, MG. (2011). Minimizing risks from spilled oil to ecosystem services using influence
diagrams: The deepwater horizon spill response. Environ Sci Technol 45: 7631-7639.
http://dx.doi.org/10.1021/es201Q37u
CEE (Collaboration for Environmental Evidence). (2013). Guidelines for systematic reviews in
environmental management. Version 4.2. Banker, UK: CEE.
Chapman, P. (1990). The sediment quality triad approach to determining pollution-induced degradation.
Sci Total Environ 97/98: 815-825.
Chapman, PM. (1996). Presentation and interpretation of Sediment Quality Triad data. Ecotoxicology 5:
327-339. http://www.ncbi.nlm.nih.gov/pubmed/24193872
Chapman, PM. (2007). Determining when contamination is pollution - weight of evidence determinations
for sediments and effluents. Environ Int 33: 492-501.
http ://www.ncbi .nlm.nih. gov/pubmed/17027966
Chapman, PM; Anderson, J. (2005). A decision-making framework for sediment contamination. Integr
Environ Assess Manag 1: 163-173. http://www.ncbi.nlm.nih.gov/pubmed/16639882
COA Sediment Task Group. (2008). Canada-Ontario decision-making framework for assessment of Great
Lakes contaminated sediments. Environment Canada and Ontario Ministry of the Environment.
http://publications.gc.ca/collections/collection 2010/ec/En 164-14-2007-eng.pdf
Collier, ZA; Gust, KA; Gonzalez-Morales, B; Gong, P; Wilbanks, MS; Linkov, I; Perkins, EJ. (2016). A
weight of evidence assessment approach for adverse outcome pathways. Regul Toxicol
Pharmacol 75: 46-57. http://www.ncbi.nlm.nih.gov/pubmed/26724267
Cormier, SM; Suter, GW, 2nd. (2008). A framework for fully integrating environmental assessment.
Environ Manage 42: 543-556. http://www ncbi nlm nih gov/pubmed/18506517
Cormier, SM; Suter, GW, 2nd. (2013). A method for assessing causation of field exposure-response
relationships. Environ Toxicol Chem 32: 272-276.
http ://www.ncbi .nlm.nih. gov/pubmed/23161561
Cormier, SM; Suter, GW, 2nd; Zheng, L; Pond, GJ. (2013). Assessing causation of the extirpation of
stream macroinvertebrates by a mixture of ions. Environ Toxicol Chem 32: 277-287.
http ://www .ncbi .nlm .nih. gov/pubmed/23147750
Cormier, SM; Suter, GW; Norton, SB. (2010). Causal characteristics for ecoepidemiology. Hum Ecol
Risk Assess 16: 53-73. http://dx.doi.org/10.1080/1080703090345932Q
CRD (Centre for Reviews and Dissemination). (2009). Systematic reviews: CRD's guidance for
undertaking reviews in health care. York, U.K.: Centre for Reviews and Dissemination,
University of York.
Dagnino, A; Sforzini, S; Dondero, F; Fenoglio, S; Bona, E; Jensen, J; Viarengo, A. (2008). A "weight of
evidence" approach for the integration of environmental "triad" data to assess ecological risk and
biological vulnerability. Integr Environ Assess Manag 4: 314-326.
http: //www .ncbi.nlm.nih. gov/pubmed/18393577
Dawes, J. (2008). Do data characteristics change according to the number of scale points used? An
experiment using -point, 7-point and 10-point scales. Int J Market Res 50: 61-77.
Doi, SA; Thalib, L. (2008). A quality-effects model for meta-analysis. Epidemiology 19: 94-100.
http://www.ncbi.nlm.nih.gov/pubmed/1809086Q
Douglas, H. (2012). Weighing complex evidence in a democratic society. Kennedy Inst Ethics J 22: 139-
162. http://www.ncbi.nlm.nih.gov/pubmed/23002581
Duboudin, C; Ciffroy, P; Magaud, H. (2004). Effects of data manipulation and statistical methods on
species sensitivity distributions. Environ Toxicol Chem 23: 489-499.
http://www.ncbi.nlm.nih.gov/pubmed/14982398
EC (European Commission). (2013). Introduction to the new EU water framework directive.
http://ec.europa.eu/environment/water/water-framework/info/intro en.htm.
ECETOC (European Centre for Ecotoxicology and Toxicology of Chemicals). (2009). Framework for the
integration of human and animal data in chemical risk assessment. (Technical Report No. 104).
62

-------
Brussels, Belgium: ECETOC.
http://www.ecetoc.org/uploads/Publications/documents/TR%20104 .pdf
ECHA (European Chemicals Agency). (2008). Guidance on information requirements and chemical
safety assessment. Chapter R.10: characterization of dose [concentration]-response for
environment. Heksinki, Finland: ECHA. http://echa.europa.eu/guidance-documents/guidance-on-
information-requirements-and-chemical-safety-assessment
ECHA. (2010). Practical guide 2: How to report weight of evidence. (ECHA-10-B-05-EN). Helsinki,
Finland: ECHA.
http://echa.europa.eu/documents/10162/13655/pg report weight of evidence en.pdf
ECHA. (2015). Read-across assessment framework (RAAF). (ECHA-15-R-07-EN). Helsinki, Finland:
ECHA. http: //echa. europa.eu/documents/10162/13628/raaf en .pdf
Egger, M; Juni, P; Bartlett, C; Holenstein, F; Sterne, J. (2003). How important are comprehensive
literature searches and the assessment of trial quality in systematic reviews? Empirical study.
Health Technol Assess 7: 1-76. http://www.ncbi.nlm.nih.gov/pubmed/12583822
Fenton, N; Neil, M. (2013). Risk Assessment and decision analysis with Bayesian Networks. Bocan
Raton, Fl: CRC Press.
Ferguson, CJ; Brannick, MT. (2012). Publication bias in psychological science: prevalence, methods for
identifying and controlling, and implications for the use of meta-analyses. Psychol Methods 17:
120-128. http://www.ncbi.nlm.nih.gov/pubmed/21787082
Forbes, VE; Calow, P. (2002). Species sensitivity distributions revisited: A critical appraisal. Hum Ecol
Risk Assess 8: 473-492. http://www.tandfonline.com/doi/abs/10.1080/1080703029Q879781
Fox, GA. (1991). Practical causal inference for ecoepidemiologists. J Toxicol Environ Health 33: 359-
373. http://www.ncbi.nlm.nih.gov/pubmed/1875428
Good, I. (1950). Probability and the weighing of evidence. New York, NY: Hafner Press.
Gough, D; Oliver, S; Thomas, J. (2012). An introduction to systematic reviews. London, U.K.: Sage
Publications.
Grapentine, L; Anderson, J; Boyd, D; Burton, GA; DeBarros, C; Johnson, G; Marvin, C; Milani, D;
Painter, S; Pascoe, T; Reynoldson, T; Richman, L; Solomon, K; Chapman, PM. (2002). A
decision making framework for sediment assessment developed for the Great Lakes. Hum Ecol
Risk Assess 8: 1641-1655. http://dx.doi.org/10.1080/20028091Q57538
Gray, G. (1994). Complete risk characterization. In Risk in Perspective (pp. 1-2). Harvard Center for Risk
Analysis. https://cdnl.sph.harvard.edu/wp-content/uploads/sites/1273/2013/06/Complete-Risk-
Characterization-Nov-1994 .pdf
Greenberg, M; Charters, D. (2007). The rule of five: A novel approach to derive PRGs. Joint Services
Environmental Management Conference, May 21-24, 2007, Columbus, OH.
Greenland, S; O'Rourke, K. (2001). On the bias produced by quality scores in meta-analysis, and a
hierarchical view of proposed solutions. Biostatistics 2: 463-471.
http ://www.ncbi .nlm.nih. gov/pubmed/12933636
Hawkins, CP. (2006). Quantifying biological integrity by taxonomic completeness: its utility in regional
and global assessments. Ecol Appl 16: 1277-1294.
http ://www.ncbi .nlm.nih. gov/pubmed/16937797
Hertzberg, RC; Teuschler, LK. (2002). Evaluating quantitative formulas for dose-response assessment of
chemical mixtures. Environ Health Perspect 110 Suppl 6: 965-970.
http ://www.ncbi .nlm.nih. gov/pubmed/12634126
Higgins, J; Green, S. (2011). Cochrane handbook for systematic reviews of interventions. Version 5.1.0.
Cambridge, U.K.: The Cochrane Collaboration, http://handbook.Cochrane.org/
Hilborn, R; Mangel, M. (1997). The ecological detective: Confronting models with data. Monographs in
population biology. Princeton, NJ: Prince University Press.
Hill, AB. (1965). The environment and disease: association or causation? Proc R Soc Med 58: 295-300.
http ://www .ncbi .nlm .nih. gov/pubmed/14283879
63

-------
Hope, BK; Clarkson, JR. (2014). A strategy for using weight-of-evidence methods in ecological risk
assessments. Hum Ecol Risk Assess 20: 290-315.
http://dx.doi.org/10.1080/10807039.2Q13.781849
Hume, D. (1748). An enquiry concerning human understanding. Amherst, NY: Prometheus.
Hunter, JE; Schmidt, FL. (2004). Methods of meta-analysis: Correcting error and bias in research findings
(Second ed.). Thousand Oaks, CA: Sage Publications.
Integral Consulting Inc. (2013). Baseline ecological risk assessment: San Jacinto River Waste Pits
Superfund Site. (9559469). Seattle, WA. http://semspub.epa.gov/src/collection/06/SC32405
Jaworska, J; Gabbert, S; Aldenberg, T. (2010). Towards optimization of chemical testing under REACH:
a Bayesian network approach to Integrated Testing Strategies. Regul Toxicol Pharmacol 57: 157-
167. http://www.ncbi.nlm.nih.gov/pubmed/20156511
Johnston, RK; Munns, WR, Jr.; Tyler, PL; Marajh-Whittemore, P; Finkelstein, K; Munney, K; Short, FT;
Melville, A; Hahn, SP. (2002). Weighing the evidence of ecological risk from chemical
contamination in the estuarine environment adjacent to the Portsmouth Naval Shipyard, Kittery,
Maine, USA. Environ Toxicol Chem 21: 182-194.
http://www.ncbi.nlm.nih.gov/pubmed/1180405 3
Karr, JR; Fausch, KD; Angermeier, PL; Yant, PR; Schlosser, IJ. (1986). Assessing biological integrity in
running waters: A method and its rationale. (Publication 5). Champaign, Illinois: Illinois Natural
History Survey Special.
http://www.nrem.iastate.edu/class/assets/aecl518/Discussion%20Readings/Karr et al. 1986.pdf
Keilser, J; Collier, Z; Chu, E; Sinatra, N; Linkov, L. (2014). Value of information analysis: the state of
application. Environ SystDecis 34: 3-23.
Krimsky, S. (2005). The weight of scientific evidence in policy and law. Am J Public Health 95 Suppl 1:
S129-136. http://www ncbi nlm nih gov/pubmed/16030328
Laplace, P. (1812). A philosophical essay on probabilities. 1902 translation. New York, NY: John Wiley.
Linkov, I; Loney, D; Cormier, S; Satterstrom, FK; Bridges, T. (2009). Weight-of-evidence evaluation in
environmental assessment: review of qualitative and quantitative approaches. Sci Total Environ
407: 5199-5205. http://www.ncbi nlm nih.gov/pubmed/19619890
Linkov, I; Massey, O; Keisler, J; Rusyn, I; Hartung, T. (2015). From "weight of evidence" to quantitative
data integration using multicriteria decision analysis and Bayesian methods. ALTEX 32: 3-8.
http ://www .ncbi .nlm .nih. gov/pubmed/25 5 92482
Linkov, I; Moberg, E. (2012). Multi-criteria decision analysis: Environmental applications and case
studies. Boca Raton, FL: CRC Press.
Luftig, SD. (1999). Issuance of final guidance: Ecological risk assessment and risk management
principles for superfund sites. Memorandum, October 7. (OSWER Directive 9285.7-28 P).
Washington, D.C.: Office of Emergency and Remedial Response, U.S. EPA.
http://nepis.epa.gov/Exe/ZvPURL.cgi?Dockev=9100L92P.TXT
McDonald, BG; deBruyn, AM; Wernick, BG; Patterson, L; Pellerin, N; Chapman, PM. (2007). Design
and application of a transparent and scalable weight-of-evidence framework: an example from
Wabamun Lake, Alberta, Canada. Integr Environ Assess Manag 3: 476-483.
http ://www.ncbi .nlm.nih. gov/pubmed/18046796
McGarity, T; Wagner, W. (2008). Bending science: How special interests corrupt public health research.
Cambridge, MA: Harvard University Press.
McPherson, C; Chapman, PM; Debruyn, AM; Cooper, L. (2008). The importance of benthos in weight of
evidence sediment assessments—a case study. Sci Total Environ 394: 252-264.
http://www.ncbi.nlm.nih.gov/pubmed/18295824
Meek, ME; Palermo, CM; Bachman, AN; North, CM; Jeffrey Lewis, R. (2014). Mode of action human
relevance (species concordance) framework: Evolution of the Bradford Hill considerations and
comparative analysis of weight of evidence. J Appl Toxicol 34: 595-606.
http ://www .ncbi .nlm .nih. gov/pubmed/24777878
64

-------
Menzie, C; Henning, MH; Cura, J; Finkelstein, K; Gentile, J; Maughan, J; Mitchell, D; Petron, S; Potocki,
B; Svirsky, S; Tyler, P. (1996). Special report of the Massachusetts weight-of-evidence
workgroup A weight-of-evidence approach for evaluating ecological risks. Hum Ecol Risk Assess
2: 277-304. http://dx.doi.org/10.1080/108070396093836Q9
Murphy, BL; Morrison, RD. (2002). Introduction to environmental forensics. San Diego, CA: Academic
Press.
Newman, MC; Clements, WH. (2008). Ecotoxicology: A comprehensive treatment. Boca Raton, FL:
CRC Press.
Norton, SB; Cormier, SM; Suter, GW. (2014). Ecological causal assessment. Boca Raton, FL: CRC
Press.
NRC (National Research Council). (1983). Risk assessment in the Federal Government: managing the
process. Washington, D.C.: The National Academies Press.
http://www.nap.edu/openbook.php7islHF0309033497
NRC. (2014). Review of EPA's integrated risk information system (IRIS) process. Washington, D.C.: The
National Academies Press, http://www.nap.edu/catalog/18764/review-of-epas-integrated-risk-
information-svstem-iris-process
OECD (Organization for Economic Cooperation and Development). (2013). Guidance document on
developing and assessing adverse outcome pathways. Series on testing and assessment No. 184.
Paris, France: Organization for Economic Cooperation and Deeloment.
http://www.oecd.org/officialdocuments/publicdisplavdocumentpdf/?cote=env/im/mono(2013)6&
doclanguage=en
Pond, G; Passmore, M; Borsuk, F; Reynolds, L; Rose, C. (2008). Downstream Effects of Mountaintop
Coal Mining: Comparing Biological Conditions using Family- and Genus-Level
Macroinvertebrate Bioassessment Tools. 27: 717-737.
Pope, C; Mays, N; Popay, J. (2007). Synthesizing qualitative and quantitative health evidence: A guide to
methods. Maidenhead, U.K.: Open University Press.
Ralof, J. (2010). Atrazine paper's challenge: Who's resposible for accuracy. Science News, May 6, 2010.
Rhomberg, L. (2015). Hypothesis-based weight of evidence: An approach to assessing causation and its
application to regulatory toxicology. Risk Anal 35:1114-1124.
http://www.ncbi.nlm.nih.gov/pubmed/24724710
Rhomberg, LR; Goodman, JE; Bailey, LA; Prueitt, RL; Beck, NB; Bevan, C; Honeycutt, M; Kaminski,
NE; Paoli, G; Pottenger, LH; Scherer, RW; Wise, KC; Becker, RA. (2013). A survey of
frameworks for best practices in weight-of-evidence analyses. Crit Rev Toxicol 43: 753-784.
http://www.ncbi.nlm.nih.gov/pubmed/24040995
Rooney, AA; Boyles, AL; Wolfe, MS; Bucher, JR; Thayer, KA. (2014). Systematic review and evidence
integration for literature-based environmental health science assessments. Environ Health
Perspect 122: 711-718. http://www.ncbi.nlm.nih.gov/pubmed/24755067
SAB (Scientific Advisory Board). (2012). SAB review of the EPA's ecological assessment action plan.
(EPA-SAB-12-010). Washington, D.C.: SAB, U.S. EPA.
http://vosemite.epa.gov/sab/sabproduct.nsf/773C41AF8 lB7B16C85257A8700796DA9/$File/EP
A-SAB-12-010-unsigned.pdf
SAB. (2014). SAB review of the draft EPA report Connectivity of Streams and Wetlands to Downstream
Waters: A Review and Synthesis of the Scientific Evidence. Letter to Administrator Gina
McCarthy. (EPA-SAB-15-001). Washington, D.C.: SAB, U.S. EPA.
http://vosemite.epa.gov/sab/sabproduct.nsf/WebBoard/AFlA28537854F8AB85257D740050Q3D
2/$File/EPA-SAB-15-001+unsigned.pdf
SAB. (2015). SAB review of the EPA's Evaluation of the Inhalation Carcinogenicity of Ethylene Oxide
(Revised External Review Draft - August 2014). Letter to Administrator Gina McCarthy. (EPA-
SAB-15-002). Washington, D.C.: SAB, U.S. EPA.
http://vosemite.epa.gov/sab/sabproduct.nsf/WebBoard/BD2B2DB4F84146A585257E9A007QE65
5/$File/EPA-SAB-15-012+unsigned.pdf
65

-------
Sanz-Martin, M; Pitt, K; Condon, R; Lucus, C; de Santana, C; Duarte, C. (2016). Flawed citation
practices facilitate the unsubstantiated perception of a global trend toward increased jellyfish
blooms. Glob Ecol Biogeo 25: 1039-1049.
Semenzin, E; Critto, A; Rutgers, M; Marcomini, A. (2008). Integration of bioavailability, ecology and
ecotoxicology by three lines of evidence into ecological risk indexes for contaminated soil
assessment. Sci Total Environ 389: 71-86. http://www.ncbi.nlm.nih.gov/pubmed/17904618
Semenzin, E; Critto, A; Rutgers, M; Marcomini, A. (2009). DSS-ERAMANIA: Decision support system
for site-specific ecological risk assessment of contaminated sites. In A Marcomini; GWI Suter; A
Critto (Eds.), Decision Support Systems for Risk-Based Management of Contaminated Sites.
New York, NY: Springer.
Semenzin, E; Lanzellotto, E; Hristozov, D; Critto, A; Zabeo, A; Giubilato, E; Marcomini, A. (2015).
Species sensitivity weighted distribution for ecological risk assessment of engineered
nanomaterials: the n-Ti02 case study. Environ Toxicol Chem 34: 2644-2659.
http://www.ncbi.nlm.nih.gov/pubmed/26058704
Shull, D; Pulket, M. (2015). Causal analysis of the smallmouth bass decline in the Susquehanna and
Juniata Rivers. Harrisburg, PA: Pennsylvania Department of Environmental Protection.
http://files.dep.state.pa.us/Water/Drinking%20Water%20and%20Facilitv%20Regulation/WaterQ
ualitvPortalFiles/SusquehannaRiverStudvUpdates/SMB CADDIS Report.pdf
Small, MJ. (2008). Methods for assessing uncertainty in fundamental assumptions and associated models
for cancer risk assessment. Risk Anal 28: 1289-1308.
http ://www .ncbi .nlm .nih. gov/pubmed/18844862
Smith, E; Lipkovich, I; Ye, K. (2002). Weight-of-Evidence (WOE): Quantitative estimation of probability
of impairment for individual and multiple lines of evidence. Hum Ecol Risk Assess 8: 1585-1596.
Spiegelhalter, D; Riesch, H. (2011). Don't know, can't know: Embracing deeper uncertainties when
analyzing risks. Phil Trans Royal Soc 369: 4730-4750.
Stahl, C; Cimorelli, A; Chow, A. (2002). A new approach to environmental decision analysis: Multi-
criteria integrated resource assessment (MIRA). Bull Sci Technol Soc 22: 443-459.
Susser, M. (1986). Rules of inference in epidemiology. Regul Toxicol Pharmacol 6: 116-128.
http://www.ncbi.nlm.nih.gov/pubmed/2941827
Suter, GW, 2nd; Cormier, S. (2011). Why and how to combine evidence in environmental assessments:
Weighing evidence and building cases. Sci Total Environ 409: 1406-1417.
Suter, GW, 2nd; Cormier, S. (2014). The problem of biased data and potential solutions for health and
environmental assessments. Hum Ecol Risk Assess 21: 1736-1752.
Suter, GW, II. (1993). Ecological risk assessment. Boca Raton, FL: Lewis Publishers.
Suter, GW, II. (1996). Risk characterization for ecological risk assessment of contaminated sites.
(ES/ER/TM-200). Oak Ridge, TN: Oak Ridge National Laboratory.
http ://rais .ornl. gov/documents/tm200 .pdf
Suter, GW, II; Efroymson, RA; Sample, BE; Jones, DS. (2000). Ecological risk assessment for
contaminated sites. Boca Raton, FL: Lewis Publishers.
Suter, GW, II; Traas, T; Posthuma, L. (2002). Issues and practices in the derivation and use of species
sensitivity distributions. In L Posthuma; GW Suter, II; T Traas (Eds.), Species Sensitivity
Distributions in Ecotoxicology (pp. 437-474). Boca Raton, FL: Lewis Publishers.
Todd, P; Yeo, D; Li, D; Ladle, R. (2007). Citing practices in ecology: can we believe our own words?
Oikos 116: 1599-1601.
Tood, P; Guest, J; Lu, J; Chou, L. (2010). One in four citations in marine biology papers is inappropriate.
Mar Ecol Prog Ser 408: 299-303.
Turner, R; Speiegelhalter, D; Smith, G; Thompson, S. (2009). Bias modeling in evidence synthesis. J R
Stat Soc 172: 21-47.
U.S. EPA (Environmental Protection Agency). (1985). Guidelines for deriving numeric national water
quality criteria for the protection of aquatic organisms and their uses. (PB85-227049).
Washington, D.C.: U.S. EPA. http://www.epa.gov/sites/production/files/2Q15-
66

-------
08/documcnts/guidclincs for deriving nnwqc for the protectin of aquatic organisms and the
ir uses.pdf
U.S. EPA (Environmental Protection Agency). (1992a). Guidance for data useability in risk assessment
(Part A). (EPA/540/R-92/003). Washington, DC: Office of Emergency and Remedial Response.
https://rais.ornl. gov/documents/USERISKA.pdf
U.S. EPA (Environmental Protection Agency). (1992b). Guidance for data useability in risk assessment
(part B). (9285.7-09B, PB92 -963362). Washington, DC: Office of Emergency and Remedial
Response, U.S. EPA. http://rais.ornl.gov/documents/USERISKB.pdf
U.S. EPA (Environmental Protection Agency). (1994). Guidance for the data quality objectives process:
EPA QA/G-4. (EPA/600/R-96/055). Washington, D.C.: Office of Research and Development,
U.S. EPA.
http://www3.epa.gov/epawaste/hazard/correctiveaction/resources/guidance/qa/epaqag4.pdf
U.S. EPA (Environmental Protection Agency). (1996). Biological criteria: Technical guidance for streams
and small rivers. (EPA/822/B-096/001). Washington, D.C.: Office ofWater, U.S. EPA.
http://nepis.epa.gov/Exe/ZvPURL.cgi?Dockev=20003GSJ.TXT
U.S. EPA (Environmental Protection Agency). (1998). Guidelines for ecological risk assessment.
(EPA/630/R-95/002F). Washington, D.C.: U.S. EPA.
http://www.epa.gov/raf/publications/pdfs/ecotxtbx.pdf
U.S. EPA (Environmental Protection Agency). (2000). Stressor identification guidance document.
(EPA/822/B-00/025). Washington, D.C.: Office ofWater, Office of Research and Development,
U.S. EPA.
http://permanent.access.gpo.gov/websites/epagov/www.epa.gov/ost/biocriteria/stressors/stressori
d.pdf
U.S. EPA (Environmental Protection Agency). (2002a). Guidance for quality assurance project plans:
EPA QA/G-5. (EPA/240/R-02/009). Washington, D.C.: Office of Environmental Information,
U.S. EPA. http://www.epa.gov/qualitv/qs-docs/g5-final.pdf
U.S. EPA (Environmental Protection Agency). (2002b). Guidelines for ensuring and maximizing the
quality, objectivity, utility, and integrity of information disseminated by the Environmental
Protection Agency. (EPA/260/R-02/008). Washington, D.C.: Office of Environmental
Information, U.S. EPA.
http://www.epa.gov/qualitv/informationguidelines/documents/EPA InfoOualitvGuidelines.pdf
U.S. EPA (Environmental Protection Agency). (2005a). Guidance for developing ecological soil
screening levels (ECO-SSLs). (Publication 9285.7-55). Washington, D.C.: Office of Solid Waste
and Emergency Response, U.S. EPA. http://rais.ornl.gov/documents/ecossl.pdf
U.S. EPA (Environmental Protection Agency). (2005b). Guidelines for carcinogen risk assessment (pp.
166). (EPA/630/P-03/00 IF). Washington, D.C.: Risk Assessment Forum, Office of Research and
Development, U.S. EPA. http://www2.epa.gov/osa/guidelines-carcinogen-risk-assessment
U.S. EPA (Environmental Protection Agency). (2008). EPA quality policy. (EPA Order CIO 2106.0).
Washington, D.C.: Office of Environmental Information, U.S. EPA.
http://www.epa.gov/sites/production/files/2015-Q9/documents/epa order cio 21060.pdf
U.S. EPA (Environmental Protection Agency). (2009a). Analysis of the causes of a decline in the san
joaquin kit fox population on the Elk Hills, Naval Petroleum Reserve #1, California. (EPA/600/R-
08/130). Cincinnati, OH: National Center for Environmental Assessment, Office of Research and
Development, U.S. EPA.
http://cfpub.epa.gov/ncea/risk/recordisplav.cfm?deid=200367&CFID=53016685&CFTQKEN=36
726798
U.S. EPA (Environmental Protection Agency). (2009b). Guidance on the development, evaluation, and
application of environmental models. (EPA/100/K-09/003). Washington, D.C.: Council for
Regulatory Environmental Modeling, Office of the Science Advisor, U.S. EPA.
http://www.epa.gov/crem/library/cred guidance 0309.pdf
67

-------
U.S. EPA (Environmental Protection Agency). (2010a). Inferring causes of biological impairment in the
Clear Fork Watershed, West Virginia. (EPA/600/R-08/146). Washington, D.C.: National Center
for Environmental Assessment, Office of Research and Development, U.S. EPA.
http://cfpub.epa.gov/ncea/risk/recordisplav.cfm?deid=201963&CFID=61073289&CFTQKEN=23
932460
U.S. EPA (Environmental Protection Agency). (2010b). Integrating ecological assessment and decision-
making at EPA: a path forward. Results of a colloquium in response to Science Advisory Board
and National Research Council recommendations. (EPAS/100/R-10/004). Washington, D.C.:
Risk Assessment Forum, Office of Research and Development, U.S. EPA.
http://www.epa.gov/sites/production/files/2013-09/documents/integrating-ecolog-assess-decision-
making.pdf
U.S. EPA (Environmental Protection Agency). (201 la). Evaluation guidelines for ecological toxicity data
in the open literature. Washington, D.C.: Office of Pesticide Programs, U.S. EPA.
http://www.epa.gov/pesticide-science-and-assessing-pesticide-risks/evaluation-guidelines-
ecological-toxicitv-data-open
U.S. EPA (Environmental Protection Agency). (201 lb). A field-based aquatic life benchmark for
conductivity in Central Appalachian Streams. (EPA/600/R-10/023F). Washington, DC: Office of
Research and Development, National Center for Environmental Assessment, U.S. EPA.
http://cfpub.epa.gov/ncea/cfin/recordisplav. cfm?deid=233809
U.S. EPA (Environmental Protection Agency). (201 lc). Stressor identification (SI) at contaminated sites:
Upper Arkansas River, Colorado (EPA/600/R-08/029). Washington, D.C.: National Center for
Environmental Assessment, Office of Research and Development, U.S. EPA.
http://cfpub.epa.gov/ncea/caddis/recordisplav.cfm?deid=l 89290
U.S. EPA (Environmental Protection Agency). (2012a). Benchmark dose technical guidance.
(EPA/100/R-12/001). Washington, D.C.: Risk Assessment Forum, U.S. EPA.
http://www2.epa. gov/osa/benchmark-dose-technical-guidance
U.S. EPA (Environmental Protection Agency). (2012b). Weight-of-evidence: evaluating results of EDSP
Tier 1 screening to identify the need for Tier 2 testing (pp. 47). Washington, D.C.: Office of
Chemical Safety and Pollution Prevention, U.S. EPA.
http://www.regulations.gov/#!documentDetail;D=EPA-HQ-QPPT-2010-0877-0021
U.S. EPA (Environmental Protection Agency). (2013). Integrated science assessment for lead.
(EPA/600/R-10/075F). Research Triangle Park, NC: National Center for Environmental
Assessment, Office of Research and Development, U.S. EPA.
http://cfpub.epa.gov/ncea/isa/recordisplav.cfm?deid=255721
U.S. EPA (Environmental Protection Agency). (2014a). An assessment of potential mining impacts on
salmon ecosystems of Bristol Bay, Alaska. Seattle, WA: Region 10, U.S. EPA.
http://cfpub.epa.gov/ncea/bri stolbav/recordisplav.cfin?deid=253500
U.S. EPA (Environmental Protection Agency). (2014b). Welfare risk and exposure assessment for ozone.
(EPA/452/R-14/005a). Washington, D.C.: Office of Air Quality Planning and Standards, Office
of Air and Radiation, U.S. EPA.
http ://www3 .epa. gov/ttn/naaqs/standards/ozone/data/2014102 lwelfarerea.pdf
U.S. EPA (Environmental Protection Agency). (2015a). Connectivity of streams and wetlands to
downstream waters: A review and synthesis of the scientific evidence. (EPA/600/R-14/475F).
Washington, D.C.: National Center for Environmental Assessment, National Exposure Research
Laboratory, National Health and Environmental Effects Research Laboratory, Office of Research
and Development, U.S. EPA. http://cfpub.epa.gov/ncea/cfin/recordisplav.cfm?deid=296414
U.S. EPA (Environmental Protection Agency). (2015b). TSCA work plan chemical risk assessment. N-
methylpyrrolidone: paint stripper use. (740-R1-5002). Washington, D.C.: Office of Chemical
Safety and Pollution Prevention, U.S. EPA.
http://nepis.epa.gov/Exe/ZvPURL.cgi?Dockev=P100M55I.TXT
68

-------
URS Greiner Inc.; CH2M Hill. (2011). Coeur d'Alene basin remedial investigation/feasibility study.
(URS DCN: 4162500.06200.05,a2). Seattle, WA: Region 10, U.S. EPA.
http://vosemite.epa.gov/rl0/cleanup.nsf/8065e3cf3d5268538825778300663abc/7Q761fd9f6eel9c
988256cce0006f78f/$FILE/Preface.pdf
van der Ohe, PC; De Zwart, D; Semenzin, E; Apitz, SE; Gottardo, S; Harris, B; Hein, M; Marcomini, A;
Posthuma, L; Schafer, RB; Segner, H; Brakck, W. (2014). Monitoring programmes, multiple
stress analysis and decision support for river basin management. In J Brils; W Brack; P Negrel;
JE Vermaat (Eds.), Risk-Informed Management of European River Basins. Berlin, DL: Springer-
Verlag Heidelberg.
Villeneuve, D; Volz, DC; Embry, MR; Ankley, GT; Belanger, SE; Leonard, M; Schirmer, K; Tanguay, R;
Truong, L; Wehmas, L. (2014). Investigating alternatives to the fish early-life stage test: a
strategy for discovering and annotating adverse outcome pathways for early fish development.
Environ Toxicol Chem 33: 158-169. http://www.ncbi.nlm.nih.gov/pubmed/24115264
Wang, NC; Jay Zhao, Q; Wesselkamper, SC; Lambert, JC; Petersen, D; Hess-Wilson, JK. (2012).
Application of computational toxicological approaches in human health risk assessment. I. A
tiered surrogate approach. Regul Toxicol Pharmacol 63: 10-19.
http://www.ncbi.nlm.nih.gov/pubmed/22369873
Weed, DL. (2005). Weight of evidence: a review of concept and methods. Risk Anal 25: 1545-1557.
http ://www. ncbi .nlm. nih. gov/pubmed/16506981
Woodruff, TJ; Sutton, P. (2014). The Navigation Guide systematic review methodology: a rigorous and
transparent method for translating environmental health science into better health outcomes.
Environ Health Perspect 122: 1007-1014. http://www.ncbi nlm nih.gov/pubmed/24968373
69

-------
APPENDIX A. GLOSSARY OF WEIGHT-OF-EVIDENCE TERMS
The following terms are defined as used in this document. The U.S. Environmental Protection Agency
(EPA) might use these terms differently in other contexts.
Alteration: (1) A characteristic of causal relationships; it specifies that the affected entity is changed by
physical, chemical, or other mechanisms leading to the defined effect. (2) A change in an entity
that is consistent with interaction with a cause.
Ambiguous: Evidence that has no clear meaning or more than one possible meaning. Evidence may be
ambiguous because it is weak (e.g., shows no clear relationships) or unreliable (e.g., reference
sites may have important extraneous differences from exposed sites).
Antecedence: A characteristic of causal relationships; it specifies that the causal interaction is itself
connected to processes that precede it, potentially leading back to a source.
Assessment, environmental: (1) A process of generating and presenting scientific information to inform
an environmental regulatory or management decision. (2) The product of an environmental
assessment process.
Benchmark: A criterion, standard, screening value, effect threshold, or other value used to differentiate
potential exposure levels that are of concern from levels that are not of concern.
Case: (1) The situation that is the subject of an environmental assessment; for example, the case may be a
water body experiencing an algal bloom, a hazardous waste site or a proposed pesticide use.
(2) The weighted body of evidence for an assessment hypothesis; for example, "the lack of
co-occurrence weakens the case for fine sediments as the cause."
Characteristics: Properties that define a quality of interest and that could be supported by evidence. For
example, when weighing evidence for causation, evidence is sought to support six characteristics
of causal relationships (Table E-l).
Coherence: The property of a body of evidence that its constituent pieces are logically linked together,
thus, forming a reasonable explanation.
Confidence: The credence given a conclusion. For quantitative results, confidence is determined by the
scatter of estimates as well as the weight of the evidence.
Considerations: Sets of heterogeneous aspects of causation or other qualitative hypotheses that are used
to structure narrative weight of evidence (WoE). They are derived from or analogous to Hill's
considerations (Hill. 1965).
Co-occurrence: (1) A characteristic of causal relationships; it specifies that the cause and effect are
collocated in space and time at a scale appropriate to the cause and effect. (2) An instance of
collocation in space and time.
Correspondence: The similarity of a piece or type of evidence to the entity or conditions to which the
evidence will be applied. Evidence is relevant if it corresponds well to all aspects of the case
(biological, physical/chemical, and environmental conditions).
A-l

-------
Corroboration: Supporting evidence for an assessment proposition from one or more independent
studies providing similar results.
Data: Unanalyzed results of measurements, counts, or observations used as a basis for reasoning or
calculation.
Discrepant: Evidence that is inconsistent or contrary to established facts or theory.
Endpoint, assessment: An explicit expression of the environmental values to be protected, operationally
defined as an ecological entity and its attributes.
Evaluation: The determination of how much influence a piece or category of evidence should be
assigned. Evaluation plus the scoring of its results constitutes the weighting of evidence.
Evidence: Information that informs inferences regarding a condition, cause, prediction, or outcome.
Evidence, body of: All the applicable evidence used to make inferences concerning a proposition.
Evidence, category of: A grouping of evidence for consistency and to facilitate weighting. In WoE for
environmental assessments, evidence is categorized in terms of conventional types of evidence, or
in terms of characteristics that the evidence supports.
Evidence, line of: (1) A complex piece of evidence including multiple causal or logical steps that
establishes a line of reasoning. (2) A general term used to refer to a piece or type of
evidence—this use is discouraged to reduce ambiguous terminology.
Evidence, piece of: The basic unit of evidence; examples include the results of a toxicity test or a stream
survey.
Evidence, type of: A category of evidence that is based on a distinct form of study. Conventional types
include biological surveys, biomarkers, ambient media toxicity tests, single-substance toxicity
tests, population and ecosystem models, and quantitative structure-activity relationships.
Explanation: The process of translating pieces or types of evidence into a characteristic of causation or
some other attribute that is relevant to a hypothesis. It is part of evidence integration in the
weighing of the body of evidence step.
Hazard: The potentially adverse properties of an agent that, with potential exposure of a receptor,
imparts a risk. Hazards are identified by a general causal assessment, such as an Integrated
Science Assessment, that typically employs a qualitative WoE.
Hypothesis: A proposition proposed to be a potential explanation of a phenomenon (the fish kill may be
caused by high temperatures) or a potential outcome of a phenomenon (building more rooftops
and parking areas will interfere with groundwater recharge).
Inference: (1) The act of reasoning from evidence. (2) A result of such reasoning.
Information: Data or other facts used to derive evidence.
Integration: The first step in the weighing of the body of evidence in which the pieces or categories of
evidence are combined to characterize the body of evidence.
A-2

-------
Interaction: (1) A characteristic of causal relationships; it specifies that a causal agent impinges upon,
enters, binds to, or initiates a response in a susceptible entity in a way that can lead to the effect of
concern. (2) An instance of interaction of a causal agent on a susceptible entity that can lead to
an effect.
Interpretation: The second step in the weighing of the body of evidence in which the body of evidence
for each hypothesis is compared to other hypotheses or judged against a standard to determine
what conclusion is best supported by the weight of evidence.
Property: A feature of evidence that determines how much weight it should be assigned. In this
document, the major properties are relevance, strength, and reliability.
Property, collective: A feature of bodies of evidence that, along with the properties of the constituent
pieces of evidence, determines how much weight a body of evidence should be assigned. In this
document, the collective properties are number, coherence, diversity, and absence of bias.
Proposition: A condition, cause, prediction, or outcome hypothesized as a possible outcome of an
assessment.
Qualitative WoE: Weight of evidence to infer a quality such as causation or impairment.
Quality of interest: A distinctive attribute of an entity, relationship, or system that should be determined
to exist or not for a hypothesis in an assessment.
Quantitative WoE: Weight of evidence to estimate a quantity such as a benchmark value or a
biodegredation rate.
Reasonable explanation: (1) A statement or account that coherently explains a body of evidence.
(2) Informed reasons for apparent inconsistencies in a body of evidence that provide coherence.
Refutation: The logical process of demonstrating the impossibility of a condition, cause, predicted effect,
or outcome.
Relevance: A property of a piece or type of evidence that expresses the degree of correspondence
between the evidence and the assessment endpoint to which it is applied.
Reliability: A property of evidence determined by the degree to which it has quality or other attributes
that inspire confidence.
Risk: The likelihood of adverse effects associated with exposure to a stressor.
Scatter: The distribution of measured or estimated values due to uncertainty, variability, or both.
Scoring: The process of formalizing the evaluation of evidence by assigning a numeral, term, or symbol.
Evaluation plus scoring constitutes weighting of evidence.
Strength: A property of evidence determined by the degree of differentiation from randomness or from
control, background, or reference conditions.
Sufficiency: (1) A characteristic of a causal relationship; it specifies that the causal agent or event must
be adequate to induce the effect in susceptible entities. (2) An occurrence of enough of an agent
or process to affect a susceptible entity.
A-3

-------
Table, scoring: A table created in the weighting step that presents the pieces or categories of evidence
and their scores with respect to relevant properties.
Table, weight of evidence: A table created in the weighing step that summarizes the bodies of evidence
and their weights for one or more hypotheses.
Time order: (1) A characteristic of a causal relationship; it specifies that the cause precedes the effect.
(2) The sequence, in time, of the occurrence of a candidate cause and the effect of concern. It is
sometimes called temporality or temporal sequence.
Uncertainty: Lack of knowledge concerning the state of an organism or system or concerning the true
value of a quantity. Uncertainty is a property of the investigator and, unlike variability, may be
reduced by measurement or observation.
Variability: Heterogeneity over time, space or members of a population. Variability is a property of
nature and may not be reduced by measurement or observation.
Weigh: Consider the relevance, strength, and reliability of the body of evidence to assess the likelihood
of a hypothesis.
Weight: (noun) The importance to an inference of a piece or category of evidence, (verb) To assign an
importance descriptor or score to a piece or category of evidence.
Weight of evidence: (1) A process of making inferences from multiple pieces of evidence, adapted from
the legal metaphor of the scales of justice. (2) The relative degree of support for a conclusion
provided by evidence. The result of weighing the body of evidence.
A-4

-------
APPENDIX B. WEIGHT-OF-EVIDENCE METHODS FOR QUANTITATIVE RESULTS
Assessors are often confronted with multiple estimates of a quantitative value such as the median lethal
concentration (LC50) for a fish species, the half-life of a chemical in surface water, or the pH at which
species richness is reduced in streams. A single estimate might be derived from multiple estimates, by
weighing evidence to identify the best estimate or by using statistical techniques from meta-analysis to
combine estimates (Borcnstcin et al.. 2009; Hunter and Schmidt. 2004).
A simple method for combining values is selection of the single best (i.e., weightiest) estimate. This
approach is commonly used because often one quantitative estimate is clearly superior to all others, so
averaging would diminish accuracy. This approach also might be necessitated by statistical
considerations, if the values cannot be combined because they do not represent independent samples from
a common population of estimates (Borcnstcin et al.. 2009). When this approach is used, a good practice
is to identify in advance the properties of good estimates. Various procedures and criteria can be applied
to various study properties when weighting to identify the best value (see Section 5). For example, LC50
values might be screened using reliability properties such as use of good laboratory practices and then the
most relevant value could be chosen (e.g., the test performed in water most similar to water at the
assessment site).
A common method in meta-analysis is averaging of multiple estimates. The geometric mean is often used
because many environmental data sets are asymmetrical. Although weighting studies before averaging
them is not common in environmental assessments, weighted averages are conventional in meta-analysis.
The most common statistical model for weighting is the fixed-effects model, which weights by the inverse
of the error variance (Borcnstcin et al.. 2009). The term fixed effects is used because it implies that the
estimates are all based on sampling a common variable from a common population with a set effects level
(i.e., the only significant source of differences among the estimates being combined is sampling error). It
has the advantage of giving more weight to larger studies and of normalizing the variance. If this
assumption does not hold, a random effects model or the IVhet model can be employed (Borenstein et al..
2009).
Weighting also might be based on properties of the individual members of a higher-level entity that is
being assessed. For example, in an assessment of ozone effects on forest production, estimates of
reduction in timber production of individual tree species were weighted by the proportional basal trunk
area before they were averaged to estimate loss of forest production (U.S. EPA. 2014b). Because of
weighting, the average represented the loss of forest production and not just the averages of production
losses for individuals of each species.
Although weighting in meta-analysis is usually based on statistical properties, weights based on
qualitative properties also can be used (Turner et al.. 2009). This approach is appropriate when studies
vary significantly in terms of relevance or reliability relative to sampling variance or variance among the
studied systems. For example, chronic toxicity test endpoints, even for the same species, often differ in
the endpoint response, so statistical weighting would not capture the most critical differences among
chronic test data. A statistical model has been developed that incorporates data quality or any other
qualitative scale (Doi and Thalib. 2008). Statistical weighting models depend on variance estimates,
which are not available for many environmental values such as conventional toxicity test endpoints. In
those cases, weights can be assigned using improvised methods. Weighting for study quality has been
common in meta-analysis, but it may introduce bias if not done appropriately (Turner et al.. 2009;
Greenland and O'Rourke. 2001).
B-l

-------
Multiple quantitative estimates also can be used in regression analyses such as derivation of
exposure-response relationships from multiple tests. As in averaging, the values are commonly weighted
by the inverse of their error variance.
Quantitative results also can be combined into distributions. A common use is derivation of a value in
environmental assessments from species sensitivity distributions (SSDs). SSDs are statistical distribution
functions that, conventionally, are fit to the results of chemical toxicity tests and are generally interpreted
as representing the distribution of sensitivity of species in communities. They are used to derive an
exposure level that would be protective of a defined proportion of species or genera in communities or to
estimate the proportion of species or genera that a defined exposure level would affect. The derivation of
SSDs illustrates the issues involved in quantitative weighting of evidence (Box B-1).
Finally, rather than combining estimates, data sets from multiple studies have been combined to perform a
statistical test or analysis. Some investigators combine data sets from multiple small studies with
statistically insignificant results to obtain a data set large enough for the differences among treatments to
be statistically significant. This practice is not recommended because it violates the premise of
hypothesis testing that the likelihood of a data set given a null hypothesis is being determined. It is a
form of /'-hacking, which is the selection of data sets, tests, or hypotheses to achieve statistical
significance. Alternatively, data sets can be combined to obtain a potentially better estimate of a mean
and confidence interval than simply taking the mean of means. Also, multiple sets of exposure-response
data might be combined to obtain a more accurate regression model of the relationship. As with
averaging estimates, analyzing combined data sets requires attention to the statistical properties of the
data.
Choosing the best estimate poses fewer technical issues than combining estimates. Combining estimates
or data sets to generate an estimate introduces complications in computation and interpretation due to
heterogeneity. In contrast, in a well-designed individual study, what the data and the summary statistics
derived from the data represent is known. Different data sets representing the same relationship but
generated at different times in different places or at different scales also might be so different due to
ecological complexity and variability that they cannot or should not be combined. Further, if the studies
to be combined have greatly different sample sizes and variance structures, statistically combining them
might give counterintuitive results. At worst, studies that all indicate one relationship (e.g., declining
species richness with increasing exposure) might give the opposite result when combined (i.e., increasing
species richness with increasing exposure), a phenomenon known as Simpson's paradox (Blyth. 1972).
Finally, the set of studies may be biased, particularly by the bias against publishing studies that show no
effects. Thus, care should be taken when combining multiple estimates or data sets.
Despite these warnings, the benefits of combining estimates or data sets can be substantial. The
combined estimate might be more accurate, particularly when sampling error is important.
Inconsistencies among studies can be quantified and their sources can be analyzed. The influence on
results of methodological or environmental differences among studies can be quantified. Further, some
cases of combining estimates, such as averaging multiple toxicity test results for the same species
following a standard protocol with moderate water chemistries, are unlikely to go astray.
Qualitative WoE may be used to derive quantitative results without combining or choosing among
estimates. An example is the Rule of Five (Box B-2).
B-2

-------
Box B-1. Combining and Weighting Data in Species Sensitivity Distributions
The combining and weighting of data in SSDs has two components. First, to derive the points in the
distribution, the results of multiple tests of a species or of closely related species are combined. When
deriving the species mean acute value (SMAV) for national ambient water quality criteria, multiple LC50S are
combined by taking the geometric mean and assigning that mean value to the species (U.S. EPA, 1985).
SMAV = exp [(I log LCso)/a7]
The genus geometric mean value (i.e., the genus mean acute value or GMAV) is used when multiple
congeneric species have been tested.
GMAV = exp [(I log SMAV)/n]
Weighting is not part of the currently accepted method for combining test results into either the species or
genus means and its implications have not been investigated. However, one might weight the test results
(e.g., LCsos) before averaging, based on the variance, number of organisms tested, number of
concentrations with partial response or some other property.
SMAV = exp [(I(wt log LCso))/Iwt]
Also, one might weight the species mean values before calculating the genus means based on the number
of tests of each species or some other property.
GMAV = exp [(Iws log SMAV)/Iws]
(Note: Subscripts t and s indicate weights [w] for tests and species, respectively.)
Second, weighting might be involved in deriving the SSD from the species or genus mean values. If the SSD
is intended to represent aquatic communities, a problem can occur with over- or under-representing
particular taxa or functional groups, which might be resolved by weighting the data to achieve proportional
representation (Duboudin et al., 2004; Forbes and Calow, 2002; Suter et al., 2002). This situation would
require some difficult decisions. Should the weighting be based on the number of species in a higher taxon,
a trophic group, or other grouping; on the relative abundance or biomass or on some measure of
importance? Invertebrates would be weighted more than vertebrates for species richness, abundance, or
biomass and perhaps much less for importance (depending on the measure of importance). Also, the
assumption that the weighting should be based on particular categories [e.g., algae, invertebrates, and fish
(Duboudin et al., 2004; Forbes and Calow, 2002)1 presupposes low variance in sensitivities within those
categories relative to between categories. That assumption is questionable given the large taxonomic
differences within categories (e.g., cyanobacterial versus eukaryotic algae and arthropods versus mollusks,
rotifers, and other invertebrates), which result in large differences in sensitivity. The communities
represented by an SSD also would influence the weights. For example, stream invertebrates are
predominantly benthic insects, but planktonic crustaceans and rotifers are important in lakes. Clearly,
research would be required before introducing a weighting scheme for SSDs.
The derivation of weighted SSDs has been demonstrated by Semenzin et al. (2015); OECD (2013); U.S. EPA
(2013). They weighted species and derived the SSD by weighted bootstrapping.
B-3

-------
Box B-2. Cleanup Goals by Weight of Evidence Using the Rule of Five
Superfund cleanup goals for ecological risks are concentrations of contaminants that are higher than the
no-observed-adverse-effect level (NOAEL) and lower than the lowest-observed-adverse-effect level
(LOAEL). Because the interval between these values could be large (i.e., greater than a factor of 10) and
because the measures of effect that define the bounds are not consistent, the common practice of
interpolating by taking the geometric mean is often inadequate. Instead, a technique called the Rule of
Five is used (Greenberg and Charters, 2007; Charters and Greenberg, 2004). The NOAEL-LOAEL interval is
divided into six geometrically scaled intervals delimited by seven nodes. The node chosen as the cleanup
goal depends on characteristics of the NOAEL and LOAEL, including their relevance to the assessment
endpoint and their severity (e.g., Is the LOAEL defined by acute lethality or chronic reproductive effects?).
The choice of node also can be influenced by measures of effects other than the LOAEL and NOAEL and by
qualitative evidence concerning ecological exposure and effects on the site. The weighing of evidence to
choose a node can be, at least in part, rule based. For example, a LOAEL based on mortality would move
the choice one node below the middle node. The WoE also uses expert judgment. An example of applying
the Rule of Five is in the Baseline Ecological Risk Assessment for the Estuary at the LCP Chemical Site in
Brunswick, Georgia.
Another example of using qualitative methods to weigh evidence for quantities is the selection of
surrogate values. When there are no data for a property of a chemical, the best estimate of the property
may be derived from the best surrogate chemical for which the property is known. This approach, termed
"read across," can be based on weighing multiple attributes of the chemicals including structure,
physical/chemical properties (e.g., melting point, octanol/water partitioning coefficient) or biological
properties [e.g., enzymes induced (EC'HA. 2015: Wang et al.. 2012)1.
B-4

-------
APPENDIX C. WEIGHT-OF-EVIDENCE METHODS FOR DERIVING A MODEL
Weight of evidence (WoE) can play three roles in the derivation of mathematical models. First, the
selection of a model from a set of alternatives can be based on WoE, where WoE refers to the relative
weight provided to a model by the evidence (i.e., the available data). Good (1950) defined the WoE for
members of a set of alternative models as their relative likelihoods, given a data set. Models can be
compared using the sum of squares or other simple goodness-of-fit statistic, Bayes's factor or Akaike's
information criterion (Anderson. 2008; Hilborn and Mangel. 1997). If the models represent alternative
causal hypotheses, choice of the model that best explains the data can be said to have chosen the best
causal explanation of the modeled effect (Newman and Clements. 2008). This use of the term WoE is
distinct from the others in this document. Rather than weighting and weighing multiple pieces or types of
evidence, Good's followers weigh the alternative models against each other with respect to a common
data set. That approach is included here for completeness because it has appeared in the ecological
assessment literature (Linkov et al.. 2009; Smith et al.. 2002). Rhombcrg (2015) described using relative
likelihoods for selecting among human toxicological hazards. He admits, however, that it is difficult to
calculate likelihoods of models of toxicological hazards, and in practice, he uses qualitative WoE.
Bayesian model averaging extends this concept to provide a more recognizable example of weighing
evidence to derive a model (U.S. EPA. 2009b). That is, rather than choosing a single best model, one
might combine suitable models after weighting them based on their weights in Good's sense (i.e., their
consistency with the data).
A more conventional use of the concept of WoE is the weighing of evidence concerning model
assumptions. For example, to determine whether certain phenomena should be included in a model, one
might assemble, weight, and weigh evidence concerning their occurrence. Such phenomena might
include compensatory processes in a fish population, dietary toxicity in aquatic invertebrates exposed to
an aqueous contaminant, nonlinearities in exposure-response relationships, or avoidance behavior at
subtoxic exposures. The methods for weighing evidence for these model assumptions would be the same
as for other qualitative conclusions. For example, Hertzberg and Teuschler (2002) developed a numerical
index to weigh three properties of evidence for alternative models of mixtures toxicity.
Finally, model selection can be based on weighing a mixture of statistical and biological considerations.
The benchmark dose guidance recommends choosing a dose-response model based on both biological
plausibility and Akaike's information criterion (U.S. EPA. 2012a). A Science Advisory Board panel has
recommended broadening the statistical and nonstatistical considerations to be weighed during model
selection and using a more formal and transparent weighing process (SAB. 2015). For example, for
dose-response modeling, they recommended prioritizing regression models that directly use exposure data
for individuals.
Because any causal hypothesis, in theory, can be represented by a mathematical model, uses of WoE for
model selection could be equivalent to the approaches for deriving qualitative conclusions presented in
Sections 3-7.
C-l

-------
APPENDIX D. WEIGHT-OF-EVIDENCE APPROACHES FOR QUALITATIVE CONCLUSIONS
This appendix summarizes the diversity of weight-of-evidence (WoE) approaches that have been used in
assessments to answer questions concerning environmental qualities. Each approach has been useful in
some cases, and they were carefully considered during the development of the approach presented in this
document.
D.l. Narrative Weight of Evidence
Traditional narrative literature reviews and the assessments that adopt their approach describe a body of
evidence and reach a conclusion by methods and criteria that are not explicit. The expertise of the
reviewer is assumed sufficient to ensure a correct conclusion. The WoE is the conclusion that the author
reaches through reading the literature and presents to the reader by logically structuring the review's
narrative. This is a common approach for weighing evidence in environmental assessments. Narrative
WoE is highly flexible, but the method by which results are obtained is not as transparent or reproducible
as for other methods.
D.2. Consideration-Guided Narrative Weight of Evidence
WoE narratives can be given greater structure and logical consistency by organizing the narrative in terms
of a set of considerations. For example, the U.S. Environmental Protection Agency's (EPA's) Integrated
Risk Information System and Integrated Science Assessment documents use modifications of Hill's
considerations when weighing evidence of general causation. Also, the WoE method for Tier 1 screening
of potentially endocrine-disrupting chemicals provides five considerations (U.S. EPA. 2012b). and the
chemical risk assessments conducted pursuant to the Toxic Substances Control Act use seven
considerations (U.S. EPA. 2015b). However, these WoE methods depend on narratives and have not
explicitly weighted or tabulated the evidence for the considerations. Also, the considerations are mixtures
of various types of issues to consider, not consistent sets of criteria, properties, types of evidence, or
characteristics. This method is an advance over unstructured narratives and is generally accepted.
D.3. Evidence Summary and Scoring Tables
Methodological rigor and transparency can be increased by creating summary tables for the evidence
concerning an inference and assigning scores. For example, a set of summary and scoring tables was
developed at ORNL for contaminated site assessments (Sutcr et al.. 2000). The authors included tables
for summarizing the methods and results for each type of evidence: biological surveys, bioindicators,
ambient toxicity tests, tissue analyses, and single-chemical toxicity. Another table combined the results
of the types of evidence (a score and associated explanation) and presented a score for the body of
evidence for an endpoint.
D.4. Criteria-Guided Scoring
Criteria for weighting evidence have been combined with evidence scoring tables in the Stressor
Identification Guidance Document and Causal Analysis/Diagnosis Decision Information System
(CADDIS) to determine the causes of specific environmental impairments (http://\vww3 .epa.gov/caddis/).
Several examples of the application of this approach can be found in the CADDIS case studies. Pieces or
types of evidence were scored based on Susser's method (U.S. EPA. 2000; Susser. 1986). but scoring
types of evidence has been supplemented by scoring characteristics of causation to explain the
relationship of the evidence to the hypotheses (Norton et al.. 2014; Cormier et al.. 2010). This approach
D-l

-------
also has been applied to a risk assessment. The Bristol Bay watershed assessment used tables of evidence
and scores to clarify the WoE for risks to salmon from spills of tailings, product concentrate and diesel
fuel (U.S. EPA. 2014a). Pieces of evidence were described and scored for logical implication (relevance),
strength, and quality (reliability), and summary scores were derived for the bodies of evidence. A version
of criteria-guided scoring is presented in this document that generalizes the approach beyond causal
assessment.
D.5. Standardized Scoring and Weighting
Consistency can be increased by specifying the evidence and criteria to be considered and the scores and
weights to be assigned to the possible outcomes for each type of evidence. An example is provided by the
Massachusetts weight-of-evidence method for assessing ecological risks at contaminated sites (Menzie et
al.. 1996). The method numerically scores evidence (termed measurement endpoints) on a scale of 1 to 5
for 10 properties, weights the scores by scaling values based on the importance of the properties,
normalizes the scaled scores, and sums the result to give a "measurement endpoint weight." The body of
evidence is then qualitatively weighed based on concurrence of the numerical weights. The
standardization of scores and weights requires consistency of the cases to which the method is applied and
is facilitated by consensus of all parties.
D.6. Indices
Standardized numerical scoring and weighting of evidence has been extended to the calculation of rather
complex risk indices (Benedetti et al.. 2012; Dagnino et al.. 2008; Semenzin et al.. 2008; Bombardier and
Blaise. 2000). These indices are not quantitative risk estimates, but they often contain component
calculations that are part of quantitative risk assessments such as sums of toxic units, so that the index
values provide estimates "of relative hazard for the sites being investigated" (Bombardier and
Bermingham. 1999). Because the indices often combine evidence across pieces, types, and endpoints,
many of them are computationally complex and require support systems for data entry and computations
(Semenzin et al.. 2009).
Indices are also used to combine multiple metrics from biological survey data into a value that can be said
to represent biological integrity (Karretal.. 1986) or deviation from reference condition (Hawkins. 2006).
These indices are arithmetic combinations of metrics believed to discriminate impaired and unimpaired
waters. Their derivation is not normally considered to constitute a weight-of-evidence process, but they
do combine multiple pieces of evidence to determine a quality—the impairment of a biotic community.
Indices are specialized and their results are difficult to relate to the input data. For example, an index of
biotic integrity serves its purpose of identifying impaired waters by numerically aggregating numerous
incommensurable biological metrics. To assess the cause of the impairment or risks from remedial
actions, however, the individual metrics should be extracted, weighted, and weighed appropriately, as
described in Sections 4-6.
D.7. Rule Based
If the potential evidence results are few and all the results are reliable, specifying an interpretation of each
potential body of evidence can be possible. The decision matrix for the sediment quality triad is a
commonly used example of this approach to weighing evidence ITable D-l (Chapman. 1990)1. The triad
is used to determine whether toxic contaminants are impairing a sediment community. Its elements are
chemical analyses of sediments that are compared to toxicological benchmark values, toxicity tests of the
sediment and surveys of the sediment biotic community. If the scores for the triad elements are, for
example, -, +, +, "unmeasured toxic chemicals are causing degradation," but if it is -, -, +, "alteration is
D-2

-------
not due to toxic chemicals." The logic of Chapman's decision matrix is correct if the evidence is reliable,
but the evidence might not be sufficiently reliable to support it for many reasons (Table D-2). As a result,
weighting and interpreting the evidence often is required. Thus, the standard decision matrix is an
idealization that serves primarily to clarify the potential relationships among types of evidence. More
recently, ambiguities in results have been recognized and new decision matrices have been developed for
sediment assessments that specify a decision for only a few of the possible outcomes; for the other
outcomes that include conflicting results, they direct assessors to perform additional analyses (CPA
Sediment Task Group. 2008; Grapentine et al.. 2002). Those additional analyses could use any of the
other qualitative WoE methods.
Table D-1. Inference based on the sediment quality triad (Chapman. 1990)
Situation
Chemicals
Present
Toxicity
Community
Alteration
Possible Conclusions
1
+
+
+
Strong evidence for pollution-induced degradation
2
-
-
-
Strong evidence for no pollution-induced degradation
3
+
-
-
Contaminants are not bioavailable, or are present at nontoxic
levels.
4
-
+
-
Unmeasured chemicals or conditions exist with the potential
to cause degradation.
5
-
-
+
Alteration is not due to toxic chemicals.
6
+
+
-
Toxic chemicals are stressing the system but are not
sufficient to significantly modify the community.
7
-
+
+
Unmeasured toxic chemicals are causing degradation.
8
+
-
+
Chemicals are not bioavailable or alteration is not due to toxic
chemicals.
Responses are shown as either positive (+
or negative (-) indicating whether measurable and potentially significant differences
from control/reference conditions are determined.
D-3

-------
Table D-2. Some ways in which the triad decision table might fail (not including errors or poor technique)
Type of Evidence
False Positive
(Exaggerated Effect)
False Negative
(Minimized Effect)
Single-chemical analyses/
toxicity tests
Higher bioavailability in the laboratory
Site background is high
Overly sensitive test organisms (sensitive
test species or life stages not present at
the site)
Short duration
Insensitive test species or life stages
Critical response not included
Multiple chemicals interact
Incomplete routes of exposure
Whole-medium samples/
toxicity tests
Bioavailability increased by handling
Sampling below bioactive zone
Overly sensitive test organisms
Inappropriate reference sites
Chemical lost or altered by handling
Insensitive test organisms
Inappropriate reference sites
Insensitive test organisms
Ambient conditions (e.g., UV light) are important to
effects
Critical response not included
Contaminated locations or times missed
Sampling locations/
biological surveys
Confounding
Inappropriate reference sites
Organisms too rare
Response too infrequent
Sensitive organisms not sampled
Critical response not included
Large sample scale
Contaminated locations missed
Inappropriate reference sites
Another rule for weighing evidence is independent applicability. The Office of Water classifies a water
body as impaired if an adequate piece of evidence demonstrates an impairment, even if other evidence did
not detect the impairment. This policy takes into consideration the possibility of false negatives and that
different types of evidence detect different types of impairment (Box D-1).
A two-stage rule was developed by the European Center for Ecotoxicology and Toxicology of Chemicals
for weighing human and animal data in chemical risk assessments (ECETOC. 2009V The category of
evidence with the highest quality for a chemical is used, but if the quality of animal and human evidence
is similar, an effect is assumed to be caused by the chemical if either category of evidence shows the
effect. This logic could be applied to more than two categories of evidence if equivalent quality scales
could be developed for all categories.
D„8. Quantitative Alternatives
Some authors have suggested that quantitative methods are inherently superior for making qualitative
inferences from multiple pieces of evidence (Linkov et al.. 2009). In particular, multi-criteria decision
analysis (MCDA) and Bayesian belief networks (BBNs) have been suggested to be alternatives to
conventional WoE (Linkov et al.. 2015; Fenton and Neil. 2013V
D-4

-------
Box D-1. Independent Applicability and Weight of Evidence
Independent applicability of evidence of impairment has been a policy of the EPA Office of Water. That is,
if a water body fails to meet water quality criteria, is toxic, or is biologically impaired, it is an impaired water
body. Independent applicability can result from two possible logics.
1.	All types of evidence are fallible, so any evidence can generate false negative results. Rather than weigh
evidence of impairment against evidence that failed to detect impairment, the evidence of impairment is
accepted. Therefore, if any type of reliable evidence detects impairment, the water body is considered
impaired. This logic is defensible if the consequences of not detecting an impaired ecosystem are more
serious than the consequences of failing to detect. This justification is policy based. In addition, the logic
can be defended by recognizing that poor methods or poor implementation are inherently more likely to
fail to detect an effect than to detect a false effect. This justification is based on the fact that poor practices
introduce extraneous variance are not spatially comprehensive and otherwise tend to obscure effects.
Further, it is unlikely that a type of evidence will test or measure the most sensitive conditions, taxa, life
stages, responses and interactions between species and stressors. Therefore, even good studies are likely
to miss real effects. This does not imply that false positives never occur (see Table D-1), but rather, poor
evidence is more likely to result in false negatives and unprotective decisions.
2.	Each type of evidence addresses a distinct aspect of impairment. If any one aspect is found in a water
body, the water body is impaired in that way. This assumption involves no synthesis. Each type of evidence
is independently evaluated with respect to its aspect of impairment. This is explicitly the case with the
European Union's Water Framework Directive that distinguishes good chemical status from good ecological
and good hydrological statuses (van der Ohe et al., 2014; EC, 2013). European waters should achieve all
three. Equivalently, the policy of independent applicability can be interpreted as a requirement to protect
three types of integrity: chemical, toxicological, and ecological. This approach is supported by the inherent
differences in the types of evidence. For example, a biocriterion based on the index of biotic integrity
discriminates sites on a human disturbance gradient (Blocksom and Johnson, 2009). The index responds to
the multiple stressors that occur commonly on such gradients and is most responsive to those stressors
that are common components of human disturbance. Many of those stressors, such as flashy flow, have no
numeric criteria. On the other hand, water quality criteria are based on responses to particular chemicals
or other stressors and not necessarily to common human disturbances of aquatic systems. Water quality
criteria and biocriteria are criteria for different impairments.
Note that this discussion applies to integrating evidence across types. When applying independent
applicability, one might weigh evidence within types. For example, multiple toxicity tests of an ambient
water could be combined to determine whether the water has unacceptable toxicity. Also, once a water
body is declared impaired, WoE approaches can be applied when assessing the risks associated with
alternative remedial or regulatory actions.
Although WoE integrates evidence concerning some environmental quality of interest, MCDA is intended
to go farther and identify the best decision, given the problem framing, decision criteria, and preferences
[e.g., Which is the best option for dredge spoil disposal given costs, risks, and stakeholder preferences?
(Stahl et al.. 2002)1. MCDA conventionally requires that probabilities be estimated for the outcomes of a
decision and that a decision maker assign utilities to attributes of the outcomes (i.e., the decision criteria)
to allow calculation of the expected utility of a decision (Linkov and Moberg. 2012). Weighting is
involved in conventional MCDA when combining the multiple decision criteria (cost, aquatic toxicity,
public acceptance, etc.). The computational methods of MCDA, however, have been adapted for a
variety of uses beyond decision making, including combining numerically weighted evidence (Linkov
and Moberg. 2012). One example is weighing the evidence for adverse outcome pathways rBox 3-3.
(Collier et al.. 2016)1. Linkov et al. (2015) described MCDA as a suitable proxy for Bayesian analysis.
D-5

-------
BBNs are graphical models of the uncertainties and dependencies used to calculate the probability of an
outcome in a causal network (Fcnton and Neil. 2013). BBNs adapted for decision analysis are termed
influence diagrams (Carrigcr and Barron. 2011). BBNs are computationally complex and not familiar to
most environmental assessors and stakeholders, but available software makes their application relatively
easy. When BBNs are involved in environmental inferences, they may not have data-derived
probabilities; instead expert judgment is commonly used. These networks are WoE in the sense that the
probability of a variable's state can be considered as its weight and branches of the network can
considered chains of evidence (Small. 2008). For example, the probabilities of fishery closure could be
one weighted piece of evidence and probabilities of levels of ecological impacts might be another; these
can be combined by calculating the conditional probabilities to derive levels of offshore ecosystem
services impacts (Carrigcr and Barron. 2011). Carriger and Barron (2016) have adapted BBNs to
explicitly weigh multiple types of ecological evidence. They treated the three types of evidence in the
sediment quality triad as branches of a BBN, calculated probabilities that each type was indicative of
effects, weighted each in terms of its importance, and then calculated an overall probability of
impairment. Because this method does not use a causal network, it is no longer a BBN in the
conventional sense. However, it does provide a means of computationally combining evidence rather
than using qualitative logic.
Although these methods have not been used for conventional qualitative WoE tasks such as identifying
causes or hazards, nothing in this document precludes using them where appropriate. An example of a
relevant potential regulatory application is the use of BBNs to optimize choices of toxicity tests in
European chemical regulation in place of judgment-based WoE (Jaworska et al.. 2010). This case is ideal
for these quantitative methods because it is a relatively simple and consistent problem with a clear
decision structure. Objectively estimating constituent probabilities of, for example, a positive rat
carcinogenicity test given a positive Ames assay is possible. One simply needs a sufficient database of
test results.
D„9. Summarize Only
Some assessors have argued that the weighing of evidence should not determine what conclusion is best
supported by the evidence (Gray. 1994). Instead, they argue that assessors should assemble, clearly
summarize, and simply present the evidence to the decision maker with enough background information
for them to make a decision. Although this approach takes subjective judgments concerning the
integration of evidence away from assessment scientists, it confers responsibility on decision makers to
interpret scientific information, which they might not be trained to do, and requires time for inferential
work that they might not have.
D.10. Conclusions
This summary of WoE approaches is intended to show the range of techniques that might be used for
qualitative WoE. Methods proposed or used, however, do not necessarily fit into these categories. For
example, McDonald et al. (2007) translated qualitative weights to numbers that are combined
algebraically and then translated into a final qualitative conclusion. This method is typical of the WoE
literature, which is replete with methods developed for the individual application. The inclination to
develop a new WoE method for each application might be diminished by the availability in this document
of a generic WoE approach that is flexible but provides some consistency across a variety of applications.
D-6

-------
APPENDIX E. CHARACTERISTICS OF INFERRED QUALITIES
Qualitative weight of evidence (WoE) addresses qualities, each of which is characterized by certain
characteristics. For example, if the quality of interest is causation, the hypothesized relationship should
have the characteristics of causation such as, the causal agent and the affected entity interact. Association
of evidence with characteristics of causation, or characteristics of any other quality being addressed, can
serve two purposes. First, characteristics explain what the evidence does for the hypothesis (Section 6.2).
For example, aqueous concentrations of a chemical in streams can demonstrate co-occurrence but not
interaction. Second, categorizing the evidence in terms of the characteristics can help to perform and
justify the inference. For example, a WoE table organized in terms of causal characteristics may show
that there is abundant evidence for co-occurrence but none for interaction, which would raise questions
about bioavailability. Only characteristics of causation are established, but considering appropriate
characteristics of other inferred qualities such as impairment or protectiveness also might be useful.
Potential characteristics for five qualities are presented here.
E.l. Characteristics of Causation
In environmental assessments, WoE has been used primarily to assess causality. In fact, WoE has been
equated with determining the degree of credence due a causal hypothesis (Rhomberg et al.. 2013;
Krimskv. 2005). WoE is particularly associated with causation because causality is a fundamental
concept, and no one piece or type of evidence can prove causation (Norton et al.. 2014). Rather, the
relationships of evidence to characteristics of causal relationships should be considered. Six
characteristics of causation, developed for ecological causal assessments, have proven useful for that
purpose (Table E-l). By analogy to Hill's considerations, they have been called Cormier's causal
characteristics (CCC).
Few bodies of evidence for causal assessments include evidence for all of these characteristics. Even in
very well-studied cases, evidence of conditions prior to induction of the effect is seldom available to
consider time order (Table 7-1). Even an incomplete WoE table based on characteristics, however,
provides more insight into what is known about causation in a case than a table based only on types of
evidence.
Although these characteristics were developed for causation in specific cases, they also can be relevant to
general causation (i.e., Does agent x cause effect j?), if actual cases are available. The characteristics can
be used to demonstrate that, at least in those actual cases, the agent could cause the effect. They can be
particularly useful when the general question is framed in terms of real-world conditions. An example is
the use of the characteristics to determine that, where aqueous conductivity is elevated due to major
mineral ions in West Virginia and Kentucky, the ion mixture causes the extirpation of invertebrate genera
(Cormier et al.. 2013; Cormier and Suter. 2013; U.S. EPA. 201 lb).
Although the characteristics apply to, and can be observed in actual cases, they are not observable in the
same way when the general causal question is hypothetical. That is, hypothetically, can x cause y rather
than, in fact, did x cause y or is x causing j? In particular, when the only evidence of effects is from
toxicity tests, the co-occurrence and its antecedents are simply the design and setup of the test, and the
time order is inevitable. Sufficiency in hypothetical causal assessments is not of interest because no
particular effect level is identified, so no particular exposure is sufficient. Instead, we use the test data to
establish an exposure-response relationship that can be applied generally. The real causal concerns are
with the relevance of the co-occurrence, interaction, and alteration in the test to those in the field. Are the
conditions, durations, species, and life stages in the tests relevant to the co-occurrence of interest in the
E-l

-------
field? Are the interactions of the stressor and test organisms and the resulting alterations plausible in the
field? Are others likely to be more sensitive or important?
Table E-1. Characteristics of causal relationships3
Causa!
Characteristics
Description
What Evidence of a Characteristic
Shows
Time order
The cause precedes the effect.
Change in the entity after exposure to the
cause and not before
Antecedence
The causal relationship is a result of a larger web of
antecedent cause-and-effect relationships. Evidence
includes sources and routes of transport.
Earlier events that led to the particular
causal event. It demonstrates that a
potential causal event is part of a network
of prior causal events, which increases
confidence that the potential causal event
actually occurred (e.g., that it was not a
result of a measurement error or hoax).
Co-occurrence
The cause co-occurs with the susceptible entity in space and
time. Evidence includes contaminant concentrations in
ambient media and habitat alterations.
The presence of both a causal agent and
the affected entity in circumstances that
create the potential for exposure
Interaction
The cause interacts with the entity in a way that can lead to
the effect. Evidence includes injuries, body burdens, and
biomarkers.
Signs that the entity has been exposed to
the agent in such a way that changes can
be initiated
Sufficiency
The intensity, frequency and duration of the cause are
adequate, and the entity is sufficiently susceptible to produce
the type and magnitude of the effect.
Enough of the cause and a sufficiently
susceptible entity that can result in the
level of the observed effect
Alteration
The entity is changed by physical, chemical, or other
mechanisms leading to the defined effect. Evidence includes
symptoms and related attributes of the affected entities.
Changes in the entity attributable to or at
least appropriate to the cause, which can
indicate that the mechanism is acting
a Modified from Norton etal. (2014); Cormier etal. (2010).
The characteristics of causation can be arranged in a logical sequence: antecedent causes lead to
co-occurrence of the proximate cause with susceptible or affected organisms, which leads to interaction
with the affected organisms and, if the interaction is sufficient, leads to alteration of the organisms. The
characteristics also serve to prompt assessors to make full use of the evidence with respect to all
characteristics of causation. For example, it is obvious that laboratory test data provide evidence
concerning sufficiency, but the data also might provide evidence of alterations characteristic of organisms
affected by the tested agent that might be observed in the field.
E.2. Characteristics of Protection
Criteria, standards, screening levels and other benchmarks are intended to define a limit on exposure that
would be adequately protective. Protection, like causation, is a difficult concept with multiple
characteristics (Table E-2). Benchmarks are usually derived using one piece or one type of evidence.
WoE might be used, however, to assess the confidence provided by diverse evidence that a benchmark or
a procedure for deriving benchmarks is appropriately protective. Although the U.S. Environmental
Protection Agency has not explicitly used this potential application of WoE, characteristics like those
Table E-2 might be useful.
E-2

-------
Table E-2. Potential characteristics of a protective benchmark
Characteristics of Protective
Benchmarks
Description
Causal relationship
The exposure-response relationship used in the derivation is likely to be causal.
Predictive model of the relationship
The model of the relationship predicts the effects of specified exposures.
Represent sensitive endpoints
Known sensitive endpoint species or other entities and responses can be
represented by the relationship.
Exposure metric is relevant
The exposure metric represents the relevant exposure of the sensitive entities.
Set at an appropriately protective level
Independent data demonstrate, with sufficient confidence, a lack of effects of
concern on the aquatic community and constituent taxa.
Discriminatory power
The benchmark discriminates between impaired and unimpaired communities with
sufficient confidence.
E.3. Characteristics of Contaminants of Concern
Risk assessments of contaminated sites typically develop lists of contaminants of concern based on
chemical analyses of site media. The contaminants of concern are then the subjects of the risk
assessment. Concern may be determined by comparison to ecological benchmarks, but more often
multiple characteristics of a contaminant are weighed to determine which make the list. The
characteristics in Table E-3 are based on a text and a recent Superfund ecological risk assessment (Black
& Veatch Special Projects Corps. 2011; Suter et al.. 2000). The list may not be exhaustive.
Table E-3. Characteristics of a contaminant of concern at a contaminated site
Characteristics of Contaminants of
Concern
Description
Associated with the waste or source
Chemicals that are not associated with the waste or effluent of interest could be from
an extraneous local source, regional contamination, or natural background.
Frequently detected
Chemicals that are infrequently detected at the site are of lesser concern and might
be artifacts.
Greater than local background
Chemicals at background levels are not of concern, unless the anthropogenic form is
more bioavailable or toxic.
At potentially toxic levels
Concentrations above relevant ecotoxicological benchmarks are of concern.
Detection limits are inadequate
Chemicals that could not be detected at toxic levels are of concern, particularly if
they are associated with the waste or source.
Belongs to the same class as an
identified contaminant of concern
A chemical that is not of concern alone might belong to a class, such as polycyclic
aromatic hydrocarbons, that could be assessed jointly.
E.4. Characteristics of Biological Impairment
Impairment of biotic populations or communities is determined under the Clean Water Act's
Section 303(d) and in various other contexts. Although a population or community need be impaired in
only one way to be categorized as impaired (Box D-l). multiple characteristics of impairment could
increase confidence in the designation and increase support for a regulatory or remedial action. A few of
E-3

-------
the possible characteristics of impairment include low taxonomic richness, proportion of dominant taxa,
toxicity, contamination, poor physical habitat, and poor aesthetics, among others. These characteristics
are familiar to ecological assessors and require no descriptions here.
E.5. Characteristics of Remediation
Assessments of the success of remedial actions are common and important examples of outcome
assessments. Remediation is judged in terms of performance and effectiveness of the intervention (Table
E-4). Obtaining sufficient evidence to determine whether ecological goals are attained because of the
remedial action and not just natural recovery is valuable.
Table E-4. Potential characteristics of remediation
Characteristic of Remediation
Description
Performance of the Intervention
Functions
The remedial technology functions as intended.
Reduces the agent
The technology substantially reduces the level and extent of
the contaminant or other agent that causes the risk.
Achieves the physical/chemical goal
The levels of the causal agent are reduced sufficiently to
achieve the goal.
Effectiveness of the Intervention
Endpoint less exposed
The exposure of the environmental endpoint entity to the
agent and their interaction are reduced as predicted.
Endpoint improved
The abundance, diversity, function, or other attributes of the
endpoint entity improve within the expected time following the
remediation.
Endpoint goal achieved
Reference conditions or other ecological endpoint goal are
achieved in response to remediation.
E.6. Summary
Qualitative WoE is performed to support inferences concerning some qualitative property such as
causality, protection, concern, or impairment. Causality has been the principal quality inferred by WoE in
environmental assessments and is the only quality for which characteristics have been available.
Tentative lists of characteristics are presented here as examples to demonstrate the potential use of
characteristics as an organizing principle for WoE. Other qualitative properties such as recovery also
might be addressed by weighing the evidence for defined characteristics. As in causal inference,
weighing evidence for characteristics of these other properties could provide a better explanatory
structure to WoE than simply weighing pieces or types of evidence.
E-4

-------