Guidance for Comparing Background and Chemical Concentrations in Soil for Cercla Sites


                                    EPA540-R-01-003
                                    OSWER 9285.7-41
                                     September 2002
Guidance for Comparing Background
and Chemical Concentrations in Soil
           for CERCLA Sites
       Office of Emergency and Remedial Response
          U.S. Environmental Protection Agency
              Washington, DC 20460
                                       Recycled/Recyclable
                                       Printed witti Soy/Canoia Ink on paper that
                                       contains at least 50% recycled fiber

-------
Page ii
PREFACE

This document provides guidance to the U.S. Environmental Protection Agency Regions concerning how
the Agency intends to exercise its discretion in implementing one aspect of the CERCLA remedy selection
process. The guidance is designed to implement national policy on these issues.

Some of the statutory provisions described in this document contain legally binding requirements. However,
this document does not substitute for those provisions or regulations, nor is it a regulation itself. Thus, it
cannot impose legally binding requirements on EPA, States, or the regulated community, and may not apply
to a particular situation based upon the circumstances. Any decisions regarding a particular remedy selection
decision will be made based on the statute and regulations, and EPA decision makers retain the discretion
to adopt approaches on a case-by-case basis that differ from this guidance where appropriate. EPA may
change this guidance in the future.
ACKNOWLEDGMENTS

The EPA working group, chaired by Jayne Michaud (Office of Emergency and Remedial Response), included
Tom Bloomfield (Region 9), Clarence Callahan (Region 9), Sherri Clark (Office of Emergency and Remedial
Response), Steve Ells (Office of Emergency and Remedial Response), Audrey Galizia (Region 2), Cynthia
Hanna (Region 1), Jennifer Hubbard (Region 3), Dawn loven (Region 3), Julius Nwosu (Region 10), Sophia
Serda (Region 9), Ted Simon (Region 4), and Paul White (Office of Research and Development). David
Bennett (Office of Emergency and Remedial Response) was the senior advisor for this working group.
Comments and suggestions provided by Agency staff are gratefully acknowledged.

Technical and editorial assistance by Mary Deardorff and N. Jay Bassin of Environmental Management
Support, Inc., and Harry Chmelynski of Sanford Cohen & Associates, Inc., are gratefully acknowledged.
This report was prepared for the Office of Emergency and Remedial Response, United States Environmental
Protection Agency. It was edited and revised by Environmental Management Support, Inc., of Silver Spring,
Maryland, under contract 68-W6-0046, work assignment 007, and contract 68-W-02-033, work assignment 004,
managed by Jayne Michaud. Mention of trade names or specific applications does not imply endorsement or
acceptance by EPA. For further information, contact Jayne Michaud, U.S. EPA, Office of Emergency and
Remedial Response, Mail Code 5202G, 1200 Pennsylvania Avenue, Washington, DC 20460.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
                                                                                Page iii
                                    CONTENTS

                                                                                  Page

ACRONYMS AND ABBREVIATIONS  	vi

GLOSSARY	  vii

CHAPTER 1: INTRODUCTION  	1-1
    1.1 Application of Guidance	1-1
    1.2 Goals	1-2
    1.3 Scope of Guidance  	1-2
    1.4 Intended Audience	1-2
    1.5 Definition of Background	1-2

CHAPTER2: SCOPING	2-1
    2.1 When Background Samples Are Not Needed	2-1
    2.2 When Background Samples Are Needed  	2-2
    2.3 Selecting a Reference Area	2-2

CHAPTER 3: HYPOTHESIS TESTING AND DATA QUALITY OBJECTIVES  	3-1
    3.1 Hypothesis Testing  	3-1
       3.1.1  Background Test Form 1  	3-5
       3.1.2  Background Test Form 2	3-7
       3.1.3  Selecting a Background Test Form 	3-8
    3.2 Errors Tests and Confidence Levels	3-8
    3.3 Test Performance Plots	3-9
    3.4 DQO Steps for Characterizing Background  	3-11
    3.5 Sample Size	3-14
    3.6 An Example of the DQO  Process  	3-15

CHAPTER 4: PRELIMINARY DATA ANALYSIS  	4-1
    4.1 Tests for Normality	4-2
    4.2 Graphical Displays  	4-2
       4.2.1  Quantile Plot  	4-3
       4.2.2  Quantile-Quantile Plots	4-4
       4.2.3  Quantile Difference Plot  	4-5
    4.3 Outliers  	4-6
    4.4 Censored Data (Non-Detects)	4-7

CHAPTER 5: COMPARING SITE AND BACKGROUND DATA  	5-1
    5.1 Descriptive  Summary Statistics  	5-2
    5.2 Simple Comparison Methods  	5-3
    5.3 Statistical Methods for Comparisons with Background	5-3


  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page iv
        5.3.1  Parametric Tests	5-4
        5.3.2  Nonparametric Tests 	5-6
   5.4  Hypothesis Testing  	5-12
        5.4.1  Initial Considerations	5-12
        5.4.2  Examples  	5-12
        5.4.3  Conclusions  	5-14

APPENDIX A: SUPPLEMENTAL INFORMATION FOR DETERMINING "SUBSTANTIAL
   DIFFERENCE"	  A-l
   A. 1  Precedents for Selecting a Background Test Form	  A-l
   A.2  Options for Establishing the Value of a Substantial Difference	  A-4
        A.2.1 Proportion of Mean Background Concentration	  A-4
        A.2.2 A Selected Percentile of the Background Distribution  	  A-5
        A.2.3 Proportion of Background Variability	  A-5
        A.2.4 Proportion of Preliminary Remediation Goal 	  A-5
        A.2.5 Proportion of Soil Screening Level	  A-5
   A.3  Statistical Tests and Confidence Intervals for Background Comparisons  	  A-6
        A.3.1 Comparisons Based on the t-Test 	  A-6
        A.3.2 Comparisons Based on the Wilcoxon Rank Sum Test  	  A-8

APPENDIX B: POLICY CONSIDERATIONS FOR THE APPLICATION OF BACKGROUND DATA
   IN RISK ASSESSMENT AND REMEDY SELECTION 	  B-l
   Purpose 	  B-3
   History	  B-3
   Definitions of Terms	  B-4
   Consideration of Background in Risk Assessment	  B-5
   Consideration of Background in Risk Management	  B-6
   Consideration of Background in Risk Communication	  B-7
   Hypothetical Case Examples	  B-7
        Hypothetical Case 1  	  B-8
        Hypothetical Case 2  	  B-9
        Hypothetical Case 3  	  B-9
   References	  B-10
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
                                                                                        Page v
                                        EXHIBITS

                                                                                         Page

Figure 2.1 Determining the need for background sampling  	2-1
Figure 3.1 Test performance plot: site is not significantly different from background	3-10
Figure 3.2 Test performance plot: site does not exceed background by more than S	3-11
Figure 4.1 Example of a double quantile plot	4-3
Figure 4.2 Example of an empirical quantile-quantile plot	4-4
Figure 4.3 Example of a quantile difference plot 	4-5
Table 3.1 Required sample size for selected values of o  	3-4
Table 3.2 Achievable values of a = (3 for selected values of N  	3-5
Table 3.3 Hypothesis Testing: Type I and Type II Errors	3-8
Table 5.1 Site data	5-8
Table 5.2 Background data  	5-8
Table 5.3 WRS test for Test Form 1 (H0: site < background)	5-8
Table 5.4 Critical Values for the WRS Test	5-9
Table 5.5 WRS test for Test Form 2 (1^: site > background + 100)	5-10
Table 5.6 WRS test for Test Form 2 (H^ site > background + 50)	5-10
Table 5.7 Summary of hypothesis tests	5-15
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page vi
                           ACRONYMS AND ABBREVIATIONS

a         Alpha Decision Error (Type I)
ANOVA  Analysis of Variance Procedure
ARAR    Applicable or Relevant and Appropriate Requirements
(3         Beta Decision Error (Type II)
CERCLA  Comprehensive Environmental Response, Compensation, and Liability Act
cm       Centimeter
COC      Chemical of Concern
COPC     Chemical of Potential Concern
CV       Coefficient of Variation
A         Delta (Difference)
DCGL    Design Concentration Guideline Level
DQO      Data Quality Objective
EPA      U.S. Environmental Protection Agency
HA       Alternative Hypothesis
H0        Null Hypothesis
HRS      Hazard Ranking System
K         Tolerance Coefficient
kg        Kilogram
LBGR    Lower Bound of the Gray Region
m         Meter
Mb       Mean Background Concentration
MOD     Minimum Detectable Difference
mg       Milligram
N (n)      Number of Samples
ND       Non-Detect Measurement
NRC      U.S. Nuclear Regulatory Commission
PA/SI     Preliminary Assessment/Site Investigation
PRG      Preliminary Remediation Goal
RAGS     Risk Assessment Guidance for Superfund Vol. I, Human Health Evaluation Manual (Part A)
RCRA    Resource Conservation and Recovery Act
RPM      Remedial Project  Manager
o         Standard Deviation
S         Substantial Difference
SAP      Sampling and Analysis Plan
SSL      Soil Screening Level
TL       Tolerance Limit
TRW     Technical Review Workgroup for Lead
UTL      Upper Tolerance Limit
WRS      Wilcoxon Rank Sum
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page vii
GLOSSARY

Background Substances or locations that are not influenced by the releases from a site and
are usually described as naturally occurring or anthropogenic: (1) Naturally
occurring substances present in the environment in forms that have not been
influenced by human activity. (2) Anthropogenic substances are natural and
human-made substances present in the environment as a result of human
activities (not specifically related to the CERCLA site in question).
Background reference area The area where background samples are collected for comparison with samples
collected on site. The reference area should have the same physical, chemical,
geological, and biological characteristics as the site being investigated, but has
not been affected by activities on the site.
Background Test Form 1

Background Test Form 2

Coefficient of variation

A (delta)
Detection limit
Gehan test
Gray region
Hypothesis (statistical)
Within this guidance, the null hypothesis that the mean concentration in
potentially impacted areas is less than or equal to the mean background
concentration.
Within this guidance, the null hypothesis that the mean concentration in
potentially impacted areas exceeds the mean background concentration.
The ratio of the standard deviation to the mean. A unitless measure that allows
the comparison of dispersion across several sets of data. It is often used instead
of the standard deviation in environmental applications because the standard
deviation is often proportional to the mean.
The true difference between the mean concentration of chemical X in
potentially impacted areas and the mean background concentration of chemical
X. Delta is an unknown parameter which describes the true state of nature.
Hypotheses about its value are evaluated using statistical hypothesis tests. In
principle, we can select any specific value for A and then test if the observed
difference is as large as A or not with a given confidence and power.
Smallest concentration of a substance that can be distinguished from zero.
The Gehan test is a generalized version of the WRS test. The Gehan test
addresses multiple detection limits using a modified ranking procedure rather
than relying on the "all ties get the same rank" approach used in the WRS test.
A range of possible values of A where the consequences of making a decision
error are relatively minor—where the statistical test will yield inconclusive
results. The width of the gray region is equal to the MOD for the test. The
location of the gray region depends on the type of statistical test selected.
A statement that may be supported or rejected by examining relevant data. To
determine if we should accept a hypothesis, it is commonly easier to attempt
to reject its converse (that is, first assume that the hypothesis is not true). This
assumption to be tested is called the null hypothesis (H0), which is any testable
presumption set up to be rejected. An alternative hypothesis (H^) is the logical
opposite of the null hypothesis.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page viii
Hypothesis testing
Judgmental (or
authoritative) samples

MDD (minimum
detectable difference)

Nonparametric data
analysis
Outliers
Parametric data analysis
P-value


Quantile plot


Quantile test


Robustness


S (substantial difference)
Test performance plot
A quantitative method to determine whether a specific statement concerning
A (called the null hypothesis) can be rejected or not by examining data. The
hypothesis testing process provides a formal procedure to quantify the decision
maker's acceptable limits for decision errors.
Samples collected in areas  suspected to have higher contaminant concen-
trations due to operational or historical knowledge. Judgmental samples cannot
be extrapolated to represent the entire site.
The smallest difference in means that the statistical test can resolve. The MDD
depends on sample-to-sample variability, the number of samples, and the
power of the statistical  test. The MDD is a property of the survey design.
A distribution-free statistical method that does not depend on knowledge of the
population distribution.
Measurements (usually larger or smaller than  other data values) that are not
representative  of the sample population from which they were drawn.  They
distort statistics if used in any calculations.
A statistical method that relies on a known probability distribution for the
population from which data are selected. Parametric statistical tests are used
to evaluate statements  (hypotheses) concerning the parameters  of the
distribution. They are usually based on the assumption that the raw data are
normally or lognormally distributed.
The smallest value of a at which the null hypothesis would be rejected for the
given observations. The p-value of the test is sometimes called the critical
level, or the significance level, of the test.
A graph that displays the entire distribution of data, ranging  from the lowest
to the highest value. The vertical axis is the measured concentration, and the
horizontal axis is the percentile of the distribution.
The quantile test is a nonparametric test specifically designed to compare the
upper tails of two distributions. The quantile test may detect differences that
are not detected by the  Wilcoxon rank sum test.
A method of  comparing statistical tests. A robust test is  one with  good
performance (that is not unduly affected by outliers) for a wide variety of data
distributions.
A difference in mean concentrations that is sufficiently large to warrant
additional  interest  based on  health or  ecological information.  S is the
investigation level. If A exceeds S, the difference in concentrations is judged
to be sufficiently large to be of concern, for the purpose of the analysis. A
hypothesis test uses measurements from the  site and from background to
determine if A exceeds S.
A graph that displays the combined effects of the decision error rates, the gray
region for the decision-making process, and the level of substantial difference
between site and background. It is used in the  data quality objective process
during scoping to aid in the selection of reasonable values  for the decision
error rates (a and (3), the MDD, and the required number of samples.
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
                                                                                         Page ix
Tolerance limit



Type I error

Type II error

Walsh's test for outliers

Wilcoxon rank sum
(WRS) test
A confidence limit on a percentile of the population rather than a confidence
limit on the mean. For example, a 95 percent one-sided TL for 95 percent
coverage represents the value below which 95 percent of the population are
expected to fall (with 95 percent confidence).
The probability, referred to as  a (alpha), that the null hypothesis will be
rejected when in fact it is true (false positive).
The probability, referred to as (3 (beta), that the null hypothesis will be
accepted when in fact it is false (false negative).
A nonparametric test for determining the presence of outliers in either the
background or onsite data sets.
A nonparametric test that examines  whether  measurements  from one
population consistently tend to be larger (or smaller) than those from the other
population. It is used for determining whether a substantial difference exists
between site and background population distributions.
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
                                    CHAPTER 1
                               INTRODUCTION
The U.S. Environmental Protection Agency (EPA)
developed  this  document  to  assist  CERCLA
remedial project managers  (RPMs) and human
health  and ecological risk  assessors  during the
remedial investigation process to evaluate back-
ground concentrations at CERCLA sites. An issue
that is often raised at CERCLA sites is whether a
reliable  representation  of background has been
established.1 This document recommends statistical
methods for characterizing background concentra-
tions of chemicals in soil.

The general application of background concentra-
tions during the CERCLA remedial investigation
process is addressed in EPA policy.2 Ecological risk
assessment guidance also provides specific recom-
mendations for applying background  concentration
data.3

This document  supplements  Agency  guidance
included in the  Risk Assessment Guidance for
SuperfimdVol. I, Human Health EvaluationManual
(Part A) (RAGS).1 RAGS contains useful guidance
on background issues that the reader  should also
consult:

 *•  Sampling  needs (Sections 4.4 and 4.6);
 *•  Statistical methods (Section 4.4);
 *•  Exposure  assessment (Section 6.5); and
 *•  Risk characterization (Section 8.6).

This document draws upon many other publications
and statistical references. In general, background
may play a role in the CERCLA process when:

 *  Determining whether a release falls within the
    limitation contained in Section 104(a)(3)(A)  of
    the Comprehensive Environmental Response,
    Compensation, and Liability Act (CERCLA),
    which addresses naturally occurring substances
    in their unaltered form from a location where
    they are naturally found;4

 *•  Developing remedial goals;5

 *•  Characterizing risks from contaminants that
    may also be attributed to background sources;
    and

 *  Communicating  cumulative  risks associated
    with the CERCLA site.

As stated in RAGS, a statistically significant differ-
ence between background samples and site-related
contamination should not, by itself, trigger a cleanup
action. Risk assessment methods should be applied
to ascertain the significance of the chemical concen-
trations. EPA's national policy clarifying the role of
background characterization in the CERCLA risk
assessment and remedy selection process is included
as Appendix B in this document.

1.1    Application of Guidance

This guidance should be applied on a site-specific
basis, with assistance  from a statistician who is
familiar with the CERCLA remedial investigation
process. Not every CERCLA site investigation will
need to characterize chemicals in background areas.
A background evaluation usually is considered when
certain contaminants that pose risks and may drive
an action are believed to be attributable to back-
ground. The need for background characterization,
the timing of sampling efforts, and the required level
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 1-2
of effort should be determined on  a site-specific
basis. EPA should consider whether collecting back-
ground samples is necessary (Chapter 2); when,
where, and how to collect  background samples
(Chapter 3); and how to evaluate the  data (Chapters
4 and 5).

To the extent practicable, this guidance may also be
applicable to sites addressed under removal actions,
especially non-time-critical removal actions, and
Resource Conservation and Recovery Act (RCRA)
corrective actions.

1.2    Goals

The general goals of this guidance are to:

  > Provide  a practical  guide for  characterizing
   background concentrations at CERCLA sites;
   and

  > Present sound options  for  evaluating back-
   ground data sets in comparison to site contam-
   ination data.

1.3    Scope of Guidance

This guidance pertains to the evaluation of chemical
contamination  in  soil at CERCLA  sites. This
guidance may be updated in the future to address
non-soil media. Non-soil media are dynamic and
influenced by upstream or upgradient sources. Such
media—air, groundwater, surface water, and sedi-
ments—typically  require additional analyses of
release and transport, involve more complex spatial
and  temporal  sampling  strategies, and  require
different ways of combining and analyzing data.1

The  user  should  consult the available Agency
guidances and policies  when dealing with sites with
radioactive contaminants. Certain types of CERCLA
sites, such as mining or dioxin-contaminated sites,
may  require  consideration  of  specific  Agency
policies and regulations. Therefore, this guidance
should be applied on  a case-by-case basis, with
consideration of Agency statutes, regulations, and
policies.

1.4    Intended Audience

The  intended audience of this guidance is EPA
human health and ecological risk assessors, RPMs,
and decision makers.

1.5    Definition of Background

Forthe purposes of this guidance, background refers
to substances or locations that are not influenced by
the releases from a site, and are usually described as
naturally occurring or anthropogenic: 1? 6

    1) Naturally occurring - substances present in
   the environment in forms that have not been
   influenced by human activity; and,

   2)  Anthropogenic - natural and human-made
   substances present in the  environment as a
   result of human activities (not specifically
   related to the CERCLA site in question).

Some chemicals may be present in background as a
result of both natural and man-made  conditions
(such as naturally occurring arsenic and arsenic
from pesticide applications or smelting operations).

CERCLA site activity (such  as waste  disposal
practices) may cause naturally occurring substances
to be released into other  environmental media or
chemically transformed. The concentrations of the
released naturally occurring substance may not be
considered as representative of natural background
according to CERCLA 104(a)(3)(A).

Generally,  the  type  of background  substance
(natural or anthropogenic) does not influence the
statistical or technical method used to characterize
background  concentrations.  For comparison pur-
poses  soil  samples should have the same basic
characteristics as the site  sample (i.e., similar soil
depths  and soil types).7 (See Section 2.3).
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 1-3
CHAPTER NOTES

1. U.S. Environmental Protection Agency (EPA). 1989. Risk Assessment Guidance for Superfund Vol. I,
Human Health Evaluation Manual (Part A). Office of Emergency and Remedial Response, Washington,
DC. EPA 540-1-89-002. Hereafter referred to as "RAGS." For information on non-soil media, see
Sections 4.5 and 6.5.

2. U.S. Environmental Protection Agency (EPA). April 2002. Role of Background in the CERCLA Cleanup
Program. Office of Emergency and Remedial Response, Washington, DC. OSWER 9285.6-07P (see
Appendix B of this guidance).

3. U.S. Environmental Protection Agency (EPA) .2001. The Role of Screening-Level Risk Assessments and
Refining Contaminants of Concern in Baseline Ecological Risk Assessments. Office of Solid Waste and
Emergency Response, Washington, DC.

4. CERCLA 104(a)(3)(A) restricts the authority to take an action in response to the release or threat of
release of a "naturally occurring substance in its unaltered form or altered solely through naturally
occurring processes or phenomena, from a location where it is naturally found."

5. The National Oil and Hazardous Substances Pollution Contingency Plan (NCP) (40 CFR Part 3 00) is the
primary regulation that implements CERCLA. The preamble to the NCP discusses the use of background
levels for setting cleanup levels for constituents at CERCLA sites.

"...In some cases, background levels are not necessarily protective of human health, such as in urban or
industrial areas; in other cases, cleaning up to background levels may not be necessary to achieve
protection of human health because the background level for a particular contaminant may be close to
zero, as in pristine areas" (55 FR 8717-8718).

The preamble to the NCP also identifies background as a technical factor to consider when determining
an appropriate remedial level:

"Preliminary remediation goals.. .may be revised to a different risk level within the acceptable risk range
based on the consideration of appropriate factors including, but not limited to: exposure factors,
uncertainty factors, and technical factors...Technical factors may include...background levels of
contaminants..."(55 FR 8717).

6. U.S. Environmental Protection Agency (EPA). 1995. Engineering Forum Issue Paper. Determination
of Background Concentrations of Inorganics in Soils and Sediments at Hazardous Waste Sites. R.P
Breckenridge and A.B. Crockett. Office of Research and Development, Washington, DC. EPA/540/S-
96/500.

7. U.S. Environmental Protection Agency (EPA). July 2000. Draft Ecological Soil Screening Level
Guidance. Office of Emergency and Remedial Response, Washington, DC. EPA 540/F-01/014. OSWER
9345.0-14.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
                                     CHAPTER 2

                                       SCOPING
                                                                                         Page 2-1
A first step in determining the need for background
sampling data is gathering and evaluating all of the
available data.  Some information gathered during
the Preliminary Assessment/Site Investigation (PA/
SI) may provide data on  background levels of
chemicals. The SI usually provides the first oppor-
tunity to collect some background samples. Data
collected and assessed for the hazard ranking system
(HRS) process may include both site-related con-
taminants and off-site (or estimated background)
substances.  These data are generally limited in
quantity and sample location and may have limited
value in the remedial investigation. The locations of
all data should be identified and reported when
these data are considered during the remedial inves-
tigation. Sampling locations should be recorded with
sufficient precision to permit follow up confirma-
tory  measurements if required at a later date. The
general  types of information to consider when
determining the need for background sampling are
highlighted in the box below and in Figure 2.1.

     Background Sampling Considerations

   >  Natural variability of soil types
   >  Operational practices
   >  Waste type
   >  Contaminant mobility
Information from preliminary site studies or pub-
lished sources (regional or local data from the state
or U.S. Geological Survey) may be useful for identi-
fying local soil, water,  and air quality charac-
teristics.1 Data from these resources may be useful
for qualitative analyses  of regional conditions.
However, usually they are not sufficient to assess
site-specific conditions in a quantitative manner.2
      Background Data)
     Relevant to Decision*)
     Available Existing Data)
     Existing Data Sufficient)
      for Statistical Tests)
        (No Gaps)
            Yes)
      Existing Data from)
     Appropriate Locations)
     Existing Data of Known)
       and Acceptable)
          Quality)
       Site Unchanged
       Since Sampling)
            Yes)
     Background Sampling)
        Unnecessary)
Background Sampling)
  Recommended)
 *e.g., suspected risk driver that may be attributed to background)
  Figure 2.1 Determining the need for background
  sampling

After compiling and considering the relevant infor-
mation, EPA  should determine if the data are
sufficient for the risk assessment and risk manage-
ment decisions, or if additional site-specific data
should be collected to characterize background.

2.1    When Background Samples Are
       Not Needed

If the  sample quantity, location,  and quality of
existing data can be used to characterize background
chemical concentrations and compare  them to site
data, then additional samples may not be needed. In
some cases, background chemical concentration
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 2-2
levels are irrelevant to the decision-making process.
For example, for a chemical release whose constitu-
ents are  known and not expected to have been
released to the environment from any source other
than the site, background data would not be neces-
sary. In other cases, levels of background constitu-
ents may not exceed risk-based cleanup goals, and,
therefore, further background analysis would not be
relevant.

2.2    When Background  Samples Are
       Needed

In some cases, the existing data may be inadequate
to characterize background. The reasons for this
include, but  are not limited to, the following:

  *  Insufficient number of samples to perform the
    desired  statistical  analysis or to perform the
    tests with the desired level of statistical power;

  >  Inappropriate background  sample  locations
    (such as those affected by another contamina-
    tion source, or in soil types that  do not reflect
    onsite soil types of interest);3'4

  *  Unknown or suspect data quality;

  *  Alterations in the land since the  samples were
    collected (such  as by filling, excavation,  or
    introduction of new anthropogenic sources); and

  >  Gaps in the available data (certain chemicals
    were excluded from the sample analyses,  or
    certain soil types were not collected).

2.3    Selecting a Reference Area

A background reference  area is the area where
background samples will be collected for compari-
son with  the  samples collected on the  site.  A
background reference area should have the same
physical,  chemical,  geological, and biological
characteristics as the site being investigated, but has
not been affected by activities on the site. RAGS
states that "...the locations  of the background
samples must be areas that could not have received
contamination from the site, but that do have the
same basic characteristics as the medium of concern
at the site."2

The ideal background reference area would have the
same distribution of concentrations of the chemicals
of concern as those which would be expected on the
site if the site had never been impacted. In most
situations, this ideal reference area does not exist. If
necessary, more than one reference area  may be
selected if the  site  exhibits a range  of physical,
chemical, geological, or biological variability. Back-
ground reference areas are normally selected from
off-site areas, but are not limited to natural areas
undisturbed by human activities. It may be difficult
to find a suitable background reference area in an
industrial complex. In some cases, a non-impacted
onsite  area  may be  suitable as  a background
reference area.5

Complete discussion of the role of geochemical
properties of soils in the conduct of background
investigations is beyond the scope of this document.
In most cases, geochemical methods require more
detailed site-specific analysis of the local soil types,
biology,  and geology  than is required  for the
background comparison methods discussed in this
document. The methods in this guidance are based
on randomly sampled concentrations of the chemi-
cals of concern.3
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
                                    CHAPTER NOTES

1.  U.S. Environmental Protection Agency (EPA).  October  1988. Guidance for Conducting Remedial
   Investigations and Feasibility Studies Under CERCLA; Interim Final. (NTIS PB89-184626, EPA 540-G-
   89-004, OSWER 9355.3-01).

2.  U.S. Environmental Protection Agency (EPA). 1989. Risk Assessment Guidance for Superfund Vol. I,
   Human Health Evaluation Manual (Part A). Office of Emergency and Remedial Response, Washington,
   DC. EPA 540-1-89-002. Hereafter referred to as "RAGS."

3.  U.S. Environmental Protection Agency (EPA). December  1995. Determination of Background
   Concentrations of Inorganics in Solids and Sediments at Hazardous Waste Sites. R.P. Breckenridge and
   A.B. Crockett, National Engineering Forum Issue. Office of Research and Development, Washington,
   DC. EPA/540/S-96/500.

4.  U.S. Environmental Protection Agency (EPA). July  2000. Draft  Ecological Soil Screening Level
   Guidance. Office of Emergency and Remedial Response, Washington, DC. EPA 540/F-01/014. OSWER
   9345.0-14.

5.  Statistical methods based only on sample data collected from both impacted and non-impacted areas on
   the site are addressed by A. Singh, A.K. Singh, and G. Flatman, "Estimation of background levels of
   contaminants," Mathematical  Geology, Vol. 26, No. 3, 1994.
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
CHAPTER 3

HYPOTHESIS TESTING AND DATA QUALITY
OBJECTIVES
3.1 Hypothesis Testing

The first step in developing a hypothesis test is to
transform the problem into statistical terminology
by developing a null hypothesis and an alternative
hypothesis (see box on next page). These hypotheses
form the two alternative decisions that the hypothe-
sis test will evaluate.

In comparisons with background, the parameter of
interest is symbolized by the Greek letter deIta (A),
the amount by which the mean of the distribution of
concentrations in potentially impacted areas exceeds
the mean of the background distribution (see
definitions below). Delta is an unknown parameter,
and statistical tests may be used to evaluate
hypotheses relating to its possible values. The
statistical tests are designed to reject or not reject
hypotheses about A based on test statistics computed
from limited sample data.

The action level for background comparisons is the
largest value of the difference in means that is
acceptable to the decision maker. In this guidance,
the action level for the difference in means is
defined as a substantial difference (S), which may
be zero or a positive value based on the risk assess-
ment, an applicable regulation, a screening level, or
guidance. In some cases, the largest acceptable
value for the difference in means may be S = 0. This
Definitions

A (delta): The true difference between the mean concentration of chemical X in potentially impacted areas
and the mean background concentration of chemical X. Delta is an unknown parameter which describes
the true state of nature. Hypotheses about its value are evaluated using statistical hypothesis tests. In
principle, we can select any specific value for A and then test if this difference is statistically significant
or not with a given confidence and power.

S (substantial difference): A difference in mean concentrations that is sufficiently large to warrant
additional interest based on health or ecological information. S is the investigation level. If A exceeds S,
the difference in concentrations is judged to be sufficiently large to be of concern, for the purpose of the
analysis. A hypothesis test uses measurements from the site and from background to determine if A
exceeds S. The S value is discussed further in Appendix A.

MDD (minimum detectable difference): The smallest difference in means that the statistical test can
resolve. The MDD depends on sample-to-sample variability, the number of samples, and the power of the
statistical test. The MDD is a property of the survey design.

Gray Region: A range of values of A where the statistical test will yield inconclusive results. The width
of the gray region is equal to the MDD for the test. The location of the gray region depends on the type
of statistical test selected.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 3-2
Null and Alternative Hypotheses

A statistical hypothesis is a statement that may be supported or rejected by examining relevant data.
Conventionally, hypotheses are stated in such a way that we know what to expect if they are true.
However, to determine if we should accept a proposed hypothesis, it is commonly easier to reject its
converse (that is, first assume that the hypothesis is not true). This assumption to be tested is called the
null hypothesis (H0)—if the null hypothesis is rejected, then the initial presumption is accepted. A null
hypothesis, then, is any testable presumption set up to be rejected. If we want to show that site
concentration exceeds background, we formulate a null hypothesis that the site concentration is less than
or equal to the background concentration. Similarly, if we want to show that the site concentration is less
than or equal to the background concentration, we formulate a null hypothesis that the site concentration
exceeds the background concentration.

An alternative hypothesis (H^) is the logical opposite of the null hypothesis: if H0 is true, HA is false, and
vice-versa. Consequently, the alternative hypothesis is usually logically the same as the investigator's
research hypothesis. HA is the conclusion we accept if we find sufficient evidence to reject the null
hypothesis H0.

A null hypothesis that specifies the unknown parameter (A) as an equality ("H0: A = 0") has a
corresponding alternative hypothesis that can be higher or lower ("H0: A < 0" or "H0: A > 0"). Such a null
hypothesis is termed "two tailed" or "two sided" because the alternative hypothesis has two possibilities.
A hypothesis test that uses a null hypothesis like "H0: A < 0" is called "one sided" or "one tailed" because
the corresponding alternative hypothesis is true only if the values are greater than zero. One-sided tests
are most often used in background comparisons.
guidance does not establish a value for "S"; the
value for "S" should be considered on a case-by-
case basis. The S value is discussed further in
Appendix A. The determination of S should be
considered during the development of a Quality
Assurance Project Plan as part of the planning
process for the background evaluation.1

Estimates of A are obtained by measuring contamin-
ant concentrations in potentially impacted areas and
in background areas. For example, one estimate of
the mean concentration in potentially impacted areas
is the simple arithmetic average of the measure-
ments from these areas. An estimate of the mean
background concentration is similarly calculated.
An estimate of the difference in means (A) is
obtained by subtracting the mean background con-
centration from the mean concentration in potential-
ly impacted areas. In most cases of interest, the
estimate of A will be a positive number. If there is
little or no contamination on the site, then the esti-
mate for A may be near zero or slightly negative.
Note that the estimated value for A calculated by
this simple procedure (or by any more complicated
procedure) is only an approximation of the true
value of A. Hence, decisions based on any estimated
value for A may be incorrect due to uncertainty
concerning its true value.

Adopting hypothesis tests and a Data Quality
Objective (DQO) approach (Section 3.4) can control
the probability of making decision errors. However,
incorrect use of hypothesis tests can lead to erratic
decisions. Each type of hypothesis test is based on
a set of assumptions that should be verified to
confirm proper use of the test. Procedures for
verifying the selection and proper use of parametric
tests, such as the t-tests, are provided in EPA QA/G-
9, Chapter 4.2 Nonparametric tests generally have
fewer assumptions to verify.

Hypothesis testing is a quantitative method to
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 3-3
determine whether a specific statement concerning
A (called the null hypothesis) can be rejected.
Decisions concerning the true value of A reduce to
a choice between "yes" or "no." When viewed in
this way, two types of incorrect decisions, or
decision errors, may occur:

* Incorrectly deciding the answer is "yes" when
the true answer is "no;" and
* Incorrectly deciding the answer is "no" when
the true answer is "yes."

While the possibility of decision errors can never be
totally eliminated, it can be controlled. To control
decision errors, it is necessary to control the
uncertainty in the estimate of A. Uncertainty arises
from three sources:

> Sampling error;
> Measurement error; and
> Natural variability.

The decision maker has some control of the first two
sources of uncertainty. For example, a larger num-
ber of samples may lead to fewer decision errors
because the probability of a decision error decreases
as the number of samples increases. Use of more
precise measurement techniques or duplicate meas-
urements can reduce measurement error, thus mini-
mizing the likelihood of a decision error. The third
source of uncertainty is more difficult to control.

Natural variability arises from the uneven distribu-
tion of chemical concentrations on the site and in
background areas. Natural variability is measured by
the true standard deviation (a) of the distribution. A
large value of o indicates that a large number of
measurements will be needed to achieve a desired
limit on decision errors. Since variability is usually
higher in impacted areas of the site than in
background locations, data collected on the site is
used to estimate o. An estimate for o frequently is
obtained from historical data, if available. Estimates
of variability reported elsewhere at similar sites with
similar contamination problems may be used. If an
estimate of the mean concentration in contaminated
areas is available, the coefficient of variation obser-
ved at other sites may be multiplied by the mean to
estimate the standard deviation. If no acceptable
historical source for an estimate of o is available, it
may be necessary to conduct a small-scale pilot
survey on site using 20 or more random samples to
estimate o. Due to the small sample size of the pilot,
it is advisable to use an 80 or 90 percent upper
confidence limit for the estimate of o rather than an
unbiased estimate to avoid underestimating the true
variability. A very crude approximation for o may
be made by dividing the anticipated range (maxi-
mum - minimum) by 6. It is important that overly
optimistic estimates for o be avoided because this
may result in a design that fails to generate data with
sufficient power for the decision.

The hypothesis testing process provides a formal
procedure to quantify the decision maker's accept-
able limits for decision errors. The decision maker's
limits on decision errors are used to establish perfor-
mance goals for data collection that reduce the
chance of making decision errors of both types. The
gray region is a range of possible values of A where
the consequences of making a decision error are
relatively minor. Examples of the gray region are
shown in Figures 3.1 and 3.2 (Section 3.3).

Any useful statistical test has a low probability of
reflecting a substantial difference when the site and
background distributions are identical (false posi-
tive) but has a high probability of reflecting a
substantial difference when the distribution of
contamination in potentially impacted areas greatly
exceeds the background distribution. In the gray
region between these two extremes, the statistical
test has relatively poor performance. When the test
procedure is applied to a site with a true mean
concentration in the gray region, the test may
indicate that the site exceeds background, or may
indicate that the site does not exceed background,
depending on random fluctuations in the sample.

It is necessary to specify a gray region for the test
because the decision may be "too close to call" due
to uncertainty in the estimate of A. This may occur
when the difference in means is small compared to
the MOD for the test. In the gray region, the uncer-
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 3-4
tainty in the measurement of A is larger than the
difference between A and the action level, so it may
not be  possible for the test  to  yield  a correct
decision with  a high probability.  One step in the
hypothesis test procedure is to assign upper bounds
on the decision error rates for values of A above and
below  the gray region. These bounds limit the
probability of occurrence of decision errors.

The exact definition of the gray region is determined
by the type of hypothesis test that is selected by the
decision maker (see Figures 3.1 and 3.2 in Section
3.3). In general, the gray region for A is to the right
of the origin (A = 0) and bounded from above by the
substantial difference (A = S). Additional guidance
on specifying a gray region for the test is available
in Chapter 6  of Guidance for the Data  Quality
Objectives Process3 The size of the gray region may
also depend on specific regulatory requirements or
policy  decisions that may not be  addressed in the
DQO guidance.

The width of the gray region is called the "minimum
detectable difference" for the statistical test, indica-
ting that differences smaller than the MDD cannot
be detected reliably by the test. If the test is used to
determine if concentrations in the potentially impac-
ted areas exceed background concentrations by more
than S, it is necessary to ensure that MDD for the
test is less than S. In the planning stage, this require-
ment is  met by designing  a sampling plan  with
sufficient power to detect differences as small as S.
If data were collected without the benefit of a samp-
ling plan, retrospective calculation of the power of
the test may be necessary before making a decision.

In the planning stage, the absolute  size of the MDD
is of less importance than the ratio of the MDD to
the natural variability of the contaminant concentra-
tions in the potentially impacted area. This ratio is
termed  the "relative difference"  and defined as
MDD/o, where o  is an estimate  of the standard
deviation of the distribution of concentrations on the
site. The relative difference expresses the power of
resolution of the statistical test in units of uncertain-
ty. Relative differences much less than one standard
deviation (MDD/o «  1)  are more  difficult to
resolve unless a larger number of measurements are
available. Relative differences of more than three
standard deviations  (MDD/o  > 3) are easier to
resolve. As a general rule, values of MDD/o near 1
will result in acceptable sample sizes. The required
number of samples may increase dramatically when
MDD/o  is much smaller than  one.  Conversely,
designs with MDD/o  larger than three may be
inefficient.  If  MDD/o  is  greater   than  three,
additional measurement  precision is  available at
minimal cost by reducing the width of the gray
region. The cost of the data collection plan should
be examined quantitatively for a range of possible
values of the MDD before selecting a final value. A
tradeoff exists between cost (number of samples
required) and benefit (better power of resolution of
the test).

o
(mg/kg)
25
50
75
100
125
150
175
200
MDD/o
2
1
0.67
0.50
0.40
0.33
0.29
0.25
n
3.70
13.55
29.97
52.97
82.53
118.66
161.36
210.63
N
5
16
35
62
96
138
188
245

 Table 3.1 Required sample size for selected values
 of a (a = p = 0.10 and MDD = 50 mg/kg)

The number of measurements required to achieve
the specified  decision  error rates has  a  strong
inverse relationship with the value of MDD/o. An
example of this inverse relationship is demonstrated
in Table 3.1 for hypothetical values ofa = (3 = 0.10
and  MDD =  50 mg/kg. Sample  sizes  may be
obtained using the approximate  formula given in
EPA QA/G-92(Section 3.3.3.1, Box 3-22 , Step 5 of
that document), written here as:

   n = (0.25) z\_a + 2 (Zl.a + Zl.p)2 o2 / (MDD)2,

where zp is the p* percentile of the standard normal
distribution. Note the inverse-squared dependence
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 3-5
of n on MDD/o. Smaller values of a and (3 (leading
to larger values for the z terms) magnify the strength
of this inverse relationship. A recommended sample
size of N = (1.16)n is tabulated for a variety of o
values in the table. Note the dramatic increase in the
sample size as the value of MDD/o is lowered from
1 to 0.25.

Letting a = (3, we can solve for z^a = z^:

z\_a = n / [0.25 + 8o2 / (MOD)2 ].

For any fixed value of MDD/o, the decision error
rate a is a function of n:

a = 1 - S would
lead to a sample size that does not have sufficient
power to distinguish a difference between the site
and background means as small as S. Hence the
minimum acceptable number of samples for the
decision is obtained when MOD = S. If S/o is less
than one, this indicates that MDD/o is also less than
one, and a relatively large number of samples will
be required to make the decision. If S/o exceeds
three, then a reasonably small number of samples
are required for this minimally acceptable test
design. Additional measurement precision is avail-
able at minimal cost by choosing MOD < S. A
binary search procedure would indicate the choice
of MOD = S/2 as the next trial in the cost tradeoff
comparison. If S/o is between one and three, then
selecting MOD = S is a reasonable alternative. If
S/o < 1, then selecting MOD = S is the most cost-
effective choice consistent with the requirement that
MOD < S.

The MOD, in conjunction with the values selected
for the decision error rates, determines the cost of
the survey design and the success of the survey in
determining which areas present unacceptable risks.
From a risk assessment perspective, selection of the
proper width of the gray region is one of the most
difficult tasks. The goal is to make the MOD as
small as possible within the goals and resources of
the cleanup effort.

Two forms of the statistical hypothesis test are
useful for comparisons with background. The null
hypothesis in the first form of the test states that
there is no statistically significant difference
between the means of the concentration distributions
measured at the site and in the selected background
areas. The null hypothesis in the second form of the
test is that the impacted area of the site exceeds
background by a substantial difference. RAGS4
provides guidance for the first form of the back-
ground hypothesis test. Both forms are described in
the next section.

3.1.1 Background Test Form 1

The null hypothesis for background comparisons,
"the concentration in potentially impacted areas
does not exceed background concentration," is
formulated for the express purpose of being
rejected:

*• The null hypothesis (Hg). The mean contaminant
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 3-6
    concentration  in  samples  from  potentially
    impacted areas is less than or equal to the mean
    concentration in background areas (A < O).5

  *  The  alternative hypothesis (H^).  The  mean
    contaminant concentration  in samples  from
    potentially impacted areas is greater than the
    mean in background areas (A > 0).

When using this form of hypothesis test, the data
should provide statistically significant evidence that
the null hypothesis is  false—the site does exceed
background. Otherwise, the null hypothesis cannot
be rejected  based  on  the available  data, and the
concentrations found  in the potentially  impacted
areas are considered equivalent to background.

An easy way to think about the decision errors that
may occur using Background Test Form 1 is to think
about the criminal justice system in this country and
consider what a jury must weigh to determine guilt.
The  only choices are "guilty" and "not guilty." A
person on trial is presumed "innocent until proven
guilty." When the evidence (data) is  clearly not
consistent with the presumption of innocence, ajury
reaches a "guilty" verdict. Otherwise the verdict of
"not guilty" is rendered when the evidence  is not
sufficient to reject the presumption of innocence. A
jury does not have to be convinced that the defen-
dant is innocent to reach a verdict of "not guilty."
Similarly, when using Background Test Form 1, the
null hypothesis is presumed true until it is rejected.

Two serious problems arise when using Background
Tests Form 1. One type of problem arises when
there is a very large amount of data. In this case, the
MDD for the test will be very small, and the test
may reject the null hypothesis when there is only a
very small difference  between the site and back-
ground mean  concentrations. If the site  exceeds
background by only a small amount, there is a very
high probability that  the null hypothesis will be
rejected if a sufficiently large number of samples is
taken. This case can be avoided by selecting Back-
ground Test Form  2, which incorporates an accep-
table level  for the difference  between site  and
background concentrations.
A second type of problem may arise in the use of
Background Test Form 1 when insufficient data are
available. This may occur, for example, when the
onsite or background variability was underestimated
in the design phase. An estimated value for o is used
during the preliminary phase of the DQO planning
process  to determine the required number  of
samples. When the samples are actually collected, o
can then be re-estimated, and the power of the
analysis  should be re-evaluated.  If the variance
estimate used in the planning stage was too low, the
statistical test is unlikely to reject the null hypothe-
sis due to the lack of sufficient power. Hence, when
using Background Test Form 1, it is always best to
conduct a retrospective power analysis to ensure
that the power of the test was adequate to detect a
site with mean contamination that exceeds back-
ground by more than the MDD. A simple way to do
this is to recompute the required sample size using
the sample variance in place of the estimated vari-
ance that was used to determine the required sample
size in the planning phase. If the actual sample size
is greater than this post-calculated size, then it is
likely that the test has adequate power. The exact
power of the WRS test used for Background Test
Form 1 is difficult to calculate. See Section 5.3.2 for
more information on the power of the WRS test. If
the  retrospective analysis indicates that adequate
power was not obtained, it may be necessary to
collect more samples. Hence, if large uncertainties
exist concerning the variability of the contaminant
concentration in potentially impacted areas, Back-
ground  Test  Form 1  may lead to inconclusive
results. Therefore, the sample size should exceed the
minimum number of samples required to give the
test sufficient power.

Detailed information on the application and charac-
teristics of Background Test Form 1 is available in
the document series Statistical Methods for Evalua-
ting the Attainment of Cleanup Standards. Volume
3, subtitled Reference-Based Standards for Soils
and Solid Media6 contains detailed procedures for
comparing  site measurements with background
reference area data using parametric and nonpara-
metric tests based on Background Test Form 1.
  Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites

-------
Page 3-7
Interpretation of the Statistical Measures
Background Test Form 1
Confidence level = 80%: On average, in 80 out of 100 cases, chemical concentrations in potentially
contaminated areas will be correctly identified as being no different (statistically) from background
concentrations, while in 20 out of 100 cases, concentrations in potentially contaminated areas will be
incorrectly identified as being greater than background concentrations.

Power = 90%: On average, in 90 out of 100 cases, concentrations in potentially contaminated areas will
be correctly identified as being greater than background concentrations, while in 10 out of 100 cases,
concentrations in potentially contaminated areas will be incorrectly identified as being less than or equal
to background concentrations.

Background Test Form 2

Confidence level = 90%: On average, in 90 out of 100 cases, concentrations in potentially contaminated
areas will be correctly identified as exceeding background concentrations by more than S, while in 10 out
of 100 cases, concentrations in potentially contaminated areas will be incorrectly identified as not
exceeding background concentrations by more than S.

Power = 80%: On average, in 80 out of 100 cases, concentrations in potentially contaminated areas will
be correctly identified as not exceeding background concentrations by more than S, while in 20 out of 100
cases, concentrations in potentially contaminated areas will be incorrectly identified as exceeding
background concentrations by more than S.
3.1.2 Background Test Form 2

An alternative form of hypotheses test for compar-
ing two distributions is presented in Guidance for
the Data Quality Objectives Process, EPA QA/G-43
When adapted to the background problem, the null
hypothesis, "the concentration in potentially impac-
ted areas exceeds background concentration," again
is formulated for the express purpose of being
rejected:

* The null hypothesis (H0): The mean contaminant
concentration in potentially impacted areas
exceeds background by more than S. Symboli-
cally, the null hypothesis is written as H0: A >
S.7

*• The alternative hypothesis (H^): The mean
contaminant concentration in potentially impac-
ted areas does not exceed background by more
thanS(HA: A
-------
Page 3-8
3.1.3 Selecting a Background Test Form

When comparing Background Test Forms 1 and 2,
it is important to distinguish between the selection
of the null hypothesis, which is a burden-of-proof
issue, and the selection of the investigation level,
which involves determination of an action level.

Background Test Form 1 uses a conservative
investigation level of A = 0, but relaxes the burden
of proof by selecting the null hypothesis that the
contaminant concentration in potentially impacted
areas is not statistically different from background.
Background Test Form 2 requires a stricter burden
of proof, but relaxes the investigation level from 0
to S. Section 5.4 includes further discussion of how
to choose between Test Forms 1 and 2, and gives
additional guidance for setting up the hypotheses.
See the box on the previous page about Interpreta-
tion of the Statistical Measures.

Regardless of the choice of hypothesis, an incorrect
conclusion could be drawn from the data analysis
using either form of the test. To account for this
inherent uncertainty, one should specify the limits
on the Type I and Type II decision errors. This task
is addressed in Step 6 of the DQO process and
described in Section 3.4.

3.2 Errors Tests and Confidence
Levels

A key step in developing a sampling and analysis
plan is to establish the level of precision required of
the data.3 Whether the null hypothesis (Section 3.1)
will be rejected or not depends on the results of the
sampling.Due to the uncertainties that result from
sampling variation, decisions made using hypothesis
tests will be subject to errors. Decisions should be
made about the width of the gray region and degree
of decision error that is acceptable. These topics are
discussed below and in more detail in Chapter 5.
There are two ways to err when analyzing data
(Table 3.3):

*• Type I Error: Based on the observed data, the
test may reject the null hypothesis when in fact
the null hypothesis is true (a false positive).
This is a Type I error. The probability of
making a Type I error is a (alpha); and

*• Type II Error: On the other hand, the test may
fail to reject the null hypothesis when the null
hypothesis is in fact false (a false negative).
This is a Type II error. The probability of
making a Type II error is (3 (beta).

The acceptable level of decision error associated
with hypothesis testing is defined by two key
parameters—confidence level and power (see the
box at the bottom of the previous page). These para-
meters are closely related to the two error probabili-
ties, a and (3.

*• Confidence level 100(1 - a)%: As the confi-
dence level is lowered (or alternatively, as a is
increased), the likelihood of committing a Type
I error increases.

*• Power 100(1 - /3)%: As the power is lowered (or
alternatively, as (3 is increased), the likelihood
of committing a Type II error increases.
Decision Based
on Sample Data
H0 is not rejected
H0 is rejected
Actual Site Condition
H0 is True
Correct Decision: (1 - a)
Type I Error:
False Positive (a)
H0 is not True
Type II Error:
False Negative ((3)
Correct Decision: (1 - (3)
Table 3.3 Hypothesis Testing: Type I and Type II Errors

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-9
Although a range of values can be selected for these
two parameters, as the demand for precision
increases, the number of samples and the cost will
generally also increase. The cost of sampling is
often an important determining factor in selecting
the acceptable level of decision errors. However,
unwarranted cost reduction at the sampling stage
may incur greater costs later. The number of
samples, and hence the cost of sampling, can be
reduced but at the expense of a higher possibility of
making decision errors that may result in the need
for additional sampling, unnecessary remediation, or
increased risk. The selection of appropriate levels
for decision errors and the resulting number of
samples is a critical component of the DQO process
that should concern all stakeholders.

Because there is an inherent tradeoff between the
probability of committing a Type I or Type II error,
a simultaneous reduction in both types can only
occur by increasing the number of samples. If the
probability of committing a false positive is reduced
by increasing the level of confidence of the test (in
other words, by decreasing a), the probability of
committing a false negative is increased because the
power of the test is reduced (increasing (3).

For the purposes of this guidance, minimum recom-
mended performance measures are:8

* For Background Test Form 1, confidence level
at least 80% (a = 0.20) and power at least 90%
((3 = 0.10).

* For Background Test Form 2, confidence level
at least 90% (a = 0.10) and power at least 80%
((3 = 0.20).

When using Background Test Form 1, a Type I error
(false positive) is less serious than a Type II error
(false negative). 777/5 approach favors the protection
of human health and the environment. To ensure
that there is a low probability of Type II errors, a
Test Form 1 statistical test should have adequate
power at the right edge of the gray region.

When Background Test Form 2 is used, a Type II
error is preferable to committing a Type I error. This
approach favors the protection of human health and
the environment. The choice of hypotheses used in
Background Test Form 2 is designed to be protec-
tive of human health and the environment by
requiring that the data contain evidence of no sub-
stantial contamination. This approach may be
contrasted to the "innocent until proven guilty"
approach used in Background Test Form 1.

3.3 Test Performance Plots

During the scoping stage for the development of the
sampling plan, the interrelationships among the
decision parameters can be visualized using a test
performance plot. The test performance plot is a
graph that displays the combined effects of the
decision error rates, the gray region for the decision-
making process, and the level of a substantial
difference between site and background. In short, it
displays most of the important parameters developed
in the DQO process.

A test performance plot is used in the planning
stages of the DQO process to aid in the selection of
reasonable values for the decision error rates (a and
(3), the MOD, and the required number of samples.
Selection of these parameters is usually an iterative
process. Trial values of the decision error rates, the
location of the gray region, and its width (the MOD)
are used to generate initial estimates of the required
number of samples and the resulting test perfor-
mance curve. Adjustments to the inputs are made
until a design is achieved that offers acceptable test
performance at an acceptable cost.

Figure 3.1 illustrates an example of a test perfor-
mance plot for decision making on a statistical test
based on the null hypothesis that the mean
concentration in the potentially impacted area does
not exceed mean background concentration (Back-
ground Test Form 1). At the origin of the plot, the
true difference between the means of the site and
background distributions is zero (A=0). Positive
values of the difference between the site and back-
ground mean concentrations (A > 0) are plotted on
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-10
the horizontal axis to the right of the origin, negative
values (A < 0) to the left. The vertical axis shows
the value of the test performance measure, defined
as the power of the test. The power of the test is the
probability of rejecting the null hypothesis which,
for this test form, equals the probability of deciding
the mean concentration in potentially impacted areas
exceeds the mean background concentration. This
probability ranges from 0 to 1.0 (0 to 100 percent).
Test Performance Plot: Test 1
HO: Delta < 0 vs H1: Delta > 0
-20 0 20 40 60 80 100 120 140
Delta * Site - Background Concentration
Figure 3.1 Test performance plot: site is not
significantly different from background

At the left edge of the gray region, the test perfor-
mance curve is no greater than a for potentially
impacted areas with mean contaminant concentra-
tions less than or equal to background mean concen-
tration (A < 0) and greater than a for potentially
impacted areas with mean concentration exceeding
the mean background concentration (A > 0). The test
performance curve increases as the difference
between the potentially impacted area and back-
ground means increases. The number of samples and
the standard deviation, o, determine the rate of
increase. The right edge of the gray region is located
at the MOD (A = MOD). At this value of the
difference between the mean potentially impacted
area and background concentrations, the probability
of deciding that the potentially impacted area
exceeds background is equal to 1 - (3. When using
Background Test Form 1, the test performance curve
equals the power of the test. A statistical software
package for plotting the power of a statistical test
may be used to generate a test performance plot.
EPA has developed two software packages that
generate power curves for the two-sample t-test:
DEFT9 and DataQUEST.10

Figure 3.1 also shows a hypothetical value of a
substantial difference for this chemical of S = 100.
The value of S was developed by conducting an
evaluation of the risks presented by the site. The
value of S is used in the DQO process as an upper
limit for the width of the gray region (MDD). In
some cases, an MDD less than S may be selected for
the test. This is determined by site-specific con-
ditions, summarized by the standard deviation, o. If
the ratio S/o exceeds 3, then a sample design with
an MDD less than S may offer a test with better
power of resolution at little additional cost of
sampling, a strategy often described using the term
"ALARA"— "As Low As Reasonably Achievable."
If the MDD is selected to be smaller than S, then the
design is conservative in the sense that potentially
impacted areas with differences from background
smaller than S can be identified by the test. The test
will have a higher power to reject the null
hypothesis for sites with mean concentrations that
are in the range between the MDD and S higher than
background. In statistical terms, the power of
rejection will be (1 - (3) at A = MDD, and higher
than(l - (3) for all A > MDD.

Selecting an MDD less than S is also useful for
screening a large number of areas using a low cost
sample measurement procedure, with subsequent
confirmatory testing using more expensive proce-
dures before making a final decision. Finally, before
using previously collected data for decision making,
the power of the test should be calculated to
determine if the MDD is less that S.

An equivalent plot in Figure 3.2 shows the test
performance curve for a statistical test using the null
hypothesis that the potentially impacted area does
not exceed background by more than a substantial
difference (Background Test Form 2). For this Test
Form, the MDD again measures the width of the
gray region, but the gray region now extends from a
difference of A = S-MDD on the left to a difference
A = S on the right.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-11
Test Performance Plot: Test 2
HO: Delta .-Svs H1: Delta < S
•20 0 20 40 GO 80 100 120 140
Delta = Site - Background Concentration
Figure 3.2 Test performance plot: site does not
exceed background by more than S
When using Background Test Form 2, the MDD
may be selected to be as large as S or smaller. The
implications of making the MDD smaller than S for
this Test Form differ from those that occur when
using Background Test Form 1. As the MDD
decreases below S, the test will identify more
potentially impacted areas as not having mean
concentrations that exceed background by more than
S. The sites with mean concentration in the range
between A = 0 and A = S - MDD (those with mean
concentrations only slightly higher than back-
ground) will have a higher probability of being
classified correctly. With this Test Form, a tradeoff
exists between taking more samples and making
more errors. Since the errors tend to occur in sites
that are marginally acceptable, it would be
beneficial for responsible parties to increase the
number of samples and the power of the test.

This second form of background test requires
switching the location of a and (3. The Type I error
(a) for Background Test Form 2 is measured by the
difference between 100% and the test performance
curve at the right of the gray region, while the Type
II error ((3) is measured by the value of the test
performance curve at a difference equal to A = S -
MDD, located at the left of the gray region. When
using Background Test Form 2, the testperformance
curve equals 100% minus the power of the test.
When using Background Test Form 1, a Type I error
could lead to unnecessary remediation while a Type
II error could lead to unacceptable health risks. If
Background Test Form 2 is used, a Type II error
could lead to unnecessary remediation while a Type
I error could lead to unacceptable health risks.
Therefore, one should attempt to reduce the chance
of making either of these errors.

Comparison of Figures 3.1 and 3.2 demonstrates
that the choice a2 = (3], (32 = c^ , and MDD = S will
result in almost identical test performance plots for
Background Test Form 1 and Background Test
Form 2. If MDD is less than S, then Background
Test Form 1 will indicate that more potentially
impacted areas require remediation than Back-
ground Test Form 2. In general, a will differ from (3,
and the value selected for the MDD may be smaller
thanS.

The selection of acceptable decision error rates for
hypothesis testing is a decision that should be made
on a site-specific basis. The consequences of
making a wrong decision (such as failing to reject
the null hypothesis when it is false) should be
considered when specifying acceptable values for
the confidence and power factors (a = 0.20 and (3 =
0.10 are maximum values for Background Test
Form 1).

3.4 DQO Steps for Characterizing
Background

DQOs should be used when developing sampling
and analysis plans (SAPs) to ensure that reliable
data are acquired. The process is outlined here with
a case example for purposes of developing back-
ground sampling plans. For further details, consult
Section 6 of Guidance for the Data Quality Objec-
tives Process^ and Guidance for Data Quality Asses-
sment: Practical Methods for Data Analysis2

The DQO process is the starting point for many
decisions that shape the sampling plan. It involves a
series of steps for making optimal decisions based
on limited data. A careful statement of the DQOs for
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-12
a study will clarify the study objectives, define the
most appropriate type of data to collect, determine
the most appropriate conditions for collecting data,
and specify limits on decision errors. Use of the
DQO process ensures that the type, quantity, and
quality of environmental data used in decision
making will be appropriate for the intended applica-
tion. It improves efficiency by eliminating unneces-
sary, duplicative, or overly precise data. The DQO
process provides a systematic process for defining
an acceptable level for decision errors. The DQO
process and decision parameters establish the
quantity and quality of data needed. A sampling
design is developed to implement these require-
ments by defining the specific measurement proto-
col, sample locations, and number of samples that
will be collected. Detailed procedures for develop-
ing the sampling design are presented in EPA
QA/G-5S11. Many new sampling approaches are
discussed in this document, including ranked set
sampling and adaptive cluster sampling.

Each of the seven steps of the DQO process, listed
below, may be phrased as a question about back-
ground issues:

1. State the Problem
2. Identify the Decision
3. Identify Inputs to the Decision
4. Define Boundaries of Study
5. Develop a Decision Rule
6. Specify Limits on Decision Errors
7. Optimize the Design for Obtaining Data

The examples provided in this section should be
modified to fit the site of concern. A statistician
familiar with the challenges posed by environmental
data should be consulted before data are collected.
The statistician should be involved in discussions
about the goals of the background analysis, time and
cost constraints, limitations of the measurement
techniques, and the availability of preliminary data.

Step 1. State the Problem: Example: Are there
differences between the concentrations of a site con-
taminant and those concentrations that are found in
background samples?
Tasks include:

> Identifying the resources available to resolve the
problem. The team should include the decision
makers, technical staff and data users, and
stakeholders. Members of the technical staff
may include quality assurance managers, chem-
ists, modelers, soil scientists, engineers, geolo-
gists, health physicists, risk assessors, field
personnel, and regulators.
* Developing or refining the comprehensive con-
ceptual site model.

Step 2. Identify the Decision: Example: Are the
chemicals associated with a site-related source or
are they associated with background?

Tasks include:

*• Identifying the chemicals to analyze; and
* Determining if these chemicals are expected to
occur in reference areas selected to reflect
background conditions.

Step 3. Identify Inputs into the Decision: Exam-
ple: What kinds of data are needed? What kinds of
data are available?

Tasks include identifying:
Which chemicals need to be analyzed;
Which soil types and depths need to be
sampled;
Which comparison tests are likely to be used
(see Chapter 5 for details about comparison
tests);
What coefficient of variation is expected for the
data (based on previous samples if possible);
What preliminary remediation goals (PRGs) or
applicable or relevant and appropriate require-
ments (ARARs) should be considered; and
What are the desired power and confidence
levels?
Decision outputs for background characterizations
are discussed in detail in Chapter 5.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-13
Step 4. Define Boundaries of the Study: Example:
What are the spatial and temporal aspects of the
environmental media that the data should represent
to support the decision?

Tasks include:

* Defining the geographic areas for field investi-
gation;
* Defining the characteristics of the soil data
population of interest;
*• Dividing the soil data population of interest into
strata having relatively homogeneous charac-
teristics;
* Determining the timeframe to which the
decision applies; and
*• Identifying practical constraints that may hinder
sample collection.

Step 5. Develop a Decision Rule: Example: If the
mean concentration in potentially impacted areas
exceeds the mean background concentration, then
the chemical will be treated as site-related.
Otherwise, if the mean concentration in potentially
impacted areas does not exceed the background
mean, the chemical will be treated as background-
related.

Tasks include:

*• Choosing the null hypothesis, H0;
*• Specifying the alternative hypothesis, HA;
*• Specifying the gray region for the hypothesis
test; and
> Determining the level of a substantial difference
above background, S.

Hypothesis testing is an approach that helps the
decision maker through the analysis of data. Chapter
5 discusses the application of hypothesis testing at
CERCLA sites. General information on hypothesis
testing is provided in Section 3.1.

Step 6. Specify the Limits on Decision Errors:
Example: What level of uncertainty is acceptable for
this decision? (For definitions, see Section 3.1 on
Hypothesis Testing, Section 3.2 on Errors, and
Confidence Levels, and Figures 3.1 and 3.2.):

*• Test Form 1—The gray region extends from a
difference of A = 0 on the left to A = MOD on
the right. Acceptable limits on decision errors
are c^ at the left edge of the gray region, and p\
at the right edge. Here, c^ measures the Type I
error rate for Test Form 1, which is the prob-
ability of rej ecting the null hypothesis when it is
true, i.e., the probability of wrongly concluding
that the mean concentration on site exceeds the
background mean when it does not. p^ measures
the Type II error rate for Test Form 1, which is
the probability of not rej ecting the null hypothe-
sis when it is false, i.e., wrongly concluding the
mean concentration on the site does not exceed
background when it does.

*• Test Form 2—The gray region extends from a
difference of A = (S - MOD) on the left to A =
S on the right. The acceptable limits on decision
errors are a2 at the right edge of the gray region,
and (32 at the left edge. Here, a2 measures the
Type I error rate for Test Form 2, which is the
probability of rejecting the null hypothesis
when it is true. For this test, the Type I error
rate is the probability of concluding (wrongly)
that the mean concentration on the site does not
exceed the background mean by more than S
when it does. Similarly, (32 measures the Type II
error rate for Test Form 2, which is the prob-
ability of not rejecting the null hypothesis when
it is false. In this case, the Type II error rate is
the probability of concluding (wrongly) that the
mean concentration on site exceeds the back-
ground mean by more than S when it does not.

Tasks include:

*• Determining the possible range for A;
*• Specifying both types of decision errors (Type
I and Type II—see Section 3.2);12
*• Identifying the potential consequences of each
type of error, specifying a range of possible
values for A (the gray region—see Figures 3.1
and 3.2) where consequences of decision errors
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-14
are relatively minor; and
> Selecting the limits on decision errors (a and (3)
to reflect the decision-maker's concern about
the relative consequences for each type of
decision error (Section 3.2).

Step 7. Optimize the Sampling Design: Example:
What is the most resource-effective sampling and
analysis design for generating data that are
expected to satisfy the DQOs?

Tasks include:

*• Reviewing the DQO outputs and existing
environmental data;
*• Developing general sampling and analysis
design alternatives;
* Verifying that DQOs are satisfied for each
design alternative;
*• Selecting the most resource-effective design
that satisfies all of the DQOs; and
*• Documenting the operational details and theor-
etical assumptions of the selected design in the
sampling and analysis plan.

More information may be required to make a
decision. If the required sample size is too large, it
may be necessary to modify the original DQO
parameters. To reduce sampling cost while maximi-
zing utility of the available resources, one or more
of the constraints used to develop the sampling
design may be relaxed. Gilbert presents useful infor-
mation on how to factor cost into a sampling
design.13

3.5 Sample Size

The RPM should consult with a statistician who has
experience in designing environmental sampling
programs to select the appropriate sampling design.
Several sampling design options are available. See
EPA QA/G5S11 for guidance on sampling design. A
consistent grid to cover the entire site and areas
considered as background should provide a reason-
able characterization of the concentrations onsite
and in background areas. The ideal data sets should
be independent (spatially uncorrelated), unbiased,
and representative of the underlying site and back-
ground populations. These assumptions favor wide-
spread random samples. However, in many instan-
ces, the background analyses should rely on existing
site data collected using judgmental sampling. Such
data sets are often biased, clustered, and correlated.
In certain cases, the existing clustered data set may
be declustered for background analyses. A variety of
de-clustering alternatives exist. For example, the
investigated area can be divided into equally spaced
grids. Each grid can then be represented by average
concentration of measured values within the grid, or
a predefined number of samples can be selected
randomly from each grid. Additional options are
described in other guidance, including Chapter 4 of
RAGS.4

In most DQO applications, after electing to use a
test with confidence level 100(1 - a) percent, the
required number of samples is determined by
simultaneously selecting:

*• the MOD for the test; and
*• the power (1 - (3) of the test at the MOD.

Therefore, limits on the probability of committing
Type I and Type II errors can be used as constraints
on the number and location of samples. The DQO
process is meant to be an iterative process. If the
number of samples determined with the selected
error probabilities is too large for the available
resources, the DQO procedure should be repeated
with more reasonable error objectives until an
acceptable number of samples is determined. To
determine realistic limits for the decision errors, the
number of samples (and the corresponding cost of
sampling) could be estimated for a range of error
probability values, which would indicate the likeli-
hood of making either type of error. Reports of the
results of the DQO process should specify the
number of samples selected and the expected error
probabilities that result from this selection.

Several reference documents give formulas ortables
for selecting the number of samples, given the
specific confidence and power limits.14 Chapter 5
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-15
offers guidance for selecting appropriate statistical
techniques for comparing onsite and background
contaminant concentrations in soil.

Examples of constraints that may be adjusted to
influence the required sample size include:

> Increasing the decision-error rates, a and (3,
while considering the increased costs and risks
associated with the increased probability of
making an incorrect decision;

* Increasing the width of the gray region (MDD),
but not to exceed a substantial difference (MDD
< S); and

> Changing the boundaries. It may be possible to
reduce measurement costs by segregating the
site into subunits that require different decision
parameters due to different risks.

3.6 An Example of the DQO Process

This section presents a hypothetical application of
the DQO process for comparing lead concentrations
in a potentially impacted area to background. The
conceptual site model and remedial goals for
individual sites will determine what sampling and
analysis is done at any site. The example will illus-
trate some outputs of the DQO process and will be
extended to the preliminary data analysis stage in
Chapter 4 and to the hypothesis testing stage in
Chapter 5. RPMs should consult the Technical
Review Workgroup for Lead (TRW) for technical
assistance with lead-contaminated sites.15 This
example only illustrates the DQO process and does
not establish guidance pertaining to subsurface soil
sampling or lead cleanup goals.

Step 1. State the Problem

An abandoned storage yard has been identified as
the previous location of a battery distributorship.
Concerns have focused on this storage area as a pos-
sible source of lead contamination. Other sources of
background lead are present in the vicinity of the
storage yard due to nearby highways and industrial
facilities. Available data are not sufficient to deter-
mine that the concentrations in the potentially
impacted area are different from background chemi-
cal concentrations. The study team has decided to
conduct field measurements.

a. What resources (including necessary personnel)
are available to resolve the problem?

The members of the study team will include the
plant manager, a plant engineer, a chemist with field
sampling experience, a quality assurance officer, a
statistician, a risk assessor, and the remedial project
manager.

b. What characteristics or data will determine the
comprehensive conceptual site model?

Historical site assessment was used to develop a
comprehensive conceptual site model. Due to near-
by highways and industrial sources in the vicinity of
the yard, background lead concentrations in soil are
expected to be above the national average. Also,
because of run-off from paved areas, background
concentration near the paved areas are likely to be
higher than background concentrations in soils
distant from the paved areas. The selection of
appropriate background areas for the comparison
was restricted to areas at least 1,000 meters from
heavily used highways and 30 meters from paved
surfaces. These requirements were selected to match
the relative location of the site with respect to the
surrounding roads and highways.

Step 2. Identify the Decision

Do soils in the storage area have higher lead concen-
trations than found in soils in the surrounding area?

a. What chemical(s) should be analyzed.!

The purpose of the study is to compare total lead
concentrations at the storage yard and in surroun-
ding background areas.

b. Is the chemical likely to be a background
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-16
constituent?

Because of the nearby highways and other industrial
sources in the vicinity of the yard, background lead
concentrations are expected to be elevated. This
example will include statistical evaluation of only
unpaved areas.

Step 3. Identify Inputs into the Decision

a. Which chemicals will be analyzed?

EPA decides to focus on total lead concentration.

b. Which soil types and depths need to be
sampled?

Because there is neither surface evidence, nor
historical record, of excavation in the storage area,
EPA decides to measure total lead concentration in
the first 12 inches of surface soils. Soils in back-
ground locations will be sampled in the same way.
The TRW has recommended soil sieving at 250 (im
to assess exposures to lead on the fine fraction of
soil and dust.16 For background sampling of lead,
this fractionation may be appropriate as it relates to
human health risks.

c. Which comparison tests are likely to be used?

EPA expects that lead concentrations may not be
normally or lognornally distributed. The study team
decides to use a nonparametric statistical test for
differences in the soil lead concentration distribu-
tion in the storage yard and in the surrounding areas.

d. What coefficient of variation is expected?

Based on previous sampling in other areas, a coef-
ficient of variation ranging from 50% to 200% is
expected. Preliminary data collected at the site
indicate a standard deviation of approximately 50
mg/kg. Since this estimate is based on very limited
data, the team decides to use a more conservative,
preliminary estimate of o = 75 mg/kg in the first
stage of planning the survey design.
e. What preliminary remediation goals (PRGs)
may need to be met?

A PRG of 400 mg/kg is available for residential
sites.17

f What are the desired power and confidence
levels?

The study team decides initially on a Type I
decision error limit of a = 0.10 and a Type II
decision error limit of (3 = 0.10 (power = 90%). The
team agrees to review this decision, depending on
the overall cost estimates produced by these
objectives.

Step 4. Define Boundaries of the Study

a. What geographic areas should be investigated?

The study team decides that the entire storage yard
area, approximately 5 acres, will be included in the
study. Four different background areas of approxi-
mately 10,000 m2 were selected at distances of
between 1,000 m and 10,000 m from the storage
yard boundaries.

b. What are the characteristics of the soil data or
population of interest?

Soil samples should be collected in dry, unpaved
areas. Prepared samples should be free of roots,
leaves, and rocks or other consolidated materials.
When preparing the samples, these materials should
be removed using a 3 cm diameter sieve. Oversized
materials should be retained for additional weighing
and analysis, if necessary.

c. How should the soil data be stratified statis-
tically into relatively homogeneous character-
istics!

No stratification is planned for this study.

d. What is the time frame to which the decision
applies!
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-17
Sampling will be conducted during a four-week
period in the fall. Lead concentrations in soil are
relatively static, and decisions based on the samp-
ling results will remain applicable for many years,
barring additional contamination.

e. What practical constraints may hinder sample
collection?

The plant manager agreed to permit EPA sampling
on the storage yard. Permission must be obtained
from the owners of the selected background samp-
ling areas for permission to enter and to collect
background samples on their property.

Step 5. Develop a Decision Rule

If the selected statistical test indicates that the mean
concentration in potentially impacted areas exceeds
the mean background concentration by more than a
substantial difference, then the chemical will be
treated as site-related. Otherwise, if the statistical
test indicates that the mean concentration in
potentially impacted areas does not exceed the back-
ground mean, the chemical will be treated as
background-related.

a. What should the null hypothesis be?

The study team chooses a null hypothesis that the
lead concentrations in the storage yard exceed
background concentrations.

*• H0: Lead concentrations in the storage yard
samples exceed background concentrations by
more than S = 100 mg/kg (see paragraphs c and
d, below, and Appendix A for how 100 mg/kg
was chosen).

b. What is the alternative hypothesis?

The alternative hypothesis is the opposite of the null
hypothesis.

*• HA: Lead concentrations in the storage yard
samples do not exceed the background concen-
trations by more than S = 100 mg/kg.
c. What level constitutes a substantial difference
above background!

The study team decided to use 100 mg/kg as the
value for a substantial difference in lead concentra-
tions between the storage yard and background
areas. Issues pertaining to the selection of a value
for a substantial difference are discussed in
Appendix A.

d. Specify the gray region for the hypothesis test

When using Background Test Form 2, the gray
region of width MDD starts at a difference of A = S
= 100 mg/kg and extends on the left down to A = (S
- MDD). As a trial value, the study team chose to
use an MDD that is one-half of S, 50 mg/kg (refer to
Table 3.1). This MDD represents abalance between
the cost of extra sampling and the expected cost of
remediating the site unnecessarily.

Step 6. Specify the Limits on Decision Errors

a. What is the possible range of the parameter of
interest?
The possible range of lead concentrations in
industrial soil is very wide, ranging from 0 to many
grams per kilogram.

b. What are the acceptable decision errors (Type
I and Type lip.

The team decides that the acceptable limits on
decision errors are a = 0.10 for Type I errors at a
difference of A = S = 100 mg/kg, and (3 = 0.10 for
Type II errors at a difference of A = S/2 = 50 mg/kg.

In Figure 3.2, the test performance curve achieves a
probability of 90% of detecting significant differ-
ence (A = S). The study team is comfortable with
the choice of a 90% confidence level for the test,
because this reduces the chance of a false nega-
tive—deciding that the yard does not exceed back-
ground by more than S.

The choice of (3 = 0.10 and the selected value for the
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-18
MDD equal to one-half the width of the gray region
means that the power of 90% will be required at A
= S/2. The plant manager recognizes that a lower
value of P (higher power) would result in a lower
probability of a Type II error and improve his
chances of passing the test, but he has decided that
the extra sampling costs required to achieve a higher
power are not necessary.

c. What are the potential consequences of each
type or error, specifying a range of possible
parameter values (gray region) where conse-
quences of decision errors are relatively minor!

The team decides that the decision errors are a =
0.10 at A = S, and (3 = 0.10 at A = S/2. The gray
region extends from a difference A = 50 mg/kg to a
difference of A = 100 mg/kg (referto Figure 3.2 and
decisions made in Steps 5d and 6b).

Test Form 2 has at least 100(l-a)% confidence of
correctly detecting a site that exceeds background
by more than S, regardless of the sample size.
Greater sample size increases the power of the test
and reduces (3, which reduces the chance that a site
is remediated unnecessarily. When using Test Form
2, extra samples represent the cost of increasing the
chance that the site is determined to be acceptable
when the true A is less than S. The study team
agrees to review this decision, depending on the
overall cost estimates produced by the decision
objectives.

d. Do the limits on decision errors ensure that they
accurately reflect the study team's concern
about the relative consequences for each type of
decision error?

The study team is satisfied with the choice of the
90% confidence level for the statistical test, because
this will reduce to 10% the chance of falsely
deciding that the yard does not exceed background
by more than 100 mg/kg when it truly does. The use
of a level-a test will provide 90% confidence for all
sample sizes, but may have poor power if the sample
size is too low.
The sample size is fixed by the choice of MDD and
P. Choosing p = 0.10 at a difference of A = 50
mg/kg means that a power of at least 90% will be
obtained if the true lead concentration on the yard is
at or below that value. The plant manager recog-
nizes that a lower value of P (higher power) would
result in a lower probability that the test will decide
the yard exceeds background lead concentrations if
the yard is only 50 mg/kg higher than background.
However, the manager has decided that this extra
power would require more sampling and unwanted
additional sampling costs.

The DQO parameters a, P, S, MDD, and o provide
the information needed to calculate the number of
samples (N) required from each population. N
samples will be collected in contaminated areas, and
N samples will be collected from background areas.

The sample size may be calculated using the app-
roximate formulas presented in Chapter 3 of EPA
QA/G-9.2 The approximate sample size calculated
with the values a = 0.10, P = 0.10, MDD = 50 mg/
kg, with a conservative estimate for o of 75 mg/kg,
is N = 35, as shown in Table 3.1. If the actual o is
measured to be only 50 mg/kg, as indicated by
preliminary data, then only 16 samples would be
required in each area. In this case, a retrospective
power analysis would show that the design had more
than adequate power. If the actual o is measured to
be 100 mg/kg, then 62 samples would be required.
In this latter case, the retrospective power analysis
would indicate that the design did not have adequate
power to make the decision and additional samples
should be collected. The estimate of o is one of the
most important design parameters, and the success
of the survey design will depend strongly on the
accuracy of this estimate. More specific sample-size
calculation procedures are given in MARSSIM.18

Step 7. Optimize the Sampling Design

What is the most resource effective sampling and
analysis design for generating data that are expected
to satisfy the DQOs?

a. Review the DQO outputs and existing environ-
mental data
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-19
The statistician, chemist, and plant engineer on the
study team have reviewed the outputs developed at
each stage of the DQO process.

b. Develop general sampling and analysis design
alternatives

The study team decides to use a randomly-oriented,
rectangular grid sampling strategy for the storage
yard and selected background area. Two random
numbers (x and y) randomly will determine the
starting point selected for the grid. The grid orien-
tation will be determined by a third random number.
The size of the grid will be calculated based on the
number of samples required for each area.

c. Verify that DQOs are satisfied for each design
alternative
Only one sample design is used in this study.

d. Select the most resource-effective design that
satisfies all of the DQOs

Alternative sampling designs may result in lower
sampling costs. The study team agrees to consider
the alternative sample designs suggested in EPA
QA/G-5S before the sampling program begins.12

e. Document the operational details and theoreti-
cal assumptions of the selected design in the
sampling and analysis plan

The EPA team has documented the discussions
leading to each DQO parameter.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-20
CHAPTER NOTES

1. U.S. Environmental Protection Agency (EPA). 2001. Requirements for Quality Assurance Project Plans,
EPA QA/R-5. http://www.epa.gov/quality/qapps.html.

2. U.S. Environmental Protection Agency (EPA). 2000. Guidance for Data Quality Assessment: Practical
Methods for Data Analysis, EPA QA/G-9, QAOO Version. Quality Assurance Management Staff,
Washington, DC, EPA 600-R-96-084. Available athttp://www.epa.gov/quality/qa_docs.html.
*• Equations for computing retrospective power are provided in the detailed step-by-step instructions for
each hypothesis test procedure in Chapter 3.

3. U.S. Environmental Protection Agency (EPA) .1994. Guidance for the Data Quality Objectives Process,
EPA QA/G-4, EPA 600-R-96-065. Washington DC.

4. U.S. Environmental Protection Agency (EPA). 1989. Risk Assessment Guidance for Superfund Vol. I,
Human Health Evaluation Manual (Part A). Office of Emergency and Remedial Response, Washington,
DC. EPA 540-1-89-002. Hereafter referred to as "RAGS."

5. Mathematically, Background Test Form 1 is written:
H0: A < 0 vs HA: A > 0
with A = 0S - 9B, where 0S is the selected decision parameter (mean, median, etc.) for the site distribution,
and 9B is the same parameter for the background distribution.

6. U.S. Environmental Protection Agency (EPA). 1989. Statistical Methods for Evaluating the Attainment
of Cleanup Standards, EPA 230/02-89-042, Washington DC.

7. Mathematically, Background Test Form 2 uses the substantial difference S as a non-zero action level:
H0: A > S vs HA: A < S
with A = 0S - 9B, where 9S is the selected decision parameter (mean, median, etc.) for the site distribution,
and 9B is the same parameter for the background distribution.

8. U.S. Environmental Protection Agency (EPA). 1990. Guidance for Data Usability in Risk Assessment:
Interim Final, October 1990. EPA 540-G-90-008, PB91-921208, Washington, DC.

9. U.S. Environmental Protection Agency (EPA). 1994. The Data Quality Objectives Decision Error
Feasibility Trials (DEFT) Software (EPA QA/G-4D), EPA/600/R-96/056, Office of Research and
Development, Washington, DC.

10. U.S. Environmental Protection Agency (EPA). 1996. The Data Quality Evaluation Statistical Toolbox
(DataQUEST) Software (EPA QA/G-9D), Office of Research and Development, Washington, DC.

11. Guidance for Choosing a Sampling Designfor Environmental Data Collection, EPAQA/G5S,U.S.EPA,
Office of Environmental Information, Peer Review Draft, Aug. 2000.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 3-21
12. For further guidance on the use of hypothesis tests in environmental decision making, see EPA QA/G-4,
Guidance for the Data Quality Objectives Process, EPA/600/R-96/055. The theory of hypothesis testing
is discussed in many introductory statistics textbooks, including the popular text by Mood et al. (1974)
Introduction to the Theory of Statistics, 3rd Ed., MrGraw Hill, Chapter IX. Readers with some
background in statistics may refer to Chapter 5 of Mathematical Statistics: Basic Ideas and Selected
Topics, P.J. Bickel and K.A. Doksum, Holden-Day, 1977, for a discussion of error rates and relative
importance of the errors (p. 168) that can be committed in hypothesis testing.

13. Gilbert, Richard O. 1987. Statistical Methods for Environmental Pollution Monitoring, VanNostrand
Reinhold.

14. Common references for sample selection include:

*• Cochran, W. 1977. Sampling Techniques. New York: John Wiley.

*• Gilbert, Richard O. 1987. Statistical Methods for Environmental Pollution Monitoring. New York:
Van Nostrand Reinhold.

*• U.S. Environmental Protection Agency (EPA). 1989. Statistical Methods for Evaluating the
Attainment of Cleanup Standards. Op. cit.

*• U.S. Environmental Protection Agency (EPA). 1990. Guidance for Data Usability in Risk
Assessment. Op. cit.

15. EPA's Technical Review Workgroup for Lead provides technical assistance for people working on lead-
contaminated sites. For assistance or more information, the reader should refer to their website (http://
epa.gov/superfund/programs/lead) or call the Lead Hotline (800-680-5323).

16. U.S. Environmental Protection Agency (EPA). TR W Recommendations for Sampling and Analysis of Soil
at Lead (Pb) Sites. Office of Emergency and Remedial Response, Washington, DC. EPA 540-F-00-010,
OSWER 9285.7-38.

17. U.S. Environmental Protection Agency (EPA). 1994. Revised Interim Soil Lead Guidance for CERCLA
Sites andRCRA Corrective Action Facilities. OSWER Directive 9355.4-12.

18. U.S. Environmental Protection Agency (EPA), U.S. Nuclear Regulatory Commission, et al. 2000.
Multi-Agency Radiation Survey and Site InvestigationManual(MARSSIM). Revision 1. EPA 402-R-97-
016. Available at http://www.epa.gov/radiation/marssim/ or from http://bookstore.gpo.gov/index.html
(GPO Stock Number for Revision 1 is 052-020-00814-1).
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
CHAPTER 4
PRELIMINARY DATA ANALYSIS
This chapter provides guidance for preliminary data
analysis using graphs and distributions of the data.
Depending upon the quality of existing site and
background data, quantitative analysis used to estab-
lish background concentration may involve a com-
bination of comparative statistical analysis and
graphical methods. The preliminary data analysis is
an integral part of choosing the appropriate methods
for making statistically valid comparisons of site
and background concentrations.

Preliminary data analysis should include a detailed
"hands-on" inspection of the site and background
data before proceeding to the statistical tests.
Graphs are used to identify patterns and relation-
ships within the onsite and background data sets,
and to compare the two data sets. Preliminary data
analysis should be focused on verifying assump-
tions, such as normality, made in the DQO process.
The review should identify anomalies in the data,
including potential outliers. This step is formally a
part of the Data Quality Assessment.1

The preliminary inspection may include develop-
ment of a posting plot,1 which is a map showing the
measured concentration and location of each sam-
ple. The posting plot may reveal likely sources of
contamination, important areas that have not been
sampled, spatial correlations or trends in the data,
and the location of suspected outliers. Note that one
possible outcome of the preliminary data inspection
is that the chemical concentrations detected at the
site are much higher than background ranges
reported for similar soil types. In this case, a formal
background analysis may not be necessary if all or
most of the detected concentrations are well above
the range likely to represent background. Another
possible outcome of the preliminary analysis is that
all chemical concentrations are well below risk-
based screening levels. In this case, background
analysis is not likely to be necessary.

This chapter presents information useful for both
parametric and nonparametric data analysis (defined
in the box below). Parametric statistical methods are
based on the assumption of a known mathematical
form for the probability distributions that represent
the site and background populations. For many
parametric methods, the data user should first deter-
mine whether the data are normally distributed,
using any of several tests for normality.

Parametric and Nonparametric Methods

Parametric: A statistical method that relies on a
known probability distribution for the population
from which the data are selected. Parametric
statistical tests are used to evaluate statements
(hypotheses) concerning the parameters of the
distribution.

Nonparametric: A distribution-free statistical
method that does not depend on knowledge of the
population distribution.
Nonparametric methods do not require that the data
distribution be characterized by a known family of
distributions. Several graphical methods are presen-
ted for nonparametric comparisons.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 4-2
4.1 Tests for Normality

Tests for normality are an important step in
assessing the type of statistical test to use for
comparison with background. Parametric tests, such
as the t-test for comparing the means of the site and
background distributions, are usually based on the
assumption of normality for both data sets. Before
using a parametric test for a background compari-
son, tests should be conducted on each data set to
show whether it meets the assumption of normality.
If the raw data are not normally or lognormally
distributed, other types of transformations should be
conducted to approximate normality prior to using
the data sets in parametric statistical comparisons.

Since it is unusual to encounter environmental data
sets that are normally distributed,2 these tests are
most commonly applied after a transformation.
Usually the logarithms of the data have been applied
to the raw data. The test for normality is then
applied to the transformed data sets. In most cases,
direct application of a nonparametric background
comparison test using the raw data is preferred to
using a parametric test on transformed data. This is
particularly true when there are outliers and/or non-
detect values in the raw data. The assumption of
normality is very important as it is the mathematical
basis for the majority of parametric statistical tests.
Examples of how to perform each of these tests can
be found in Chapter 4 of EPA QA/G-9.1

The Shapiro-Wilktest is a powerful general purpose
test for normality or lognormality when the sample
size is less than or equal to 50, and is highly recom-
mended. The Shapiro-Wilk test is an effective
method for testing whether a data set has been
drawn from an underlying normal distribution. It can
also evaluate lognormality if the test is conducted on
logarithms of the data. If the normal probability plot
is approximately linear—the distribution follows a
normal curve—the test statistic will be relatively
high. If the normal probability plot contains signifi-
cant curves, the test statistic will be relatively low.

Another test related to the Shapiro-Wilk test is the
Filliben statistic, also called the "probability plot
correlation coefficient." If the normal probability
plot is approximately linear, the correlation coef-
ficient is relatively high. If the normal probability
plot contains significant curves—the distribution
does not follow a normal curve—the correlation
coefficient will be relatively low. The Filliben test
is recommended for sample sizes less than or equal
to 100.

D 'Agostino 's test for normality or lognormality is
used when sample sizes are greater than 50. This
test is based on an estimate of the standard deviation
obtained using the ranks of the data. This estimate is
compared to the usual mean square estimate of the
standard deviation, which is appropriate for the
normal distribution.

The studentized range test for normality is based on
the fact that almost 100 percent of the area of a
normal curve lies within ± 5 standard deviations
from the mean. The studentized range test compares
the range of the sample to the sample standard
deviation. For example, if the minimum of 50 data
points is 40.2, the maximum is 62.7, and the
standard deviation is 4.2, then the studentized range
is (62.7 - 40.2)74.2 = 5.4. Tables of critical sizes up
to 1,000 are available for determining whether the
absolute value of the studentized range is
significantly large. Using, for example, Table A-2 in
EPA QA/G-91 the upper critical values for the
studentized range test with n = 50 are 5.35 for a =
0.05 and 5.77 for a = 0.01. In this example, the
assumption of normality would be accepted at the
95% confidence level, but rejected at the 99%
confidence level. The studentized range test does
not perform well if the distribution is asymmetric
and if the tails of the distribution are heavier than
the normal distribution. In most cases, this test
performs as well as the Shapiro-Wilk test and is
easier to apply.

4.2 Graphical Displays

Graphical methods provide visual examination of
the site and background distributions, and compari-
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
sons of the two. Graphical methods supplement the
statistical tests described in Chapter 5. Graphical
methods also may be used to verify that the assump-
tions of statistical tests are satisfied, to identify
outliers, and to estimate parameters of probability
distributions that fit to the data. The methods
described in this section assume that separate data
sets are collected on site and in background. In some
situations, an appropriate background area with
similar soil types and chemistry cannot be identi-
fied. Graphical methods designed for analyzing
sample data collected from both impacted and non-
impacted areas on the site are addressed by Singh et
al. (1994).3

4.2.1 Quantile Plot

A quantileplot displays the entire distribution of the
data, ranging from the lowest value to the highest
value. The vertical axis for the quantile plot is the
measured concentration, and the horizontal axis is
the percentile of the distribution. Each ranked data
value is plotted against the percentage of the data
with that value or less.

To construct a quantile plot, the data set is ranked
from smallest to largest. The percentage value for
each data point of rankj is computed as

Percentj = 100 ( rankj - 0.5) / n

where n is the number of values in the data set.
There are two quantile plots in Figure 4.1, one for
the site data and another for background data. If one
or more data values are non-detects, all non-detects
are ranked first, below the first numerical value. The
plot starts with the first numerical value. For
example, if a data set with 10 observations has 2
non-detect values, then the smallest detected value
has rank 3 and a percentage of 100(3 -0.5)/10 = 25.
The highest 8 data points would be shown on the
plot, starting at the 25th percentile.

The slope of the curve in the quantile plot is an
indication of the amount of data in a given range of
values. A small amount of data in a given range will
Page 4-3
result in a large slope for the quantile plot. A large
amount of data in a range will result in a more hori-
zontal slope. A sharp rise near the bottom or the top
of the curve may indicate the presence of outliers.

A graph may contain more than one quantile plot. In
a double-quantileplot, the site and background data
are each plotted in a single graph, providing a direct
visual comparison of the two distributions. A curve
that is higher in the vertical direction indicates a
higher distribution of data values.

An example of the double-quantile plot is shown in
Figure 4.1. The lower curve shows the distribution
of the background data, and the middle curve
(indicated by symbols only) shows the quantile plot
for the site data. In this example, the entire site
distribution is higher than the background distribu-
tion indicating that some degree of contamination is
likely. The close proximity of the site and back-
ground quantile plots near the 70th percentile and
rapid divergence above indicate a larger difference
between the two distributions in the upper 30
percent of the distributions. At the left end of the
plot, the background data distribution falls off more
rapidly to zero concentration below the 20th
percentile than the site data distribution, which has
a y-intercept substantially above zero. The positive
intercept and roughly parallel shape of the three
lines in the plot below the 70th percentile suggest
that the distribution of concentrations on site is
shifted to higher levels than the background distri-
Site and Background vs Percentile
Percentile
+ Background + S
Figure 4.1 Example of a double quantile plot
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 4-4
bution, with a larger shift above the 70th percentile.
The upper curve in the figure shows the back-
ground distribution augmented by S = 10, a hypo-
thetical value for a substantial difference over back-
ground. In this example, the entire site distribution
lies below the S-augmented background distribution,
indicating that the site does not exceed background
by more than a substantial difference.

Issues affecting the determination of site-specific
values for a substantial difference are discussed in
more detail in Appendix A of this guidance.

The formal statistical test procedures presented in
Chapter 5 may be used to make decisions that
confirm or deny these graphical indications with
predetermined error rates. In this and the following
exhibits, contaminant concentrations are plotted
using a linear scale. If the data are highly variable,
it may be necessary to transform the graph by using
a logarithmic scale for the concentration axis. Use
of the logarithmic transformation does not affect the
ranks of the data.
a known distribution, such as the normal distri-
bution. This application is referred to as a
normal probability plot. If the data follow a
normal distribution, the plot will appear as a
straight line. Probability plots are useful for
determining if the site data or the background
data follow a normal or lognormal distribution.
More information on the use of the quantile-
quantile plot to compare with known parametric
distributions is provided in Section 2.3 of EPA
QA/G-9.1

Empirical Quantile-Quantile Plot. In nonpara-
metric applications, the empirical quantile-
quantile plot is used to compare two data sets.
In our case, the two data sets are the site
distribution and the background distribution. If
there are an equal number of data values in the
two data sets, it is very easy to construct an
empirical quantile-quantile plot. The graph is
constructed by plotting each ranked site value
against the corresponding background value
with the same rank.
4.2.2 Quantile-Quantile Plots

A quantile-quantile plot is useful for comparing two
distributions in a single graph. The vertical axis of
this plot represents the first distribution of values,
and the horizontal axis represents the second
distribution. The scales for the concentration axes
may be either both linear or both logarithmic. If the
two distributions are identical, the quantile-quantile
plot will form a straight line at 45 degrees when
equal scales are used for the two axes. The slope of
this line has a value of one, regardless of the
selected scales. Deviations from this line show the
differences between the two distributions.

There are two common applications of the quantile-
quantile plot. One type is used for parametric appli-
cations, and the other for nonparametric compari-
sons.

*• Parametric Quantile-Quantile Plot. In para-
metric applications of the quantile-quantile plot,
the horizontal axis represents the quantiles from
The empirical quantile-quantile plot is useful
because it provides a direct visual comparison of the
two data sets. An example of the quantile-quantile
plot is shown in Figure 4.2. If the site and back-
ground distributions are identical, the plotted values
would lie on a straight line through the origin with
slope equal to 1, shown in the figure as the line
Interpolated Site Data vs Background
Concentration in Background
Site = Background + Background + S x Q - Q Plot
Figure 4.2 Example of an empirical quantile-quantile
plot
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 4-5
labeled "Site=Background." Any deviation from this
line shows differences between the two distribu-
tions . The points that mark the empirical Q-Q plot in
the figure are above the line that indicates equality
of the two distributions. This indicates that the
distribution of site measurements exceeds the distri-
bution of background measurements. If the site
differs from the background data distributions only
by an additive difference along the entire distribu-
tion, the plotted site values will lie on a straight line
with slope 1 that does not pass through the origin. If
the site distribution is t units above the background
distribution, the straight line will have slope 1 and
a y intercept at +1.

A hypothetical level of substantial contamination, S,
is shown in the upper plot in Figure 4.2 labeled
"Background + S." Note that the median interpola-
ted site value is plotted against the median of the
background values at the center of the plot. When
this point lies above the equal-distribution line with
slope 1, the median interpolated site value is larger
than the median background value.

When the size of the data set differs from the size of
the background data set, interpolation is used to
construct the empirical quantile-quantile plot.
Detailed procedures for creating a quantile-quantile
plot with unequal sample sizes are provided in
Section 2.3.7.4 of EPA QA/G-9.1

4.2.3 Quantile Difference Plot

The nonparametric quantile difference plot is a
variant of the empirical quantile-quantile plot. When
site data are compared to background data, the
quantity of greatest interest is the amount by which
the site distribution exceeds the background
distribution. This difference can be viewed in the
empirical quantile-quantile plot of Figure 4.2 as the
difference between two sloped lines, the quantile-
quantile plot and the line with slope 1 where site
equals background. More resolution for examining
the differences between the site and background
distributions is obtained by subtracting each back-
ground value from its corresponding interpolated
site value, then plotting the differences versus their
corresponding background values.

An example of the quantile difference plot is shown
in Figure 4.3. In the quantile difference plot, back-
ground is represented by the horizontal axis. The
distribution of background values is shown by the
symbols plotted on this axis. A hypothetical level of
substantial contamination of S = 10 appears in this
plot as a horizontal line, not to be exceeded. In this
example, the entire quantile difference plot lies
between the background and the substantial differ-
ence level, indicating that the site exceeds back-
ground by a small amount, but does not exceed
background by more than a substantial difference.

;kground)
(C
DQ
£
5>
%
1
0

(Site minus Background) vs Background
~ Site = Background + 5
— 4- ^^1.+ +). . + _j. + _|_.+ .+

0 2 4 6 8 10 12 14 16 18 2D 22
Concentration in Background
n Background + Background + S « Q - Diff Plot
Figure 4.3 Example of a quantile difference plot

The quantile difference plot permits a quick visual
evaluation of the amount by which the site exceeds
background. In this example, the largest differences
occur in the upper half of the distribution. It is clear
that the interpolated site values do not exceed
background by more than the hypothetical S = 10
concentration units. This conclusion is not as
obvious using the sloped quantile-quantile plot in
Figure 4.2.

Similar warnings exist for use of the quantile
difference plot as for the empirical quantile-quantile
plot when there are more than twice as many site
values as background values. The empirical quan-
tile-quantile plot and the quantile difference plot
work best when the site and background data sets
are of approximately the same size, and they depend
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 4-6
upon the choice of S.

4.3 Outliers

Outliers are measurements that are unusually larger
or smaller than the remaining data. They are not
representative of the sample population from which
they were drawn, and they distort statistics if used in
any calculations. Statistical tests based on para-
metric methods generally are more sensitive to the
existence of outliers in either the site or background
data sets than are those based on nonparametric
methods.

Outliers can lead to both Type I and Type II errors.
They can lead to inconclusive results if the results
are highly sensitive to the outliers. There are many
plausible reasons for the presence of outliers in a
data set:

> Data entry errors. Data that are extremely high
or low should be verified for data entry errors.

> Missing values and non-detects. It is important
that missing value and non-detect codes are not
read as real data. For example the number 999
might be a code for missing data, but the com-
puter program used to analyze the data, if not
properly designated, could misread this as an
extreme value of 999. This is easily remedied.

> Sampling error. In this case the sample results
for the sample that is not from the population of
interest should be deleted. However, using data
from a population other than the one of concern
is not easily recognized. Therefore, this type of
error can result in the presence of outliers in the
data set.

> Non-normal population. An outlier might also
exist when a sample is from the population of
interest, but its distribution has more extreme
values than the normal distribution. In this
situation, the sample can be retained if a
statistical approach is selected for which the
outliers do not have undue impact.
Outliers may misrepresent the sample population
from which they were taken, and any conclusion
drawn that is based on these results may be suspect.
Outliers may be true measurements of conditions on
the site, or may be due to faulty sample collection,
cross contamination, lab equipment failure, or
improper data entry. To determine which case
applies, the outliers should first be identified. If
there is a large number of outliers in the data set, it
may be necessary to reassess the area. Outliers in
the site data set have different implications from
outliers in the background data set. For example, an
onsite outlier can indicate a "hot spot," which
indicates that the one spot needs attention. An
outlier in the background data set, however, might
indicate that one of the background samples was
collected in a location that is not truly background.
In such a case, an outlier test should be used (along
with a qualitative study of where the sample in
question was collected) to see if that data point
should be discarded from the background set.
Additional guidance for handling outliers is
provided in EPA QA/G-9, Section 4.4.l

Data points that are flagged as outliers should be
eliminated from the data set if field or laboratory
records indicate that the sample location was not a
reasonable reference area, or if there was a problem
in collecting or analyzing the sample. However,
background areas are not necessarily pristine areas.
A data point should not be eliminated from the
background data set simply because it is the highest
value that was observed. The use of nonparametric
hypothesis tests for background comparisons greatly
reduces the sensitivity of test results to the presence
of outliers. Parametric tests based on the lognormal
distribution may yield results that are extremely
sensitive to the presence of one or more outliers.

Statistical outlier tests give probabilistic evidence
that an extreme value does not "fit" with the
distribution of the remainder of the data and is
therefore a statistical outlier. There are five steps
involved in treating extreme values or outliers:

1. Identify extreme values that may be potential
outliers;
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 4-7
2. Apply statistical tests;

3. Review statistical outliers and decide on their
disposition;

4. Conduct data analyses with and without statisti-
cal outliers; and

5. Document the entire process.

More guidance on handling outliers is given in
Chapters.

4.4 Censored Data (Non-Detects)

Contamination on the site or in background areas
may be present at concentrations close to the detec-
tion limits. A sample is said to be "censored" when
certain values are unknown, although their existence
is known. Type I censoring occurs when the sample
is censored by reference to a fixed value. Non-detect
measurements are examples of Type I left censoring.
The specific value is unknown, but the existence of
a concentration value in the closed interval from 0
to the reporting limit may be inferred. The value
may be 0, or a small positive value less than the
detection limit. If other measurements collected on
the site indicate concentrations above the detection
limit, then the likelihood that at least some of the
non-detects represent small positive values is
increased. Concentration values may be censored at
their detection limits or at some arbitrary level based
on detection limits.

A detection limit is the smallest concentration of a
substance that can be distinguished from zero. Con-
sequently, non-detects may not represent the
absence of a chemical but its presence at a concen-
tration below its reliable minimum detection level.
Many parametric statistical methods require numeri-
cal values for all data points. One approach is to
impute a surrogate value for non-detects, commonly
assumed to be half the reporting limit. The use of
L / V2 has also been recommended.4 Alternatively,
a random value between the reporting limit and zero
may be chosen to represent each non-detect for the
purposes of testing assumptions concerning distribu-
tions. Both approaches may seriously affect the
estimated distribution parameters.5

If less than 15 percent of the site and background
samples are non-detects, then distributions of both
the background and the site sample may be
determined by using surrogate values. Probability
plots and goodness-of-fittests may be performed for
each data set, first including the non-detects as part
of the sample using random values for non-detects,
and second, excluding the non-detects from the
sample. If the two sets of estimated parameters
differ only slightly, then the non-detect problem is
of lesser importance. However, if the two sets of
estimates differ significantly, then the surrogate
value approach should be re-evaluated.

If more than 15 percent and less than 50 percent of
the measurements in the background sample set or
the site sample set are non-detects, the use of
specialized methods for analyzing non-detects is
recommended. Section 4.7 of EPA QA/G-91 des-
cribes in detail several methods for estimating the
mean and standard deviation of data sets with non-
detects.

If more than 50 percent of the measurements in
either the background sample set or the site sample
set are non-detects, it may not be possible to com-
pare the means of the two distributions. An alterna-
tive approach is to compare the upper percentiles of
the two distributions by comparing the proportion of
the two populations that is above a fixed level.1
Comparisons maybe made for the upper percentiles
of each distribution despite the large number of non-
detects.

Nonparametric methods may be used to avoid the
necessity of imputing surrogate values for non-
detect measurement. Nonparametric methods are
often based only on the ranks of the data, and the
non-detect values can be assigned unambiguous
ranks without the need for assigning surrogate
values. Bootstrapping and other nonparametric
methods have recently received attention.6
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 4-8
CHAPTER NOTES

U.S. Environmental Protection Agency (EPA). 2000. Guidance for Data Quality Assessment: Practical
Methods for Data Analysis, EPA QA/G-9, QAOO Version. EPA 600-R-96-084. Quality Assurance
Management Staff, Washington, DC. Available athttp://www.epa.gov/quality/qa_docs.html.

See Chapter 2 for more information on the role of preliminary data analysis in the Data Quality
Assessment process.
See Section 2.3 for information on the use of a quantile-quantile plot, and Section 2.3.7.4 for detailed
procedures for creating a quantile-quantile plot with unequal sizes.
See Section 2.3.9.1 for guidance on preparing a posting plot.
See Section 3.3.2.1 for recommendations on dealing with high proportions of non-detects.
See Chapter 4 for examples of how to test for normality.
See Section 4.4 for guidance on outliers.
See Section 4.7 for methods of estimating mean and standard deviation of data with non-detects.
See Table A-2 for critical values.

U.S. Environmental Protection Agency (EPA). 1992. Supplemental Guidance to RAGS: Calculating the
Concentration Term, Publication 9285.7-081, Office of Solid Waste and Emergency Response,
Washington, DC.

Singh, A., A.K. Singh, and G. Flatman. 1994. "Estimation of background levels of contaminants,"
Mathematical Geology, 26:3.

See, for example, cleanup regulations in the Model Toxics Control Act, State of Washington, WAC 173-
340-708(1 l)(e).

A detailed consideration of non-detects is included in Statistical Guidance for Ecology Site Managers,
Supplement S-6, Analyzing site of background data with below-detection limit or below-PQL values
(Censored data sets), Washington State Department of Ecology, Olympia, WA, August 1993.

U.S. Environmental Protection Agency (EPA). 1997. The Lognormal Distribution in Environmental
Applications, EPA/600/R-97/006. Office of Research and Development, Environmental Sciences
Division, Las Vegas, NV.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
CHAPTER 5

COMPARING SITE AND
BACKGROUND DATA
This chapter provides guidance on selecting quan-
titative statistical approaches for comparing site data
to background data. Statistical methods allow for
specifying (controlling) the probabilities of making
decision errors and for extrapolating from a set of
measurements to the entire site in a scientifically
valid fashion.1

Several methods are available for comparing back-
ground to site data. These can be divided into several
major categories: data ranking and plotting, descrip-
tive summaries, simple comparisons, parametric
tests, and nonparametric tests. For many of these
methods, data users first should determine whether
the data are normally distributed, using any of several
tests for normality. Data can also be assessed in terms
of the whole data set from the site, or with a focus on
outliers in the background data set or in the con-
taminant concentrations at the site (see Chapter 4).

The issue of randomness is an important element of
most statistical procedures when sample results are to
be extrapolated to the entire site or background
sampling area, rather than only representing the areas
where measurements were made. The statistical tests
discussed in this chapter assume that the data
constitute a random sample from the population. If a
sample of measurements is to represent the entire
site, every sampling point within the area represented
by the sample should have a non-zero probability of
being selected as part of the sample. If all points have
an equal opportunity for selection, the sampling
procedure will generate a simple random sample. A
random sample implies independence, loosely
meaning that the samples are also uncorrelated. If
samples are too closely spaced, then adjacent
samples may exhibit a high degree of correlation.
This lack of independence is avoided by using a grid
sampling technique.

Most procedures presented in this chapter require a
simple random sample. Stratification of the site will
usually result in differing probabilities of selection
within each stratum. A stratified sample is not a
simple random sample, and a statistician should be
consulted before conducting the analysis. In this
context, the statistician would advise on the
appropriate calculations to use for estimation and
hypothesis testing if a stratified design has been
selected.

Judgmental (or "authoritative"2) samples are those
collected in areas suspected to have higher
contaminant concentrations due to operational or
historical knowledge. Judgmental samples may
result from sampling conducted for overall site
characterization, developing exposure point concen-
trations, or sampling specifically to delineate areas
requiring remediation. Judgmental samples cannot
be extrapolated to represent the entire site. In some
cases, there is a great deal of bias associated with
the collection of judgmental samples. The statistical
hypothesis testing procedures recommended in this
chapter are based on random samples and should not
be used on judgmental samples. If judgmental samp-
ling is used on site, while background measurements
are collected randomly, direct comparison of the
means of the two data sets is not recommended.

Graphical methods, such as posting plots, may be
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-2
Method
Descriptive Summary
*D Mean
*D Median
>D Standard deviation
>D Variance
*D Percentiles
Simple Comparisons
Parametric Tests
>D Student t-test
*D Behrens-Fisher Student t-test
Nonparametric Tests
*D Wilcoxon Rank Sum Test
(also called the "Mann-
Whitney Test")
*D Gehan Test
Application
Preliminary examination of data
for comparison with site history
and land use activities in the
establishment of background. Use
as a preliminary screening tool.
Used with very small data sets.
Tests require approximate
normality of the estimated means.
Use if a larger number of data
points are available (n > 25). For
smaller data sets, examine data
for normality or lognormality in
distribution.4
Use when data are not normally
distributed, as rank-ordered tests
make no assumption on
distribution.
Comments
Simple and straightforward; less
statistical rigor.
Not recommended
Statistically robust and used
frequently in parametric data
analysis.
Statistically robust and used
frequently in background
estimation.
used to display judgmental data. These displays may
reveal likely sources and pathways of contamination.
Kriging3 and other spatial smoothing algorithms may
be applied to identify areas with suspected high con-
centrations for conducting the remediation, although
the estimated mean concentrations should be recog-
nized for their upward bias.

Depending upon the data and other site-specific
considerations, statistical analysis should involve one
or a combination of the following methods:

> Parametric statistical comparison methods invol-
ving comparison of one or more parameters of
the distribution of site samples with the corres-
ponding parameter of the background distribu-
tion, such as the Student t-test; or
> Nonparametric tests, such as Wilcoxon Rank
Sum (WRS) test.

The box at the top of this page lists examples of the
statistical tests and applications recommended for
establishing background constituent concentrations.
These and other useful tests are discussed in more
detail in the following sections.

5.1 Descriptive Summary Statistics

Several statistics can be used to describe data sets.
These statistics may be used in many of the tests
described later in this chapter. There are two
important features of a data set: central tendency
and dispersion.

Estimators of central tendency include the arith-
metic mean, median, mode, and geometric mean.
The sample mean is an arithmetic average for simple
random sampling designs; however for complex
sampling designs, such as stratification, the sample
mean is a weighted arithmetic average. The sample
mean is influenced by extreme values (large or
small) and can easily be influenced by non-detects.
The sample median value is directly in the middle of
the data when the measurements are ranked in order
from smallest to largest. More simply, the median is
the middlemost value in the data set when the
number of data values is odd. When the number of
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-3
data points is even, the median is usually defined as
the average of the two middlemost values. The
median is less affected by the presence of values
recorded as being below the detection limit.

The dispersion around the central tendency is
described by such items as the range, variance,
sample standard deviation, and coefficient of varia-
tion. The easiest measure of dispersion is the sample
range. For small samples, the range is easy to inter-
pret and may adequately represent the spread of the
data. For large samples, the range is not very
informative because it only considers and is greatly
influenced by extreme values. The sample variance
measures the dispersion from the mean of a data set
and is affected by extreme values and by a large
number of non-detects. The coefficient of variation
(CV) is a unitless measure that allows the compari-
son of dispersion across several sets of data. The CV
is often used instead of the standard deviation in
environmental applications because the standard
deviation is often proportional to the mean. The
standard deviation is affected by values below the
detection limit, and some method of substituting
numerical values for these should be found.4

5.2 Simple Comparison Methods

Simple comparison methods rely on descriptive sum-
mary statistics, such as comparing means or maxi-
mums. These approaches can be used with very small
data sets but are highly uncertain.

5.3 Statistical Methods for
Comparisons with Background

Many statistical tests and models are only appropriate
for data that follow a particular distribution.
Statistical tests that rely on knowledge of the form of
the population distribution for the data are known as
parametric tests, because the test is usually phrased
in terms of the parameters of the distribution assumed
for the data. Two of the most important distributions
for tests involving environmental data are the normal
distribution and the lognormal distribution. A normal
distribution has only two parameters, the mean and
variance. Lognormal distributions also have only
two parameters, but there are several common ways
to parameterize the lognormal distribution. In this
chapter, use of parametric comparison methods like
t-tests or ANOVA may require normalization of
data by conversion to a log scale.5

Tests for the distribution of the data (such as the
Shapiro-Wilk test for normality) often fail if there
are insufficient data, if the data contain multiple
populations, or if there is a high proportion of
non-detects in the sample.6 Tests for normality lack
statistical power for small sample sizes. In this
context, "small" may be defined roughly as less than
20 samples, either on site or in background areas.
Some standard tests for a particular distribution
against all alternatives, such as the Lilliefors form of
the Kolmogorov-Smirnoff test, require as many as
50 samples. Therefore, for small sample sizes or
when the distribution cannot be determined, non-
parametric tests should be used to avoid incorrectly
assuming the data are normally distributed when
there is not enough information to test this
assumption.

Statistical tests that do not assume a specific
mathematical form for the population distribution
are called distribution-free or nonparametric statisti-
cal tests. Nonparametric tests have good test perfor-
mance for a wide variety of distributions, and their
performance is not unduly affected by outliers.
Nonparametric tests can be used for normal or non-
normal data sets. If one or both of the data sets fail
to meet the test for normality, or if the data sets
appear to come from different types of distributions,
then nonparametric tests may be the only alternative
for the comparison with background. However, for
normal data with no outliers or non-detect values,
the parametric methods discussed in the next section
are somewhat more powerful. Nonparametric tests
are discussed in Section 5.3.2.

The relative performance of different testing proce-
dures may be summarized by comparing their p-
values. The p-value of a statistical test is defined as
the smallest value of a at which the null hypothesis
would be rejected for the given observations. (The
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-4
p-value of the test is sometimes called the critical
level, or the significance level, of the test.)

Statistical tests may also be compared based on their
robustness. Robustness means that the test has good
performance for a wide variety of data distributions,
and that performance is not unduly affected by
outliers. In addition, nonparametric tests used to
compare population means and medians generally are
unaffected by a reasonable number of non-detect
values. There are different circumstances that should
be considered:

*• If a parametric test for comparing means is
applied to data from a non-normal population
and the sample size is large, the parametric test
will work well. The central limit theorem ensures
that parametric tests for the mean will work
because parametric tests for the mean are robust
to deviations from normal distributions as long as
the sample size is large. Unfortunately, the
answer to the question of how large is large
enough depends on the nature of the particular
distribution. Unless the population distribution is
very peculiar, you can safely choose a parametric
test for comparing means when there are at least
24 data points in each group.

*• If a nonparametric test for comparing means is
applied to data from a normal population and the
sample size is large, the nonparametric test will
work well. In this case, the p values tend to be a
little too large, but the discrepancy is small. In
other words, nonparametric tests for comparing
means are only slightly less powerful than
parametric tests with large samples.
*• If a parametric test is applied to data from a
non-normal population and the sample size is
small (for example, less than 20 data points),
the p value may be inaccurate because the
central limit theory does not apply in this case.

*• If a nonparametric test is applied to data from a
non-normal population and the sample size is
small, the p values tend to be too high. In other
words, nonparametric tests may lack statistical
power with small samples.

In conclusion, large data sets do not present any
problem. In this case the nonparametric tests are
powerful and the parametric tests are robust.
However, small data sets are challenging. In this
case the nonparametric tests are not powerful, and
the parametric tests are not robust.

5.3.1 Parametric Tests

Parametric statistical tests, examples of which are
listed in the box at the bottom of this page, assume
the data have a known distributional form. They
may also assume that the data are statistically
independent or that there are no spatial trends in the
data. Parametric statistical comparison methods, in
the context of this guidance, involve comparison of
one or more distribution parameters of site samples
with corresponding parameters of the background
distribution.

Tests for the distribution of the data offer clues on
metals detected frequently at higher concentrations.
For example, as a general rule, naturally occurring
Parametric Tests
Test
t-test
Upper Tolerance Limit (UTL)
Extreme Value (Dixon's) Test
Rosner's Test
Discordance Test
Purpose
Test for difference in means
Test for outliers
Test for one outlier
Test for up to 10 outliers
Test for one outlier
Assumptions
Normality, equal variances
Normality
Normality, not including outlier
Normality, sample size 25 or larger
Normality, not including the outlier
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-5
aluminum, iron, calcium, and magnesium tend to be
normally distributed, while trace metals tend to have
lognormal distributions.

Tests of Means

The most common method for background compari-
sons involves a comparison between means using t-
tests or similar parametric methods. If the estimated
means do not differ by a statistically significant
amount (given a predetermined level of significance
such as 0.05), then there is no substantial difference
in the mean of the site data as compared to the mean
of the background data.

To conduct a t-test, a null hypothesis should first be
developed. (See Section 3.1 for developing null
hypotheses.) The t-statistic calculated from the data
is then compared to a critical value for the test which
depends on the level of confidence selected to
determine whether or not the null hypothesis should
be rejected. Although the t-test is derived based on
normality, the conclusion that the data do not follow
a normal distribution does not discount the t-test.
Generally, the t-test is robust and therefore not
sensitive to small deviations from the assumptions of
normality.

If the two populations have significantly different
variances, the two-sample t-test should not be used
for comparing means. Procedures are available to test
for equality of variance. Instructions for performing
Bartlett's test and Levene's test are presented in EPA
QA/G-9, Section 4.5.2

Any t-test should be discussed with a statistician
prior to use since there are a number of variations
and assumptions that can apply. The Student t-test
has good application when comparing background
sites to potentially contaminated sites.7

Methods such as Cochran's Approximation to the
Behrens-Fischer Student t-test may be useful when
replicated measurements are available. This statisti-
cal comparison method requires that two or more
discrete samples be taken at each sampling station.
Note that the choice of a specific t-test depends on
site-specific information and other statistical con-
siderations.

Tests of Outliers

There are many parametric tests for outliers, based
on deviations from the normal distribution. Three of
these tests are explained in detail in EPA QA/G-9,2
including Dixon's test, Rosner's test, and the
Discordance test shown in the box on the previous
page. In addition to these tests, suspected outliers
may be identified using a tolerance limit approach.
There are parametric and nonparametric forms of
tolerance intervals. This section discusses the
parametric version.8 A nonparametric version of
tolerance intervals is presented in Section 5.3.2.

While mean tests explore whether the true means of
two populations are significantly different, other
tests can be used to indicate whether a single sample
is likely to be an outlier in the data set. This type of
test can be useful in identifying a "hot spot" that
may exceed background, even if the average site
concentration does not seem to be different from
background. One such test is the tolerance interval.
A thorough discussion of normal, Poisson, and non-
parametric tolerance limits can be found in Chapter
4 of Gibbons.9

A tolerance limit (TL) is a confidence limit on a
percentile of the data, rather than a confidence limit
on the mean. Tolerance limits provide an interval
within which at least a certain proportion of the
population lies with a specified probability that the
stated interval does indeed "contain" that proportion
of the population. An example would be a situation
in which you are trying to draw a random sample,
and want to know how large the sample size should
be so that you can be 95 percent sure that at least 95
percent of the population lies between the smallest
and the largest observation in the sample. Similarly,
one-sided TLs can be developed. Establishing a TL
is recommended for identifying outliers.

For example, a 95 percent one-sided TL for 95
percent coverage represents the value below which
95 percent of the population are expected to fall
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-6
(with 95 percent confidence).

5.3.2 Nonparametric Tests

The statistical tests discussed in the previous section
rely on the mathematical properties of the population
distribution (normal or lognormal) selected for the
comparison with background. When the data do not
follow the assumed distribution, use of parametric
statistical tests may lead to inaccurate comparisons.
Additionally, if the data sets contain outliers or non-
detect values, an additional level of uncertainty is
faced when conducting parametric tests. Since most
environmental data sets do contain outliers and non-
detect values, it is unlikely that the current
widespread use of parametric tests is justified, given
that these tests may be adversely affected by outliers
and by assumptions made for handling non-detect
values.
Nonparametric Tests
Test Assumptions
Wilcoxon Both samples are randomly selected
Rank Sum from respective populations and
(WRS) mutually independent; distributions
are identical (except for possible
difference in location parameter).
Gehan Multiple detection limits and non-
Test detect.
Quantile Populations are identical except for
Test differences above a given percentile

Tests that do not assume a specific mathematical
form for the underlying distribution are called
distribution-free or nonparametric statistical tests.
The property of robustness is the main advantage of
nonparametric statistical tests. Nonparametric tests
have good test performance for a wide variety of
distributions, and that performance is not unduly
affected by outliers.

Nonparametric tests can be used for normal or non-
normal data sets. If one or both of the data sets fail to
meet the test for normality, or if the data sets appear
to come from different types of populations, then
nonparametric tests may be the only alternative for
the comparison with background. If the two data
sets appear to be from the same family of distribu-
tions, use of a specific statistical test that is based on
this knowledge is not necessarily required because
the nonparametric tests will perform almost as well.
However, for normal data with no outliers or non-
detect values, the parametric methods discussed in
the previous section are somewhat more powerful.

Several nonparametric test procedures, including
three listed in the box at left, are available for
conducting background comparisons. Nonpara-
metric tests compare the shape and location of the
two distributions instead of a statistical parameter
(such as mean). Nonparametric tests are currently
used by some EPA regions on a case-by-case basis.
These methods have varying levels of sensitivity
and data requirements and should be considered as
the preferred methods whenever data are heavily
censored (a high percentage of non-detect values).

Wilcoxon Rank Sum Test for Background
Comparisons

The Wilcoxon Rank Sum (WRS)10 test is an
example of a nonparametric test used for determin-
ing whether a difference exists between site and
background population distributions. The WRS tests
whether measurements from one population consis-
tently tend to be larger (or smaller) than those from
the other population. This test determines which
distribution is higher by comparing the relative
ranks of the two data sets when the data from both
sources are sorted into a single list. One assumes
that any difference between the background and site
concentration distributions is due to a shift in the
site concentrations to higher values (due to the
presence of contamination in addition to back-
ground).

Two assumptions underlying this test are: 1) sam-
ples from the background and site are independent,
identically distributed random samples, and 2) each
measurement is independent of every other measure-
ment, regardless of the set of samples from which it
came. The test assumes also that the distributions of
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-7
the two populations are identical in shape (variance),
although the distributions need not be symmetric.

The WRS test has three advantages for background
comparisons:

> The two data sets are not required to be from a
known type of distribution. The WRS test does
not assume that the data are normally or log-
normally distributed, although a normal distribu-
tion approximation often is used to determine the
critical value for the test for large sample sizes.

> It allows for non-detect measurements to be
present in both data sets (see box below).11 The
WRS test can handle a moderate number of non-
detect values in either or both data sets by
treating them as ties.12 Theoretically, the WRS
test can be used with up to 40 percent or more
non-detect measurements in either the back-
ground or the site data. If more than 40 percent
of the data from either the background or site are
non-detect values, the WRS test should not be
used.13

> It is robust with respect to outliers because the
analysis is conducted in terms of ranks of the
data. This limits the influence of outliers because
a given data point can be no more extreme than
the first or last rank.

Procedures for Non-Detect Values in WRS Test

If there are t non-detect values, they are consider-
ed as "ties" and are assigned the average rank for
this group. Their average rank is the average of the
first t integers, (t+l)/2. If more than one detection
limit was in use, all observations below the largest
detection limit should be treated as non-detect
values. Alternatively, the Gehan test may be
performed.14

The WRS test may be applied to either null hypothe-
sis in the two forms of background test discussed in
Chapter 3: no statistically significant difference or
exceed by more than a substantial difference. In
either form of background test, the null hypothesis
is assumed to be true unless the evidence in the data
indicates that it should be rejected in favor of the
alternative.

WRS Test Procedure for Background Test Form 1

Null Hypothesis (H0): The mean of the site distribu-
tion is less than or equal to the mean of the
background (A < 0).

Alternative Hypothesis (H^): The mean of the site
distribution exceeds the mean of the background
distribution (A > 0).

The WRS test for Background Test Form 1 is
applied as outlined in the following steps. The lead-
contaminated storage yard example from Chapter 3
serves to illustrate the procedure. (Although EPA
selected Background Test Form 2 in this example,
both forms of the test are evaluated.)

Hypothetical data for the storage yard example is
shown in Tables 5.1 and 5.2 for the onsite and
background areas, respectively. There is one non-
detect measurement (ND) in the data collected on
site and five in the background data set. The
background non-detects were treated as 0 values
when adding S to the background measurements.
This is a more conservative approach than using !/>
the detection limit or other surrogate or random
numbers for the non-detect values.15 The WRS test
is very robust to this small modification as it is
unlikely that any reasonable surrogate value will
affect significantly the assigned rank of the non-
detects in the combined data set.

Table 5.3 demonstrates the WRS test procedure for
Background Test Form 1, testing the null hypothesis
that there is no statistically significant difference
between the site and background distributions. The
background measurements (m = 20) and the site
measurements (n = 20) are ranked in a single list in
order of increasing size from 1 to N, where N = m +
n = 40. At the top of the list, all six non-detect
values are considered as ties and are assigned an
average rank of 3.5 = (6 + 1) + 2. (See the box
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-8
Data
(mg/kg)
ND
34.0
39.5
48.6
54.9
70.9
72.1
81.3
83.2
86.2
88.2
96.1
98.3
104.3
105.6
129.0
139.3
156.9
167.9
208.4
Source

Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Table 5.1 Site data
Data
(mg/kg)
ND
ND
ND
ND
ND
0.1
15.7
46.1
48.1
49.3
53.5
58.0
59.7
68.0
88.5
96.5
115.8
122.9
126.8
147.5
Source
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Background
Ranks for
Rank

3.5
3.5
3.5
3.5
3.5
3.5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
820

Data
(mg/kg)
ND
ND
ND
ND
ND
ND
0.1
15.7
34.0
39.5
46.1
48.1
48.6
49.3
53.5
54.9
58.0
59.7
68.0
70.9
72.1
81.3
83.2
86.2
88.2
88.5
96.1
96.5
98.3
104.3
105.6
115.8
122.9
126.8
129.0
139.3
147.5
156.9
167.9
208.4

Source

Site
Background
Background
Background
Background
Background
Background
Background
Site
Site
Background
Background
Site
Background
Background
Site
Background
Background
Background
Site
Site
Site
Site
Site
Site
Background
Site
Background
Site
Site
Site
Background
Background
Background
Site
Site
Background
Site
Site
Site
Sum of Ranks

Site

3.5

9
10

20
21
22
23
24
25

29
30
31

35
36

38
39
40
491.5
ws
Background

3.5
3.5
3.5
3.5
3.5
7
8

11
12

14
15

17
18
19

32
33
34

328.5
wb
Table 5.2 Background data
Table 5.3 WRS test for Test Form 1
H0: site < background
entitled, "Procedures for Non-Detect Values in
WRS Test," on the previous page). The ranks for
each area are shown in the two columns at the right
of the exhibit. The sum of the ranks of the site measure
ments (Ws = 491.5) and the sum of the ranks of the
background measurements (Wb = 328.5) are shown
at the bottom of the exhibit.16 The sum of the ranks
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-9
of the site measurements (Ws = 491.5) is the test
statistic used for Background Test Form 1. The sum
of the site ranks is used as the test statistic for
background test from 1 because EPA is looking for
evidence that the site distribution exceeds the back-
ground distribution. If Ws is greater than the tabula-
ted critical value for the test, the null hypothesis that
there is no significant difference will be rejected.

Most readily available tables for the WRS test only
extend up to sample sizes of n = m = 20. Critical
values for the WRS test when n and m exceed 20
may be calculated from the large sample approxima-
tion using this equation:
Wcnt = m(N+l)/2 + za[ nm(N
where N = n + m and za is the 100(1 - a)* percentile
of the standard normal distribution. The first term is
the expected value of the sum of ranks W, calculated
under the assumption that the null hypothesis is true .
The second term is a standard normal variate times
the standard deviation of W, under the same assump-
tions. The first factor in the expectation term m
represents the number of ranks that were summed,
each having expectation (N+l)/2 under the equality
assumption included in the null hypothesis.

Table 5.4 shows the critical values for the WRS test
for selected values of a for data sets with n = m = 20.
The critical value for a = 0. 10 is 458, and the critical
value for a = 0.05 is 471. Since Ws exceeds the
critical values for most commonly used values of a,
the null hypothesis is rejected. Hence, the site is
distinguishable from background at a confidence
a Critical Value
0.20
0.15
0.10
0.05
0.025
0.010
0.005
0.001
442
449
458
471
482
495
504
521
Table 5.4 Critical Values for the WRS Test
for n = m = 20
level of 95 percent. Note that the null hypothesis
would not be rejected at a = 0.01.

WRS Test Procedure for Background Test Form 2

Null Hypothesis (H0): The site distribution exceeds
the background distribution by more than a
substantial difference S (A > S).

Alternative Hypothesis (H^): The site distribution
does not exceed the background distribution by
more than S (A < S).

The WRS test for Background Test Form 2 is
applied as outlined in the following steps. The lead
example will again serve as an illustration of the
procedure. In the example from Chapter 3, EPA
chose to use Background Test Form 2, with a = 0.10
and a substantial difference of S = 100 mg/kg. First,
the background measurements are adjusted by
adding S = 100 mg/kg to each measured value.
Table 5.5 contains two columns on the right that
show the S-adjusted background data for S = 50
mg/kg and S = 100 mg/kg.

The adjusted background measurements and the
measurements from the site in Table 5.5 are ranked
in increasing order from 1 to 40. Note that the five
adjusted background measurements that were non-
detects are tied at 100 mg/kg. They are all assigned
the average rank of 16 for that group of tied
measurements.

The sum of the ranks of the adjusted measurements
from background, Wb = 544, is the test statistic for
Background Test Form 2. Note that the test statistic
for Background Test Form 2 differs from the test
statistic for Background Test Form 1. In this case,
we are looking for evidence that S plus the back-
ground distribution is greater than the site distribu-
tion. Earlier, in Background Test Form 1, we were
looking for evidence that the site distribution
exceeds the (unmodified) background distribution.
The critical value for the WRS test (Table 5.4) for
a = 0.10 is 458. Since Wb is greater than the critical
value, the null hypothesis that the site exceeds
background by more than a substantial difference of
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-10
Ranks for
Rank

1
2
3
4
5
6
7
8
9
10
11
12
13
16
16
16
16
16
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
820

Data
(mg/kg)
ND
34.0
39.5
48.6
54.9
70.9
72.1
81.3
83.2
86.2
88.2
96.1
98.3
100.0
100.0
100.0
100.0
100.0
100.1
104.3
105.6
115.7
129.0
139.3
146.1
148.1
149.3
153.5
156.9
158.0
159.7
167.9
168.0
188.5
196.5
208.4
215.8
222.9
226.8
247.5

Source

Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Site
Background+S
Background+S
Background+S
Background+S
Background+S
Background+S
Site
Site
Background+S
Site
Site
Background+S
Background+S
Background+S
Background+S
Site
Background+S
Background+S
Site
Background+S
Background+S
Background+S
Site
Background+S
Background+S
Background+S
Background+S
Sum of Ranks

Site

1
2
3
4
5
6
7
8
9
10
11
12
13

20
21

23
24

276
ws
Background
+ 100

16
16
16
16
16
19

25
26
27
28

30
31

33
34
35

37
38
39
40
544
wb
Ranks for
Rank

1
2
3
4
7
7
7
7
7
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
820

Data
(mg/kg)
ND
34.0
39.5
48.6
50.0
50.0
50.0
50.0
50.0
50.1
54.9
65.7
70.9
72.1
81.3
83.2
86.2
88.2
96.1
96.1
98.1
98.3
99.3
103.5
104.3
105.6
108.0
109.7
118.0
129.0
138.5
139.3
146.5
156.9
165.8
167.9
172.9
176.8
197.5
208.4

Source

Site
Site
Site
Site
Background+S
Background+S
Background+S
Background+S
Background+S
Background+S
Site
Background+S
Site
Site
Site
Site
Site
Site
Background+S
Site
Background+S
Site
Background+S
Background+S
Site
Site
Background+S
Background+S
Background+S
Site
Background+S
Site
Background+S
Site
Background+S
Site
Background+S
Background+S
Background+S
Site
Sum of Ranks

Site

1
2
3
4

13
14
15
16
17
18

25
26

40
379
Ws
Background
+ 50

7
7
7
7
7
10

23
24

27
28
29

37
38
39

441
wb
Table 5.5 WRS test for Test Form 2
H0: site > background + 100

100 mg/kg is rejected at the 90 percent confidence
level.

Table 5.6 shows the WRS test for the lead example
using Background Test Form 2 with a smaller (more
conservative) value for a substantial difference, S =
Table 5.6 WRS test for Test Form 2
H0: site > background + 50

50 mg/kg. The sum of the ranks of the S-adjusted
background measurements is Wb = 441. After
examination of these data, it is clear that the null
hypothesis that the site exceeds background by more
than 50 mg/kg cannot be rejected at any reasonable
level of confidence.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-11
In conclusion, site concentrations in this example are
significantly higherthan background concentrations.
The site distribution may exceed background by 50
mg/kg or more, but it is unlikely that the site distribu-
tion is more than 100 mg/kg above background.

Power of the WRS Test

The exact power of the WRS test is difficult to calcu-
late. An approximation by Lehmann17 is based on the
Mann-Whitney form of the WRS test statistic.18 The
Mann-Whitney test statistic is equal to the Wilcoxon
rank sum statistic minus its smallest possible value.
Thus the Mann-Whitney test statistic is

WMW = Ws-n(n+l)/2

for Background Test Form 1, and
W
MW
= Wb-m(m+l)/2
for Background Test Form 2. In each case, the power
of the WRS test is calculated using the mean and
variance of the approximating normal distribution for
the corresponding Mann-Whitney test statistic.
Lehmann describes the method for approximating the
power of the WRS test used in Background Test
Form 1. The U.S. Nuclear Regulatory Commission
(NRC)19 offers a detailed application of the normal
approximation for the power of Background Test
Form 2 using a different notation for the gray region
than is used in this guidance. At the upper end of the
gray region, EPA's substantial difference (S) is
called by NRC the "Design Concentration Guideline
Level" (DCGL). The width of the gray region (EPA's
MOD) is defined by NRC as A = DCGL - LBGR,
where "LBGR" is the lower bound of the gray region.
The NRC document also implements Lehmann's
power approximation for Background Test Form 1,
obtained by letting LBGR = O.20 The NRC document
also contains tables for use in evaluating the mean
and variance of W^ and W*^ in Lehmann's
approximation for the power of the WRS test. Due to
the differences in notation noted above, the NRC
tables are tabulated in terms of the design parameter
A/a, which corresponds with MDD/o in this
guidance.
Gehan's Form of the WRS Test

The Gehan test is a generalized version of the WRS
test.21 If there are a large number of non-detect
measurements and several different detection levels,
Gehan's form of the WRS test is a more powerful
test for the background comparison. The Gehan test
addresses multiple detection limits using a modified
ranking procedure rather than relying on the "all ties
get the same rank" approach used in the WRS test.15
After the modified ranking is completed, the
standard WRS test procedure discussed above is
applied to determine if the null hypothesis should be
rejected. It has been recommended that there should
be at least 10 data values in each data set to use this
test.

Quantile Test

In many instances, releases of chemicals have
impacted only portions of the site. Under such
conditions, chemical concentrations in relatively
small areas at the site could be elevated relative to
the underlying background concentrations. As a
result, only a small portion in the upper tail of the
distribution of site measurements would be expected
to be shifted to higher concentrations than the distri-
bution of background measurements. The quantile
test is a nonparametric test that is designed to com-
pare the upper tails of the distributions. The quantile
test may detect differences that are not detected by
the WRS test. The quantile test is described in detail
in Chapter 7 of Gilbert and Simpson.1

Walsh's Tests for Outliers

Walsh's test is a nonparametric test for determining
the presence of outliers in either the background or
onsite data sets. This test was developed to detect up
to a specified number of outliers, r. The test requires
large sample sizes (n > 60 for aS 0.10; and n > 220
for aS 0.05). Procedures for conducting this test is
discussed in Section 4.4 of EPA QA/G-9.2

Nonparametric Tolerance Intervals

The parametric tolerance intervals presented in
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-12
Section 5.2 are derived based on the assumption of a
normal distribution. If the data are not normal and are
not easily transformed to normality, then non-
parametric tolerance intervals may be calculated for
the background distribution to provide a tolerance
level for screening site data. A readable discussion of
the use of nonparametric tolerance intervals is
provided by Glick.22

5.4 Hypothesis Testing

Hypothesis testing was discussed in detail in Section
3. Here, some of this information is reviewed, and
additional aspects of such testing are discussed. The
emphasis is on classical methods for testing hypothe-
ses, including parametric and nonparametric
methods. The Bayesian approach is an alternative to
classical methods for hypothesis testing, but is not
included in the discussion. The Bayesian approach
for comparing the means of two populations is
discussed by many authors, including Box and Tiao ,23
Bayesian methods permit the incorporation of prior
knowledge and provide the ability to update informa-
tion as results come in from successive rounds of
sampling.

5.4.1 Initial Considerations

For CERCLA sites, use of a null hypothesis and
alternative hypothesis is recommended when compar-
ing data sets from potentially impacted areas with
background data. For example, a null hypothesis
could be "there is no difference between the mean
contaminant concentration in samples from poten-
tially impacted areas and background data sets." The
alternative hypothesis would be "there is a difference
between mean contaminant concentration in samples
from potentially impacted areas and background data
sets." To conduct the comparison, parametric or non-
parametric statistical tests are recommended. Use of
parametric comparison methods like t-tests or
AN OVA may require normalization of data, such as
the conversion to a log scale. Depending upon the
data and other site-specific considerations, statistical
analysis should involve one or a combination of the
following methods:
*• A preliminary descriptive analysis involving the
comparison of median, mean, and upper range
concentrations between sample sets considered
site-related and background;

> Parametric statistical comparison methods in-
volving the comparison of one or more para-
meters of the distribution of site samples with
corresponding parameters of the (assumed or
sampled) background distribution, such as
Gosset's Student t-test or Cochran's Approxi-
mation to the Behrens-Fischer Student t-test; or

> Nonparametric tests, such as the Wilcoxon
Rank Sum test (on a case-by-case basis).

Once a test has been selected, the assessor should
consider several questions:

* What should the null and alternative hypotheses
be? What are we testing? What are we trying to
support or rej ect about the site and background?

*• Should the test be one-tailed or two-tailed?
Should we ask whether the site and background
are from the same population, or should we
focus on whether one is more contaminated than
the other?

*• What confidence level should be used? At what
"cut-off point do we accept or reject the
hypothesis?

5.4.2 Examples

It may be easiest to explore these questions by using
an example. Suppose we have an area that meets our
criteria for local background. The data from this
area for Chemical X (mg/kg) are as follows:

66 67 68 68 69 69 69
70 70 70 71 71 71 72
72 72 72 73 74 74 75

These data were collected randomly and are normal-
ly distributed. There are 21 measurements (n = 21),
with an average of 70.6 mg/kg and a standard
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-13
deviation of 2.37 mg/kg.

We also have data from an onsite process area. These
data for Chemical X (mg/kg) are as follows:

62 63 64 65 66 68 68
69 69 70 71 71 72 72
72 73 74 75 77 78 80

These data were collected randomly and are normally
distributed. There are 21 measurements (n = 21),
with an average of 70.4 mg/kg and a standard
deviation of 4.86 mg/kg.

The background and onsite areas appear to be
similar, but some of the onsite data exceed the
background data. We would like to be able to state
with a given level of confidence whether the data are
essentially from the same population, or not. If we
use the t test to compare the true means of these data
sets, we could test the hypothesis that the background
mean and the site mean are essentially equal (H0, the
null hypothesis). If H0 is not true, then we would
support the alternative hypothesis that the means are
not equal. This is a two-tailed test, because H0 could
be rejected if the site mean is greater than the
background mean or if the site mean is less than the
background mean.
Example 1:
HO: Ms
H
A
Mb
Mb
(Note that this is a two-tailed version of Test Form
1.) Using the equations in EPA QA/G-9, for t, we
find thatt = 0.1693.2 At 40 degrees of freedom, for a
two-tailed test, our t falls below the t of 0.681, where
a = O.5.24 Therefore, if we had chosen an a of 0.01
(99 percent confidence), 0.05 (95 percent confi-
dence), or 0.1 (90 percent confidence), we would not
reject our null hypothesis. Only if we were testing at
less than 50 percent confidence would we reject HQ.

When using Test Form 1, the higher the confidence
limit, the more likely this test is to find that the site is
from the same population as background. Choosing
the rejection range for the hypothesis involves balan-
cing both kinds of error. In general, EPA recom-
mends a minimum confidence limit of 80 percent
and a maximum confidence limit of 95 percent.

Suppose we want to compare our background data
set with another onsite process area. The data for
Chemical X (mg/kg) are as follows:

56 58 60 62 66 67 68
70 72 73 75 76 81 82
84 85 87 90 91 92 103

These data were collected randomly and are
normally distributed. There are 21 measurements (n
= 21), with an average of 76.1 mg/kg and a standard
deviation of 12.68 mg/kg.

Is this area significantly different from background?
The arithmetic mean is 76.1 mg/kg, compared to the
background mean of 70.6 mg/kg. But is this differ-
ence truly significant? After all, the mean of the first
process area, 70.4 mg/kg, was different from the
background mean. According to the t test, however,
we did not find the difference of 0.2 mg/kg to be
significant at the 80-99 percent confidence levels.
What about the second process area?

Suppose we decide that we are really interested in
whether the site is "dirty" (above background).
Instead of a 2-tailed test, we could perform a 1-
tailed test:
Example 2:
HO: Ms > Mb
HA: Ms < Mb
(Note that this is Test Form 2 with S = 0.) This test
is 1-tailed because the rejection region is only on
one side of the distribution; that is, we are only
interested in whether the site is greater than the
background.

To use the normal distribution theory correctly, for
a 1-tailed t test, with 40 degrees of freedom, the t of
-1.95 is calculated for the background mean minus
the site mean. This t falls between the 95 percent
and 97.5 percent confidence levels. If we were
testing at 80 percent or 95 percent confidence, we
would reject HO and find that the site is less than or
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-14
equal to background—in other words, "clean." At 99
percent confidence, HO could not be rejected. In this
case, therefore, a lower confidence limit seems to
increase the chances of finding that the site is clean,
where in our earlier tests we found that a lower con-
fidence limit decreased the chances of considering
the site clean. Why is this?

The difference is in the setup of the hypotheses. In
the first case (example 1), the null hypothesis was
that the site and background were from the same
population (the site was clean). In the later case, the
null hypothesis was that the site mean exceeded the
background mean (the site is dirty). In essence, we
have shifted the burden of proof. If we are really
interested in whether the site is dirty (greater than
background), then our last test could have looked at
these hypotheses:
Example 3:
HO: Ms < Mb
HA: Ms > Mb
(Note that this is a one-tailed version of Test Form
1.) Using the site mean minus the background mean
for this test, we derive a t of 1.95. At the 80 percent
confidence level, we would rej ect H0 and find that the
site is dirty. At the 95 percent confidence level and
above, we would accept H0 and find that the site is
clean because the data are insufficient to support this
higher level of confidence demanded for rejection.
Once again, with Test Form 1, a lower confidence
level results in a more conservative approach to
environmental protection.

There is another problem, besides burden-of-proof,
with Example 2. As discussed in Chapter 3, the null
hypothesis that there is a substantial difference (Test
Form 2, A > 0) should only be tested if some minimal
difference (S) is specified. This is because the null
hypothesis H0: A > 0 (i.e., H0 (is > (ib) will be rejected
only if the site mean is significantly below the back-
ground mean. In a more typical case, the site mean
may be almost equal to or slightly below the back-
ground mean, and the null hypothesis will only be
rejected when a large number of samples is collected
to reduce the uncertainty to below the magnitude of
the difference in means.
Essentially, Test Form 1 uses the default assumption
that the site is "clean" unless it can be shown other-
wise; Test Form 2 uses the default assumption that
the site is "dirty" unless it can be shown otherwise.

5.4.3 Conclusions

Now we return to our original three questions. Table
5.7 also summarizes this information.

*• What should the null and alternative hypotheses
be?
*• Should the test be one-tailed or two-tailed?
*• What confidence level should be used?

To determine whether the site and background are
from the same population, these hypotheses can be
used in a two-tailed Test Form 1:

HO: Ms = Mb
HA: Ms * Mb

For this test, the confidence level should be at least
80 percent but no more than 95 percent. For a more
conservative test, use the lower end of the confi-
dence range.

To determine whether the site is significantly
greater than background, these hypotheses can be
used in a one-tailed Test Form 1:

HO: Ms ^Mb
HA: Ms > Mb

For this test, the confidence level should be at least
80 percent; for a more conservative test, use the
lower end of the confidence range and require
adequate power.

If testing the hypotheses in reverse—Test Form 2
—to show whether the site is greater than back-
ground + S, use a higher confidence level, such as
95 percent, and specify a substantial difference S.
(See Appendix A for guidance on choosing S.) To
determine whether the site exceeds background by
more than S, these hypotheses can be used in a one-
tailed Test Form 2:
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
H
Page 5-15
For this test, the confidence level should be at least
80 percent; for a more conservative test, use higher
levels of the confidence range.
What to test:
H0: site and background are from
the same population; vs.
HA: site and background are from
different populations
(Two-tailed, Test Form 1)
H0: site is less than or from the same
population as background; vs.
HA: site is greater than background
(One-tailed, Test Form 1)
H0: site is greater than background +
S; vs.
HA: site is less than or from the
same population as background + S
(One-tailed, Test Form 2)
HO
M-s=M-b
^ "b
ns > nb+s
HA
Us* "b
M-s>M-b
ns < nb+s
Recommended alpha
80-95% confidence (a = 0.2
to 0.05)
[More conservative: a = 0.2]
80-95% confidence (a = 0.2
to 0.05)
[More conservative: a = 0.2]
80-95% confidence (a = 0.2
to 0.05)
[More conservative: a =
0.05]
Rejection criteria
For 2-sided t test, e.g.,
reject H0 if t > t^ or if t
<-ta/2
For 1-sided t test, e.g.,
reject H0 if |t|>ta
For 1-sided t test, e.g.,
reject H0 if t>ta*
For 1-sided t test, e.g.,
reject H0 if |t|>ta
For 1-sided t test, e.g.,
reject H0 if t<-ta*
* Assuming the test statistic, t, is calculated using site mean minus background mean (or background mean + S, for
Test Form 2) in the numerator.
Table 5.7 Summary of hypothesis tests
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-16
CHAPTER NOTES

1. Gilbert, R.O. & J.C. Simpson. June 1994. Statistical Methods for Evaluating the Attainment of Cleanup
Standards, Volume 3. EPA 230-R-94-004.

2. U.S. Environmental Protection Agency (EPA). July 2000. Guidance for Data Quality Assessment:
Practical Methods for Data Analysis, EPA QA/G-9, QAOO Version. Quality Assurance Management
Staff, Washington, DC, EPA 600-R-96-084. Available at http://www.epa.gov/quality/qa_docs.html.
*• See Section 1.3.1 for guidance on "authoritative samples."
*• See Section 3.3.1.1 for guidance on how to calculate t.

3. Cressie, N. 1991. Statistics for Spatial Data, New York: John Wiley & Sons. See Section 3.2.

4. A variety of methods for addressing non-detects are presented in Section 4.7 of EPA QA/G-9, op. cit.
Simulation results using LA/2 are reported by Hornung, R.W. and Reed, L.D. "Estimation of average
concentration in the presence of nondetectable values," Applied Occupational and Environmental
Hygiene, 5(1), p.45-51, January, 1990.

5. Although the use of t-tests after logarithmic transformation to approximate normality has a long history,
some authors have recommended against using t-tests on log transformed data. See A. K. Singh, A.
Singh, and M. Engelhardt, The Lognormal Distribution in Environmental Applications, EPA Technology
Support Center Issue, U.S. EPA Office of Research and Development, National Exposure Research
Laboratory, Las Vegas, NV, EPA/600/R-97/006, December 1997. The authors note that the H-statistic
based on the Upper Confidence Limit for the mean of a lognormal has erratic performance with sample
sizes smaller than 30.

6. When using parametric statistical tests, a limit of 15 percent non-detect measurements in either data set
is suggested in the Navy's Procedural Guidance for Statistically Analyzing Environmental Background
Data. Nonparametric statistical methods are recommended if this limit is exceeded.

7. Michigan Department of Environmental Quality Waste Management Division. April 1994. Guidance
Document: Verification of Soil Remediation. Revision 1. http://www.deq.state.mi.us/wmd/docs/vsr.html.

8. Devore, J.L. 2000. Probability andStatistics forEngineering andthe Sciences, 5th Ed., Duxbury Press,
Pacific Grove, California.

9. Gibbons, R.D. 1994. Statistical Methods for Groundwater Monitoring, John Wiley & Sons, Inc.

10. The WRS test is also called the Mann-Whitney test, which is mathematically equivalent to the WRS test.
Sometimes, the combined name is used: Wilcoxon-Mann-Whitney test.

11. In general, the use of "non-detect" values in data reporting is not recommended. Wherever possible, the
actual result of a measurement, together with its uncertainty, should be reported. Estimated concen-
trations should be reported for data below the detection limit, even if these estimates are negative,
because their relative magnitude compared to the rest of the data is of importance.

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-17
12. The Gehan test discussed in the next section should be considered if there are many non-detect values
with different detection levels.

13. A limit of 50 percent non-detect values is suggested in the Navy's Procedural Guidance for Statistically
Analyzing Environmental Background Data. A more conservative limit of 40 percent non-detect values
is recommended in the Multi-Agency Radiation Survey and Site Investigation Manual (MARSSIM).

14. As a third alternative, Morris DeGroot (1986) recommends that in the case of ties, the WRS could "be
carried out twice. In the first test, the smaller ranks in each group of tied observations should be assigned
tot he x's and the larger ranks should be assigned to thej's. In the second test, these assignments should
be reversed. If the decision to accept or rej ect H0 is different for the two assignments, or if the calculated
tail areas are very different, the data must be regarded as inconclusive." Probability and Statistics, 2nd
Edition. Addison-Wesley, Reading, MA, pp. 517 and 584.

15. Additional information on the treatment of non-detects is given by Newman, et al, (1989) "Estimating
the mean and variance for environmental samples with below detection limit observations," Water
Resources Bulletin 25(4): 905-916.

16. Critical values for the WRS test are available in many published texts and reference books. Two sources
are W.J. Conover (1980) PracticalNonparametric Statistics, 2ndEd., John Wiley & Sons, New York;
and Zwillinger, D. and S. Kokoska (2000) CRC Standard Probability and Statistics Tables and
Formulae, Chapman and Hall/CRC Press, Boca Raton, Florida.

17. Lehmann, E.L. and H.J.M. D'Abrera. 1998. Nonparametrics: Statistical Methods Based on Ranks,
revised 1st Ed., Prentice Hall, New Jersey. Section 2.3.

18. Many tables of critical values for the WRS test, including Lehmann's, are expressed in terms of the
Mann-Whitney form of the test statistic. Care should be exercised in selecting an appropriate table when
results of the WRS test are evaluated.

19. Gogolak, C.V., G.E. Powers, and A.M. Huffert.^4 Nonparametric Statistical Methodology for the Design
and Analysis of Final Status Decommissioning Surveys, Office of Nuclear Regulatory Research, U.S.
Nuclear Regulatory Commission (NRC),NUREG-1505. June 1998 (Rev. 1). See "Scenario A" in Section
10.4.

20. See "Scenario B" in Section 10.5 of Gogolak et al., op cit. This scenario, with LBGR > 0, introduces a
new set of WRS tests not discussed in this guidance. These tests have null hypotheses similar to the null
hypothesis used in Background Test Form 1, but with a less stringent definition of "clean" (H0: A <
LBGR). The test statistic for rejecting these null hypotheses is obtained by ranking in one column the
site measurements, which are first adjusted by subtracting the LBGR from each measurement, together
with the (unadjusted) background measurements. The WRS test statistic is defined as the sum of the
ranks of the adjusted site measurements.

21. Detailed instructions for conducting the Gehan test are found in Handbook for Statistical Analysis of
Environmental Background Data, Naval Facilities Engineering Command, SWDFV and EFA WEST,
July, 1999, Section 3.7. Also see Appendix C.I of the draft Supplemental Guidance to RAGS: Region
4 Bulletins - Addition #1, Statistical Tests for Background Comparison at Hazardous Waste Sites, U.S.

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page 5-18
EPA, Waste Management Division, Region 4, Office of Technical Services, November, 1998. An earlier
use is found in Millard, W.P., and S.J. Deverel. 1988. "Non-Parametric Statistical Methods for
Comparing Two Sites Based on Data with Multiple Non-Detect Limits." Water Resources Research,
24:12, pp. 2087-2098.

22. Glick,N. "Breaking records and breaking boards", American Mathematical Monthly, Vol. 85,No. l,pp.
2-26, 1978.

23. Bayesian Inference in Statistical Analysis, G.E.P. Box and G.C. Tiao, Addison-Wesley, Reading, MA,
1973.

24. In this context, degrees of freedom (n - 1) is the number of independent observations ("n") minus the
number of independent parameters estimated in computing the variation. The shape of the t-distribution
curve depends upon the number of degrees of freedom. Distributions with fewer degrees of freedom have
heavier tails.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
APPENDIX A

SUPPLEMENTAL INFORMATION
FOR DETERMINING "SUBSTANTIAL
DIFFERENCE"
"Substantial difference" (S) is the difference in
mean concentration in potentially impacted areas
over background levels that presents a "substantial
risk."

In situations where regulatory requirements indi-
cate that contamination at or below background
concentrations presents an unacceptably high risk,
it may not be possible to define a reasonable level
for a substantial difference. However, background
analysis is important in situations where back-
ground chemicals occur at concentrations above
risk-based criteria, and the statistical methods
presented in this guidance are useful tools for
background analysis in these situations.

This guidance does not establish a value for S
because it should be developed within the Quality
Assurance Project Plan on a case-by-case basis as
part of the planning process for site investigations.1
This Appendix does not establish policy, and is
provided as supplemental information on the statis-
tical considerations that support a selection for S.

A.I Precedents for Selecting a
Background Test Form

Hypothesis testing is used to make decisions under
conditions of uncertainty. The DQO process
provides a way for decision makers to determine the
requirements of the test, based on evaluation of the
consequences of making a Type 1 error (a) or a
Type 2 error ((3). Statisticians involved in develop-
ing the theory of hypothesis testing have noted the
asymmetry between these two types of errors:

The justification for fixing the Type 1 error to
be a (usually small and often taken as . 05 or
. 01) seems to arise from those testing situations
where the two hypotheses are formulated in
such a way that one type of error is more
serious than the other. The hypotheses are
stated so that the Type 1 error is more serious,
and hence one wants to be certain that it is
small2

These opinions are echoed by Bickel and Doksum,3
who use the symbol H for the null hypothesis (we
have used H0) and K for the alternative (our HA):

Even when we leave the area of scientific
research the relative importance of the errors
we commit in hypothesis testing is frequently
not the same. ...There is a general convention
that, if the labeling of H and K Is free, the label
H is assigned so that type 1 error is the most
important to the experimenter.

These opinions relate to the choice between two
complementary hypothesis tests with the difference
being the reversal of the null and the alternative.
This is a burden-of-proof issue. The comparison of
the two background test forms also involves
selection of an appropriate value for a "substantial
difference." It is also important to distinguish
between the value that characterizes a "substantial
difference over background" and the appropriate
risk-based "action level" for the chemical of
concern.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-2
Existing EPA guidance in the data quality obj ectives
(DQO) process for choosing the null hypothesis has
focused on the burden-of-proof, when the contamin-
ant concentration is to be compared to a fixed, risk-
based action level, L. The choice of test forms for
this type of decision includes either
and
or
a) H0: X < L vs. HA: X > L

b) H0: X > L vs. HA: X < L,
where X represents the parameter of interest for the
distribution of contaminant concentrations in the
potentially impacted areas. Hypothesis test a
compares the site concentrations to the action level
using a null hypothesis that the site does not exceed
the action level and an alternative hypothesis that
the site exceeds the action level. Hypothesis test b is
the opposite of test a, using a null hypothesis, the
site exceeds the action level. Background issues are
not addressed directly in this framework.

One way to address background comparisons is to
reformulate the hypotheses using the difference
(delta—A) between the distribution of contaminant
concentrations and background:
and
a') H0: A < S vs. HA: A > S

b') H0: A > S vs. HA: A < S.
In hypothesis tests a' and b', concentrations in
potentially impacted areas and in background
locations are compared to determine if there is or is
not a substantial difference between the two areas.
Test a'uses the null hypothesis that the site does not
exceed background by more than a substantial
difference, while the opposite test b' uses the null
hypothesis that the site exceeds background by more
than a substantial difference (S). Approaches for
selecting a value for S are addressed in the follow-
ing section. Note that Test Form b 'is the one discus-
sed in Section 3.1.2 (Background Test Form 2).

Background Test Form 1 focuses interest on com-
parisons using a "substantial" difference of S = 0. In
this case, the two alternative tests are
a") H0: A < 0 vs. HA: A > 0

b") H0: A > 0 vs. HA: A < 0.
Background Test Form 1 (Section 3.1.1) is identical
with test a". This discussion demonstrates that the
two background tests addressed in this paper are not
opposite forms of the same test in the same sense
that tests a and b are opposite forms of the same test
with the same threshold. Since the guidance review-
ed in this section compares opposite forms of tests
with the same action level, the guidance does not
contain a direct recommendation for choosing

The two background test forms differ both in
terms of burden of proof and in the choice of a
substantial difference:

> Test Form 1 uses a conservative value for a
substantial difference of S = 0, but relaxes
the burden of proof by selecting the null
hypothesis that there is no statistically signi-
ficant difference.
> Test Form 2 requires a stricter burden of
proof, but permits a larger value for a
substantial difference.

between Test Forms 1 and 2. Distinguishing charac-
teristics are listed in the box below.

EPA QA/G-94 (Section 1.2) provides the following
guidance on the selection of an appropriate null
hypothesis in a choice between Test Forms a and b:

The decision on what should constitute the null
hypothesis and what should be the alternative is
sometimes difficult to ascertain. In many cases,
this problem does notarise because the null and
alternative hypotheses are determinedby speci-
fic regulation. However, when the null hypothe-
sis is not specified by regulation, it is necessary
to make this determination. The test of hypothe-
sis procedure prescribes that the null hypothesis
is only rejected in favor of the alternative,
provided there is overwhelming evidence from
the data that the null hypothesis is false. In
other words, the null hypothesis is considered to
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-3
be true unless the data show conclusively that
this is not so. Therefore it is sometimes useful to
choose the null and alternative hypotheses in
light of the consequences of possibly making an
incorrect decision between the null and alterna-
tive hypotheses. The true condition that occurs
with the more severe decision error (not what
would be decided in error based on the data)
should be defined as the null hypothesis. For
example, consider the two decision errors:
"decide a company does not comply with
environmental regulations when it truly does "
and "decide a company does comply with
environmental regulations when it truly does
not. " If the first decision error is considered
[the] more severe decision error, then the true
condition of this error, "the company does
comply with the regulations " should be defined
as the null hypothesis. If the second decision
error is considered the more severe decision
error, then the true condition of this error, "the
company does not comply with the regulations "
should be defined as the null hypothesis.

For background comparisons, that guidance may be
extrapolated. When deciding between Test Forms a "
and b", there are two possible decision errors:

(i) decide the site exceeds background when it
truly does not; and

(ii) decide the site does not exceed background
when it truly does.

Decision error (i) occurs when a "clean" site is
wrongly rejected. If decision error (i) is more
serious than decision error (ii), and if the choice is
between tests a " and b " with a substantial difference
of 0, then Background Test Form 1 (a") should be
selected.

When deciding between Test Forms a' and b', there
are two possible decision errors:

(i) decide the site exceeds background + S when it
truly does not; and
(ii) decide the site does not exceed background + S
when it truly does.

Decision error (ii) occurs when atruly contaminated
site goes undetected. If decision error (ii) is
considered more serious than error (i) and the choice
is between tests a" and b " with a substantial differ-
ence of S, then Background Test Form 2 should be
selected. Note that this logic does not provide a
direct comparison of the two forms of background
tests considered here, but does indicate situations
when Test Forms 1 or 2 may be recommended over
their respective opposites.

Chapter 6 of EPA QA/G-45 is succinct and
definitive for deciding between Test Form a and b:

"Define the null hypothesis (baseline condition)
and the alternative hypothesis and assign the
terms "false positive" and "false negative" to
the appropriate decision error.

"In problems that concern regulatory compli-
ance, human health, or ecological risk, the
decision error that has the most adverse
potential consequences should be defined as the
null hypothesis (baseline condition). In statisti-
cal hypothesis testing, the data must conclu-
sively demonstrate that the null hypothesis is
false. That is, the data must provide enough
information to authoritatively reject the null
hypothesis (reject the baseline condition) in
favor of the alternative. Therefore, by setting
the null hypothesis equal to the true state of
nature that exists when the more severe decision
error occurs, the decision maker guards against
making the more severe decision error by
placing the burden of proof on demonstrating
that the most adverse consequences will not be
likely to occur."

This suggests that environmental concerns are not
like the jury trial process, and that the "innocent
until proven guilty" assumption is an environ-
mentally risky approach. From this viewpoint, a
more protective approach would be to presume guilt,
and demand proof of innocence: "guilty until proven
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-4
innocent." Remember that this comparison assumes
that opposite forms of the same test (a and b) are
being evaluated. Extrapolation of this logic to the
background problem would indicate that Test Form
2 is preferred over its true opposite, but Test Form
1 is not preferred over its opposite.

EPA guidance6 adopts a conservative approach by
stating that when the results of the investigation are
uncertain, erroneously concluding that the sample
area does not attain the cleanup standard is prefer-
able to concluding that the sample area attains the
cleanup standard when it actually may not. Again
the recommended approach favors protection of
human health and the environment.

A.2 Options for Establishing the
Value of a Substantial Difference

Selection of an appropriate value to represent a
substantial difference when testing for differences
between concentrations in potentially impacted
areas and background areas depends on the intended
application of the test and a variety of factors. These
factors include site and background variability and
appropriate cleanup goals.

The term "substantial difference" (S) was defined at
the beginning of this Appendix as the difference in
mean concentration that presents a "substantial
risk." Alternatively, S may represent a selected "not-
to-exceed" action level that is appropriate for the
decision at hand. The application of either test form
for a background comparison requires that an upper
bound be established for the magnitude of the
difference before the site is determined to exceed
background. When using Test Form 1, the power of
the test is specified at the right edge of the gray
region, which has a width equal to the minimum
detectable difference (MDD). In this case, the value
of S serves as an upper bound for the width of the
gray region. When using Test Form 2, S is explicitly
incorporated into the test procedure. S is measured
in concentration units above the mean background
concentration. The decision to use a specific value
for a substantial difference may be based on direct
risk assessment, a generic regulatory value, or other
level selected to reflect site-specific conditions.

Background comparisons may be conducted at
various stages of site characterization and remedia-
tion cycle. In the characterization stage, areas with
some likelihood of contamination may be compared
to background areas to determine if contamination
is present in excess of background levels. For
example, the goal at this stage may be to determine
the areal extent of contamination on a large site. The
site is divided into sub-units that are compared to
background to determine if contamination is present
in the sub-unit. At this stage, Background Test Form
1 is useful for determining if the difference between
the site mean and the background mean is signifi-
cantly greater than zero. An upper bound for the
MDD of the test is set by determining a value of the
substantial difference S which will represent athres-
hold value for identifying possibly contaminated
sub-units.

Later in the site evaluation process, background
comparisons may be used to determine if a sub-unit
with known contamination has been sufficiently
remediated. At this stage, Background Test Form 2
is useful to demonstrate that the remediation was
successful. If the goal of the remediation is to
reduce contamination to near-background levels,
than an appropriate value of S is selected that will
repre sent the maximum amount by which a remedia-
ted sub-unit may exceed background.

A.2.1 Proportion of Mean Background
Concentration

One choice for selecting a value of S is to use a
specified proportion of typical mean background
concentrations for the contaminant of concern:

S = rMb

where Mb is the mean background concentration and
r is the specified proportion. This choice may be ap-
propriate for determining if contamination exists in
a sub-unit, or if a sub-unit has been remediated suc-
cessfully. There is no theoretical reason for restric-
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-5
ting r to proportions less than 1, if background
concentrations are far below the level that presents
a substantial risk. Values of r near 1 may require a
high number of samples, because the MDD for the
test should be less than S.

The required sample size is determined by MDD/o,
where o is the standard deviation of the concentra-
tions in potentially impacted areas. Even if the area
has little or no contamination, then o will be
approximately as large as the background standard
deviation, which is usually at least as large as the
background mean. Hence, if r is less than 1, then it
is very likely that MDD/o also is less than 1. If there
is contamination in the potentially impacted area,
then MDD/o will be much less than 1.

A.2.2 A Selected Percentile of the Back-
ground Distribution

Due to the high variability in background concen-
trations of many chemicals, defining S as a fraction
of the mean background concentration may not be
appropriate. Another choice for a value to represent
a substantial difference is to use a specified percen-
tile of the distribution of background concentrations
for the contaminant of concern:

S = (Bp-Mb)

where Bp is the p* percentile of the background
distribution and Mb is the mean background concen-
tration. Values of p less than 0.85 may require a
high number of samples, because the MDD for the
test should be less than S. This is because the 85th
percentile is approximately 1 standard deviation
above the background mean. When there is little or
no contamination on the site, S is approximately
equal to o, and hence, MDD/o usually will be near
1. If there is contamination, then MDD/o will be
much less than 1.

A.2.3 Proportion of Background Variability

A third choice for selecting a value to represent a
substantial difference is to use a specified propor-
tion of variance of background concentrations for
the contaminant of concern:

S = rob

where ob is the standard deviation of background
concentrations and r is the specified proportion.
This choice for a substantial difference is closely
related to the use of a percentile of the background
distribution discussed in Section A.3.2.

Areas with relatively high mean background con-
centrations generally also have high variance of
background. Values of r less than 1 may require a
high number of samples, for the reasons noted in
Section A.2.2.

A.2.4 Proportion of Preliminary Remediation
Goal

The concept of calculating risk-based soil concen-
trations to serve as reference points for establishing
site-specific cleanup levels was introduced in
RAGS.7 If a preliminary remediation goal (PRG) is
available for the contaminant of concern, the value
of S may be based on a proportion of the PRG:

S = r•PRG

A proportion less than 1 may be required, because
the total risk will be the sum of the incremental risk
due to S plus the risk due to background concen-
trations of the contaminant. If the PRG is less than
the mean or standard deviation of background, a
high number of samples may be required for conclu-
sive test results.

A.2.5 Proportion of Soil Screening Level

If a PRG is not available for the contaminant of
concern, a risk-based value of S may be based on the
soil screening level (SSL) for the contaminant.8

S = r-SSL

SSLs are based on a 10~6 individual risk for carcino-
gens and a hazard quotient of 1 for noncarcinogens.
SSLs were established to identify the lower bound
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-6
of the range of risks of interest in decision making,
and are not cleanup goals. SSL target risks should
be adjusted to reflect established cleanup level
targets. Again, a proportion less than 1 may be
required, because the total individual risk will be the
sum of the incremental risk due to S plus the risk
due to background concentrations of the contamin-
ant. If the (adjusted) SSL is less than the mean or
standard deviation of background, a high number of
samples may be required for the background
comparison.

A.3 Statistical Tests and Confidence
Intervals for Background
Comparisons

This section provides supplementary material on the
use of hypothesis tests and confidence intervals for
conducting background comparisons. The science of
statistics is often divided into two parts: estimation
theory and hypothesis testing. Estimation theory
includes the calculation of confidence intervals as
estimates for population parameters, while hypothe-
sis testing focuses on the use of statistical tests to
accept or reject hypotheses concerning these para-
meters. Although only the use of hypothesis tests
has been discussed in the main text, the one-to-one
correspondence between hypothesis tests for A con-
ducted at level a and the estimated 100( 1 -a) percent
confidence interval for A permits the use of either
method to conduct a background comparison. While
the emphasis of this section is technical in nature,
mathematical proofs of results have been omitted.

When using Test Form 1, a one-sided, level-a
hypothesis test of the null hypothesis A < 0 will only
reject the null hypothesis if we conclude that A is
significantly greater than zero by comparing the test
statistic to the tabulated critical value. The critical
value is selected to ensure that the probability the
test statistic will exceed the critical value by chance
alone is less than a. A similar conclusion is reached
when the lower bound of the one-sided, lOO(l-a)
percent confidence interval for A is greater than
zero. There are two ways to reach the same conclu-
sion that A is significantly greater than zero. A two-
sided confidence interval for A is often more useful
than a one-sided confidence interval to summarize
the information about A that is contained in the data.
In this case, a two-sided, 100(1-a) percent confi-
dence interval for A will correspond to a one-sided,
level-a/2 hypothesis test for A.

A.3.1 Comparisons Based on the t-Test

Background comparisons based on the t-test rely on
the assumption of a normal distribution for the data,
or for a transformation of the data. Hypotheses are
tested using the t-statistic, which follows the Student
t-distribution. Similar results are obtained by esti-
mating a confidence interval for A = (iy - (ix, where
(iy is the mean concentration in the potentially
impacted areas and (ix is the mean background
concentration.

NORMAL THEORY, CASE 1: EQUAL BUT UNKNOWN
VARIANCES9

For simplicity, we first assume that the site data (Y1;
..., YJ and background data (X1; ..., Xm) are
independent random samples from normal distribu-
tions with the same variance, a2, but with different
means, (iy and (ix, respectively:
and
In this case, the test statistic for the two-sample t-
test is based on the difference in the estimated
means, Mv and 1VL, where
and
y = 2Yj/n~N[ny,a2/n]

x = SXJ/m~N[nx,o2/m].
A pooled estimate for a2, the common variance of
the distributions, is

sp2=[I(YJ-My)2 + I(XJ-Mx)2]/(n + m-2).

The test statistic for conducting a t-test using
Background Test Form 1 is
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-7
where
In Background Test Form 1, the test statistic ^ has
the standardized Student-t distribution with n+m-2
degrees of freedom if (iy = (ix (A = 0). Let t^
represent the 100(1-0)* quantile of the Student t-
distribution with n+m-2 degrees of freedom. The
value t!_a is the critical value for the test. If the test
statistic t] exceeds the critical value t^, the null
hypothesis in Background Test Form 1 (H0: A < 0)
may be rejected with 100(1-a) percent confidence.

The test statistic for conducting a two-sample t-test
using Background Test Form 2 is

t2 = (Mx+S-My)/s*

where the quantity S is a substantial difference. The
test statistic t2 has a standard Student-t distribution
with n+m-2 degrees of freedom when (is = (iB + S. If
the test statistic t2 exceeds the critical value t^, then
the null hypothesis in Background Test Form 2 (H0:
A > S) may be rejected with 100(1-a) percent
confidence.

A 100(1-a) percent confidence interval for A is an
interval denoted as (A1; A2) that satisfies the require-
ment

PrjAj < A< A2} > 1 -a.

Here Aj represents the lower limit of the confidence
interval, and A2 represents the upper limit of the
confidence interval. Although one-sided hypothesis
tests were considered on the previous page, the
desired confidence interval is two-sided and sym-
metric, meaning that there is a probability of a/2 that
A will be below this interval and a probability of a/2
that it will be above this interval.

If the lower limit of a 100(1-a) percent confidence
interval for A is greater than zero, then the mean in
the potentially impacted area is significantly greater
than the background mean. This means that a one-
sided, level-a/2 test of the null hypothesis H0: A < 0
(Test Form 1) will reject the null hypothesis.
Similarly, if the upper limit of a 100(1-a) percent
confidence interval for A is less than S, then the
difference between the mean in the potentially
impacted area and the background mean is
significantly less than a substantial difference. This
means that a one-sided, level-a/2 test of the null
hypothesis H0: A > S (Test Form 2) will reject the
null hypothesis.

A symmetric confidence interval for the difference
A = (iy - (ix is constructed using t^^, which repre-
sents the 100(l-a/2)th quantile of the Student t-
distribution with n+m-2 degrees of freedom. A
100(1-a) percent confidence interval for A has the
form (A l5 A 2), where the lower bound is

A1 = (My-Mx)-t1.a/2s*

and the upper bound is

A2 = (My-Mx)+t1.a/2s*.

Although the distribution of the test statistic for the
two-sample Student t-test is derived based on the
assumption of normal distributions with equal
variances, the test is robust and has demonstrated
good performance when the variances are unequal,
and when the population distributions are not
normal. However, the estimates My, Mx and sp2 are
sensitive to outliers in either data set. If either or
both data sets contain non-detects, then the test will
be sensitive to most common methods of handling
these values. Confidence intervals derived using the
two-sample test statistic are expected to have similar
properties.

NORMAL THEORY, CASE 2: UNEQUAL, UNKNOWN
VARIANCES10

Now assume that the site data (Y1; ..., Yn) and
background data (X1; ..., Xm) are independent
random samples from normal distributions with
different means, (iy and (ix, and different variances,
oy2 and ox2, respectively:
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-8
and
Estimates for the sample variances are
and
x2 = I(XJ-Mx)2/(m-l).
An estimate of the approximate degrees of freedom
is
v = T2/b
where
T = sy2/n + sx2/m
and
b = (sy2/n)2 / (n - 1) + (sx2/m)2 / (m - 1).

A symmetric confidence interval for the difference
A = (iy - (ix is constructed using the Student t-
distribution with v* degrees of freedom, where v* is
the closest positive integer to v. Let t^^ represent
the 1 00( 1 -a/2)* quantile of this t-distribution with v*
degrees of freedom. An approximate 100(1 -a)
percent confidence interval for A has the form (A 1;
A 2), where the lower bound is
and the upper bound is

A2 = (My-Mx)+t1.a/2i1/2

A.3.2 Comparisons Based on the Wilcoxon
Rank Sum Test

The Wilcoxon Rank Sum (WRS)1 : test is a nonpara-
metric test for testing whether there is a difference
between the site and background population distri-
butions. The WRS test examines whether measure-
ments from one population tend to be consistently
larger (or smaller) than those from the other
population. The test determines which is the higher
distribution by comparing the relative ranks of the
two data sets when the data from both sources are
sorted into a single list. One assumes that any
difference between the site and background concen-
tration distributions represents a shift of the site
concentrations to higher values due to the presence
of contamination in addition to background. The
WRS test is most effective when contamination is
spread throughout a site.

Two assumptions underlying the WRS test are:

1) Samples from the background and site are
independent, identically distributed random
samples; and

2) Each measurement is independent of every
other measurement, regardless of the set of
samples from which it came.

The WRS test assumes that the distributions of the
two populations are identical in shape (variance),
although the distributions need not be symmetric.

The WRS test has three advantages over the t-test
for background comparisons:

1) The two data sets are not required to be from a
known type of distribution. The WRS test does
not assume that the data are normally or log-
normally distributed, although a normal distri-
bution approximation often is used to determine
the critical value for the test for large sample
sizes.

2) The WRS test is robust with respect to outliers
because the analysis is conducted in terms of
ranks of the data. This limits the influence of
outliers because a given data point can be no
more extreme than the first or last rank.

3) The WRS test allows for non-detect measure-
ments to be present in both data sets. The WRS
test can handle a moderate number of non-detect
values in either or both data sets by treating
them as ties.12

Theoretically, the WRS test can be used with up to
40 percent or more non-detect measurements in
either the background or the site data. Such a high
proportion of non-detects indicates that there will be
a large number of ties. In this case, the simple
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-9
expediency of assigning all ties the same ranks may
not be adequate. More specific procedures have
been developed to address data sets with a large
number of ties.13 If more than 40 percent of the data
from either the background or site are non-detect
values, the WRS test should not be used.

The WRS test may be applied to both forms of
background test, no statistically significant differ-
ence or exceed by more than a substantial differ-
ence. In either form of background test, the null
hypothesis is assumed to be true unless the evidence
in the data indicates that it should be rejected in
favor of the alternative.

The WRS test for Background Test Form 1 is
applied as outlined in the following steps. The site
and background measurements are ranked in a single
list in increasing order from 1 to N, where N = m +
n. All tied values are assigned the average of the
ranks for that group of measurements. All non-
detect values are considered as ties and are assigned
an average rank (if there are a total of ^ non-detects,
they all are assigned rank (t+l)/2, which is the
average of the first t integers).

The sum of the ranks of the site measurements (Wy)
and the sum of the ranks of the background
measurements (Wx) are sufficient statistics for the
test, where Wy + Wx = N(N + 1)12. The sum of the
ranks of the site measurements (Wy) is the test
statistic used for Background Test Form 1. To
conduct the test, Wy is compared with wa, which is
the critical value for a level-a WRS test for the
appropriate values of n and m.14 If Wy exceeds the
critical value for the test, the null hypothesis that
there is no statistically significant difference (A < 0)
may be rejected with 100(1-a) percent confidence.

The WRS test for Background Test Form 2 is
applied as outlined in the following steps. First, the
background measurements are adjusted by adding
the substantial difference S to each measured
value.15 Second, the S-adjusted background data and
the site data are ranked in a single list in increasing
order from 1 to N. Finally, all tied values are
assigned the average of the ranks for that group of
measurements.

The sum of the ranks of the S-adjusted background
measurements (Wx+s) is the test statistic for
Background Test Form 2. If Wx+s is greater than the
critical value forthe test, wa, the null hypothesis that
the site exceeds background by more than a
substantial difference (A > S) is rejected at the
100(1 -a) percent confidence level.

Nonparametric confidence intervals for A are
derived based on the Mann-Whitney form of the
WRS test (Section 5.3.2). The Mann-Whitney test
statistics are computed from the set of all possible
differences between the site and background data
sets:
There are n times m possible differences in this set,
so a computer program may be required to perform
the necessary calculations. Let the symbol Zr (r = 1,
. . . , nm) represent the r*-ranked difference in the
ordered set of all possible differences between the
site and background data. A symmetric nonpara-
metric confidence interval for A is constructed using
the ^-smallest ranked difference (Zk) and the &*-
largest ranked difference (Z^^) in the set of all
possible differences, where k depends on n, m, and
a.16 Thus, a 100 x (l-a) percent confidence interval
for A is a closed interval of the form
with
(Al5 A2) = (Zk,Znm.k+1)

k = Wa/2-n(n+l)/2.
Here, as noted above for the WRS test, w^ is the
tabulated critical value for a level-a/2 WRS test for
the appropriate values of n and m. This confidence
interval satisfies the requirement

Pr{ Aj< A< A2} > 1 -a.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-10

APPENDIX A NOTES
1. U.S. Environmental Protection Agency (EPA). 2001. Requirements for Quality Assurance Project Plans,
EPA QA/R-5. http://www.epa.gov/QUALITY/qapps.html).

2. Mood, A.M., Graybill, F. A. And Boes, D. C., Introduction to the theory of statistics, 3rd Ed., McGraw
Hill, Boston, Mass., 1974, p. 411.

3. Bickel, P.J., and Doksom, K. A..,Mathematical Statistics: Basic Ideas and Selected Topics, Holden Day,
San Francisco, 1977, p. 168.

4. U.S. Environmental Protection Agency (EPA). July 2000. Guidance for Data Quality Assessment:
Practical Methods for Data Analysis, EPA QA/G-9, QAOO Version. Quality Assurance Management
Staff, Washington, DC. EPA 600-R-96-084. Available at http://www.epa.gov/quality/qa_docs.html.

5. U.S. Environmental Protection Agency (EPA). 1994. Guidance for the Data Quality Objectives Process,
EPA QA/G-4, EPA 600-R-96-065. Washington DC.

6. U.S. Environmental Protection Agency (EPA). 1989. Statistical Methods for Evaluating the Attainment
of Cleanup Standards Volume 3, subtitled Reference-Based Standards for Soils and Solid Media, EPA
230-02-89-042. Washington DC.

7. U.S. Environmental Protection Agency (EPA). 1989. Risk Assessment Guidance for Superfund Vol. I,
Human Health Evaluation Manual (Part A). Office of Emergency and Remedial Response, Washington,
DC. EPA 540-1-89-002.

8. U.S. Environmental Protection Agency (EPA). 1996. Soil Screening Guidance: Technical Background
Document, EPA 540-R-95-128.

9. Zwillinger, D. and S. Kokoska. 2000. CRC Standard Probability and Statistics Tables and Formulae,
Chapman and Hall/CRC Press, New York, Section 9.6.2.

10. Zwillinger and Kokoska, Op. Cit., Section 9.6.3.

11. The WRS test is also called the Mann-Whitney test, which is mathematically equivalent to the WRS test.
Sometimes, the combined name is used: Wilcoxon-Mann-Whitney test. See Section 5.3.2.

12. The Gehan form of the WRS test should be considered if there are many non-detect values with different
detection levels.

13. If there are many ties, see instructions for applying the WRS test in Conover, W.J., Practical
Nonparametric Statistics, 2nd Ed., John Wiley & Sons, Inc., New York, NY, 1980.

14. Critical values for the WRS test are available in many published texts and reference books. Two sources
are Conover, W.J., Practical Nonparametric Statistics, 2nd Ed., John Wiley & Sons, NY, 1980; and CRC
Standard Probability and Statistics Tables and Formulae, D. Zwillinger and S. Kokoska, Chapman and
Hall/CRC Press, Boca Raton, Florida, 2000.

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page A-11
15. Conover, Practical Nonparametric Statistics, 2nd Ed., Op. Cit., p. 223, Equation 8.

16. Conover, Practical Nonparametric Statistics, 2nd Ed., Op. Cit., p. 224.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
-------
APPENDIX B

POLICY CONSIDERATIONS FOR THE
APPLICATION OF BACKGROUND DATA IN RISK
ASSESSMENT AND REMEDY SELECTION
Role of Background in the CERCLA Cleanup Program
U.S. Environmental Protection Agency
Office of Solid Waste and Emergency Response
Office of Emergency and Remedial Response

April 26, 2002

OSWER 9285.6-07P
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-2
Table of Contents

Purpose B-3

History B-3

Definitions of Terms B-4

Consideration of Background in Risk Assessment B-5

Consideration of Background in Risk Management B-6

Consideration of Background in Risk Communication B-7

Hypothetical Case Examples B-7

References B-10
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-3
Purpose

This document clarifies the U.S. Environmental Protection Agency (EPA) preferred approach for the
consideration of background constituent concentrations of hazardous substances, pollutants, and contamin-
ants in certain steps of the remedy selection process, such as risk assessment and risk management, at
Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA or "Superfund") sites.
To the extent practicable, this document may also be applicable to sites addressed under removal actions and
time-critical actions. In general, the presence of high background concentrations of hazardous substances,
pollutants, and contaminants found at a site is a factor that should be considered in risk assessment and risk
management.

The primary goal of the CERCLA program is to protect human health and the environment from current and
potential threats posed by uncontrolled releases of hazardous substances, pollutants, and contaminants.
Contamination at a CERCLA site may originate from releases attributable to the CERCLA site in question,
as well as contamination that originated from other sources, including natural and/or anthropogenic sources
not attributable to the specific site releases under investigation (EPA, 1995a). In some cases, the same
hazardous substance, pollutant, and contaminant associated with a release is also a background constituent.
These constituents should be included in the risk assessment, particularly when their concentrations exceed
risk-based concentrations. In cases where background levels are high or present health risks, this information
may be important to the public. Background information is important to risk managers because the CERCLA
program, generally, does not clean up to concentrations below natural or anthropogenic background levels.

A comprehensive investigation of all background substances found in the environment usually will not be
necessary at a CERCLA site. For example, radon background samples normally would not be collected at
a chemically contaminated site unless radon, or its precursor (radium, Ra-226) was part of the CERCLA
release. Also, EPA normally would not analyze background samples for Ra-226 at a cesium (Cs-137) site,
or dioxin at a lead site where dioxin was not the subject of a CERCLA release into the environment.

This document provides guidance to EPA Regions concerning how the Agency intends to exercise its
discretion in implementing one aspect of the CERCLA remedy selection process. The guidance is designed
to implement national policy on these issues.

Some of the statutory provisions described in this document contain legally binding requirements. However,
this document does not substitute for those provisions or regulations, nor is it a regulation itself. Thus, it
cannot impose legally-binding requirements on EPA, States, or the regulated community, and may not apply
to a particular situation based upon the circumstances. Any decisions regarding a particular remedy selection
decision will be made based on the statute and regulations, and EPA decision makers retain the discretion
to adopt approaches on a case-by-case basis that differ from this guidance where appropriate. EPA may
change this guidance in the future.

History

Background issues are discussed in a number of EPA documents.1 A need for CERCLA-specific guidance
> Risk Assessment Guidance for Superfund Volume I, Human Health Evaluation Manual [RAGS] (EPA, 1989).
(continued...)

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-4
was identified during risk assessment reform discussions with stakeholders in 1997. An issue that is often
raised at CERCLA sites is whether a reliable representation of background is established (EPA, 1989). To
assist Regions with this issue, EPA developed a peer-reviewed practical guide to sampling and statistical
analysis of background concentrations in soil at CERCLA sites (EPA, 2001b).

EPA has developed this policy to respond to questions about the general application of background
concentration during the CERCLA remedial investigation process.2 This policy encourages national
consistency and responds to the Agency's goals for risk characterization and communication of risks to the
public as expressed in other EPA policy and guidance, including:

* Policy for Risk Characterization which provides principles for fully, openly, and clearly characterizing
risks (EPA, 1995b); and

*• Cumulative Risk Assessment Guidance which encourages programs to better advise citizens about the
environmental and public health risks they face (EPA, 1997c).

Definitions of Terms

For the purposes of this policy, the following definitions are used.

Background refers to constituents or locations that are not influenced by the releases from a site, and is
usually described as naturally occurring or anthropogenic (EPA, 1989; EPA, 1995a):

1) Anthropogenic - natural and human-made substances present in the environment as a result of
human activities (not specifically related to the CERCLA release in question); and,

2) Naturally occurring - substances present in the environment in forms that have not been
influenced by human activity.

Chemicals (or constituents) of concern (COCs) are the hazardous substances, pollutants, and contaminants
that, at the end of the risk assessment, are found to be the risk drivers or those that may actually pose
1 (...continued)
*• Preamble to the National Oil and Hazardous Substances Pollution Contingency Plan (NCP, 1990a).
*• Role of the Baseline Risk Assessment in Superfund Remedy Selection Decisions (EPA, 1991).
> Determination ofBackground Concentrations oflnorganics in Soils and Sediments at Hazardous Waste Sites (EPA,
1995a).
* Soil Screening Guidance: User's Guide (EPA, 1996).
* Ecological Risk Assessment Guidance for Superfund (EPA, 1997a).
> Rules of Thumb for Superfund Remedy Selection (EPA, 1997b).
> Soil Screening Guidance for Radionuclides: User's Guide (EPA, 2000).
* ECO Update. The Role of Screening-Level Risk Assessments and Refining Contaminants of Concern in Baseline
Ecological Risk Assessments (EPA, 200 la).

2 The process of determining when risks warrant remedial actions and the degree of cleanup for specific hazardous
substances, pollutants, and contaminants involves many factors that are not addressed in this document. Additional
guidance is provided in the EPA (1991) Role of the Baseline Risk Assessment in Superfund Remedy Selection Decisions.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-5

unacceptable human or ecological risks.3 The COCs typically drive the need for a remedial action (EPA,
1999a).

Chemicals (or constituents) of potential concern (COPCs) generally comprise the hazardous substances,
pollutants, and contaminants that are investigated during the baseline risk assessment. The list of COPCs may
include all of the constituents whose data are of sufficient quality for use in the quantitative risk assessment,
or a subset thereof (EPA, 1989).

Screening is a common approach used by risk assessors to refine the list of COPCs to those hazardous
substances, pollutants, and contaminants that may pose substantial risks to health and the environment.
Screening involves a comparison of site media concentrations with site-specific risk-based values4

Consideration of Background in Risk Assessment

A baseline risk assessment generally is conducted to characterize the current and potential threats to human
health and the environment that may be posed by hazardous substances, pollutants, and contaminants at a
site. EPA's 1989 Risk Assessment Guidance for Superfund (RAGS) provides general guidance for selecting
COPCs, and considering background concentrations. In RAGS, EPA cautioned that eliminating COPCs based
on background (either because concentrations are below background levels or attributable to background
sources) could result in the loss of important risk information for those potentially exposed, even though
cleanup may or may not eliminate a source of risks caused by background levels. In light of more recent
guidance for risk-based screening (EPA, 1996; EPA, 2000) and risk characterization (EPA, 1995c), this
policy recommends a baseline risk assessment approach that retains constituents that exceed risk-based
screening concentrations. This approach involves addressing site-specific background issues at the end of
the risk assessment, in the risk characterization. Specifically, the COPCs with high background
concentrations should be discussed in the risk characterization, and if data are available, the contribution of
background to site concentrations should be distinguished.5 COPCs that have both release-related and
background-related sources should be included in the risk assessment. When concentrations of naturally
occurring elements at a site exceed risk-based screening levels, that information should be discussed
qualitatively in the risk characterization. To summarize:

*• The COPCs retained in the quantitative risk assessment should include those hazardous substances,
pollutants, and contaminants with concentrations that exceed risk-based screening levels.
3 Guidance for determining if site risks are unacceptable is discussed in the EPA (1991) Role of the Baseline Risk
Assessment in Superfund Remedy Selection Decisions. As stated in the EPA (1991) memorandum, "EPA uses the
general 10"4 to 10"6 risk range as a "target range" within which the Agency strives to manage risks as part of a Superfund
cleanup." The risk used in this decision generally is the "cumulative site risk" to an individual using reasonable
maximum exposure (RME) assumptions for either current or future land use and includes all exposure pathways which
the same person may consistently face. See also EPA (1989) RAGS, Section 8.3.

4 Risk-based values or concentrations are generally based on a cancer risk of one-in-a-million (IxlO"6) or a hazard
quotient of 1.0 for noncarcinogens (EPA, 1996) or screening-level ecological risk values (EPA, 1997a; EPA, 2001a).
COPCs with concentrations below the screening levels might be excluded fromthe risk assessment unless there are other
pathways or conditions that are not addressed by the screening values (EPA, 1996).

5 Technical guidance should be consulted for sampling and analysis of background concentration data (EPA, 200 Ib).

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-6
> The Risk Characterization should include a discussion of elevated background concentrations of COPCs
and their contribution to site risks.

> Naturally occurring elements that are not CERCLA hazardous substances, pollutants, and contaminants,
but exceed risk-based screening levels should be discussed in the risk characterization.

This general approach is preferred in order to:

> Encourage national consistency in this area;

> Present a more thorough picture of risks associated with hazardous substances, pollutants, and
contaminants at a site; and

> Prevent the inadvertent omission of potentially release-related hazardous substances, pollutants, and
contaminants from the risk assessment.

This approach is consistent with the Policy for Risk Characterization which provides principles for fully,
openly, and clearly characterizing risks (EPA, 1995b). Risks identified during the baseline risk assessment
should be clearly presented and communicated for risk managers and for the public. Risk characterization
is one of many factors in determining appropriate CERCLA risk management actions (EPA, 1991; EPA,
1995b).

Consideration of Background in Risk Management

Where background concentrations are high relative to the concentrations of released hazardous substances,
pollutants, and contaminants, a comparison of site and background concentrations may help risk managers
make decisions concerning appropriate remedial actions. The contribution of background concentrations to
risks associated with CERCLA releases may be important for refining specific cleanup levels for COCs that
warrant remedial action.6

Generally, under CERCLA, cleanup levels are not set at concentrations below natural background levels.
Similarly, for anthropogenic contaminant concentrations, the CERCLA program normally does not set
cleanup levels below anthropogenic background concentrations (EPA, 1996; EPA, 1997b; EPA, 2000). The
reasons for this approach include cost-effectiveness, technical practicability, and the potential for
recontamination of remediated areas by surrounding areas with elevated background concentrations. In cases
where area-wide contamination may pose risks, but is beyond the authority provided under CERCLA, EPA
may be able to help identify other programs or regulatory authorities that are able to address the sources of
area-wide contamination, particularly anthropogenic (EPA, 1996; EPA, 1997b; EPA, 2000). In some cases,
as part of a response to address CERCLA releases of hazardous substances, pollutants, and contaminants,
EPA may also address some of the background contamination that is present on a site due to area-wide
contamination.
6 For example, in cases where a risk-based cleanup goal for a COC is below background concentrations, the cleanup
level may be established based on background.

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-7
The determination of appropriate CERCLA response actions and chemical-specific cleanup levels includes
the consideration of nine criteria as provided in the National Oil and Hazardous Substances Pollution
Contingency Plan (NCP, 1990b). In cases where applicable or relevant and appropriate requirements
(ARARs) regarding cleanup to background levels apply to a CERCLA action, the response action generally
should be carried out in the manner prescribed by the ARAR. In the case where a law or regulation is
determined to be an ARAR and it requires cleanup to background levels, the ARAR will normally apply and
be incorporated into the Record of Decision, unless the ARAR is waived.

Consideration of Background in Risk Communication

EPA strives for transparency in decision-making (EPA, 1995c) and encourages programs to better advise
citizens about the environmental and public health risks they face (EPA, 1997c). The presence of high
background concentrations of COPCs may pose challenges for risk communication. For example, the
discussion of background may raise the expectation that EPA will address those risks under CERCLA. The
knowledge that background substances may pose health or environmental risks could compound public
concerns in some situations.

On the other hand, knowledge of background risks could help some community members place CERCLA
risks in perspective. Also, the information about site and background risks can be helpful for both risk
managers who make an appropriate CERCLA decision, and for members of the public who should know
about environmental risk factors that come to light during the remedial investigation process.

As a general policy matter, EPA strives for early and frequent outreach to communities in order to share
information and encourage involvement (EPA, 200 Ic). EPA has made a clear commitment to fully, openly,
and clearly characterize and communicate risks (EPA, 1995b; EPA, 1995c). There is no one-size-fits-all
technique that can help explain risks associated with CERCLA releases or with background levels, or the
basis of risk management decisions. Approaches will depend on the site, the issues, and the level of
community interest. Early on in the process, Regions should clarify their understanding of stakeholder
expectations and clearly explain the relevant constraints and limitations of the CERCLA remedial process
(EPA, 1999b;EPA, 2001c).

In some cases where area-wide contamination may pose a risk, but is beyond the authority of the CERCLA
program, communication of potential risks to the public may be most effective when coordinated with public
health agencies. Examples of situations where Regions might coordinate risk communication with local, state
or federal health officials are sites where widespread lead contamination or high levels of naturally occurring
radiation have been found, but are not the subject of a CERCLA release into the environment. Public health
agency officials may combine education and outreach efforts to inform residents about ways to reduce
exposures and risks.

Hypothetical Case Examples

Three general hypothetical case examples are given to show how background may be considered in risk
assessment and risk management at CERCLA sites:

Case 1 presents an example of a chemical site with widespread background contamination.

Case 2 presents an example of a radiation site with both natural- and release-related sources.

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-8
Case 3 presents an example of a site with hazardous substances, pollutants, and contaminants from both
natural- and release-related sources.

In these examples, it is presumed that adequate samples are collected from appropriate background reference
locations and evaluated using appropriate statistical methods. It is presumed that background is not used to
screen out substances from the risk assessment. For simplicity, only one pathway7 is used for hypothetical
human health risk assessments.8

Based on the presumptions above, the basic concepts these examples are designed to highlight are:

> Background issues should be discussed in the risk characterization portion of the baseline risk
assessment in order to inform risk management decisions;

> Information about unacceptable risks should be communicated to public; and

> Other factors, such as the nine criteria provided in the NCP, should be considered by the risk manager
in making final decisions.

Hypothetical Case 1

The ABC Industrial Site risk assessment included all COPCs that exceed site-specific risk-based concen-
trations for soil pathways. The results of the risk assessment identified the following COPCs with risks above
or at the high end of the 10~4 to 10~6 risk range: arsenic, dieldrin, and 4,4-DDT. The hazard quotients were
below 1.0.

Arsenic is a potential background substance—it is a common naturally occurring element—but is also a
hazardous substance that was released at this site. The available site characterization data indicate that soil
arsenic concentrations may be naturally occurring or consistent with background concentrations. Dieldrin
and DDT are present at high concentrations that contribute to an unacceptable site risk. However, only
dieldrin is known to be associated with the CERCLA site activities and releases. Since there are no known
historical uses of DDT at this CERCLA site, the RPM suspects that the DDT in soil originated from area-
wide agricultural pesticide applications in this part of the state. Based on this information, the RPM requests
additional sampling of background locations for arsenic and DDT analysis. A statistical comparison of
sampling data for arsenic and 4,4-DDT in on-site samples and background samples indicates that site
concentrations for DDT are consistent with background concentrations. Local and regional data support the
conclusion that DDT is an area-wide contaminant. The additional data indicate that arsenic concentrations
7 At most CERCLA sites, risks for the reasonably maximum exposed individual typically are combined across several
exposure pathways to estimate the total risks at a CERCLA site. This is done only for the pathways which the same
individual would be likely to face consistently (EPA, 1989). Depending on the particular CERCLA site, risks could be
calculated for the entire area of the site or for separate units (see Section 4.5 of RAGS (EPA, 1989)). More technical
guidance for characterizing background concentrations and comparing data sets is provided in EPA (200 Ib) and other
technical references cited previously in this document.

8 Guidance on the consideration of background concentrations during screening level ecological risk assessments is
provided in EPA (200la).

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
Page B-9
on the site are above background concentrations. Therefore, the arsenic risks cannot be attributed solely to
background.

In this example, arsenic and dieldrin are the soil COCs for which cleanup goals should be derived. The risk
characterization should present information about DDT as an area-wide background contaminant that is
unrelated to releases at this site, and the Agency should explain whether or not it will be addressed. The RPM
should consider whether other regulatory programs or authorities are able to address the area-wide DDT
contamination in a coordinated response effort. If available, the location(s) of additional information on
pesticide use in this part of the state should be provided for concerned citizens.

Hypothetical Case 2

At ABC Radium Production Site, site characterization data indicate that radium (Ra-226) and inorganics are
present in soil. Arsenic concentrations exceed screening levels but are assumed to be within naturally
occurring levels. To confirm this assumption, the RPM evaluates site-specific background samples for
comparison to site concentrations. The site-specific background analysis confirms that arsenic concentrations
collected on the site are consistent with background concentrations in soils. There are no known regional
anthropogenic sources of arsenic (such as smelters or pesticide manufacturers). Arsenic, in this case, is
considered to be a naturally occurring substance and is excluded from further consideration in the quantifica-
tion of site risks. However, the finding of natural background arsenic at concentrations that may pose health
risks should be discussed in the text of the risk characterization.

The risk assessment indicates that Ra-226 exceeds the high end of the acceptable risk range of 10~4 to 10~6.
It is commonly known that Ra-226 occurs naturally in the environment. Samples collected in an appropriate
background location near this site indicate that Ra-226 levels from natural sources are lower than the site
levels, but are associated with a risk at the upper end of the risk range (10~4).

In this example, only Ra-226 should be a COC for which a cleanup goal should be derived. The risk
characterization, however, should include a discussion of natural background levels of both arsenic and Ra-
226.

Hypothetical Case 3

XYZ Site contains buried chemical wastes, but some anecdotal accounts indicate that radium may have been
used. Preliminary site characterization data show that arsenic, manganese, and Ra-226 concentrations exceed
the site-specific, risk-based concentrations. A comparison of arsenic and manganese concentrations in
groundwater samples collected from upgradient background locations indicates that only manganese site
concentrations are consistent with background levels and considered to be naturally occurring. Naturally
occurring manganese is not considered further in the quantification of risks, but is included in a qualitative
discussion of risks in the risk characterization.

The RPM decides to analyze for Ra-226 both at the site and in background locations because it is commonly
known that Ra-226 occurs naturally in the environment. Samples are collected in an appropriate background
location near this site. The samples indicate that Ra-226 levels at this site are not different from naturally
occurring levels. Therefore, Ra-226 is not a COPC for further consideration in the quantification of risks.
Subsequent site investigation data confirms the use of chemicals, but not radionuclides.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
PageB-10
In this example, only arsenic risks are quantified in the risk assessment. The baseline risk for groundwater
indicates that arsenic poses an unacceptable risk. The risk characterization should include a discussion of
the natural Ra-226 and manganese concentrations because the levels exceeded risk-based concentrations. Site
characterization data indicate that site disposal activities caused naturally occurring arsenic in soil to be
mobilized and leach to groundwater. Arsenic, therefore, is the subject of a CERCLA release into the
environment and a cleanup goal for it should be derived.

References

U.S. Environmental Protection Agency (EPA). 1989. Risk Assessment Guidance for Superfund (RAGS):
Volume I: Human Health Evaluation Manual (HHEM), (Part A), Interim Final, Office of Emergency
and Remedial Response, Washington, DC. EPA/540/1-89/002, OSWER 9285.70-02B.

U.S. Environmental Protection Agency (EPA). 1991. Role of the Baseline Risk Assessment in Superfund
Remedy Selection Decisions, Office of Emergency and Remedial Response, Washington, DC. OSWER
9355.0-30.

U.S. Environmental Protection Agency (EPA). 1995a. Engineering Forum Issue Paper. Determination of
Background Concentrations of Inorganics in Soils and Sediments at Hazardous Waste Sites, R.P
Breckenridge and A.B. Crockett, Office of Research and Development, Office of Solid Waste and
Emergency Response, Washington, DC. EPA/540/S-96/500.

U.S. Environmental Protection Agency (EPA). 1995b. Policy for Risk Characterization at the U. S.
Environmental Protection Agency, Science Policy Council, Washington, DC. http://www.epa.gov/OSP/
spc/rcpolicy.htm.

U.S. Environmental Protection Agency (EPA). 1995c. Risk Characterization Handbook, Science Policy
Council, Washington, DC. EPA 100-B-00-002.

U.S. Environmental Protection Agency (EPA). 1996. Soil Screening Guidance: User's Guide. Office of
Emergency and Remedial Response, Washington, DC. EPA/540-R-96/018, OSWER 9355.4-23.

U.S. Environmental Protection Agency (EPA). 1997a. Ecological Risk Assessment Guidance for Superfund:
Process for Designing and Conducting Ecological Risk Assessments, Interim Final, EPA/540-R-97-006,
OSWER 9285.7-25.

U.S. Environmental Protection Agency (EPA). 1997b. Rules of Thumb for Superfund Remedy Selection.
Office of Emergency and Remedial Response, Washington, DC. EPA 540-R-97-013, OSWER9355.0-69.

U.S. Environmental Protection Agency (EPA). 1997c. Cumulative Risk Assessment Guidance-Phase I
Planning and Scoping, Science Policy Council, Washington, DC. http://www.epa.gov/osp/spc/cumrisk2.
htm.

U.S. Environmental Protection Agency (EPA). 1999a. A Guide to Preparing Superfund Proposed Plans,
Records of Decision, and Other Remedy Selection Decision Documents, Office of Emergency and

Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------
PageB-11
Remedial Response, Washington, DC. OSWER 9200.1-23.P.

U.S. Environmental Protection Agency (EPA). 1999b. Risk Assessment Guidance for Superfund: Volume 1
- Human Health Evaluation Manual Supplement to Part A: Community Involvement in Superfund Risk
Assessments, Office of Emergency and Remedial Response, Washington, DC. OSWER 9285.7-01E-P.

U.S. Environmental Protection Agency (EPA). 2000. Soil Screening Guidance for Radionuclides: User's
Guide, Office of Radiation and Indoor Air, OSWER 9355.4-16A.

U.S. Environmental Protection Agency (EPA). 2001a. ECO Update, The Role of Screening-Level Risk
Assessments and Refining Contaminants of Concern in Baseline Ecological Risk Assessments, OSWER
9345.0-14.

U.S. Environmental Protection Agency (EPA). 200 Ib. Guidance for Characterizing Background Chemicals
in Soil at Superfund Sites, External Review Draft, Office of Emergency and Remedial Response,
OSWER. 9285.7-41. [Replaced by Guidance for Comparing Background and Chemical Concentrations
in Soil for CERCLA Sites, EPA 540-R-01-003, September 2002.]

U.S. Environmental Protection Agency (EPA). 2001 c. Early andMeaningfiil Community Involvement, Office
of Emergency and Remedial Response, OSWER 9230-0-9.

NCP, 1990a. Preamble to the National Oil and Hazardous Substances Pollution Contingency Plan (NCP),
40 CFR Part 300, 53 Federal Register 51394 and 55 Federal Register 8666.

NCP, 1990b. National Oil and Hazardous Substances Pollution Contingency Plan (NCP), 40 CFR Part 300,
55 Federal Register 8666, March 8, 1990.
Guidance for Comparing Background and Chemical Concentrations in Soil for CERCLA Sites
-------