vvEPA
Unified Guidance
        STATISTICAL ANALYSIS OF
  GROUNDWATER MONITORING DATA AT
            RCRA FACILITIES

            UNIFIED  GUIDANCE
OFFICE OF RESOURCE CONSERVATION AND RECOVERY

PROGRAM IMPLEMENTATION AND INFORMATION DIVISION

U.S. ENVIRONMENTAL PROTECTION AGENCY
MARCH 2009
         EPA 530-R-09-007                    March 2009

-------
vvEPA
Unified Guidance
                 This page intentionally left blank
              EPA 530-R-09-007                               March 2009

-------
                                                                            Unified Guidance
                                     DISCLAIMER

       This Unified Guidance has been prepared to assist EPA's Regions, the States and the regulated
community in testing and evaluating groundwater monitoring data under 40 CFR Parts 264 and 265 and
40 CFR Part 258. This guidance is not a rule, is not legally enforceable, and does not confer legal rights
or impose legal obligations on any member of the public,  EPA, the States or any other agency. While
EPA has made every effort to ensure the accuracy of the discussion in this guidance, the obligations of
the regulated community are determined by the relevant statutes, regulations, or other legally binding
requirements. The use of the term "should" when used in this guidance does not connote a requirement.
This guidance may not apply in a particular situation based on the circumstances. Regional and State
personnel retain the discretion to adopt approaches on a case-by-case basis that differ from this guidance
where appropriate.

       It should be stressed that this guidance is a work in progress. Given the complicated nature of
groundwater and geochemical behavior, statistical applications describing and evaluating data patterns
have evolved over time. While many new  approaches and a conceptual framework have been provided
here based on our understanding  at the time of publication, outstanding issues remain. The Unified
Guidance sets out mostly classical statistical  methods using  reasonable interpretations of existing
regulatory objectives and  constraints. But even  these highly  developed mathematical models deal
primarily with sorting out chance effects from potentially real differences or trends. They do not exhaust
the possibilities of  groundwater  definition  using other technical or  scientific  techniques  (e.g.,
contaminant modeling or geostatistical evaluations). While providing a workable decision framework,
the models and  approaches offered within  the Unified Guidance are only approximations of a complex
underlying reality.

       While providing a basic understanding of underlying statistical principles, the guidance  doesn't
attempt to provide the reader with more thorough explanations and derivations found in standard texts
and papers. It also doesn't comprehensively cover all potential statistical approaches, and confines itself
to reasonable and current methods, which will work in the present RCRA groundwater context. While it
is highly likely that methods promoted in this guidance will be applied using commercial or proprietary
statistical software, a detailed discussion of software applications is beyond the scope of this document.

       This document has  been reviewed by the Office of Resource Conservation and Recovery (former
Office  of Solid Waste), U.S.  Environmental Protection Agency, Washington, D.C., and approved for
publication.  Mention of  trade names, commercial  products, or  publications  does not constitute
endorsement or recommendation for use.
          "It is far better to have an approximate answer to the right
          question than a precise answer to the wrong question..." — John
          Mauser
                                                                                   March 2009

-------
                                                                         Unified Guidance
                             ACKNOWLEDGMENTS

      EPA's Office of Solid Waste developed initial versions of this document under the direction of
James R. Brown and Hugh Davis, Ph.D. of the Permits and State Programs Division. The final draft
was developed and edited under the direction of Mike Gansecki, EPA Region 8. The guidance was
prepared by Kirk M. Cameron, Ph.D., Statistical Scientist and President of MacStat Consulting, Ltd. in
Colorado Springs, Colorado.  It also  incorporates the substantial efforts on the  1989 Interim  Final
Guidance of Jerry Flora, Ph.D. and Ms. Karen Bauer, both — at the time — of Midwest Research
Institute in Kansas City, Missouri. Science Applications  International Corporation (SAIC) provided
technical support in developing this document under EPA Contract No. EP-WO-5060.

      EPA also  wishes to acknowledge the  time and effort spent in reviewing and improving the
document  by a workgroup composed of statisticians, Regional and State personnel,  and industry
representatives — Dr. Robert  Gibbons, Dr. Charles Davis, Sarah Hession, Dale Bridgford, Mike  Beal,
Katie Crowell, Bob Stewart, Charlotte Campbell, Evan Englund, Jeff Johnson, Mary Beck, John Baker
and Dave  Burt.  We also wish to acknowledge the excellent comments  by a number of state,  EPA
Regional and industry parties  on the September 2004 draft. Finally, we gratefully acknowledge the
detailed reviews, critiques and comments of Dr. Dennis Helsel of the US Geologic Survey, Dr. James
Loftis of Colorado State University, and Dr.  William Huber of Quantitative Decisions Inc., who
provided formal peer reviews of the September 2004 draft.

     A  special note of thanks is due to Dave Bartenfelder and Ken Lovelace of the EPA CERCLA
program, without whose assistance this document would not have been completed.
                                                                                March 2009

-------
                                                                            Unified Guidance
                              EXECUTIVE SUMMARY

     The Unified Guidance provides a suggested framework and recommendations for the statistical
analysis of groundwater monitoring data at RCRA facility units subject to 40 CFR Parts 264 and 265
and 40 CFR Part 258, to determine whether groundwater has been impacted by a hazardous constituent
release. Specific statistical methods are identified in the RCRA regulations, but their application is not
described in any detail. The Unified Guidance provides examples and background information that will
aid in successfully conducting the required statistical analyses. The Unified Guidance draws upon the
experience  gained in the last decade in implementing  the RCRA Subtitle C and D groundwater
monitoring programs and new research that has emerged since earlier Agency guidance.

      The  guidance is  primarily oriented  towards  the  groundwater monitoring  statistical analysis
provisions of 40 CFR Parts 264.90 to .100.  Similar requirements for groundwater monitoring at solid
waste landfill facilities  under 40 CFR Part 258 are also  addressed.  These regulations  govern the
detection, characterization and response to releases from regulated units into the uppermost  aquifer.
Some of the methods and strategies set out in this guidance may also be appropriate for analysis of
groundwater monitoring data from solid waste management units subject to 40 CFR 264.101. Although
the focus of this guidance is to address the RCRA regulations, it can be used by the CERCLA program
and for improving remedial actions at other groundwater monitoring programs.

      Part I of the Unified Guidance introduces the context for statistical testing at RCRA facilities. It
provides an overview of the regulatory requirements, summarizing the current RCRA Subtitle C and D
regulations and outlining the statistical methods in the final rules, as well as key regulatory sections
affecting statistical decisions. It explains the basic groundwater monitoring framework, philosophy and
intent of each stage of monitoring — detection, compliance (or assessment), and corrective action —
and certain features common to the groundwater monitoring environment. Underlying statistical ideas
common to all statistical test procedures are identified, particularly issues involving false positives
arising from multiple statistical comparisons and statistical power to detect contamination.

     A new component of the Unified Guidance addresses issues of statistical design: what factors are
important in constructing a reasonable and effective statistical monitoring program. These include the
establishment and updating of background data, designing an acceptable detection monitoring plan, and
statistical strategies for compliance/assessment monitoring and corrective action. This part also includes
a short summary of statistical methods recommended in the Unified Guidance, detailing conditions for
their appropriate use.

      Part II of the Unified Guidance covers diagnostic evaluations of historical facility data for the
purpose of checking key assumptions implicit in the recommended statistical  tests and for making
appropriate adjustments to the data (e.g., consideration of  outliers, seasonal autocorrelation, or non-
detects). Also included is a discussion of groundwater sampling and how hydrologic factors  such as
flow and gradient can impact the sampling program. Concepts of statistical and physical independence
are compared, with  caveats provided regarding the impact of dependent data on statistical test results.
Statistical methods  are suggested for  identifying special kinds of dependence  known  as  spatial and
temporal variation, including reasonable approaches when these dependencies are observed. Tests for
trends are also included in this part.

                                              iii                                     March 2009

-------
                                                                             Unified Guidance
      Part III of the Unified Guidance presents a range of detection monitoring statistical procedures.
First, there is a discussion of the Student's Mest and its non-parametric counterpart, the Wilcoxon rank-
sum test, when comparing two groups of data (e.g., background versus one downgradient well). This
part defines both parametric and non-parametric prediction limits, and their application to groundwater
analysis when  multiple  comparisons are involved. A variety  of prediction limit possibilities are
presented  to  cover likely  interpretations of  sampling and  testing requirements  under the  RCRA
regulations.

      Substantial detailed guidance is offered for using prediction limits with retesting procedures, and
how various retesting algorithms might be constructed.  The final chapter of this Part considers another
statistical method especially useful for intrawell comparisons, namely the Shewhart-CUSUM  control
chart. A brief discussion of analysis of variance [ANOVA] and tolerance limit tests identified in the
RCRA regulations is also provided.

      Part IV of the Unified Guidance is devoted to statistical methods recommended for compliance
or  assessment monitoring  and corrective  action.  Compliance  monitoring typically involves  a
comparison of downgradient well data to a groundwater protection standard [GWPS], which may be a
limit derived from background or a fixed concentration limit (such as in 40 CFR 264.94 Table 1, an
MCL,  a  risk-based limit,  an alternate  concentration  limit,  or a  defined clean-up  standard under
corrective action).  The  key statistical procedure is the  confidence interval, and several confidence
interval tests (mean,  median, or upper  percentile)  may be appropriate  for  compliance  evaluation
depending on the circumstances.  The choice depends on the distribution of the data,  frequency of non-
detects, the type of standard being compared,  and whether or not the data exhibit a significant trend.
Discussions in this part consider  fixed compliance standards  used in a variety of EPA programs and
what  they might represent in statistical terms.  Strategies for corrective action  differ from those
appropriate for compliance monitoring primarily because statistical hypotheses are changed, although
the same basic statistical methods  may be  employed.

     Since some programs  will also utilize background as standards for compliance and corrective
action monitoring, those tests and discussions under Part III detection monitoring (including statistical
design in Part I) may pertain in identifying the appropriate standards and tests.

       A glossary of important statistical terms, references and a subject index are provided at the end
of the  main  text.  The Appendices contain additional notes on a number of topics including previous
guidance, a special study for the guidance, more detailed statistical power discussions, and an extensive
set of statistical tables for implementing the methods outlined in the Unified Guidance. Some tables,
especially those for prediction limit retesting procedures, have been extended  within the  Unified
Guidance beyond published sources in order to  cover a wider variety of plausible scenarios.
                                              iv                                     March 2009

-------
                                                                  Unified Guidance
                           TABLE OF CONTENTS

DISCLAIMER	i
ACKNOWLEDGMENTS	ii
EXECUTIVE SUMMARY	iii
TABLE OF CONTENTS	v
          PART I.  STATISTICAL DESIGN AND PHILOSOPHY


CHAPTER 1.   OBJECTIVES AND POTENTIAL USE OF THIS GUIDANCE

   1.1 OBJECTIVES	1-1
   1.2 APPLICABILITY TO OTHER ENVIRONMENTAL PROGRAMS	1-3

CHAPTER 2.   REGULATORY OVERVIEW

   2.1 REGULATORY SUMMARY	2-1
   2.2 SPECIFIC REGULATORY FEATURES AND STATISTICAL ISSUES	 2-6
      2.2.1 Statistical Methods Identified under §264.97(h) and §258.53(g)	2-6
      2.2.2 Performance Standards under §264.97(1) and §258.53(h)	 2-7
      2.2.3 Hypothesis Tests in Detection, Compliance and Corrective Action Monitoring	2-10
      2.2.4 Sampling Frequency Requirements	2-10
      2.2.5 Groundwater Protection Standards	2-12
   2.3 UNIFIED GUIDANCE RECOMMENDATIONS	         2-13
      2.3.1 Interim Status Monitoring	2-13
      2.3.2 Parts 264 and 258 Detection Monitoring Methods	 2-14
      2.3.3 Parts 264 and 258 Compliance/assessment Monitoring	2-15

CHAPTER 3.    KEY STATISTICAL CONCEPTS

   3.1 INTRODUCTION TO GROUNDWATER STATISTICS	3-2
   3.2 COMMON STATISTICAL ASSUMPTIONS	3-4
      3.2.1 Statistical Independence	 3-4
      3.2.2 Stationarity.	3-5
      3.2.3 Lack of Statistical Outliers	3-7
      3.2.4 Normality	 3-7
   3.3 COMMON STATISTICAL MEASURES	3-9
   3.4 HYPOTHESIS TESTING FRAMEWORK	3-12
   3.5 ERRORS IN HYPOTHESIS TESTING	 3-14
      3.5.1 False Positives and Type I Errors	3-15
      3.5.2 Sampling Distributions, Central Limit Theorem	3-16
      3.5.3 False Negatives, Type II Errors, and Statistical Power.	 3-18
      3.5.4 Balancing Type I and Type II Errors	3-22

CHAPTER 4.    GROUNDWATER MONITORING PROGRAMS AND STATISTICAL
ANALYSIS

   4.1 THE GROUNDWATER MONITORING CONTEXT	 4-1
   4.2 RCRA GROUNDWATER MONITORING PROGRAMS	4-3
                                        v                               March 2009

-------
                                                                        Unified Guidance
   4.3 STATISTICAL SIGNIFICANCE IN GROUNDWATER TESTING	 4-6
      4.3.1 Statistical Factors	 4-8
      4.3.2 Well System Design and Sampling Factors	 4-8
      4.3.3 Hydrological Factors	 4-9
      4.3.4 Geochemical Factors	4-10
      4.3.5 Analytical Factors	 4-10
      4.3.6 Data or Analytic Errors	 4-11

CHAPTER 5.    ESTABLISHING  AND UPDATING BACKGROUND

   5.1 IMPORTANCE OF BACKGROUND	5-1
      5.1.1 Tracking Natural Groundwater Conditions	5-2
   5.2 ESTABLISHING AND REVIEWING BACKGROUND	5-2
      5.2.1 Selecting Monitoring Constituents and Adequate Sample Sizes	5-2
      5.2.2 Basic Assumptions About Background.	 5-4
      5.2.3 Outliers in Background.	 5-5
      5.2.4 Impact of Spatial Variability	 5-6
      5.2.5 Trends in Background.	5-7
      5.2.6 Summary: Expanding Background Sample Sizes	5-8
      5.2.7 Review of Background.	5-10
   5.3 UPDATING BACKGROUND	5-12
      5.3.1 When  to Update	5-12
      5.3.2 How to Update	 5-12
      5.3.3 Impact of Retesting	 5-14
      5.3.4 Updating When Trends are Apparent.	5-14

CHAPTER 6.    DETECTION MONITORING PROGRAM  DESIGN

   6.1 INTRODUCTION	6-1
   6.2 Elements of the Statistical Program Design	 6-2
      6.2.1 The Multiple Comparisons Problem	6-2
      6.2.2 Site-Wide False Positive  Rates [SWFPR]	6-7
      6.2.3 Recommendations for Statistical Power.	6-13
      6.2.4 Effect Sizes and Data-Based Power Curves	 6-18
      6.2.5 Sites Using More Than One Statistical Method.	6-21
   6.3 How KEY ASSUMPTIONS IMPACT STATISTICAL DESIGN	6-25
      6.3.1 Statistical Independence	6-25
      6.3.2 Spatial Variation: Interwell vs. Intrawell Testing	6-29
      6.3.3 Outliers	 6-34
      6.3.4 Non-Detects	 6-36
   6.4 DESIGNING DETECTION MONITORING TESTS	6-38
      6.4.1 T-Tests	6-38
      6.4.2 Analysis Of Variance [ANOVA]	6-38
      6.4.3 Trend  Tests	 6-41
      6.4.4 Statistical Intervals	6-42
      6.4.5 Control Charts	 6-46
   6.5 SITE DESIGN EXAMPLES	6-46
                                            vi                                  March 2009

-------
                                                                    Unified Guidance
CHAPTER 7.   STRATEGIES FOR COMPLIANCE/ASSESSMENT MONITORING AND
CORRECTIVE ACTION

   7.1 INTRODUCTION	7-1
   7.2 HYPOTHESIS TESTING STRUCTURES	7-3
   7.3 GROUNDWATER PROTECTION STANDARDS	7-6
   7.4 DESIGNING A STATISTICAL PROGRAM	7-9
      7.4.1 False Positives and Statistical Power in Compliance/Assessment.	7-9
      7.4.2 False Positives and Statistical Power In Corrective Action	7-12
      7.4.3 Recommended Strategies	7-13
      7.4.4 Accounting for Shifts and Trends	7-14
      7.4.5 Impact of Sample Variability, Non-Detects, And Non-Normal Data	7-17
   7.5 COMPARISONS TO BACKGROUND DATA	7-20

CHAPTER 8.   SUMMARY OF RECOMMENDED METHODS

   8.1 SELECTING THE RIGHT STATISTICAL METHOD	8-1
   8.2 TABLE s.i INVENTORY OF RECOMMENDED METHODS	8-4
   8.3 METHOD SUMMARIES	8-9
               PART II.  DIAGNOSTIC METHODS AND TESTING

CHAPTER 9.   COMMON EXPLORATORY TOOLS

   9.1 TIMES SERIES PLOTS	9-1
   9.2 Box PLOTS	9-5
   9.3 HISTOGRAMS	9-8
   9.4 SCATTER PLOTS	9-13
   9.5 PROBABILITY PLOTS	9-16

CHAPTER 10.    FITTING DISTRIBUTIONS

   10.1 IMPORTANCE OF DISTRIBUTIONAL MODELS	10-1
   10.2 TRANSFORMATIONS TO NORMALITY	10-3
   10.3 USING THE NORMAL DISTRIBUTION AS A DEFAULT	10-5
   10.4 COEFFICIENT OF VARIATION AND COEFFICIENT OF SKEWNESS	10-9
   10.5 SHAPIRO-WILK AND SHAPIRO-FRANCIA NORMALITY TESTS	10-13
      10.5.1 Shapiro-Wilk Test (n < 50)	 10-13
      10.5.2 Shapiro-Francia Test (n > 50)	 10-15
   10.6 PROBABILITY PLOT CORRELATION COEFFICIENT	10-16
   10.7 SHAPIRO-WILK MULTIPLE GROUP TEST OF NORMALITY	10-19

CHAPTER 11.    TESTING EQUALITY OF VARIANCE

   11.1 Box PLOTS	11-2
   11.2 LEVENE'STEST	11-4
   11.3 MEAN-STANDARD DEVIATION SCATTER PLOT	 11-8
                                         vii                                 March 2009

-------
                                                                     Unified Guidance
CHAPTER 12.    IDENTIFYING OUTLIERS

   12.1 SCREENING WITH PROBABILITY PLOTS	12-1
   12.2 SCREENING WITH Box PLOTS	 12-5
   12.3 DIXON'STEST	12-8
   12.4 ROSNER'STEST	12-10

CHAPTER 13.    SPATIAL VARIABILITY

   13.1 INTRODUCTION TO SPATIAL VARIATION	13-1
   13.2 IDENTIFYING SPATIAL VARIABILITY	13-2
      13.2.1 Side-by-Side Box Plots	 13-2
      13.2.2 One-Way Analysis of Variance for Spatial Variability.	13-5
   13.3 USING ANOVA TO IMPROVE PARAMETRIC INTRAWELL TESTS	13-8

CHAPTER 14.    TEMPORAL VARIABILITY

   14.1 TEMPORAL DEPENDENCE	14-1
   14.2 IDENTIFYING TEMPORAL EFFECTS AND CORRELATION	14-3
      14.2.1 Parallel Time Series Plots	 14-3
      14.2.2 One-Way Analysis of Variance for Temporal Effects	14-6
      14.2.3 Sample Autocorrelation Function	  14-12
      14.2.4 Rank von Neumann Ratio Test.	  14-16
   14.3 CORRECTING FOR TEMPORAL EFFECTS AND CORRELATION	  14-18
      14.3.1 Adjusting the Sampling Frequency and/or Test Method.	  14-18
      14.3.2 Choosing a Sampling Interval Via Darcy's Equation	  14-19
      14.3.3 Creating Adjusted, Stationary Measurements	  14-28
      14.3.4 Identifying Linear Trends Amidst  Seasonality: Seasonal Mann-Kendall Test..... 14-37

CHAPTER 15.    MANAGING NON-DETECT DATA

   15.1 GENERAL CONSIDERATIONS FOR NON-DETECT DATA	15-1
   15.2 IMPUTING NON-DETECT VALUES BY SIMPLE SUBSTITUTION	 15-3
   15.3 ESTIMATION BY KAPLAN-MEIER	 15-7
   15.4 ROBUST REGRESSION ON ORDER STATISTICS	15-13
   15.5 OTHER METHODS FOR A SINGLE CENSORING  LIMIT	  15-21
       15.5.1  Cohen's Method.	  15-21
       15.5.2  Parametric Regression on Order Statistics	15-23
   15.6 USE OF THE 15% AND 50% NON-DETECT RULE	15-24

              PART III.  DETECTION MONITORING TESTS


CHAPTER 16.    TWO-SAMPLE TESTS

   16.1 PARAMETRIC T-TESTS	16-1
      16.1.1 Pooled Variance T-Test.	 16-4
      16.1.2 Welch's T-Test.	  16-7
      16.1.3 Welch's T-Test and Lognormal Data	16-10
   16.2 WILCOXON RANK-SUM TEST	16-14
   16.3 TARONE-WARE TWO-SAMPLE TEST FOR CENSORED DATA	16-20
                                         viii                                 March 2009

-------
                                                                      Unified Guidance
CHAPTER 17.    ANOVA, TOLERANCE LIMITS, AND TREND TESTS

   17.1 ANALYSIS OF VARIANCE [ANOVA]	17-1
      17.1.1 One-Way Parametric F-Test	 17-1
      17.1.2 Kruskal-Wallis Test	 17-9
   17.2 TOLERANCE LIMITS	17-14
      17.2.1 Parametric Tolerance Limits	  17-15
      17.2.2 Non-Parametric Tolerance Intervals	  17-18
   17.3 TREND TESTS	  17-21
      17.3.1 Linear Regression	  17-23
      17.3.2 Mann-Kendall Trend Test.	  17-30
      17.3.3 Theil-Sen Trend Line	  17-34

CHAPTER 18.    PREDICTION LIMIT PRIMER

   18.1 INTRODUCTION TO PREDICTION LIMITS	18-1
      18.1.1 Basic Requirements for Prediction Limits	 18-4
      18.1.2 Prediction Limits With Censored Data	  18-6
   18.2 PARAMETRIC PREDICTION LIMITS	18-7
      18.2.1 Prediction Limit for m Future Values	18-7
      18.2.2 Prediction Limit for a Future Mean	18-11
   18.3 NON-PARAMETRIC PREDICTION LIMITS	18-16
      18.3.1 Prediction Limit for m Future Values	18-17
      18.3.2 Prediction Limit for a Future Median	18-20

CHAPTER 19.    PREDICTION LIMIT STRATEGIES WITH RETESTING

   19.1 RETESTING STRATEGIES	19-1
   19.2 COMPUTING SITE-WIDE FALSE POSITIVE RATES [SWFPR]	19-4
      19.2.1 Basic Subdivision Principle	 19-7
   19.3 PARAMETRIC PREDICTION LIMITS WITH RETESTING	19-11
      19.3.1 Testing Individual Future Values	19-15
      19.3.2 Testing Future Means	  19-20
   19.4 NON-PARAMETRIC PREDICTION LIMITS WITH RETESTING	  19-26
      19.4.1 Testing Individual Future Values	19-30
      19.4.2 Testing Future Medians	  19-31

CHAPTER 20.    MULTIPLE COMPARISONS USING CONTROL CHARTS

   20.1 INTRODUCTION TO CONTROL CHARTS	 20-1
   20.2 BASIC PROCEDURE	 20-2
   20.3 CONTROL CHART REQUIREMENTS AND ASSUMPTIONS	  20-6
      20.3.1 Statistical Independence and Stationarity	 20-6
      20.3.2 Sample Size, Updating Background.	  20-8
      20.3.3 Normality and Non-Detect Data	  20-9
   20.4 CONTROL CHART PERFORMANCE CRITERIA	20-11
      20.4.1 Control Charts with Multiple Comparisons	  20-12
      20.4.2 Retesting in Control Charts	  20-14
                                           ix                                 March 2009

-------
                                                                  Unified Guidance
   PART IV.  COMPLIANCE/ASSESSMENT AND CORRECTIVE
ACTION TESTS

CHAPTER 21.    CONFIDENCE INTERVALS

   21.1 PARAMETRIC CONFIDENCE INTERVALS	21-1
      21.1.1 Confidence Interval Around a Normal Mean	 21-3
      21.1.2 Confidence interval Around a Lognormal Geometric Mean	 21-5
      21.1.3 Confidence Interval Around a Lognormal Arithmetic Mean	 21-8
      21.1.4 Confidence Interval Around an Upper Percentile	  21-11
   21.2 NON-PARAMETRIC CONFIDENCE INTERVALS	  21-14
   21.3 CONFIDENCE INTERVALS AROUND TREND LINES	  21-23
      21.3.1 Parametric Confidence Band Around Linear Regression	  21-23
      21.3.2 Non-Parametric Confidence Band Around a Trteil-Sen Line	  21-30

CHAPTER 22.    COMPLIANCE/ASSESSMENT AND CORRECTIVE ACTION TESTS

   22.1 CONFIDENCE INTERVAL TESTS FOR MEANS	22-1
      22.1.1 Pre-Specifying Power In Compliance/Assessment,	22-2
      22.1.2 Pre-Specifying False Positive Rates in Corrective Action	 22-9
   22.2 CONFIDENCE INTERVAL TESTS FOR UPPER PERCENTILES	22-18
      22.2.1 Upper Percentile Tests in Compliance/Assessment	 22-19
      22.2.2 Upper Percentile Tests in Corrective Action	 22-20

APPENDICES1

A.I   REFERENCES
A.2   GLOSSARY
A.3   INDEX
B     HISTORICAL NOTES
C.I   SPECIAL STUDY:  NORMAL VS.  LOGNORMAL PREDICTION LIMITS
C.2   CALCULATING STATISTICAL POWER
C.3   R SCRIPTS
D     STATISTICAL TABLES
 The full table of contents for Appendices A through D is found at the beginning of the Appendices

                                        x                                March 2009

-------
PART I	Unified Guidance
       PART  I.    STATISTICAL  DESIGN AND
                              PHILOSOPHY
     Chapter 1 provides introductory information, including the purposes and goals of the guidance, as
well as its potential applicability to other environmental programs. Chapter 2 presents a brief discussion
of the existing regulations and identifies key portions of these rules which need to be addressed from a
statistical standpoint, as well as some recommendations. In Chapter 3, fundamental statistical principles
are highlighted which play a prominent role in the Unified Guidance including the notions of individual
test false positive and negative decision errors and the accumulation of such errors across multiple tests
or comparisons. Chapter 4 sets the groundwater monitoring program context, the nature of formal
statistical  tests for  groundwater and  some caveats in identifying statistically significant increases.
Typical groundwater monitoring scenarios also are described in this chapter. Chapter 5 describes how
to establish background and how to periodically update it. Chapters 6 and 7 outline various factors to be
considered when designing a  reasonable  statistical strategy for  use  in detection  monitoring,
compliance/assessment monitoring,  or  corrective  action.  Finally,  Chapter  8  summarizes  the
recommended statistical tests and methods, along with  a concise review  of assumptions, conditions of
use, and limitations.
                                                                           March 2009

-------
PART I                                                            Unified Guidance
                    This page intentionally left blank
                                                                        March 2009

-------
Chapter 1.  Objectives	Unified Guidance

 CHAPTER 1.   OBJECTIVES AND  POTENTIAL USE OF THIS
                                   GUIDANCE
       1.1   OBJECTIVES	1-1
       1.2   APPLICABILITY TO OTHER ENVIRONMENTAL PROGRAMS	1-3
1.1 OBJECTIVES

      The  fundamental  goals  of  the  RCRA  groundwater monitoring  regulations  are  fairly
 straightforward. Regulated parties are to accurately characterize existing groundwater quality at their
 facility, assess whether a hazardous constituent release has occurred and, if so, determine whether
 measured levels meet the compliance standards. Using  accepted  statistical testing,  evaluation of
 groundwater quality should have a high probability of leading to correct decisions about  a facility's
 regulatory status.

       To implement these goals,  EPA first promulgated regulations in 1980  (for interim  status
 facilities) and 1982 (permitted facilities) for detecting contamination of groundwater at hazardous waste
 Subtitle C land disposal facilities.  In 1988, EPA revised portions of those regulations found at 40 CFR
 Part 264,  Subpart F. A similar set of regulations applying to Subtitle  D municipal and industrial waste
 facilities was adopted in 1991 under 40  CFR Part 258. In April 2006, certain modifications were made
 to the 40 CFR Part 264 groundwater monitoring regulations affecting statistical testing and decision-
 making.

       EPA  released the Interim Final Guidance [IFG] in  1989  for implementing the statistical
 methods and sampling procedures identified in the 1988 rule. A second guidance document followed in
 July  1992  called Addendum to Interim Final  Guidance  [Addendum], which  expanded certain
 techniques and also served as guidance for the newer Subpart D regulations.

       As the RCRA groundwater monitoring program has matured, it became apparent that the existing
 guidance needed to be updated to adequately cover statistical methods and issues important to detecting
 changes  in  groundwater.1 Research conducted in  the area of groundwater statistics since 1992 has
 provided a number of improved statistical techniques. At the same time, experience gained in applying
 the regulatory  statistical  tests in groundwater monitoring  contexts has identified certain constraints.
 Both needed  to be factored  into the guidance. This Unified Guidance document addresses these
 concerns and supercedes both the earlier IFG and Addendum.

       The Unified Guidance offers  guidance to  owners and operators, EPA  Regional  and  State
 personnel, and  other interested parties in selecting, using,  and interpreting appropriate  statistical
 methods for evaluating  data under the RCRA groundwater monitoring regulations. The guidance
1  Some recommendations in EPA's Statistical Training Course on Groundwater Monitoring were developed to better
  reflect the reality of groundwater conditions at many sites, but were not generally available in published form. See RCRA
  Docket #EPA\530-R-93-003, 1993

                                             iTl                                   March 2009

-------
Chapter 1. Objectives	Unified Guidance

 identifies recent approaches and recommends a consistent framework for applying these methods. One
 key aspect of the Unified Guidance is providing a systematic application of the basic statistical principle
 of balancing false positives and negative errors in designing good testing procedures (i.e., minimizing
 both the risk of falsely declaring a site to be out-of-compliance and of missing  real  evidence of an
 adverse change in the groundwater). Topics addressed in the guidance include basic statistical concepts,
 sampling design and sample sizes, selection of appropriate statistical approaches, how to check data and
 run statistical tests, and the interpretation of results. References for the suggested procedures and to
 more general statistical texts are provided. The guidance notes when expert statistical consultation may
 be advisable. Such guidance may also have applicability to other remedial activities as well.

       Enough commonality exists in sampling, analysis, and evaluation under the RCRA regulatory
 requirements that the Unified Guidance often suggests relatively general strategies. At the same time,
 there may be  situations where site-specific considerations for sampling and  statistical  analysis are
 appropriate or needed.  EPA policy has been  to  promulgate regulations that are specific enough to
 implement, yet flexible in accommodating a wide variety of site-specific environmental factors.  Usually
 this is accomplished by specifying criteria appropriate for the majority of monitoring situations,  while at
 the same time allowing alternatives that are also protective of human health and the environment.
                                                                                        r\
       40 CFR Parts 264 and 258 allow the use of other sampling procedures and test methods  beyond
 those explicitly identified in the regulations,3 subject to approval by the Regional Administrator or state
 Director. Alternative test methods must be able to meet the performance standards at §264.97(i) or
 §258.53(h). While these performance standards  are occasionally specific, they are much less so  in other
 instances.  Accordingly,  further guidance is provided concerning the types of procedures that should
 generally satisfy such performance standards.

       Although the Part 264  and 258  regulations explicitly  identify  five basic formal statistical
 procedures for testing two- or multiple-sample comparisons characteristic of detection  monitoring, the
 rules are silent on specific tests under compliance or corrective action monitoring when a groundwater
 protection standard is fixed (a  one-sample comparison). The rules also  require consideration of data
 patterns (normality, independence, outliers, non-detects, spatial and temporal dependence), but do not
 identify specific tests.  This  document expands  the potential  statistical procedures  to  cover these
 situations  identified in  earlier  guidance, thus  providing a comprehensive single EPA  reference  on
 statistical methods  generally recommended for RCRA groundwater monitoring programs. Not every
 technique will be appropriate in a given situation, and in many cases more than one statistical approach
 can be used. The Unified Guidance is meant to be broad enough in scope to cover a high percentage of
 the potential situations a user might encounter.

       The Unified  Guidance is not designed  as a treatise for statisticians; rather it is  aimed at the
 informed groundwater professional with a limited background in statistics. Most methods discussed are
 well-known to statisticians, but  not necessarily to regulators, groundwater engineers or scientists. A key
 thrust  of the Unified Guidance has been to tailor the standard statistical techniques to the RCRA
 groundwater arena and its unique constraints. Because of this emphasis, not every variation of each test
2 For example, §264.97(g)(2), §264.97(h)(5) and §258.53(g)(5)

3 §264.97(g)(l), §264.97(h)(l-4), and §258.53(g)(l-4) respectively
                                               1-2                                     March 2009

-------
Chapter 1.  Objectives	Unified Guidance

 is discussed in  detail. For example, groundwater monitoring in a detection monitoring program  is
 generally concerned  with increases  rather than decreases  in  concentration levels  of monitored
 parameters. Thus, most detection monitoring tests in the Unified Guidance are presented as one-sided
 upper-tailed tests. In the sections covering compliance and corrective action monitoring (Chapters 21
 and 22 in Part IV), either one-sided lower-tail or upper-tail tests are recommended depending on the
 monitoring  program.  Users requiring two-tailed tests  or additional information may need to consult
 other guidance or the statistical references listed at the end of the Unified Guidance.

       The Unified Guidance is not intended to cover all statistical methods that might be applicable to
 groundwater. The technical literature is even more extensive, including other published frameworks for
 developing  statistical  programs at RCRA facilities. Certain statistical methods and general strategies
 described in the Unified Guidance are outlined in American Society for Testing and Materials [ASTM]
 documents  entitled  Standard  Guide for  Developing Appropriate  Statistical  Approaches for
 Groundwater Detection Monitoring Programs (D6312-98[2005]) (ASTM, 2005) and Standard Guide
for Applying Statistical Methods for Assessment and Corrective Action Environmental Monitoring
 Programs (D7048-04) (ASTM, 2004).

      The first  of these  ASTM guidelines  primarily  covers  strategies for detection  monitoring,
 emphasizing the use of prediction limits and control charts. It also contains a series of flow diagrams
 aimed at guiding the user to an appropriate statistical approach. The second guideline covers statistical
 strategies useful in compliance/assessment monitoring and corrective action. While not identical  to
 those described in the Unified Guidance, the ASTM guidelines do provide an alternative framework for
 developing  statistical programs at RCRA facilities and are worthy of careful consideration.

       EPA's primary consideration in  developing the Unified Guidance was to select methods both
 consistent with the RCRA regulations, as well as straightforward to implement. We believe the methods
 in the guidance are not only effective, but also understandable and easy to use.

1.2 APPLICABILITY TO  OTHER ENVIRONMENTAL PROGRAMS

       The  Unified Guidance is tailored to the  context of the RCRA groundwater monitoring
 regulations. Some of the techniques  described are  unique  to  this guidance.    Certain regulatory
 constraints  and the nature of groundwater monitoring limit how statistical procedures are likely  to be
 applied. These include typically small  sample sizes  during a given evaluation period, a minimum  of
 annual monitoring and evaluation and typically at least semi-annual, often a large number of potential
 monitoring  constituents, background-to-downgradient well comparisons,  and a limited set of identified
 statistical methods. There are also  unique regulatory  performance constraints such as  §264.97(i)(2),
 which requires a minimum single test false positive a level  of 0.01 and a minimum  0.05  level for
 multiple comparison procedures such as analysis of variance [ANOVA].

      There are  enough commonalities with other regulatory groundwater monitoring programs  (e.g.,
 certain distributional features of routinely monitored background groundwater constituents) to allow for
 more general use of the tests and methods in the Unified Guidance. Many of these test methods and the
 consideration  of false positive and  negative errors in  site design are directly applicable to corrective
 action  evaluations of solid waste  management units under  40 CFR  264.101  and Comprehensive
                                             1-3                                   March 2009

-------
Chapter 1. Objectives	Unified Guidance

 Environmental  Response,  Compensation,  and Liability  Act [CERCLA]  groundwater monitoring
 programs.

      There are also comparable situations involving other environmental media to which the Unified
 Guidance  statistical methods might be applied. Groundwater detection  monitoring involves either a
 comparison between different monitoring stations  (i.e., downgradient compliance wells vs. upgradient
 wells) or a contrast between past and present data within a given station (i.e., intrawell comparisons).
 To the extent that an environmental monitoring station is essentially fixed in location (e.g., air quality
 monitors,  surface water stations) and measurements are made over time, the same statistical methods
 may be applicable.

       The Unified Guidance also details methods to compare background data against measurements
 from regulatory compliance  points. These  procedures  (e.g., Welch's Mest, prediction limits with
 retesting, etc.) are designed to contrast multiple groups of data. Many environmental problems involve
 similar comparisons, even if the groups of data  are not collected at fixed monitoring stations (e.g., as in
 soil sampling). Furthermore, the guidance describes diagnostic techniques for checking the assumptions
 underlying many statistical  procedures. Testing of normality is ubiquitous in environmental statistical
 analysis. Also common are checks of statistical independence in time series data, the assumption of
 equal variances across  different populations, and the need to identify outliers. The Unified Guidance
 addresses each of these topics, providing useful  guidance and worked out examples.

       Finally, the  Unified  Guidance discusses  techniques for  comparing datasets against  fixed
 numerical standards (as in compliance monitoring or corrective action). Comparison of data against a
 fixed standard is encountered in many regulatory programs. The methods described in Part IV of the
 Unified Guidance could therefore have wider applicability, despite being tailored to the  groundwater
 monitoring data context.

       EPA recognizes that many guidance users  will make use of either commercially available or
 proprietary statistical software in applying these statistical methods. Because of their wide range of
 diversity and coverage, the Unified Guidance does not evaluate software usage or applicability. Certain
 software is provided with the guidance. The guidance  limits itself to describing the basic statistical
 principles underlying the application of the  recommended tests.
                                              1-4                                    March 2009

-------
Chapter 2. Regulatory Overview                                           Unified Guidance

               CHAPTER 2.   REGULATORY  OVERVIEW
       2.1   REGULATORY SUMMARY	2-1
       2.2   SPECIFIC REGULATORY FEATURES AND STATISTICAL ISSUES	2-6
         2.2.1   Statistical Methods Identified Under §264.97(h) and §258.53(g)	2-6
         2.2.2   Performance Standards Under §264.97(i) and §258.53(h)	2-7
         2.2.3   Hypothesis Tests in Detection, Compliance/Assessment, and Corrective Action Monitoring	2-10
         2.2.4   Sampling Frequency Requirements	2-10
         2.2.5   Groundwater Protection Standards	2-12
       2.3   UNIFIED GUIDANCE RECOMMENDATIONS	2-13
         2.3.1   Interim Status Monitoring	2-13
         2.3.2   Parts 264 and 258 Detection Monitoring Methods	2-14
         2.3.3   Parts 264 and 258 Compliance/assessment Monitoring	2-15
      This chapter generally summarizes the RCRA groundwater monitoring regulations under 40 CFR
 Parts 264, 265 and 258 applicable to this guidance. A second section  identifies the most critical
 regulatory statistical issues and how they  are addressed by this guidance. Finally, recommendations
 regarding interim status facilities and certain statistical methods in the regulations are presented at the
 end of the chapter.

2.1 REGULATORY SUMMARY

       Section 3004 of RCRA directs EPA to establish regulations applicable to owners and operators
 of facilities that treat, store, or dispose of hazardous waste as may be necessary to protect human health
 and the environment. Section 3005 provides for the  implementation of these standards under permits
 issued to owners and operators by EPA or authorized States. These regulations are codified in 40 CFR
 Part 264. Section 3005 also provides that owners and operators of facilities in existence at the time of
 the  regulatory or statutory requirement for a permit, who apply for and  comply with applicable
 requirements, may operate until a permit determination is made. These  facilities are commonly known
 as interim status facilities, which must comply with the standards promulgated in 40 CFR Part 265.

      EPA first promulgated the groundwater monitoring regulations under Part 265 for interim status
 surface impoundments, landfills and land treatment  units ("regulated units") in  1980.1 Intended as a
 temporary system for units awaiting full permit requirements, the rules set out a minimal detection and
 assessment monitoring system consisting of at least a single upgradient and three downgradient wells.
 Following collection of the minimum number of samples prescribed in the rule for four  indicator
 parameters — pH, specific conductance, total organic carbon (TOC) and total organic halides (TOX) —
 and certain constituents  defining overall groundwater quality, the owner/operator of a land disposal
 facility is  required to implement a detection monitoring program.  Detection monitoring consists of
 upgradient-to-downgradient comparisons using the Student's t-test of the four indicator parameters at
 no less than  a  .01 level of significance (a). The regulations refer to the use of "replicate" samples for
 contaminant indicator comparisons. Upon failure  of a single detection-level test,  as well as a repeated
1  [45 FR 33232ff, May 19, 1980] Interim status regulations; later amended in 1983 and 1985

                                              2^1                                    March 2009

-------
Chapter 2. Regulatory Overview                                            Unified Guidance

 follow-up test, the facility is required to conduct an assessment program identifying concentrations of
 hazardous waste constituents from the unit in groundwater. A facility can return to detection monitoring
 if none of the latter constituents are detected. These regulations are still in effect today.

       Building on the interim status rules,  Subtitle C regulations for Part 264 permitted hazardous
 waste facilities followed in  1982,  where  the  basic elements of the  present RCRA groundwater
 monitoring program are defined. In §264.91, three monitoring programs —  detection  monitoring,
 compliance monitoring,  and corrective  action  — serve  to protect  groundwater  from  releases  of
 hazardous waste constituents at  certain regulated land disposal units (surface  impoundments, waste
 piles, landfills,  and land  treatment). In  developing permits,  the Regional Administrator/State Director
 establishes groundwater protection  standards [GWPS] under §264.92 using concentration limits
 [§264.94] for certain monitoring constituents [§264.93]. Compliance well monitoring locations are
 specified in the permit following the rules in §264.95 for the required compliance  period [§264.96].
 General monitoring requirements were established in §264.97, along with specific detection [§264.98],
 compliance [§264.99], and corrective action  [§264.100] monitoring requirements. Facility owners and
 operators are  required to  sample groundwater at specified intervals and to use a statistical procedure to
 determine whether  or not hazardous wastes  or  constituents from  the facility are contaminating the
 groundwater.

       As found in §264.91, detection monitoring is the first stage of monitoring when no or minimal
 releases have been identified, designed to allow identification of significant changes in the groundwater
 when compared to background  or established baseline levels. Downgradient well observations are
 tested against established background data, including measurements from upgradient wells. These are
 known as two- or multiple-sample tests.

       If there is statistically significant evidence  of a release of hazardous constituents [§264.91(a)(l)
 and  (2)], the  regulated unit  must  initiate compliance  monitoring,  with  groundwater  quality
 measurements compared to   the groundwater protection standards [GWPS]. The owner/operator is
 required to conduct a more extensive Part 261 Appendix VIE (later Part 264 Appendix IX)3 evaluation
 to determine if additional hazardous constituents must be added to the compliance monitoring list.

       Compliance/assessment as well as corrective action monitoring differ from detection monitoring
 in  that groundwater well data are tested against the groundwater protection standards  [GWPS]  as
 established in the permit. These may  be fixed health-based standards such as Safe Drinking Water Act
 [SDWA] maximum concentration limits [MCLs],  §264.94 Table  1 values,  a value  defined from
 background,  or alternate-concentration limits as  provided  in  §264.94(a).  Statistically, these are
 considered single-sample tests against a fixed limit (a background limit can either be a single- or  two-
 sample test depending on how the limit is defined).  An exceedance occurs when a constituent level is
 shown to be significantly greater than the GWPS or compliance standard.

       If a hazardous monitoring constituent under compliance monitoring statistically exceeds the
 GWPS  at any  compliance well, the facility is  subject to corrective action  and monitoring under
 §264.100. Following remedial action, a return to compliance consists of a statistical demonstration that
2 [47 FR 32274ff, July 26, 1982] Permitting Requirements for Land Disposal Facilities
3 [52 FR 25942, July 9, 1987] List (Phase I) of Hazardous Constituents for Groundwater Monitoring; Final Rule

                                              2^2                                    March 2009

-------
Chapter 2. Regulatory Overview                                            Unified Guidance

 the concentrations of all relevant hazardous constituents lie below their respective standards. Although
 the rules  define a three-tiered approach, the Regional Administrator or  State Director can assess
 available information at the  time of permit development to  identify which monitoring program is
 appropriate [§264.9l(b)].

       Noteworthy features of the 1982  rule included retaining  use  of  the four Part 265  indicator
 parameters, but allowing for additional constituents in detection monitoring. The number of upgradient
 and downgradient wells was  not specified;  rather the requirement is  to  have a  sufficient number of
 wells  to  characterize  upgradient  and downgradient  water quality passing beneath a regulated  unit.
 Formalizing the "replicate" approach in  the 1980 rules and the  use  of  Student's Mest, rules under
 §264.97 required the use of aliquot replicate samples, which  involved analysis of at least four physical
 splits of a single volume of water. In addition, Cochran's Approximation to the Behrens-Fisher [CABF]
 Student's  t-test was specified for detection monitoring at no less  than a  .01 level of significance (a).
 Background sampling was specified  for a one-year period consisting  of  four quarterly samples  (also
 using the aliquot approach). The rules allowed use of a repeated, follow-up test subsequent to failure of
 a detection monitoring test. A minimum of semi-annual sampling was required.

       In  response to a number of concerns  with  these regulations, EPA  amended portions of the 40
 CFR Part 264 Subpart F regulations including statistical methods and sampling procedures on October
 11, 1988.4 Modifications to the regulations included requiring (if  necessary)  that owners  and/or
 operators more accurately characterize the hydrogeology and  potential contaminants at the facility. The
 rule also identifies specific performance standards in  the regulations that all  the statistical methods and
 sampling procedures  must meet  (discussed  in  a  following  section).  That  is,  it is  intended that the
 statistical methods and sampling procedures meeting these performance  standards defined in §264.97
 have a low probability both of indicating contamination when it is not present (Type I error),  and of
 failing to detect contamination that actually is present (Type  n error). A facility owner and/or operator
 must demonstrate that a procedure is appropriate for the site-specific conditions at the  facility, and
 ensure that it meets  the performance standards. This demonstration applies to  any of the statistical
 methods  and  sampling  procedures outlined in the  regulation as well as  any alternate methods or
 procedures proposed by facility owners and/or operators.

       In addition, the amendments removed the required use of the CABF Student's t-test, in favor of
 five different statistical methods deemed to be more appropriate for analyzing groundwater monitoring
 data (discussed in a following section).  The CABF procedure is still retained in Part 264, Appendix IV,
 as  an  option, but there  are no longer specific citations in the regulations  for this test.  These newer
 procedures offer greater flexibility in designing a groundwater statistical  program appropriate to site-
 specific conditions.  A sixth option allows the use of alternative statistical  methods, subject to approval
 by  the Regional  Administrator. EPA  also  instituted new   groundwater  monitoring   sampling
 requirements,  primarily aimed  at ensuring  adequate statistical sample  sizes for use in analysis of
 variance  [ANOVA] procedures,  but  also allowing alternative sampling  plans to be approved  by the
 Regional Administrator. The requirements identify the need for statistically  independent samples to be
 used during evaluation. The  Agency further recognizes  that the selection of appropriate hazardous
4 [53 FR 39720, October 11, 1988] 40 CFR Part 264: Statistical Methods for Evaluating Groundwater Monitoring From
  Hazardous Waste Facilities; Final Rule

                                              2^3                                     March 2009

-------
Chapter 2. Regulatory Overview                                           Unified Guidance

 constituent monitoring parameters is an essential part of a reliable statistical evaluation. EPA addressed
 this issue in a 1987 Federal Register notice.5

       §264.101 requirements for corrective action at non-regulated units were added in 1985 and later.6
 The Agency determined that since corrective action at non-regulated units would work under a different
 program, these units are not required to follow the detailed steps of Subpart F monitoring.

       In 1991, EPA promulgated Subtitle D groundwater monitoring regulations for municipal solid
 waste landfills in 40 CFR Part 258.7 These rules also incorporate a three-tiered groundwater monitoring
 strategy (detection monitoring, assessment monitoring, and corrective action), and describe statistical
 methods for determining whether background concentrations or the groundwater protection standards
 [GWPS] have been exceeded.

       The statistical methods and related performance standards in 40 CFR Part 258 essentially mirror
 the requirements found as  of 1988  at 40 CFR Part 264 Subpart F, with certain differences. Minimum
 sampling frequencies are different than in the Subtitle C regulations.  The rules also specifically provide
 for the  GWPS using either current MCLs or  standardized risk-based limits  as well  as background
 concentrations.  In addition, a specific list of hazardous constituent analytes is identified in 40 CFR Part
 258, Appendix I for detection-level monitoring,  including the use of unfiltered (total) trace elements.

       The  1988  and  1991  rule  amendments identify certain statistical  methods  and sampling
 procedures believed appropriate for  evaluating groundwater monitoring  data under a variety  of
 situations. Initial guidance to implement these methods was released  in 1989 as: Statistical Analysis of
 Groundwater Monitoring Data at RCRA Facilities: Interim Final Guidance [IFG]. The IFG covered
 basic  topics such as checking distributional assumptions, selecting  one of the methods and sampling
 frequencies.  Examples  were  provided  for  applying  the recommended statistical procedures and
 interpreting the results.  Two types of compliance tests were provided for comparison to the GWPS —
 mean/median confidence intervals and upper limit tolerance intervals.

       Given additional interest from users of the comparable regulations adopted for Subtitle D solid
 waste facilities in 1991, and with experience gained in implementing various tests, EPA actively sought
 to improve existing groundwater statistical guidance. This culminated in a July 1992  publication of:
 Statistical Analysis of Groundwater Monitoring Data  at RCRA Facilities: Addendum to  Interim
 Final Guidance [Addendum].

       The 1992 Addendum included a chapter  devoted to retesting strategies, as well as new guidance
 on several non-parametric techniques not covered within the IFG. These included the Wilcoxon rank-
 sum test, non-parametric tolerance  intervals, and non-parametric prediction intervals.  The Addendum
 also included a reference approach  for evaluating statistical power to ensure that contamination could
 be adequately  detected.  The Addendum  did not replace the IFG  — the two documents contained
 overlapping material but were mostly intended to complement one another based on newer information
5  [52 FR 25942, July 9, 1987] op. cit.
6  [50 FR 28747, July 15, 1985] Amended in 1987, 1993, and 1998
7  [56 FR 50978, October 9, 1991] 40 CFR Parts 257 & 258: Solid Waste Disposal Facility Criteria: Final Rule, especially
  Part 258 Subpart E Groundwater Monitoring and Corrective Action

                                              2^4                                    March 2009

-------
Chapter 2. Regulatory Overview                                          Unified Guidance

 and comments from statisticians and users of the guidance. However, the Addendum changed several
 recommendations within the IFG and replaced certain test methods first published in the IFG. The two
 documents provided contradictory guidance on several points, a concern addressed by this guidance.

       More recently in April 2006, EPA promulgated further  changes to  certain  40 CFR Part 264
                                                                             o
 groundwater monitoring provisions as part of the Burden Reduction Initiative Rule.   A brief summary
 of the regulatory changes and the potential effects on existing RCRA groundwater monitoring programs
 is provided. Four items of specific interest are:

        »»»  Elimination of the requirements to sample four successive times per statistical evaluation
           under §264.98(d) and §264.99(f) in favor of more flexible, site-specific options as identified
           in §264.97(g)(l)&(2);

        »«»  Removal  of  the  requirements in §264.98(g)  and  §264.99(g) to  annually  sample  all
           monitoring wells  for Part 264 Appendix IX constituents in favor of a  specific subset of
           wells;

        »»»  Modifications of these provisions to allow for a specific subset of Part 264 Appendix IX
           constituents tailored to site needs; and

        »«»  A change in the resampling requirement in §264.98(g)(3) from "within a month" to a site-
           specific schedule.

       These changes to the  groundwater monitoring provisions require coordination between the
 regulatory agency and owner/operator with final  approval by the agency. Since the regulatory changes
 are not issued under the 1984 Hazardous and Solid Waste Amendments [HSWA] to RCRA, authorized
 State RCRA program adoption  of these rules is discretionary.  States may choose  to maintain  more
 stringent requirements, particularly if already  codified  in existing regulations. Where EPA has direct
 implementation authority, the provisions would go into effect following promulgation.

       The first provision reaffirms the  flexible  approach in  the Unified Guidance for detection
 monitoring sampling frequencies and testing options. State RCRA programs using the four-successive
 sampling requirements can still  continue  to do  so under §264.97(g)(l), but the rule now allows  for
 alternate sampling frequencies under §264.97(g)(2) in  both detection and compliance monitoring. The
 second and third provisions provide more site- and waste-specific options for Part  264 Appendix IX
 compliance monitoring. The final provision provides more  flexibility when resampling these Appendix
 IX constituents.

       Since portions of the earlier and the most recent rules are still operative,  all are considered in the
 present Unified Guidance. The effort to create  this guidance began in 1996, with a draft release in
 December 2004, a peer review in 2005, and a final version completed in 2009.
  [71 FR 16862-16915] April 4, 2006
                                              2-5                                    March 2009

-------
Chapter 2.  Regulatory Overview                                         Unified Guidance

2.2 SPECIFIC REGULATORY  FEATURES AND STATISTICAL ISSUES

       This section describes critical portions of the RCRA groundwater monitoring regulations which
 the present guidance addresses. The regulatory language is provided below in bold and italics.9 A brief
 discussion of each issue is provided in statistical terms and how the Unified Guidance deals with it.

2.2.1  STATISTICAL METHODS IDENTIFIED  UNDER §264.97(h) AND §258.53(g)

       The owner or operator will specify one of the following statistical methods to be  used in
       evaluating groundwater monitoring data for each  hazardous constituent which,  upon
       approval by the Regional Administrator, will be specified in the unit permit. The statistical test
       chosen shall be conducted separately for each hazardous constituent in each well...

       1. A parametric analysis of variance (ANOVA) followed by multiple comparison procedures
          to identify statistically  significant evidence of contamination. The method must include
          estimation and testing of the contrasts  between each compliance well's mean  and the
          background mean levels for each constituent.

       2. An analysis of variance  (ANOVA)  based on ranks followed by multiple  comparison
          procedures to identify statistically significant  evidence of contamination. The  method
          must include estimation and testing of the contrasts between  each compliance well's
          median and the background median levels for each constituent.

       3. A tolerance interval or prediction interval procedure in which an interval for each
          constituent is established from the distribution of the background data, and the level of
          each constituent in each compliance well is compared to the upper tolerance or prediction
          limit.

       4. A control chart approach that gives control limits for each  constituent.

       5. Another statistical  method  submitted by  the owner or operator and approved by  the
          Regional Administrator.

     Part III of the Unified  Guidance addresses these  specific  tests, as  applied to a detection
monitoring program.  It is assumed that  statistical testing will be conducted separately for each hazardous
constituent in each monitoring well.  The recommended non-parametric ANOVA method based on ranks
is identified  in this guidance as the Kruskal-Wallis test. ANOVA tests are discussed in Chapter 17.
Tolerance interval and prediction limit  tests are discussed  separately in Chapters  17 and  18, with
particular attention given  to implementing prediction limits with  retesting when  conducting  multiple
comparisons in Chapter 19. The recommended type of control chart is the combined Shewhart-CUSUM
control chart test, discussed  in Chapter 20. Where a groundwater protection standard  is based on
background levels, application of these  tests is discussed in Part I, Chapter 7 and Part IV, Chapter 22.
  The following discussions somewhat condense the regulatory language for ease of presentation and understanding. Exact
  citations for regulatory text should be obtained from the most recent Title 40 Code of Federal Regulations.

                                            2^6                                   March 2009

-------
Chapter 2.  Regulatory Overview                                         Unified Guidance

     If a groundwater protection standard involves a fixed limit, none of the listed statistical methods in
these regulations directly apply. Consequently, a number of other single-sample tests for comparison
with a  fixed limit are recommended in Part IV. Certain statistical limitations encountered when using
ANOVA and tolerance level tests in detection and compliance monitoring are also discussed in these
chapters. Additional use  of ANOVA tests for diagnostic identification of spatial variation or temporal
effects  is discussed in Part II, Chapters 13 and 14.

2.2.2  PERFORMANCE STANDARDS  UNDER §264.97(i) AND §258.53(h)

       Any statistical method chosen under §264.97(h) [or §258.53(g)] for specification  in the unit
       permit shall comply with the following performance standards, as appropriate:

       1.  The statistical method used to evaluate ground-water monitoring data shall be appropriate
         for the distribution of chemical parameters or hazardous constituents. If the distribution of
          the chemical parameters or hazardous constituents is shown by the owner or operator to be
          inappropriate for a normal theory  test,  then  the  data should be  transformed or a
          distribution-free test should be used. If the distributions for  the constituents differ, more
          than one statistical method may be needed.

       2.  If an individual well comparison procedure is used to compare an individual compliance
          well  constituent  concentration   with  background  constituent  concentrations  or a
          groundwater protection standard, the test shall be done at a Type I error level no less than
          0.01 for each testing period. If a multiple comparisons procedure is used, the Type I
          experiment-wise error rate for each testing period shall be no less than 0.05; however, the
          Type I error of no less than 0.01 for individual well comparisons must be maintained. This
         performance standard does not apply to  control charts, tolerance intervals,  or prediction
          intervals.

       3.  If a control chart approach is used to evaluate groundwater monitoring data, the specific
          type of control chart and its associated parameter values shall be proposed by the owner or
          operator and approved by the Regional Administrator if he or she finds it to be protective
          of human health and the environment.

       4.  If a tolerance interval or a prediction interval is used to evaluate groundwater monitoring
          data, the levels of confidence, and for tolerance intervals, the percentage of the population
          that the interval must contain, shall be proposed by the owner or operator and approved by
          the Regional Administrator  if he or she finds it protective of human health  and the
          environment.  These parameters  will be determined after  considering the number of
          samples in the background data base, the data distribution, and  the range of the
          concentration values for each constituent of concern.

       5.  The statistical method shall account for data below the limit of detection with one or more
         procedures  that are protective  of human health and the  environment. Any practical
          quantification limit  (pql)  approved by the Regional Administrator under §264.97(h) [or
          §258.53(g)] that is used in the statistical method shall be the lowest concentration level that
          can be reliably achieved within specified limits  of precision and accuracy during routine
          laboratory operating conditions available  to the facility.

                                            2^7                                   March 2009

-------
Chapter 2. Regulatory Overview                                            Unified Guidance

       6.  If necessary,  the statistical method shall include procedures to control or correct for
          seasonal and spatial variability as well as temporal correlation in the data.

     These performance  standards pertain to both the listed tests as well as others (such as those
recommended in Part IV of the guidance for comparison to fixed standards). Each of the performance
standards is addressed in Part I of the guidance for designing statistical monitoring programs and in
Part II of the guidance covering diagnostic testing.

     The first performance standard considers distributional properties of sample data; procedures for
evaluating normality, transformations to normality, or use of non-parametric (distribution-free) methods
are found in Chapter 10. Since some  statistical tests also require an assumption of equal variances
across  groups, Chapter 11 provides the relevant diagnostic tests. Defining an appropriate distribution
also requires  consideration of possible  outliers. Chapter 12 discusses techniques useful  in outlier
identification.

     The second performance standard identifies minimum  false positive error rates required when
conducting certain  tests.  "Individual well comparison procedures" cited  in  the  regulations include
various ANOVA-type tests, Student's Mests, as well  as one-sample compliance monitoring/corrective
action  tests against a fixed standard. Per the regulations, these significance level (a) constraints do not
apply to the other listed statistical methods — control charts, tolerance intervals, or prediction intervals.

     When comparing an individual  compliance well against background, the probability of the test
resulting in a false positive or Type I  error should be no less  than  1 in 100 (1%). EPA required a
minimum Type I error level for a given test and fixed sample size because false positive and negative
rates are inversely related. By limiting Type I error rates to  1%, EPA felt that the risk of incurring false
positives would be sufficiently low, while providing sufficient  statistical power (i.e., the test's ability to
control the false negative rate, that is, the rate of missing or not  detecting true changes in groundwater
quality).

     Though a procedure to test an individual well like the Student's  ^-test may be appropriate for the
smallest  of facilities,  more  extensive  networks  of  groundwater monitoring wells and monitoring
parameters will generally require a multiple comparisons procedure. The 1988 regulations recognized
this need in specifying a one-way analysis of variance  [ANOVA]  procedure as the method of choice for
replacing the CABF Student's Mest. The F-statistic in an analysis of variance [ANOVA]  does indeed
control the  site-wide or experiment-wise  error rate  when  evaluating  multiple upgradient and
downgradient  wells,  at  least for a  single  constituent.  Using  this  technique allowed  the Type I
experiment-wise error rate for each constituent to be controlled to  about 5% for each testing period.

      To maintain adequate statistical power, the regulations also mandate that the ANOVA procedure
be run at a minimum 5% false positive rate per constituent.  But  when  a full set of well-constituent
combinations  are considered (particularly large suites of detection monitoring analytes at  numerous
compliance wells), the site-wide false positive rate can be much greater than 5%. The one-way ANOVA
is  inherently an interwell technique, designed to simultaneously compare datasets from different well
locations.  Constituents with significant natural spatial variation are likely to  trigger the ANOVA  F-
statistic even in the absence of real contamination, an issue discussed in Chapter 13.
                                              2-8                                     March 2009

-------
Chapter 2. Regulatory Overview                                            Unified Guidance

     Control charts, tolerance intervals, and prediction intervals provide alternate testing strategies for
simultaneously controlling false positive rates while maintaining adequate power to detect contamination
during detection monitoring. Although the rules do not require a minimum nominal false positive rate as
specified in the second performance standard,  use of tolerance or prediction intervals combined with a
retesting strategy can result in sufficiently low experiment-wise Type I error rates and the ability to
detect  real  contamination. Chapters 17, 18 and 20 consider how tolerance limits, control charts, and
prediction limits can be designed to meet the third and fourth performance standards specific to these
tests considering the number of samples  in background, the  data distribution,  and  the  range  of
concentration  values for each  constituent of concern  [COC]. Chapters  19  and  20  on multiple
comparison procedures using prediction limits or control charts identify how retesting can be used to
enhance power and meet the specified false positive objectives.

     The fifth performance standard requires statistical tests to account for non-detect data. Chapter 15
provides some alternative approaches for either adjusting or modeling sample data in the presence of
reported non-detects. Other chapters include modifications of standard tests to properly account for the
non-detect portion of data sets.

     The sixth performance standard requires  consideration of spatial or temporal (including seasonal)
variation in the  data. Such patterns can have  major statistical consequences and need to be carefully
addressed. Most classical statistical tests in this guidance require assumptions of data independence and
stationarity. Independence roughly means that observing a given sample  measurement does not allow a
precise prediction of other sample measurements.  Drawing colored balls  from an urn at random
illustrates and fits this requirement; in groundwater, sample volumes are assumed to be drawn more or
less at random from the population  of possible same-sized volumes comprising the underlying aquifer.
Stationarity assumes that the population being sampled has a constant mean and variance across time
and space.  Spatial  or temporal variation in the  well means and/or variances  can negate these test
assumptions.  Chapter 13 considers the use of ANOVA techniques to establish evidence of spatial
variation.  Modification  of  the statistical  approach  may  be necessary  in  this case;   in  particular,
background levels will need to be established  at each compliance well for future comparisons (termed
intrawell tests).  Control chart, tolerance limit, and prediction limit tests can  be  designed for intrawell
comparisons; these topics are considered in Part III of this guidance.

     Temporal variation can occur for a number of reasons — seasonal fluctuations, autocorrelation,
trends  over time, etc. Chapter 14 addresses these forms of temporal variation, along with recommended
statistical procedures. In order to achieve stationarity and independence, sample data may need to be
adjusted to  remove trends or other forms of temporal dependence. In these cases, the residuals remaining
after trend  removal  or  other  adjustments  are used for formal testing  purposes.  Correlation  among
monitoring constituents  within and  between compliance wells can occur, a subject also treated in this
chapter.

     When evaluating statistical methods by these performance standards, it is important to recognize
that the ability of a particular  procedure to operate correctly in  minimizing unnecessary  false positives
while detecting possible contamination depends on several factors. These include  not only the choice of
significance level and test hypotheses, but  also the statistical test itself,  data distributions, presence or
absence of  outliers and non-detects, the presence or absence of spatial and temporal variation, sampling
requirements, number of samples and comparisons to be made, and frequency of sampling. Since all of
these  statistical  factors  interact to  determine  the procedure's effectiveness, any proposed statistical
                                               2^9                                     March 2009

-------
Chapter 2. Regulatory Overview                                         Unified Guidance

procedure  needs  to be evaluated in its entirety, not by individual components. Part I, Chapter 5
discusses evaluation of potential background databases considering all of the performance criteria.

2.2.3 HYPOTHESIS  TESTS  IN   DETECTION,  COMPLIANCE/ASSESSMENT,   AND
       CORRECTIVE ACTION MONITORING

     The Part 264 Subpart F groundwater monitoring regulations do not specifically identify the test
hypotheses to be  used  in detection monitoring (§264.98),  compliance monitoring  (§264.99),  and
corrective  action  (§264.100). The same is true  for the parallel  Part 258 regulations for  detection
monitoring (§258.54), assessment  monitoring  (§258.55),  and  assessment of corrective  measures
(§258.56),  as well as for evaluating interim  status indicator parameters (§265.93) or Appendix in
constituents. However, the language of these regulations as well as accepted statistical principles allow
for clear  definitions  of the  appropriate  test  hypotheses.  Two-  or multiple-sample  comparisons
(background vs. downgradient well data) are usually involved in detection monitoring (the comparison
could also be made against an ACL limit based on background data). Units under detection monitoring
are initially presumed not to be contributing a release to the groundwater unless demonstrated otherwise.
From a statistical  testing standpoint, the population  of downgradient well measurements is assumed to
be equivalent to or no worse than those of the background population; typically this translates into an
initial or null hypothesis that the downgradient population mean is equal to or less than the background
population  mean.  Demonstration  of a release is triggered when one or more well constituents indicate
statistically significant levels above background.

     Compliance and corrective action tests generally compare single sets of sample data to a fixed limit
or a background standard. The language of §264.99 indicates that a significant increase above a GWPS
will demonstrate the need for corrective action. Consequently, the null hypothesis is that the compliance
population  mean  (or  perhaps an  upper percentile) is  at or  below a given standard. The  statistical
hypothesis  is thus quite similar to that of detection monitoring. In contrast, once an exceedance has been
established and  §264.100 is triggered, the null hypothesis  is  that  a site is contaminated  unless
demonstrated to be significantly below the GWPS.  The same principles apply to Part 258 monitoring
programs. In Part 265, the detection monitoring hypotheses apply to an evaluation of the contaminant
indicator parameters. The general subject of hypothesis testing is discussed in Chapter 3, and specific
statistical hypothesis formulations are found in Parts III and IV of this guidance.

2.2.4 SAMPLING FREQUENCY REQUIREMENTS

     Each  of the RCRA  groundwater  monitoring  regulations defines somewhat different minimum
sampling requirements. §264.97(g)(l) & (2) provides two main options:

        1.  Obtaining a sequence of at least four samples taken at an interval that ensures, to the
           greatest extent technically feasible, that a statistically independent sample is obtained, by
           reference  to the  uppermost  aquifer  effective  porosity, hydraulic  conductivity,   and
           hydraulic gradient, and the fate and transport characteristics of potential contaminants;
           or

        2.  An alternate sampling procedure proposed by the owner or operator and approved by the
           Regional Administrator if protective of human health and the environment.
                                             2-10                                   March 2009

-------
Chapter 2.  Regulatory Overview                                          Unified Guidance

       Additional  regulatory  language  in  detection  [§264.98(d)]  and  compliance  [§264.99(f)]
monitoring reaffirms the first approach:

          [A] a sequence of at least four samples from each well (background and compliance wells)
          must be collected at least send-annually during detection/compliance monitoring...

       Interim status sampling requirements under §265.92[c] read as follows:

          (1)  For all monitoring wells, the owner or operator must establish initial background
          concentrations or values of all parameters specified in paragraph (b) of this section. He
          must do this quarterly for one year;

          (2)  For each  of the indicator parameters specified in paragraph (b)(3) of this section, at
          least four replicate  measurements must  be obtained for each  sample and the  initial
          background arithmetic mean and variance must be determined by pooling  the replicate
          measurements for the respective parameter concentrations or values  in samples obtained
          from upgradient wells during the first year.

       The requirements under Subtitle D §258.54(b) are somewhat different:

          The monitoring frequency for all constituents listed in Appendix I to this part,... shall be at
          least semi-annual during the active life of the facility.... A minimum  of four independent
          samples from each well (background and downgradient) must be collected and analyzed
          for  the Appendix I constituents...  during the first semi-annual event.  At least one sample
          from each well (background and downgradient)  must be collected and analyzed during
          subsequent semi-annual events...

     The  1980 and 1982 regulations required four analyses of essentially a single physical sample for
certain  constituents, i.e.,  the  four contaminant  indicator  parameters. The need  for statistically
independent  data was recognized  in the 1988 revisions to Part  264 and in the Part 258 solid waste
requirements.  In the latter rules, only a minimum single sample is required in successive semi-annual
sampling events. Individual Subtitle C programs have also made use of the provision in §264.97(g)(2) to
allow for fewer than four  samples collected during  a  given semi-annual  period, while other State
programs require the four successive sample measurements. As noted, by the recent changes in the April
2006 Burden Reduction Rule, the  explicit requirements to obtain at least four samples during the next
evaluation period under 40 CFR §264.98(d) and §264.99(f) have been removed, allowing  more general
flexibility under the §264.97(g) sampling options. Individual State RCRA programs should be consulted
as to whether these recent rule changes may be applicable.

     The  requirements of Parts 264 and 258 were generally intended to provide sufficient data for
ANOVA-type tests in detection monitoring. However, control chart, tolerance limit, and prediction limit
tests can be applied with  as few as one new sample per evaluation, once background data are established.
The guidance provides maximum flexibility in offering a range of prediction limit options in Chapter
18 in order to address these various sample size requirements. Although not discussed in detail, the same
conclusions pertain to the use of control charts or tolerance limits.

     The  use  of the term  "replicate" in the Part 265 interim status regulations  can be a significant
problem, if interpreted to mean repeat analyses of splits (or aliquots) of a single physical sample. The
                                             2^11                                   March 2009

-------
Chapter 2.  Regulatory Overview                                          Unified Guidance

regulations indicate the need for statistical independence among sample data for testing purposes. This
guidance discusses the technical statistical problems that arise if replicate (aliquot) sample data are used
with the required Student's Mest in Part 265.  Thus, the guidance recommends, if possible, that interim
status statistical evaluations be based on independent sample data as discussed in Chapters 13 and 14
and at the end of this chapter. A more  standardized Welch's version of the Student-t test for unequal
variances is provided as an alternative to the CABF Student's Mest.

2.2.5 GROUNDWATER  PROTECTION STANDARDS

     Part 265 does not use the term groundwater protection standards.  A first-year requirement under
§265.92(c)(l)is:

          For  all  monitoring  wells,   the  owner  or  operator  must  establish  background
          concentrations or values of all parameters specified in paragraph (b) of this section. He
          must do this quarterly for one year.

      Paragraph (b)  includes water supply parameters listed  in Part 265 Appendix HI,  which also
provides  a Maximum Level for each constituent. If a facility owner or operator  does not develop and
implement an assessment plan under §265.93(d)(4), there is a requirement in §265.94(a)(2) to report the
following information to the Regional Administrator:

          (i) During the first year when initial background concentrations are being established for
          the facility: concentrations or values  of  the parameters listed in §265.92(b)(l) for each
          groundwater monitoring well within 15 days after completing each quarterly analysis. The
          owner or operator must separately identify for each  monitoring well any parameters whose
          concentrations or  value has been found to exceed the maximum contaminant levels in
          Appendix III.

     Since the Part  265 regulations are explicit in requiring a one-to-one comparison, no statistical
evaluation is needed or possible.

     §264.94(a) identifies the permissible concentration limits as a GWPS under §264.92:

          The Regional Administrator will specify in the facility permit concentrations limits  in the
          groundwater for hazardous constituents established under §264.93. The concentration of a
          constituent:

          (1) must not exceed the background level of that constituent in the groundwater at the time
          the limit is specified in the permit; or

          (2) for any of the constituents listed in Table 1, must not exceed the respective value given
          in that table if the background level is below the value given in Table 1; or

          (3) must not exceed an alternate limit established by the Regional Administrator  under
          paragraph (b) of this section.
                                             2-12                                   March 2009

-------
Chapter 2.  Regulatory Overview                                          Unified Guidance

       The RCRA Subtitle D regulations establish the following standards under §258.55(h) and (i):

          (h)  The owner or operator must establish a groundwater protection standard for  each
          Appendix II constituent detected in groundwater. The groundwater protection standard
          shall be:

             (1)  For  constituents for which a maximum contaminant level (MCL) has  been
             promulgated under Section 1412 of the Safe Drinking Water Act (codified) under 40
             CFR Part 141, the MCL for that constituent;

             (2) for constituents for which  MCLs have  not  been promulgated, the background
             concentration  for  the  constituent  established from  wells  in  accordance  with
             §2 5 8.51 (a) (1); or

             (3) for constituents for which the background level is higher than the MCL identified
             under paragraph  (h)(l)  of this  section  or  health  based levels identified under
             §258(i)(l), the background concentration.

          (i) The Director of an approved State program may establish an alternative groundwater
          protection standard for constituents for which MCLs have not been established. These
          groundwater protection standards shall be appropriate health based levels that satisfy the
          following criteria:

             (1) the level is derived in a manner consistent with Agency guidelines for assessing
             health risks or environmental pollutants [51 FR 33992, 34006, 34014, 34028, Sept. 24,
             1986]

             (2) to (4)... [other detailed requirements for health risk assessment procedures]

       The two principal alternatives for defining a groundwater protection standard [GWPS] are either
a limit based on background data or a fixed health-based value (e.g., MCLs, §264.94 Table 1 values, or a
calculated risk limit). The Unified Guidance discusses these two basic kinds of standards in Chapters 7
and 21.  If a background limit is applied, some definition of how the limit is constructed from  prior
sample data is required at the time of development.  For fixed health-based limits, the regulatory program
needs to consider the statistical characteristic of the data (e.g., mean, median, upper percentile)  that best
represents the  standard in order to conduct appropriate  statistical comparisons. This subject is also
discussed in Chapter 21; the guidance provides a number of testing options in this regard.

2.3  UNIFIED GUIDANCE RECOMMENDATIONS

2.3.1 INTERIM STATUS MONITORING

     As discussed in Chapter 14, replicates required for the four contaminant indicator parameters are
not statistically independent when analyzed  as aliquots or splits from a single physical sample. This
results in incorrect estimates of variance and the degrees of freedom when used in a Student's t-tesi. One
of the most  important revisions in the 1988 regulations  was to require that  successive  samples  be
independent.   Therefore, at  a minimum, the Unified Guidance recommends that only independent
water quality sample data be applied to the detection monitoring Student's t-tesis in Chapter 16.

                                            2^13March 2009

-------
Chapter 2. Regulatory Overview                                            Unified Guidance

     There are other considerations limiting the application of these tests as well. As noted in Chapter
5, at least two of the indicator parameters (pH and specific conductance) are likely to exhibit natural
spatial differences among monitoring wells. Depending on site groundwater characteristics, TOC and
TOX may also vary spatially. TOX analytical limitations described in SW-84610  also note that levels of
TOX are affected by inorganic chloride levels, which themselves can vary spatially by well. In short, all
four indicator parameters may need to be evaluated on an intrawell basis, i.e., using historical data from
compliance monitoring wells.

     Since  this option  is  not  identified in existing Part  265 regulations for  indicator detection
monitoring, a  more appropriate strategy is to develop an alternative groundwater quality assessment
monitoring plan under §265.90(d)(3)  and  (4)  and §265.93(d)(3) and  (4).  These  sections of  the
regulations require evaluation of hazardous waste constituents  reasonably derived from the regulated
unit (either those which served as a basis  for listing in  Part  265  Appendix VII or which are found in
§261.24  Table  1).  Interim status  units  subject to a  permit  are  also subject  to  the groundwater
contaminant information collection provisions under §270.14[c], which potentially include all hazardous
constituents (a wider range of contaminants, e.g., Part 264 Appendix IX) reasonably expected from the
unit. While an interim status facility can return  to indicator  detection monitoring if no hazardous
constituent releases have been identified, such a return is itself optional.

     EPA  recommends that  interim status facilities develop  the §265.90(d)(3) &  (4)  alternative
groundwater quality assessment monitoring plan, if possible, using  principles and procedures found in
this guidance for monitoring design and statistical evaluation. Unlike Part 264 monitoring, there are no
formal compliance/corrective action steps associated with statistical testing. A regulatory agency may
take appropriate  enforcement  action if data indicate  a release or significant  adverse  effect.  The
monitoring plan can be applied for an indefinite period until permit  development. Multi-year collection
of semi-annual or quarterly hazardous constituent data is more determinative of potential releases.  The
facility or the  regulatory agency may also wish to continue evaluation of some or all of the Part 265
water quality indicators. Eventually these groundwater data can be used to establish which  monitoring
program(s) may be appropriate at the time of permit development under §264.91(b).

2.3.2 PARTS 264 AND 258  DETECTION  MONITORING METHODS

     As described in Chapter 13, many of the commonly monitored inorganic analytes exhibit natural
spatial variation among wells. Since the two ANOVA techniques in  §264.97(h) and §258.53(g) depend
on an assumption of a single common  background population, these tests may not be  appropriate in
many situations. Additionally, at least 50% of the data should be  detectable in order to compare either
well means or medians. For many hazardous trace elements, detectable percentages are considerably
lower.  Interwell ANOVA techniques would also not be generally useful in these cases.  ANOVA may
find limited applicability in detection monitoring  with  trace  organic  constituents,  especially where
downgradient levels are considerably higher than background  and there is a high percentage of detects.
Based on ranks alone,  it may be possible to determine that compliance well(s) containing one or more
hazardous  constituents exceed background.  However, the Unified Guidance  recommends avoiding
ANOVA techniques in the limiting situations just described.
10 Test Methods for Evaluating Solid Waste (SW-846), EPA OSWER, 3rd Edition and subsequent revisions, Method 9020B,
  September 1994

                                              2^14                                   March 2009

-------
Chapter 2.  Regulatory Overview                                           Unified Guidance

      Another detection monitoring method receiving less emphasis in this guidance is the tolerance
limit. In previous guidance, an upper tolerance limit based on background was suggested to identify
significant increases in downgradient well  concentration levels. While still acceptable by regulation
(e.g., under existing RCRA permits), use of  prediction limits are preferable to tolerance limits in
detection monitoring for the following reasons. The construction of a tolerance limit is nearly identical
to that of a prediction limit. In parametric normal distribution applications, both methods use the general
formula: x + KS . The  kappa (K) multiplier varies depending on  the  coverage  and confidence levels
desired, but in both cases some multiple of the standard deviation (s) is added  or subtracted from the
sample mean (x). For non-parametric limits, the similarity is even more apparent. Often the identical
statistic (e.g.., the maximum observed value in background) can either be used  as an upper prediction
limit or an upper tolerance limit, with only a difference in statistical interpretation.

     More fundamentally, given the wide variety of circumstances in which retesting strategies are now
encouraged and  even necessary, the mathematical underpinnings of retesting with prediction limits are
well established while those for retesting with  tolerance limits are not. Monte Carlo simulations were
originally conducted for the  1992  Addendum to develop appropriate retesting strategies  involving
tolerance limits.  Such simulations were found insufficient for the Unified Guidance.11

     While the  simultaneous  prediction limits presented in the Unified Guidance consider the actual
number of comparisons in defining exact false positive error rates,  some tolerance limit approaches
(including past guidance) utilized an approximate and less precise pre-selected low level of probability.
On balance,  there is little  practical need for  recommending two highly similar (but  not identical)
methods in the Unified Guidance, both for the reasons just provided and to  avoid confusion of which
method to use.  The final regulation-specified  detection monitoring  method — control charts  — is
comparable to prediction limits, but possesses some unique benefits and so is also recommended in this
guidance.

2.3.3 PARTS  264 AND 258  COMPLIANCE/ASSESSMENT  MONITORING

     A  second  use  of tolerance  limits  recommended  in  earlier  guidance was for  comparing
downgradient monitoring well data to a fixed  limit  during compliance/assessment monitoring. In this
case, an upper tolerance limit constructed on each compliance well  data set could be used to identify
non-compliance with a  fixed GWPS limit. Past guidance also used upper confidence limits around an
upper proportion in defining these tolerance limits.  A number of problems were identified using this
approach.

     A tolerance limit  makes statistical sense  if the limit represents an upper percentile, i.e., when a
limit is not  to  be exceeded  by more than, for  instance,  1% or  5% or  10% of  future individual
concentration values. However, GWPS limits can also be interpreted as long-term averages, e.g., chronic
risk-based values, which are better approximated by  a statistic like the mean or median. Chapters 7 &
  1) there were minor errors in the algorithms employed; 2) Davis and McNichols (1987) demonstrated how to compute exact
  kappa multipliers for prediction limits using a numerical algorithm instead of employing an inefficient simulation strategy;
  and 3) further research (as noted in Chapter 19) done in preparation of the guidance has shown that repeated prediction
  limits are more statistically powerful than retesting strategies using tolerance limits for detecting changes in groundwater
  quality.

                                              2^15March 2009

-------
Chapter 2. Regulatory Overview                                           Unified Guidance

22  discuss  important  considerations when identifying the  appropriate statistical parameter  to  be
compared against a fixed GWPS limit.

     More importantly,  since the upper  confidence level of tolerance limit overestimates the true
population  proportion  by design, demonstrating an  exceedance of a GWPS  by this limit  does not
necessarily indicate that the corresponding population proportion also exceeds the standard, leading to a
high false positive rate. Therefore, the Unified Guidance recommends that the compliance/assessment
monitoring null hypothesis be structured so that the compliance population characteristic (e.g.,  mean,
median, upper percentile) is assumed to be less than or equal to the fixed standard unless demonstrated
otherwise.  The correct test statistic  in this situation is then the lower confidence limit. The  upper
confidence limit is used in corrective action to identify whether a constituent has returned to compliance.

     To ensure consistency  with the underlying  statistical presumptions  of compliance/assessment
monitoring (see Chapter 4)  and to  maintain control  of  false positive rates, the Unified Guidance
recommends that this tolerance interval approach be replaced with a more coherent and comprehensive
strategy based on the use  of confidence intervals (see Chapters 21 and 22). Confidence intervals can be
applied in a consistent  fashion to GWPS concentration limits representing either long-term averages or
upper percentiles.
                                              2-16                                    March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

            CHAPTER  3.  KEY STATISTICAL CONCEPTS
        3.1  INTRODUCTION TO GROUNDWATER STATISTICS	3-2
        3.2  COMMON STATISTICAL ASSUMPTIONS	3-4
          3.2.1   Statistical Independence	3-4
          3.2.2   Stationarity	3-5
          3.2.3   Lack of Statistical Outliers	3-7
          3.2.4   Normality	3-7
        3.3  COMMON STATISTICAL MEASURES	3-9
        3.4  HYPOTHESIS TESTING FRAMEWORK	3-12
        3.5  ERRORS IN HYPOTHESIS TESTING	3-14
          3.5.1   False Positives and Type I Errors	3-15
          3.5.2   Sampling Distributions, Central Limit Theorem	3-16
          3.5.3   False Negatives, Type II Errors, and Statistical Power	3-18
          3.5.4   Balancing Type I and Type II Errors	3-22
     The success of any discipline rests on its ability to accurately model and explain real problems.
Spectacular successes have been registered during the past four centuries by the field of mathematics in
modeling fundamental processes in mechanics and physics. The last century,  in turn, saw the rise of
statistics and its fundamental theory of estimation and hypothesis testing. All of the tests described in the
Unified Guidance are based upon this theory and involve the same key concepts. The purpose of this
chapter is to  summarize the statistical  concepts underlying  the  methods presented in  the Unified
Guidance, and to consider each  in the practical context of groundwater monitoring. These include:

    *»*  Statistical inference: the difference between samples and populations; the concept of sampling.
    »«»  Common  statistical  assumptions used  in  groundwater monitoring: statistical independence,
       Stationarity, lack of outliers, and normality.
    »»»  Frequently-used  statistical  measures:  mean,   standard  deviation,   percentiles,  correlation
       coefficient, coefficient of variation, etc.
    *»*  Hypothesis testing: How probability distributions are used to model the behavior of groundwater
       concentrations and how the  statistical evidence is used to "prove" or "disprove" the validity of
       competing models.
    »«»  Errors  in hypothesis testing: What false positives  (Type I errors) and  false negatives (Type II
       errors) really represent.
    *»*  Sampling  distributions  and  the Central Limit Theorem: How  the statistical behavior of test
       statistics differs from that of individual population measurements.
    »»»  Statistical power and power curves: How the ability to detect real contamination depends on the
       size or degree of the concentration increase.
    *»*  Type I vs. Type II  errors: The tradeoff between false positives  and false negatives;  why it is
       generally impossible to minimize both kinds of error simultaneously.
                                              3-1                                    March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

3.1 INTRODUCTION TO GROUNDWATER STATISTICS

     This section briefly covers some basic statistical terms and principles used in this guidance. All of
these topics are more thoroughly discussed in standard textbooks. It is presumed that the user already has
some familiarity with the following terms and discussions.

     Statistics is a branch  of applied mathematics, dealing with the description, understanding,  and
modeling of data. An integral part of statistical analysis is the testing of competing mathematical models
and  the management  of data  uncertainty. Uncertainty  is present because  measurement  data exhibit
variability, with limited knowledge of the medium being sampled. The fundamental aim of almost every
statistical analysis is to draw inferences. The data analyst must infer from the observed data something
about the physical world without knowing or seeing all the possible facts or evidence.  So the question
becomes: how closely do the measured data mimic reality, or put another way, to what extent do the data
correctly identify a physical truth  (e.g., the compliance  well  is contaminated with  arsenic above
regulatory limits)?

     One way to ascertain whether an aquifer is contaminated with certain chemicals would  be to
exhaustively sample and measure every physical volume of groundwater underlying the site of interest.
Such a collection of measurements would be impossible to procure in practice and would be infinite in
size, since sampling would have to be continuously conducted over time at a huge number of wells and
sampling depths.  However,  one would possess the entire population of possible measurements at that
site and the exact statistical distribution of the measured concentration values.

     A statistical distribution is an organized  summary  of a set of data values, sorted into the relative
frequencies of occurrence of different measurement levels (e.g., concentrations of 5 ppb or less  occur
among 30 percent of the values,  or levels of 20 ppb or  more only occur 1  percent  of the time).  More
generally, a distribution may refer to a mathematical model (known as ^probability distribution) used to
represent the  shape and statistical characteristics of a given population and chosen according to  one's
experience with the type of data involved.

     By contrast to the population,  a statistical sample is  a finite subset of the population,  typically
called a data  set. Note that  the statistical definition of sample is usually different from a geological or
hydrological definition of the same term. Instead of a physical volume or mass, a statistical sample is a
collection of measurements, i.e., a set of numbers. This collection might contain only a single value, but
more generally has a number of measurements denoted as the sample size, n.

     Because a sample is only a partial representation of the population, an inference is usually desired
in order to conclude something from the observed data  about the underlying population. One or more
numerical characteristics of the population might be of  interest, such as the true average contaminant
level or the upper 95thpercentile of the concentration distribution. Quantities computed from the sample
data are known as statistics, and can be used to reasonably estimate the desired but unknown population
characteristics. An  example is when testing sample data  against  a regulatory standard such  as  a
maximum concentration limit [MCL] or background level.   A mean sample estimate of the average
concentration can be used to judge whether the corresponding population characteristic — the true mean
concentration (denoted by the Greek letter u) — exceeds the MCL or background limit.

     The accuracy of  these estimates depends on how representative the sample measurements of the
underlying  population  are.  In a representative sample, the distribution of sample values have the best
                                             3-2                                    March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

chance of closely matching the population distribution. Unfortunately, the degree of representativeness
of a given sample is almost never known.  So it quite important to understand precisely how the sample
values were obtained from the population and to  explore whether or not they appear representative.
Though there  is no guarantee that  a sample  will  be adequate, the  best  protection  against an
unrepresentative sample is to select measurements from the population at random. A random sample
implies that each potential population value has an equivalent chance of being selected depending only
on  its likelihood of  occurrence.  Not only  does random sampling  guard against selection of an
unrepresentative portion of the population distribution, it also enables a mathematical estimate to be
drawn of the statistical uncertainty associated with the ability of a given sample to represent the desired
characteristic of the population. It  can be  very difficult to gauge the uncertainty surrounding a sample
collected haphazardly or by means of professional judgment.

     As a simple example, consider an urn filled with red and green balls. By thoroughly mixing the urn
and blindly sampling (i.e., retrieving) 10  percent of the balls,  a very nearly random sample  of the
population of balls will be obtained, allowing a fair estimate of the true overall proportion of one color
or the other. On the other hand, if one looked into the urn while sampling and only picked red balls or
tried  to  alternate between red  and  green,  the   sample would  be  far from random  and  likely
unrepresentative of the true proportions.

     At first glance, groundwater measurements obtained during routine monitoring would not seem to
qualify as random samples. The well points are generally not placed in random  locations or at random
depths, and the physical samples are usually collected at regular, pre-specified intervals.  Consequently,
further  distinctions  and  assumptions are necessary  when performing  statistical evaluations  of
groundwater data. First, the distribution  of a given  contaminant may not be spatially uniform  or
homogeneous.  That is, the local distribution of measured values at one  well may not be the same  as at
other  wells. Because this  is often  true for naturally-occurring groundwater constituents, the statistical
population(s) of interest may be well-specific.  A statistical sample gathered from a particular well must
then be treated as potentially representative only of that well's local population.  On the other hand,
samples drawn from a number of  reference background wells for which no  significant differences are
indicated, may permit the pooled data to serve as an estimate of the overall well  field behavior for that
particular monitoring constituent.

     The distribution of a contaminant may also not be temporally uniform  or stationary over time. If
concentration values indicates a trend,  perhaps because a plume intensifies or dissipates or natural in-situ
levels rise  or fall due to drought conditions, etc., the distribution is said to be  non-stationary.  In this
situation, some of the measurements collected over time may not be representative of current conditions
within the aquifer.  Statistical  adjustments might be needed or the data  partitioned into  usable and
unusable values.

     A similar difficulty is posed by cyclical or seasonal trends. A long-term constituent concentration
average at a well location or the entire site may  essentially be constant over time, yet temporarily
fluctuate up and down on a seasonal basis. Given a fixed interval between sampling  events, some of this
fluctuation may go unobserved due to the non-random nature of the sampling times.  This could result in
a sample that is unrepresentative of the population variance and possibly of the population mean as well.
In such settings, a shorter (i.e., higher frequency) or staggered sampling interval may be needed to better
capture key characteristics of the population as a part of the distribution of sample measurements.


                                              3-3                                     March  2009

-------
Chapter 3.  Key Statistical Concepts	Unified Guidance

     The difficulties in identifying a valid statistical framework for groundwater monitoring highlight a
fundamental assumption governing almost every statistical procedure and test.  It is the presumption that
sample data from a given population should be independent and identically distributed, commonly
abbreviated as i.i.d.  All of the mathematics and statistical formulas contained  in this guidance are built
on this basic assumption. If it is not satisfied, statistical conclusions and test results may be invalid or in
error. The associated statistical uncertainty may be different than expected from a given test procedure.

     Random sampling of a single, fixed, stationary population will guarantee independent, identically-
distributed sample data. Routine groundwater sampling  typically does not. Consequently, the Unified
Guidance discusses  both below and in later chapters what assumptions about the sample data must be
routinely or periodically checked. Many but not all of these assumptions are a simple consequence of the
i.i.d. presumption. The guidance also discusses how sampling ought to be conducted and designed to get
as close as possible to the i.i.d. goal.

3.2 COMMON  STATISTICAL ASSUMPTIONS

     Every statistical test or procedure makes certain assumptions about the  data used to compute the
method. As noted above, many of these assumptions flow as a natural consequence of the presumption
of independent, identically-distributed data (i.i.d.). The most common assumptions are briefly  described
below:

3.2.1  STATISTICAL IN DEPENDENCE

     A major advantage of truly random sampling of a population is that the measurements will be
statistically independent. This means that observing or knowing the value of one measurement does not
alter or influence the probability of observing any other measurement in the population. After  one value
is selected, the next  value is sampled again at random without regard to the previous measurement, and
so on. By contrast, groundwater samples are not chosen at  random times or  at random locations. The
locations are fixed and typically few in number. The intervals between sampling events are  fixed and
fairly regular.  While samples  of independent data exhibit no pairwise correlation  (i.e., no  statistical
association of similarity or dissimilarity between pairs of sampled measurements), non-independent or
dependent data do exhibit pairwise correlation and often other, more complex forms of correlation.
Aliquot split sample pairs are generally not independent because of the positive correlation induced by
the  splitting  of the same physical groundwater sample. Split measurements tend to be highly similar,
much more so than the random pairings of data from distinct sampling events.

     In a similar vein, measurements  collected close together in time from the same well tend to be
more highly correlated than pairs collected  at longer  intervals.   This is especially true  when the
groundwater is so slow-moving  that the  same general volume of groundwater is  being sampled on
closely-spaced consecutive sampling events. Dependence may also be exhibited spatially across a well
field. Wells  located  more closely in space and screened in the same hydrostratigraphic zone may show
greater similarity in concentration patterns than wells that are farther apart. For both of these temporal or
time-related  and spatial dependencies, the observed correlations are a result not only  of the non-random
nature of the sampling but also the fact that many groundwater populations are not uniform throughout
the  subsurface. The aquifer may instead exhibit pockets  or sub-zones of higher or lower concentration,
perhaps due to location-specific  differences in natural geochemistry or the dynamics of contaminant
plume behavior over time.

                                             3-4                                   March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

     As a mathematical construct, statistical independence is essentially impossible to check directly in
a set of sample data — other than by ensuring ahead of time that the measurements were collected at
random. However, non-zero pairwise correlation, a clear sign of dependent data, can be checked and
estimated in a variety of ways.  The Unified Guidance describes two methods for identifying temporal
correlation  in Chapter 14: the  rank von Neumann ratio test and the sample autocorrelation function.
Measurable correlation  among consecutive sample pairs  may  dictate the need for  decreasing  the
sampling frequency or for a more complicated data adjustment.

     Defining and  modeling wellfield spatial  correlation is beyond the  scope of this guidance, but is
very much  the purview of the field of geostatistics. The Unified Guidance instead looks for evidence of
well-to-well spatial variation,  i.e., statistically identifiable  differences in mean and/or variance levels
across the well field. If evident, the statistical approach would need to be modified so that distinct wells
are treated  as individual populations with statistical testing being conducted separately at each one (i.e.,
intrawell comparisons).

3.2.2 STATIONARITY

     A stationary statistical distribution is one  whose population characteristics do not change over time
and/or space. In  a  groundwater context, this  means that the  true population  distribution of a given
contaminant is the same no matter where or when it is sampled. In the strictest form  of stationarity, the
full distribution must be exactly the same at every time and location. However, in  practice, a weaker
form is usually assumed: that the population mean (u) and  variance (denoted by the Greek symbol o )
are the same over time and/or space.

     Stationarity is important to  groundwater statistical analysis because of the way that monitoring
samples must be collected. If a sample set somehow represented the entire population  of possible aquifer
values, stationarity would not be an issue in theory. A limited number of physical groundwater samples,
however, must be individually collected from  each sampled location. To generate a statistical sample,
the individual measurements must be pooled together over time from multiple sampling events within a
well, or pooled together across space by aggregating data from multiple wells, or both.

     As long as the contaminant distribution is stationary, such pooling poses no statistical problem. But
with a non-stationary distribution, either the mean and/or variance  is changing over time in any given
well, or the means and variances differ at distinct locations.  In either case, the pooled measurements are
not identically-distributed even if they may be statistically independent.

     The effects of non-stationarity are commonly seen in four basic ways in the groundwater context:
1) as spatial variability, 2) in the existence of trends and/or seasonal variation,  3) via other forms of
temporal variation,  and 4) in the lack of homogeneity of variance.  Spatial variability (discussed more
extensively in Chapter 13) refers to statistically identifiable differences in mean  and/or variance levels
(but usually means) across the well field (i.e.,  spatial non-stationarity). The existence of such variation
often  precludes the pooling of data  across multiple background wells  or the  proper  upgradient-to-
downgradient comparison of background wells against distinct  compliance wells.  Instead,  the usual
approach is to perform intrawell comparisons, where well-specific background data  is culled from the
early sampling history at  each well. Checks  for spatial variability are conducted graphically with the aid
of side-by-side box plots (Chapter 9) and through the use  of analysis of variance [ANOVA, Chapter
13].

                                              3-5                                    March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

     A trend over time at a given well location indicates that the mean level is not stationary but is
instead rising or falling. A seasonal trend is similar in that there are periodic increases and decreases.
Pooling  several  sampling events together  thus mixes  measurements with  differing  statistical
characteristics.  This can violate the identically-distributed presumption of almost all statistical tests and
usually leads to an inflated estimate of the current population variance. Trends  or seasonal variations
identified in (upgradient) background wells or in intrawell background data from compliance wells can
severely impact the accuracy and effectiveness  of statistical procedures described in this guidance if data
are pooled  over time to establish background limits. The approach that should be taken will vary with
the circumstance. Sometimes the trend component might need to be estimated and removed  from the
original data, so that what gets tested are the data residuals (i.e., values that result from subtracting the
estimated trend from the original data) instead of the raw measurements.  In other cases, an alternate
statistical approach might be needed such as a test for (positive) trend or construction of a confidence
band around an estimated trend. More discussion of these options is presented in Chapters 6, 7, 14, and
21

     To identify a linear trend, the Unified Guidance describes simple linear regression and the Mann-
Kendall test in Chapter 17. For seasonal patterns or a combination of linear and  seasonal trend effects,
the guidance discusses the seasonal Mann-Kendall test and the use of ANOVA tests to identify seasonal
effects. These diagnostic procedures are also presented in Chapter 14.

     Temporal variations are distinguished in this guidance from trends or seasonal effects by the lack
of a regular or identifiable pattern. Often a temporal effect will be observed as a temporary shift in
concentration levels that is similar in magnitude and direction at multiple wells. This can occur at some
sites,  for instance,   due  to rainfall  or recharge events. Because  the  mean level  changes  at  least
temporarily, pooling data over time again violates the assumption of identically-distributed data. In this
case, the temporal effect can be identified by looking for parallel traces on a time series plot of multiple
wells and then more formally by performing a one-way ANOVA for temporal effects. These procedures
are described  in Chapter  14.  Once  identified, the  residuals from the  ANOVA  can be  used for
compliance testing, since the common temporal effect has been removed.

     Lastly, homogeneity of variance is important in ANOVA tests, which simultaneously evaluates
multiple groups of data each representing a sample from a distinct statistical population.  In the latter
test, well means need not be the  same; the reason for performing the test in the first place is to find out
whether the means do indeed differ. But the procedure assumes that all the group variances are equal or
homogeneous. Lack  of homogeneity or stationarity in the variances  causes the test  to be much less
effective at discovering differences in the well means.  In extreme cases, the concentration levels would
have to differ by large amounts before the ANOVA would correctly register a statistical difference. Lack
of homogeneity of variance can be identified graphically via the use of side-by-side box plots and then
more formally with the use of Levene's test. Both these methods are discussed further in Chapter 11.
Evidence of unequal  variances may necessitate the use of a transformation to stabilize the variance prior
to running the ANOVA. It might also preclude  use of the ANOVA altogether for compliance testing, but
require intrawell approaches to be considered instead.

     ANOVA is not the only  statistical procedure which assumes homogeneity of variance. Prediction
limits and  control charts require a similar assumption between background and  compliance well data.
But if only one new  sample  measurement is collected per well per evaluation period (e.g.,  semi-
annually) it can be difficult to formally test this assumption with the diagnostic methods cited above. As
                                              3-6                                     March 2009

-------
Chapter 3.  Key Statistical Concepts	Unified Guidance

an alternative, homogeneity of variance can be periodically tested when a sufficient sample size has been
collected for each compliance well (see Chapter 6).

3.2.3 LACK OF STATISTICAL OUTLIERS

     Many authors have noted that outliers — extreme, unusual-looking measurements — are a regular
occurrence among groundwater data (Helsel and Hirsch, 2002; Gibbons and Coleman, 2001). Sometimes
an outlier results from nothing more than a typographical error on a laboratory data sheet or file.  In
others, the fault is an incorrectly calibrated measuring device or a piece of equipment that was not
properly decontaminated. An unusual measurement might also reflect the sampling of a temporary, local
'hot spot' of higher concentration. In each of these situations, outliers in a statistical context represent
values that are inconsistent with the distribution of the remaining measurements. Tests for outliers thus
attempt to  infer  whether the suspected outlier could have reasonably been  drawn from the same
population as the other measurements, based on the sample data  observed up to that point. Statistical
methods to help identify potential outliers are discussed in Chapter 12, including both Dixon 's and
Rosner 's tests, as well as references to other methods.

     The basic problem with including statistical outliers in analyzing groundwater data is that they do
not come from the same distribution as the other measurements in the sample and so fail the identically-
distributed presumption of most tests.  The consequences can be dramatic, as can be seen for instance
when considering non-parametric prediction limits. In this testing method, one of the largest values
observed in the background data such  as the maximum, is often the statistic selected as the prediction
limit. If a large outlier is present among the background measurements, the prediction limit may be set to
this value despite being unrepresentative of the background population. In effect, it arises from another
population, e.g., the 'population' of typographical errors. The prediction limit could then be much higher
than warranted based on the observed background data and may provide little if any probability that truly
contaminated  compliance wells will be identified. The test will then have lower than expected statistical
power.

     Overall, it pays to try to identify  possible outliers and to either correct the value(s) if possible, or
exclude known outliers from subsequent statistical analysis. It is  also possible to select a statistical
method that is resistant to the presence of outliers,  so that the test results are still likely to be accurate
even if one or more outliers is unidentified. Examples of this last strategy include setting non-parametric
prediction limits to values other than the background maximum using repeat testing (see Chapter 18) or
using Sen's slope procedure to estimate the rate of change in a linear trend (Chapter 17).

3.2.4 NORMALITY

     Probability distributions introduced in Section 3.1 are  mathematical models used to approximate
or represent the statistical characteristics of populations. Knowing  the exact form and defining equation
of a probability distribution allows one to assess how likely or unlikely it will be to observe particular
measurement values (or ranges of values) when selecting or drawing independent, identically distributed
[i.i.d] samples from the associated population. This can be done as follows. In the case of a continuous
distributional model,  a curve  can be drawn to  represent the  probability distribution by plotting
probability values along the_y-axis and measurement or concentration values along the x-axis. Since the
continuum of x-values along this curve is infinite,  the probability of occurrence of any single possible
value is negligible (i.e., zero), and does not equal the height  of the curve. Instead, positive probabilities
can be computed for ranges of possible values by summing the area under the distributional curve
                                              3-7                                     March 2009

-------
Chapter 3.  Key Statistical Concepts	Unified Guidance

associated with the desired range. Since by definition the total area under any probability distribution
curve sums to unity, all probabilities are then numbers between 0 and 1.

     Probability distributions form the basic building blocks of all statistical testing procedures. Every
test  relies on comparing  one  or more  statistics computed from  the sample data against a reference
distribution.  The reference distribution is in turn a probability distribution summarizing the expected
mathematical behavior of the statistic(s) of interest.   A formal  statistical test utilizes this reference
distribution  to make  inferences about the  sample statistic in terms of two contrasting conditions or
hypotheses.

     In any event, probability distributions used in statistical testing make differing assumptions about
how the underlying  population  of  measurements  is  distributed. A  case  in  point  is  simultaneous
prediction limits using retesting (Chapter  19). The first  and most common version of this test (Davis
and  McNichols,  1987)  is based on an  assumption that the sample  data  are  drawn  from a normal
probability distribution. The normal distribution is the well-known bell-shaped curve, perhaps the single
most important and frequently-used distribution in statistical analysis.  However, it is not the only one.
Bhaumik  and Gibbons (2006) proposed  similar  prediction limits  for data drawn  from  a gamma
distribution  and Cameron (2008) did the same for Weibull-distributed measurements. This more recent
research demonstrates that prediction limits with similar  statistical decision error rates can vary greatly
in magnitude, depending on the type of data distribution assumed.

     Because many tests  make an explicit assumption concerning the distribution represented by the
sample data, the form and exact type of distribution often has to be checked using a goodness-of-fit test.
A goodness-of-fit test assesses how closely the observed sample data resemble a proposed distributional
model. Despite the wide variety of probability distributions identified in the statistical  literature, only a
very few goodness-of-fit tests generally are needed in practice. This is because most tests are based on an
assumption of normally-distributed or normal data. Even  when an  underlying distribution is not normal,
it is often possible to use a mathematical transformation of the raw measurements  (e.g., taking the
natural logarithm or log of  each value) to normalize the data set.   The  original values can  be
transformed into a set of numbers that behaves as if drawn from a normal distribution.  The transformed
values can then be utilized in and analyzed with a normal-theory test (i.e., a procedure that assumes the
input data are normal).

     Specific goodness-of-fit tests for checking and identifying data distributions are found in Chapter
10 of this guidance. These methods  all are designed to check the fit to normality of the sample data.
Besides the normal, the lognormal distribution is also commonly used as a model for groundwater data.
This distribution is not symmetric in shape like the  bell-shaped normal curve, nor does it have similar
statistical  properties.  However, a simple  log transformation of lognormal  measurements works to
normalize such a data set. The transformed values can be tested using one of the standard goodness-of-
fit tests of normality to confirm that the original data were indeed lognormal.

     More generally, if a  sample shows evidence of non-normality using the techniques in Chapter 10,
the initial remedy is  to try  and find  a suitable normalizing transformation.  A set of useful possible
transformations in this  regard  has been termed the ladder of powers (Helsel and Hirsch,  2002).  It
includes not  only the natural logarithm, but also other mathematical power transformations  such as the
square root,  the  cube root, the square, etc. If none of these  transformations creates an adequately
normalized data set, a second approach is to consider what are known as non-parametric tests. Normal-

                                              3-8                                     March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

theory and other similar parametric statistical  procedures  assume that the  form of the underlying
probability distribution  is  known.  They  are called  parametric  because the  assumed  probability
distribution is generally characterized by a small set of mathematical parameters. In the case of the
normal distribution, the general formula describing its shape and properties is completely specified by
two parameters: the population mean  (u) and the population variance (o ). Once  values for these
quantities are known, the exact  distribution representing a particular normal population can be computed
or analyzed.

     Most parametric tests do not require knowledge of the exact distribution represented by the sample
data, but rather just the type of distribution (e.g., normal, lognormal, gamma, Weibull, etc). In more
formal terms, the test assumes  knowledge of the family of distributions indexed  by the characterizing
parameters. Every different combination of population mean and variance  defines a different normal
distribution, yet all belong  to the normal family. Nonetheless, there are many data  sets for which a
known distributional family cannot be  identified. Non-parametric methods may then be appropriate,
since a known distributional form is not  assumed. Non-parametric tests are discussed in various chapters
of the Unified Guidance. These tests are typically based on either a ranking or an ordering of the sample
magnitudes in order to assess their statistical performance and accuracy.   But even non-parametric tests
may make use of a normal approximation to define how expected rankings are distributed.

     One other common difficulty in checking for normality among groundwater measurements is the
frequent presence of non-detect values,  known in statistical terms as left-censored measurements. The
magnitude  of these sample concentrations is known only  to lie  somewhere between zero  and  the
detection or reporting limit; hence the true concentration is  partially 'hidden' or  censored on the left-
hand side of the numerical concentration scale. Because the most effective normality tests assume that
all the sample measurements are known  and quantified and not censored, the Unified Guidance suggests
two possible approaches in this circumstance. First, it is usually possible to simply assume that the true
distributional form  of the underlying  population cannot be identified, and to instead apply a non-
parametric test alternative. This solution is not always ideal, especially when using prediction limits  and
the background sample size is  small, or when using control  charts (for which there is no current non-
parametric alternative to the Unified Guidance recommended test).

     As a second alternative, Chapter 10 discusses  methods for assessing approximate normality in the
presence  of non-detects. If normality can be established,  perhaps through a normalizing transformation,
Chapter 15 describes methods for estimating the mean and variance parameters of the specific normal
distribution needed for constructing tests (such as prediction limits or control  charts), even though the
exact value of each non-detect is unknown.

3.3  COMMON  STATISTICAL MEASURES

     Due to the variety of statistical tests and other methods presented in the Unified Guidance, there
are  a  large number of equations and formulas of relevance  to specific situations. The most common
statistical measures used in many settings are briefly described below.

     Sample mean and standard deviation  —  the mean of a set of measurements of sample size n is
simply the arithmetic average of each of the numbers in the sample (denoted by X[), described by formula
[3.1] below. The sample mean is a common estimate of the center or middle of a statistical distribution.
That is, x is an estimate of  u, the population mean. The basic formula for the sample standard deviation

                                             3-9                                    March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

is  given in equation [3.2]. The sample standard deviation is an estimate of the degree of variability
within a distribution, indicating how much the values typically vary from the average value or mean.
Thus, the  standard deviation s is an estimate of the population standard deviation o. Note that another
measure of variability,  the sample variance, is simply the square of the standard deviation (denoted by
s2) and serves as an estimate of the population variance a2.


                                           x=-Yx                                       [3.1]
                                               ,„ "<-—I  '                                       L   J
                                                                                           [3.2]
                                                                                           L   J

     Coefficient of Variation —  for  positively-valued  measurements,  the  sample coefficient of
variation provides a quick and useful indication of the relative degree of variability within a data set. It is
computed as s/x and so indicates whether the amount of 'spread' in the sample is small or large relative
to the average observed magnitude.  Sample coefficients of variation can also be calculated for other
distributions such as the logarithmic (see discussion on logarithmic statistics below and Chapter 10,
Section 10.4)

     Sample percentile — the pih percentile of a sample  (denoted as xp) is  the  value such that
p x 100 % of the measurements are no greater than xp, while (l - p)x 100 % of the values are no less
than xp. Sample percentiles are computed by making an ordered  list of the measurements (termed the
order statistics of the sample) and either selecting an observed value from  the sample that comes closest
to satisfying  the above definition or interpolating between the pair  of  sample values closest  to the
definition if no single value meets it.

     Slightly different estimates  of the sample  percentile are  used to perform the interpolation
depending on the software package or statistics textbook. The Unified Guidance follows Tukey's (1977)
method for computing  the lower and upper quartiles (i.e., the 25th and 75th sample percentiles, termed
hinges by Tukey) when constructing box plots (Chapter 9). In that setting, the pair of sample values
closest  to the desired percentile  is simply  averaged. Another popular method for more generally
computing sample percentiles is to set the rank of the desired order statistic as k = («+l) * p. If & is not
an integer, perform linear interpolation between the pair of ordered sample values with ranks just below
and just above k.

     Median  and  interquartile  range —  the  sample median  is the  50th percentile  of  a  set of
measurements, representing the midpoint of an ordered list of the  values.  It is usually denoted as x or
x5, and represents an alternative estimate of the center of a distribution. The interquartile range [IQR] is
the difference between  the 75th and 25th sample percentiles, thus equal to (x 75 -x25). The IQR offers an
alternative estimate of variability  in a population, since it represents the  measurement range  of the
middle 50% of the ordered sample values. Both the median and the interquartile range are key statistics
used to construct box plots (Chapter 9).
                                              3-10                                   March 2009

-------
Chapter 3. Key Statistical Concepts
Unified Guidance
     The median and interquartile range can be very useful as alternative estimates of data centrality and
dispersion to the mean and standard deviation, especially when samples are drawn from a highly skewed
(i.e., non-symmetric) distribution or when one or more outliers is present. The table below depicts two
data sets, one with an obvious outlier, and demonstrates how these statistical measures compare.

     The median and interquartile ranges are not  affected by the inclusion of an outlier (perhaps  an
inadvertent reporting of units in terms of ppb rather than ppm). Large differences between the mean and
median, as well as between the standard deviation and interquartile range in the second data set can
indicate that an anomalous data point may be present.
Data Set #1
5
10
15
15
15
20
25
X = 15
x = 15
s = 6.5
IQR = 10
Data Set #2
5
10
15
15
15
20
25,000
X > 3,500
x = 15
s > 9,000
IQR = 10
     Log-mean, log-standard deviation and Coefficient of Variation — The lognormal distribution
is  a  frequently-used model in  groundwater  statistics.   When  lognormally  distributed  data  are
transformed, the normally-distributed measurements  can then  be input into normal-theory  tests. The
Unified Guidance frequently makes use of quantities computed on log-transformed values. Two of these
quantities,  the  log-mean and  the log-standard deviation, represent the  sample mean  and standard
deviation computed using log-transformed values instead of the raw measurements.  Formulas for these
quantities — denoted y and sy to distinguish them from the measurement-scale mean (x ) and standard
deviation (s) — are given below.  Prior to calculating the logarithmic mean and standard deviation, the
measurement scale data must first be log-transformed. Taking logarithms  of the sample mean (x ) and
the sample  standard deviation (s) based on the original measurement-scale data, will not give the correct
result.
                                                                                          [3.3]
                                                                                          [3.4]
     A population logarithmic  coefficient of  variation  can be  estimated from the logarithmically

transformed data as:  CFlog =yeSy -1.  It is based solely on the logarithmic standard deviation, sy, and
represents the intrinsic variability of the untransformed data.
                                             3-11
        March 2009

-------
Chapter 3. Key Statistical Concepts _ Unified Guidance

     Sample correlation coefficient — correlation is a common numerical measure of the degree of
similarity or linear association between two random variables, say x andy. A variety of statistics are used
to estimate the correlation depending  on the setting and how much is known about  the underlying
distributions of x andy. Each measure  is typically designed to take on values in the range  of-1 to +1,
where -1 denotes perfect inverse correlation (i.e., as x increases, y decreases,  and vice-versa), while +1
denotes perfect correlation (i.e., x andy increase or decrease together), and 0 denotes no correlation (i.e.,
x and y behave independently of one another).  The most popular measure  of linear correlation  is
Pearson's correlation coefficient (r), which can be computed for a set of n sample pairs (x\, y\) as:
                                r=  .       ,                                               [3.5]
3.4  HYPOTHESIS TESTING  FRAMEWORK

     An important component of statistical analysis involves the testing of competing mathematical
models, an activity known as hypothesis testing. In hypothesis testing,  a formal comparison is made
between two mutually exclusive possible statements about reality.  Usually these statements concern the
type or form of underlying statistical population from which the sample  data originated, i.e., either the
observed data  came from one statistical population or from another, but  not both. The sample data are
used to judge which statistical model identified by the two hypotheses is  most consistent with the
collected observations.

     Hypothesis testing is similar in nature to what takes place in a criminal trial. Just as one of the two
statements in an hypothesis test is judged true and the other false, so the defendant is declared either
innocent or guilty. The opposing lawyers each develop their theory  or model of the crime and what really
happened. The jury must then decide whether the available evidence better supports the prosecution's
theory or the defense's  explanation.  Just as  a strong presumption of innocence is given to a criminal
defendant, one of the  statements  in a statistical hypothesis is initially favored over the other. This
statement, known as the null hypothesis [Ho], is only rejected as false if the sample evidence strongly
favors the other side of the hypothesis, known as the alternative hypothesis [HA].

     Another  important parallel is that the same mistakes  which can occur in statistical hypothesis
testing are made in criminal trials. In a criminal proceeding, the innocent can falsely be declared guilty or
the guilty can wrongly be judged innocent. In the same way, if the null hypothesis [Ho] is a  true
statement about reality  but is rejected in  favor  of the  alternative hypothesis [//A], a mistake akin to
convicting the innocent has occurred.  Such a mistake is known in statistical terms as a false positive or
Type I error. If the alternative hypothesis [//A] is true but is rejected in favor of HO, the mistake is akin to
acquitting the guilty.  This mistake is known as a false negative or Type II error.

     In a criminal investigation, the test hypotheses can be reversed.  A detective investigating a crime
might consider a list of probable suspects as potentially guilty (the null hypothesis [Ho]), until substantial
evidence is  found to exclude  one or  more  suspects  [//A].   The burden of proof for accepting the
alternative hypothesis and the kinds of errors which can result are the opposite from a legal trial.

                                              3-12                                    March 2009

-------
Chapter 3.  Key Statistical Concepts _ Unified Guidance

      Certain steps are involved in conducting any statistical hypothesis test.  First, the null hypothesis
HO must be specified and is given presumptive weight in the hypothesis testing framework. The observed
sample (or a statistic derived from these data) is assumed to follow a known statistical distribution,
consistent with the distributional model used to describe reality under HQ. In groundwater monitoring, a
null hypothesis might posit that concentration measurements of benzene, for  instance, follow a normal
distribution with zero  mean. This statement is contrasted against the alternative hypothesis, which is
constructed as a competing model of reality. Under HA, the observed data or statistic follows a different
distribution, corresponding to a different distributional model. In the simple example above, HA might
posit that benzene concentrations follow a normal distribution, but this time with a mean no less than 20
ppb, representing a downgradient well that has been contaminated.

      Complete  descriptions  of statistical  hypotheses are usually not  made. Typically, a shorthand
formula is  used for the two competing statements. Denoting the true population mean as the Greek letter
[j, and a possible value of this mean as uo, a common specification is:

                                    HQ: JUjU0                                [3.6]
This formulation clearly distinguishes between the location (i.e., magnitude) of the population mean (I
under the two competing models, but it does not specify the form of the underlying population itself. In
most parametric tests, as explained in  Section 3.2,  the underlying model is assumed to be the normal
distribution, but this is not a necessary condition or the basic assumption in all tests. Note also that a
family of distributions  is specified by the hypothesis, not two individual, specific distributions. Any
distribution with a true mean no greater than uo satisfies the null hypothesis, while any distribution from
the same family with true mean larger than uo satisfies the alternative hypothesis.

     Once the statistical hypothesis has been  specified, the next step is to actually collect the data and
compute whatever test  statistic is required based on the observed measurements and the kind of test.
The pattern of the observed measurements or the  computed test statistic is then compared with the
population model predicted or described under HQ. Because this  model is specified as a  statistical
distribution, it can be used to assign probabilities to different results. If the observed result or pattern
occurs with very low probability under the null hypothesis model (e.g., with at most a 5%  or 1%
chance), one of two outcomes is assumed to have occurred.  Either the result is a "chance" fluctuation in
the data representing a real but unlikely outcome under HQ, or the null  hypothesis was an incorrect
model to begin with.

     A low probability of occurrence under HQ is cause for rejecting the null hypothesis in favor of H&,
as long as the  probability of occurrence under the latter alternative is also not too small. Still, one should
be careful to understand that statistics involves the art of managing uncertainty.  The null hypothesis may
indeed be true, even if the measured results seem unlikely to have arisen under the HQ model. A small
probability of occurrence is not the  same as no possibility of occurrence. The judgment in favor of HA
should be made with full recognition that & false positive mistake is always possible even if not very
likely.

     Consider the measurement  of benzene  in groundwater in  the  example  above.  Given natural
fluctuations in groundwater composition from week-to-week or month-to-month and  the  variability
introduced in the lab during the measurement process, the fact that one or two samples show either non-
detect  or very low levels of benzene does not guarantee that the true mean benzene concentration at the
                                              3-13                                    March 2009

-------
Chapter 3. Key Statistical Concepts
Unified Guidance
well is essentially zero. Perhaps the true mean is higher, but the specific sample values collected were
gotten from the "lower tail" of the benzene distribution just by chance or were measured incorrectly in
the lab. Figure 3-1 illustrates this possibility, where the full benzene distribution is divided into a lower
tail portion that has  been sampled and a remaining portion that has not so far been observed. The
sampled values are not representative of the entire population distribution, but only of a small part of it.

     Along a similar vein, if the observed result or pattern can occur with moderate to high probability
under the null  hypothesis, the  model represented by HO is  accepted  as consistent with the sample
measurements.  Again, this does not mean the null  hypothesis is necessarily true. The alternative
hypothesis could be true instead, in which case the judgment to accept HO would be considered a false
negative. Nevertheless the sample data do not provide  sufficient evidence or justification to reject the
initial presumption.

          Figure 3-1.  Actual, But Unrepresentative Benzene Measurements
                  0.4
                  0,3 -
                  0.2 H
               5J
               JD
               8
               OH

                  0.1  -I
                  0.0
                                                                True Population
                                                                Distribution
                           Sampled Values
                                        Benzene Concentration.
3.5  ERRORS IN HYPOTHESIS TESTING

     In order to properly interpret the results of any statistical test, it is important to understand the risks
of making a wrong decision. The risks of the two possible errors or mistakes mentioned above are not
fixed quantities; rather, false positive and false negative risks are best thought of as statistical parameters
that can  be adjusted when performing a  particular test.  This flexibility allows one, in  general, to
"calibrate" any test to meet specific risk or error criteria. However, it is important to recognize what the
different  risks represent. RCRA groundwater regulations stipulate that any test procedure maintain a
"reasonable balance" between the risks of false positives and false negatives. But  how does one decide
on a reasonable balance? The answer lies in a proper understanding of the real-life  implications attached
to wrong judgments.
                                             3-14
        March 2009

-------
Chapter 3.  Key Statistical Concepts                                        Unified Guidance
3.5.1 FALSE POSITIVES AND TYPE  I ERRORS

     P± false positive or Type I error occurs whenever the null hypothesis [Ho]  is falsely rejected in
favor of the alternative hypothesis [//A]. What this means in terms of the underlying statistical models is
somewhat different for every test. Many of the tests in the Unified Guidance are designed to address the
basic groundwater detection monitoring framework, namely, whether the concentrations at downgradient
wells are significantly greater than background. In this case, the null hypothesis is that the background
and  downgradient wells  share the same underlying distribution  and that downgradient concentrations
should be consistent with background in the absence  of any contamination. The alternative hypothesis
presumes  that downgradient well  concentrations are  significantly  greater than background and come
from a distribution with an elevated concentration.

     Given this  formulation of HQ and HA,  a Type I error occurs whenever one decides that the
groundwater at downgradient locations is significantly higher than  background when in reality it is the
same in distribution. A judgment of this sort concerns the underlying statistical populations and not the
observed sample data. The measurements  at a downgradient well may indeed be higher than those
collected in background.  But the  disparity must be great enough  to decide  with confidence that the
underlying populations also differ. A proper statistical test must account for  not just the difference in
observed mean levels but also  variability in the data likely to be  present in the  underlying statistical
populations.

     False positive mistakes can cause regulated  facilities to incur substantial unnecessary costs and
oversight  agencies to  become unnecessarily  involved.  Consequently, there is  usually  a desire by
regulators and the regulated community alike to minimize the false positive rate (typically denoted by
the Greek letter a). For reasons that will become clear below, the false positive rate is inversely related
to the false negative rate for a fixed sample size n.  It is impossible to completely eliminate the  risk of
either Type I or  Type n errors, hence the regulatory mandate  to minimize the  inherent tradeoff by
maintaining a "reasonable balance" between false positives and false negatives.

     Type I errors  are strictly defined in terms  of the hypothesis  structure of the test. While the
conceptual groundwater detection monitoring framework assumes that false positive errors are incorrect
judgments  of a release when there is none,  Type I  errors  in other statistical tests  may  have  a very
different meaning.  For instance,  in  tests  of normality (Chapter  10) the null hypothesis is that the
underlying population is normally-distributed, while the alternative is that the population follows some
other, non-normal pattern. In this setting, a false positive represents the mistake of falsely  deciding the
population to be non-normal, when in fact it is normal  in distribution. The implication of such an error is
quite different, perhaps  leading  one to  select  an alternate test method or to needlessly attempt  a
normalizing transformation of the data.

     As a matter of terminology, the  false positive rate a is also known as the significance level of the
test. A test conducted at the a = .01 level of significance means there is  at most a 1% chance  or
probability that a  Type I error will occur in the results. The test is likely to lead to a false rejection of the
null hypothesis at most about 1 out of every  100 times the  same test is performed. Note that this last
statement says nothing about how well the test will work if HA is true, when HQ shouldbe rejected. The
                                              3-15                                    March 2009

-------
Chapter 3. Key Statistical Concepts	Unified Guidance

false positive rate strictly concerns those cases where HQ is an accurate reflection of the physical reality,
but the test rejects HO anyway.

3.5.2 SAMPLING DISTRIBUTIONS, CENTRAL LIMIT THEOREM

     The false positive rate of any statistical test can be calibrated to meet a given risk criterion. To see
how this is done, it helps to understand the concept of sampling distribution. Most statistical test
decisions are  based on the magnitude  of a  particular test  statistic computed from the sample data.
Sometimes the test statistic is relatively simple, such as the sample mean (x ), while in other instances
the statistic is  more complex and non-intuitive. In every case, however, the test statistic is formulated as
it is for a specific purpose: to enable the analyst to identify the distributional behavior of the test statistic
under the null hypothesis. Unless one knows the expected behavior of a test statistic, probabilities cannot
be assigned to specific outcomes for deciding when the probability is too low to be a chance fluctuation
of the data.

     The distribution of the test statistic is known as its sampling distribution. It is given a special
name, in part,  to distinguish the behavior of the test statistic from the potentially different distribution of
the individual observations or measurements used to calculate the  test. Once identified, the sampling
distribution can  be used to establish critical points of the test associated with specific  maximal false
positive rates  for any given a level  of significance.  For most tests, a single level of  significance is
generally chosen.

     An example  of this  idea can be illustrated via the F-test. It is used  for instance in parametric
analysis  of variance [ANOVA]  to  identify differences in the population means at  three or  more
monitoring wells.  Although ANOVA assumes that the individual measurements input  to the test are
normally-distributed, the test statistic under  a null hypothesis  [Ho]  of no differences between  the true
means follows an F-distribution. More specifically, it applies  to one member of the F-distribution family
(an example using 5 wells and 6 measurements per well is pictured in Figure 3-2). As seen in the right-
hand tail of this  distribution by summing the area under the  distributional curve,  large values of the F-
statistic become  less and less probable as they increase in magnitude. For a given  significance level (a),
there is a corresponding F-statistic value such that the probability of exceeding this cutoff value is a or
less. In such situations, there is at most an a  x 100% chance of observing an  F-statistic under HQ that is
as large or  larger than the cutoff (shaded area in Figure 3-2). If a is quite small  (e.g., 5% or 1%), one
may then judge  the null hypothesis to be an untenable model  and accept HA- As a consequence, the
cutoff value can be defined as an a-level critical point for the F-test.

     Because  test  statistics can be quite complicated, there is no easy rule for determining the sampling
distribution of a  particular test. However, the sampling behavior of some statistics is a consequence of a
fundamental result known as the  Central Limit Theorem. This theorem roughly states that averages or
sums  of identically-distributed random variables will follow  an approximate  normal  distribution,
regardless of the distributional behavior of the individual measurements.  This averaged distribution will
have the same mean \i as the population of individual measurements and whose variance, compared to
the underlying population variance a2 , is scaled by a factor of the sample size n on which the average or
sum is based.  Specifically, the variance is greater by  a factor of n in the case  of a sum(«-<72) and
smaller by a factor of n in the case of an average  (<72/w). The approximation  of the averages or sums to
the normal  distribution improves as sample size increases (also see the power discussion on page 3-21).

                                              3-16                                    March 2009

-------
Chapter 3. Key Statistical Concepts
Unified Guidance
                  Figyre 3-2. F-Distribution with 4 and 25 Degrees of Freedom
              *"".
              d
              if)
              d
          -Q
           a
           2
           a.
              eo
              d
              OJ
              d
              o ~
              o
              d -
                                             F-statistic value
     Because of the Central Limit Theorem, a number of test statistics at least approximately follow the
normal distribution.  This allows critical points for these tests to be  determined from a table of the
standard normal distribution. The Central Limit Theorem also explains why sample means provide a
better estimate of the true  population  mean than  individual  measurements  drawn  from the same
population (Figure 3-3). Since the sampling distribution of the mean is centered on the true average (|l)
of the underlying population and the variance is lower by a factor of n, the sample average x will tend to
be much closer to (I than a typical individual measurement.
                                             3-17
        March 2009

-------
Chapter 3. Key Statistical Concepts
Unified Guidance
                       Figure 3-3.  Effect of Central Limit Theorem
                                         Sampling Distribution of Mean
                      Underlying Population
3.5.3 FALSE NEGATIVES, TYPE II ERRORS, AND  STATISTICAL POWER

     False negatives or Type II errors are the logical opposites of false positive errors.  An error of this
type occurs whenever the null hypothesis [Ho] is accepted, but instead the alternative hypothesis [//A] is
true. The false negative rate is denoted by the Greek letter (3. In terms of the groundwater detection
monitoring  framework,  a Type n  error  represents a  mistake of judging the compliance  point
concentrations to be consistent with  background, when in reality the compliance point distribution is
higher on average. False negatives  in this context describe the risk of missing or not identifying
contaminated groundwater when  it really exists.  EPA has traditionally been more concerned with such
false negative errors, given its mandate to protect human health and the environment.

     Statistical power is an alternate way  of describing false negative  errors.  Power is merely the
complement of the false negative rate.  If (3 is the probability of a false negative, (l-(3) is the statistical
power of a particular test. In terms of the hypothesis structure, statistical power represents the probability
of correctly rejecting the null hypothesis. That is, it is the minimum chance that one will decide to accept
//A, given that H\ is true. High power translates into a greater probability of identifying contaminated
groundwater when it really exists.

     A convenient way to keep track of the differences between false positives, false negatives, and
power is via a Truth Table (Figure 3-4). A truth table distinguishes between the underlying truth of each
hypothesis HQ or H\ and the  decisions made on the basis of statistical testing.  If HQ is true, then a
decision to accept the alternative  hypothesis  (//A) is a false positive error  which will occur with a
                                              3-18
        March 2009

-------
Chapter 3. Key Statistical Concepts
                                    Unified Guidance
probability of at most a. Because only one of two decisions is possible, HQ will also be accepted with a
probability of at least (1-a). This is also known as the confidence probability or confidence level of the
test, associated with making a 'true negative' decision. Similarly if HA is actually true, making a false
negative decision error by accepting the null hypothesis (Ho) has at most a probability of p.  Correctly
accepting HA when true then has a probability of at least (1- P) and is labeled a 'true positive' decision.
This probability is also known as the statistical power of the test.

     For any application of a test to a particular sample, only one of the two types of decision errors can
occur. This is because only one of the two mutually exclusive hypotheses will be a true statement. In the
detection monitoring context, this  means that if a well is uncontaminated (i.e., HQ is true), it may be
possible to commit a Type I false positive mistake, but it is not possible to make a Type II false negative
error. Similarly, if a contaminated well is tested (i.e.,  HA is true), Type I false positive errors cannot
occur, but a Type  II false negative error might occur.
                      Figure 3-4. Truth Table in Hypothesis Testing

                                                DECISION

                                              H                    H
                      s
                      H
                      P
                            H
                                         OK
                                     (True

                                        (1-a)
II


(P)
                     I
               (False Positive)

                    (a)
                                                              OK
                                                          (True Positive}

                                                              (1-P)
     Since the false positive rate can be fixed in advance of running most statistical tests by selecting a,
one might think the same could be done with statistical power. Unfortunately, neither statistical power
nor the false negative rate can be fixed in advance for a number of reasons. One is that power and the
false negative rate depends on the degree to which the true  mean concentration level is elevated with
respect to the  background null condition. Large concentration increases  are easier to detect than small
increments.  In fact, power can be graphed as an increasing function of  the true concentration level in
what is termed & power curve (Figure 3-5).  A power curve  indicates the probability of rejecting HQ in
favor of the alternative HA for any given  alternative to the null hypothesis (i.e.,  for a range of possible
mean-level increases above background).
                                              3-19
                                            March 2009

-------
Chapter 3. Key Statistical Concepts
Unified Guidance
     In interpreting  the  power  curve  below, note that  the  x-axis is labeled  in  terms  of relative
background standard deviation units (o) above the true background population mean (u). The zero point
along the x-axis is associated with the background mean itself, while the kth positive unit along the axis
represents a 'true' mean concentration in the compliance well being tested equal toju + ka. This mode of
scaling the graph allows the same power curve to be potentially applied to any constituent of interest
subject to the same test conditions. This is true no matter what the typical background concentration
levels of a chemical typically found in groundwater may be. But it also means that the same point along
the power curve will represent different absolute concentrations for different constituents.  Even if the
background means are the same, a two standard deviation increase in a chemical with highly variable
background concentrations will correspond to a  larger population mean increase at a compliance well
than the same relative increase in a less variable constituent.

     As  a simple example, if the background  population averages for arsenic  and manganese both
happen to be 10 ppb, but the arsenic standard deviation is 5 ppb while that for manganese is only 2 ppb,
then a compliance well with a mean equivalent to a three  standard deviation increase over background
would have an average arsenic level of 25 ppb, but an average manganese level of only 16 ppb. For both
constituents, however,  there  would be  approximately a  50% probability of detecting a difference
between the compliance well and background.

                            Figure 3-5.  Example Power Curve
                           100
                        o
                        a.
                           »
                           40
                                     1234
                                       SDs Above Background
     Because the power probability depends on the relative difference between the actual downgradient
concentration level and background, power cannot typically be fixed ahead of time like the critical false
positive rate for a test.  The true concentration level (and  associated power) in  a compliance well is
unknown. If it were known, no hypothesis test would be needed.  Additionally, it is often not clear what
specific magnitude of increase  over background is environmentally significant.   A two  standard
deviation increase  over the background  average might  not be protective of human health and/or the
                                             3-20
        March 2009

-------
Chapter 3. Key Statistical Concepts
Unified Guidance
environment for some monitoring situations. For others, a four standard deviation increase or more may
be tolerable before any threat is posed.

     Since the exact ramifications of a particular concentration increase are uncertain, it points to the
difficulty in setting a minimum power requirement (or a maximum  false negative rate) for a given
statistical test.  Some  State statutes contain water quality non-degradation provisions, for which  any
measurable increase might be of concern. By emphasizing relative power as in Figure 3-5, all detection
monitoring constituents  can be evaluated for significant concentration  increases on a common footing,
subject only to differences in measurement variability.

Another key factor affecting statistical power is sample size. All other test conditions being equal, larger
sample sizes provide higher statistical power and the lower the false negative rate ((3).  Statistical tests
perform more accurately with larger data sets, leading to greater power and fewer errors in the process.
The Central Limit Theorem illustrates why  this is true. Even if a downgradient well mean level is only
slightly greater than background, upgradient and  downgradient well sample means will  have so little
variance in their  sampling distributions with enough measurements that they will tend  to hover very
close to their respective  population means.  True mean differences in the underlying populations can be
distinguished with higher probability as sample sizes increase. In Figure 3-6, the sampling distributions
of means of size 5 and 10 between two different normal populations are provided  for illustration. The
narrower width of the  distribution for the n = 10 sample means are more clearly distinguished from each
other than for means  of sample  size n = 5. This implies higher probability and power  to distinguish
between the two population means.

            Figure  3-6. Why Statistical  Power Increases with Sample Size
                          Sampling Distits of Mean (n =10)
                                                     Sampling Distils of
                                                     Mean (n = 5)
        Actual Populations
-I «
>1234S67S
Pi pi
> 19 11 12
                                              3-21
        March 2009

-------
Chapter 3. Key Statistical Concepts
                    Unified Guidance
3.5.4 BALANCING TYPE I AND TYPE II  ERRORS

     In maintaining an appropriate balance between false positive and false negative error rates, one
would ideally like to simultaneously minimize both kinds of errors.  However, both risks are inherent to
any statistical test procedure, and the risk of committing a Type I error is indirectly but inversely related
to the risk of a Type n error unless the sample  size can be increased. It is necessary to find a balance
between  the  two error rates.   But given that the false negative  rate depends largely  on the true
compliance point concentrations, it is first necessary to designate what specific mean difference (known
as an effect size} between  the  background and compliance  point populations  should be considered
environmentally  important.   A minimum power  requirement can  be  based on this difference (see
Chapter 6)

       ^EXAMPLE 3-1

     Consider a  simple example of using the downgradient sample mean to test the proposition that the
downgradient population  mean is 4 ppb larger than background.  Assume that extensive sampling has
demonstrated that the  background population mean is equal to 1 ppb. If the true downgradient mean
were the same as the  background level, curves of the two sampling distributions would  coincide (as
depicted  in Figure 3-7). Then a critical point (e.g., CP = 4.5  ppb) can be selected so that  the risk of a
false positive mistake is a. The critical  point establishes the decision criteria for the test. If the observed
sample mean based on randomly selected data from the downgradient sampling distribution exceeds the
critical point, the downgradient population will be declared higher in concentration than the  background,
even though this  is not the case. The frequency that such a wrong decision will be made is  just the area
under the sampling distribution to the right of the critical point  equal to a.

         Figure 3-7.  Relationship Between Type I and Type II Errors, Part  A
                         : p = I ppb
                       HA ; fi = 5 ppb
Under H(j;
Background and Compliance Point
Populations Completely Overlap
                                                     CP  5
                         CP = Critical Point
     If the true downgradient mean is actually 5 ppb, the sampling distribution of the mean will instead
be centered over 5 ppb  as in the right-hand curve (i.e., the downgradient population) in Figure 3-8.
Since there really is a difference between the two populations, the alternative hypothesis and not the null
                                             3-22
                            March 2009

-------
Chapter 3. Key Statistical Concepts
Unified Guidance
hypothesis is true. Thus,  any observed sample  mean drawn from  the downgradient population  then
falling below the critical point is a false negative mistake.  Consequently, the area under the right-hand
sampling distribution in Figure 3-8 to the left of the critical point represents the frequency of Type II
errors ((3).

     The false negative rate ((3) in Figure 3-8 is obviously larger than  the false positive rate (a) of
Figure 3-7. This need not be the  case in general, but the key point is to understand that for a fixed
sample size, the Type I and Type II error rates cannot be simultaneously minimized. If a is increased, by
selecting a lower critical point in Figure 3-7,  the false negative rate will also be lowered in Figure 3-8.
Likewise, if a is decreased by selecting a higher critical point,  (3 will be enlarged. If the false positive
rate is indiscriminately lowered, the false negative rate (or reduced power)  will likely reach unacceptable
levels  even for mean  concentration levels of environmental importance. Such  reasoning lay behind
EPA's decision to mandate minimum false positive rates for Mests and ANOVA procedures in both the
revised 1988 and 1991 RCRA rules.

          Figure 3-8. Relationship Between Type  I  and Type II  Errors, Part B
                                                       Under HA:
                                                       Background and Compliance Point
                                                       Populations Differ
                   Background Population
                          CP = Critical Point
                                              3-23
        March 2009

-------
Chapter 3.  Key Statistical Concepts                                   Unified Guidance
                     This page intentionally left blank
                                        3-24                                March 2009

-------
Chapter 4.  Groundwater Monitoring Programs                           Unified Guidance

  CHAPTER 4.   GROUNDWATER MONITORING PROGRAMS
                     AND  STATISTICAL ANALYSIS
       4.1   THE GROUNDWATER MONITORING CONTEXT	4-1
       4.2   RCRA GROUNDWATER MONITORING PROGRAMS	4-3
       4.3   STATISTICAL SIGNIFICANCE IN GROUNDWATER TESTING	4-6
         4.3.1  Statistical Factors	4-8
         4.3.2  Well System Design and Sampling Factors	4-8
         4.3.3  Hydrological Factors	4-9
         4.3.4  Geochemical Factors	4-10
         4.3.5  Analytical Factors	4-10
         4.3.6  Data or Analytic Errors	4-11
     This chapter provides an overview of the basic groundwater monitoring framework, explaining the
intent of the federal groundwater statistical regulations and offering insight into the key identification
mechanism of groundwater monitoring, the statistically significant increase [SSI]:

   »«»  What are statistically significant increases and how should they be interpreted?
   »«»  What factors, both statistical and non-statistical can cause SSIs?
   »«»  What factors should be considered when demonstrating that an SSI does not represent evidence
       of actual contamination?

4.1 THE GROUNDWATER MONITORING CONTEXT

     The RCRA regulations frame a  consistent approach to groundwater monitoring, defining  the
conditions under which  statistical  testing takes place. Upgradient and  downgradient wells must be
installed to monitor the uppermost aquifer in order to identify releases or changes in existing conditions
as expeditiously as possible.  Geological  and hydrological expertise is  needed to properly locate  the
monitoring wells in the aquifer passing beneath the monitored unit(s). The regulations identify a variety
of design and sampling requirements for groundwater monitoring (such as measuring well piezometric
surfaces and identifying flow directions) to assure that this basic goal is achieved. Indicator or hazardous
constituents are measured in these wells at regular time intervals; these sample data serve as the basis for
statistical comparisons. For identifying releases under detection monitoring, the regulations generally
presume  comparisons of observations from downgradient wells against those  from upgradient wells
(designated as  background). The rules also recognize certain situations  (e.g.,  mounding  effects) when
other means to define background may be necessary.

     The Unified Guidance may apply to facility groundwater monitoring programs straddling a wide
range of conditions. In addition to units regulated under Parts 264 and 265 Subpart F and Part 258 solid
waste landfills, other non-regulated units  at Subtitle C facilities or CERCLA sites  may utilize similar
programs. Monitoring can vary from a regulatory minimum of one upgradient and three  downgradient
wells, to very  large facilities with multiple units, and perhaps 50-200 upgradient  and  downgradient
wells. Although the rules presume that monitoring will occur in the  single uppermost aquifer likely to be
affected by a release, complex geologic conditions may require sampling and evaluating a number of
aquifers or strata.
                                            4^1                                   March 2009

-------
Chapter 4. Groundwater Monitoring Programs                             Unified Guidance

     Detection monitoring constituents may include indicators  like common ions and other general
measures  of water  quality, pH, specific conductance, total organic carbon [TOC] and total organic
halides  [TOX].  Quite often, well monitoring data  sets are  obtained  for filtered or unfiltered trace
elements (or both) and sizeable suites of hazardous trace organic constituents, including volatiles, semi-
volatiles, and pesticide/herbicides. Measurement and analysis of hazardous constituents using standard
methods (in SW-846  or elsewhere) have become fairly routine over time. A large number of analytes
may be potentially available as monitoring constituents for statistical testing, perhaps  50-100 or more.
Identification  of the  most  appropriate  constituents  for  testing depends  to  a great extent on  the
composition of the managed wastes (or their decomposition products) as measured in leachate analyses,
soil gas sampling, or from prior knowledge.

     Nationally,  enough groundwater monitoring  experience  has  been gained in using routine
constituent  lists and  analytical techniques to suggest some common  underlying patterns.  This is
particularly true  when defining background conditions in groundwater.  Sampling frequencies have also
been standardized enough (e.g., semi-annual or quarterly sampling) to enable reasonable computation of
the sorts of sample sizes  that can be used for statistical testing. Nevertheless, complications can and do
occur over time — in the form of changes in  laboratories, analytical methods, sampled wells,  and
sampling frequencies — which can affect the quality and availability of sample data.

     Facility status  can also affect what data are potentially available for evaluation and testing — from
lengthy regulated unit monitoring records under the Part 265 interim status requirements at sites awaiting
either operational or post-closure 264 permits or permit re-issuance, to a new solid waste facility located
in a zone  of  uncontaminated  groundwater with little  prior  data.  Some combined RCRA/CERCLA
facilities  may  have  collected  groundwater information  under  differing   program  requirements.
Contamination from offsite or non-regulated units (or  solid waste management units) may complicate
assessment of likely contaminant sources or contributions.

     Quite  often, regulators  and regulated parties find themselves  with considerable  amounts of
historical constituent-well monitoring data that must be assessed for appropriate action, such as a permit,
closure, remedial action  or enforcement decision. Users will  need to  closely consider the diagnostic
procedures in Part II  of the Unified Guidance, with an eye towards selection of one or more appropriate
statistical tests in Parts III and IV. Selection will depend on key factors such as the number  of wells  and
constituents, statistical characteristics of the observed data, and historical patterns of contamination (if
present), and  may  also  reflect preferences for  certain types of tests. While the Unified Guidance
purposely identifies  a  range of tests which might fit a situation, it is generally recommended that one set
of tests be selected for final implementation, in order to avoid "test-shopping" (i.e., selecting tests during
permit implementation based on the most favorable  outcomes).  EPA recognizes that the  final permit
requirements are approved by the regulatory agency.

     All of the above situations share some features in  common. A certain number of facility wells will
be designated as compliance points, i.e., those locations  considered as  significant from a regulatory
standpoint for assessing potential releases.  Similarly, the most appropriate and critical indicator and/or
hazardous constituents for monitoring will be identified. If  detection monitoring (i.e., comparative
evaluations of compliance wells against background) is deemed  appropriate for some or all wells  and
constituents, definitions  of background or reference comparison levels  will  need to be  established.
Background data can be obtained  either from the upgradient wells or  from the  historical  sampling
database as described  in Chapter 5. Choice of background will depend on how statistically comparable
                                              4^2                                     March 2009

-------
Chapter 4. Groundwater Monitoring Programs                             Unified Guidance

the compliance point data are with respect to background and whether individual  constituents exhibit
spatial or temporal variability at the facility.

     Compliance/assessment or corrective action monitoring may be appropriate choices when there is a
prior or historical indication of hazardous constituent releases from a regulated unit. In those situations,
the regulatory agency will establish GWPS limits. Typically, these limits are found in established tables,
in SDWA drinking water MCLs, through risk-based calculations or determined from background data.
For remedial actions,  site-specific levels may be developed which account  not only  for risk, but
achievability and implementation costs as well. Nationally, considerable experience has been gathered in
identifying cleanup targets which might be applicable at a given facility, as well as how practical those
targets are likely to be.

     Use  of the Unified Guidance should thus be viewed in an overall context.  While  the guidance
offers important considerations and suggestions in selecting and designing a statistically-based approach
to monitoring, it is important to realize that it is only a part of the overall decision  process at a facility.
Geologic and hydrologic expertise, risk-based decisions, and legal and practical considerations by the
regulated  entity  and  regulatory agency  are  fundamental  in developing   the final   design  and
implementation. The guidance does not attempt to address the many other relevant decisions which
impact the full design of a monitoring system.

4.2  RCRA  GROUNDWATER MONITORING PROGRAMS

     Under the RCRA regulations,  some form of statistical testing of sample data  will generally be
needed to determine whether there has been a release, and if so, whether concentration levels lie below
or  above  a  protection  standard.   The  regulations  frame  the   testing programs  as  detection,
compliance/assessment, and corrective action monitoring.

     Under RCRA permit  development and during routine evaluations, all three monitoring program
options may need to be simultaneously considered. Where sufficient hazardous constituent data from site
monitoring or other evidence of a release exists, the regulatory agency can evaluate which monitoring
program(s) are  appropriate under §264.91. Statistical principles and testing provided in the  Unified
Guidance can be used to develop presumptive evidence for one program over another.

     In some applications, more than one monitoring program may be appropriate.  Both the number of
wells  and constituents to be tested can vary among the three monitoring programs at  a given site.  The
types  of non-hazardous indicator constituents used for detection monitoring might not be applied  in
compliance or corrective action monitoring. The latter focus is on hazardous constituents. Only a few
compliance well constituents may exceed their respective GWPSs.  The  focus in a corrective action
monitoring program might then be placed on the latter, with the remaining well constituents evaluated
under the other  monitoring schemes. But following the general regulatory structure, the three monitoring
systems are presented below and elsewhere in the guidance as an ordered sequence:

      Detection monitoring  is appropriate either when there is  no  evidence of  a release from a
regulated unit, or when the unit situated in a historically contaminated area is  not impacted by current
RCRA waste management  practices. Care must be taken to avoid  a situation where the constituents
might reasonably have originated offsite or from units not subject to testing, since any  adverse change in
groundwater quality would be attributed to on-site causes. Whether an observed change in groundwater

                                              4^3                                     March 2009

-------
Chapter 4. Groundwater Monitoring Programs                             Unified Guidance

quality is in fact due to a release from on-site waste activities at the facility may be open to dispute
and/or further demonstration. However, this basic framework underlies each of the statistical methods
used in detection monitoring.

     A crucial step in  setting up a detection monitoring program is to establish a set of background
measurements, a baseline or reference level for statistical  comparisons (see Chapter 5). Groundwater
samples  from  compliance wells are then compared against  this baseline to measure  changes in
groundwater  quality. If at least  one chemical parameter on the monitoring  indicates a statistically
significant increase above the baseline [SSI, see  Section 4.3], the facility or regulated unit moves into
the next phase: compliance or assessment monitoring.

       Compliance or assessment monitoring1 is appropriate when there is reliable statistical evidence
that  a concentration increase over the baseline has occurred.  The purpose of compliance/assessment
monitoring is two-fold: 1) to assess the extent  of contamination (i.e., the size of the increase, the
chemical parameters involved, and the locations on-site where contamination is  evident); and 2) to
measure  compliance with pre-established  numerical  concentration  limits generally  referred  to as
GWPSs.  Only the  second purpose is fully addressed using formal statistical tests. While important
information can be gleaned  from  compliance well  data,  more complex analyses (e.g.,  contaminant
modeling) may be needed to address the first goal.

     GWPSs  can be fixed health- or risk-based limits, against which  single-sample tests are made. At
some sites, no specific  fixed concentration  limit  may be assigned or readily available for one or more
monitoring parameters. Instead, the comparison is made against a limit developed from background data.
In this case,  an appropriate statistical approach might be to  use  the background measurements to
compute a statistical limit and set it as  the GWPS.  See Chapter  7 for further  details.  Many of the
detection monitoring design principles (Chapter 6) and statistical tests (Part III) can also be applied to
a set of constituents defined by a background-type GWPS.

     The RCRA Parts  264 and  258 regulations require an expanded analysis of potential hazardous
constituents (Part 258 Appendix n for municipal landfills or Part 264 Appendix IX for hazardous waste
units) when detection monitoring  indicates a release and compliance monitoring is potentially triggered.
The purpose is to better gauge which hazardous constituents have actually impacted groundwater. Some
detection monitoring programs may require  only limited testing of indicator parameters. This additional
sampling can be used to determine which wells have been impacted and provide some understanding of
the on-site distribution  of hazardous constituent concentrations  in groundwater. . The course of action
decided by the Regional Administrator or State Director will depend on the number of such chemicals
that are present in quantifiable levels and the actual concentration levels.
  The terms compliance monitoring (§264.99 & 100) and assessment monitoring (§258.55 & 56) are used interchangeably in
  this document to refer to RCRA monitoring programs. Compliance monitoring is generally used for permitted hazardous
  waste facilities under RCRA Subtitle C, while assessment monitoring is applied to municipal solid waste landfills regulated
  under RCRA Subtitle D. The term "assessment" is also used in 40 CFR 265 Subpart F for a second phase of additional
  analyte testing. Occasional use is also made of the term "compliance wells," which refers to downgradient monitoring wells
  located at the point(s) of compliance under §264.95 (any of the three monitoring programs may apply when evaluating
  these wells).

                                               4^4                                     March 2009

-------
Chapter 4. Groundwater Monitoring Programs                            Unified Guidance

     Following the occurrence of a valid statistically  significant increase [SSI] over baseline during
detection monitoring, the statistical presumption in compliance/assessment monitoring is quite similar to
the detection stage. Given G as a fixed compliance or background-derived GWPS, the null hypothesis is
that true concentrations (of the underlying compliance point population) are no greater than G. This
compares to the detection monitoring presumption that  concentration levels do not exceed background.
One reason for the similarity is that compliance limits  may be higher than background levels in  some
situations. An  increase over background in these situations does not necessarily imply an increase over
the compliance limit, and the latter must be formally tested. On the other hand, if a health- or risk-based
limit is below a background level, the RCRA regulations provide that the GWPS should be based on
background.
     	                                                                            9
     The Subtitle D regulations for municipal solid waste landfills [MSWLF] stipulate  that if "the
concentrations of all Appendix II constituents are shown to be at or below background values, using the
statistical procedures in §258.53(g), for two consecutive sampling events, the owner or operator... may
return  to detection  monitoring." In other words, assessment monitoring may be  exited  in  favor of
detection monitoring when concentrations at the compliance wells are statistically indistinguishable from
background for two consecutive sampling periods. While a demonstration that concentration levels are
below  background would generally not be realistic, it may be possible to show that compliance  point
levels of contaminants  do not exceed an upper limit computed from the background data. Conformance
to the limit would then indicate  an  inability to  statistically distinguish between background and
compliance point  concentration levels.

     If a hazardous constituent under compliance  or assessment monitoring statistically exceeds a
GWPS, the facility is  subject to corrective action.  Remedial activities must be undertaken to remove
and/or prevent the further  spread  of contamination  into groundwater.  Monitoring under corrective
action  is used to track the progress of remedial activities and to determine if the facility has returned to
compliance. Corrective action is usually preceded or accompanied by  a formal Remedial Investigation
[RI] or RCRA Facility  Investigation [RFI] to further delineate  the nature and extent of the contaminated
plume.  Corrective action may be confined to a single regulated unit if only that unit exhibits SSIs above
a standard during  the detection and  compliance/assessment monitoring phases.

     Often, clean-up  levels are established by the  Regional Administrator or State  Director during
corrective action.  Remediation must continue until these clean-up levels are met. The focus of remedial
action  and monitoring would be  on those hazardous  constituents  and well locations exceeding the
GWPSs. If specific clean-up levels have not been met, corrective action must continue until there is
evidence of a statistically significant decrease [SSD] below the compliance limit for three  consecutive
years.  At this point,  corrective action may be exited and  compliance monitoring re-started.   (As
described above  and in  Chapter  7, the protocol  for  assessing corrective action compliance with a
background-type  standard can differ).  If subsequent concentrations are statistically indistinguishable
from background  or no detectable concentrations can be demonstrated for three consecutive  years in any
of the contaminants that triggered corrective measures in the first place, corrective action may be exited
in favor of detection monitoring.
2 [56 FR 51016] October 9, 1991
                                              4-5                                    March 2009

-------
Chapter 4. Groundwater Monitoring Programs                            Unified Guidance

4.3  STATISTICAL SIGNIFICANCE IN GROUNDWATER TESTING

     The outcome of any statistical test is judged either to be statistically significant or non-significant.
In groundwater monitoring, a valid statistically significant result can force a change in the monitoring
program, perhaps even leading to remedial  activity. Consequently, it is important to understand what
statistically significant results represent and what they do not. In the language of groundwater hypothesis
testing (Chapter 3),  a statistically significant test result is a decision to reject the null hypothesis (Ho)
and to accept the alternative hypothesis (//A), based on the observed pattern of the sample data. At the
most elementary level, a statistically significant increase [SSI] (the kind  of result typically of interest
under RCRA detection and compliance monitoring) represents an observed increase in concentration at
one or more compliance wells. In order to be declared an SSI, the change in concentration must be large
enough after accounting for variability in the sample data, that the result is unlikely  to have occurred
merely by chance. What constitutes a statistically  significant result depends on the phase  of monitoring
and the type of statistical test being employed.

     If the detection monitoring statistical test being used is a t-test  or Wilcoxon rank-sum test
(Chapter 16), an SSI occurs whenever the ^-statistic or  fF-statistic is larger than an a-level critical point
for the test. If a retesting procedure is chosen using a prediction limit (Chapter 19), an SSI occurs only
when both  the initial compliance sample or  initial mean/median and one or more resamples all exceed
the upper prediction limit. For control charts (Chapter 20), an SSI occurs whenever either the CUSUM
or Shewhart portions of the chart exceed their respective control limits. In another variation, an SSI only
occurs if  one or another of the CUSUM or Shewhart statistics  exceeds the control limits when
recomputed using one or more resamples.  For tests of trend (Chapter 17), an SSI is declared whenever
the slope is significantly greater than zero at some significance level a.

     In compliance/assessment monitoring, tests are often made against a fixed compliance limit or
GWPS. In this setting, one can utilize a confidence interval around a mean, median, upper percentile or a
trend  line (Chapter  21). A confidence interval is an  estimated concentration or measurement range
intended to contain a given statistical characteristic of the population from which the sample is drawn. A
most common formulation is  a two-way confidence interval around a normally-distributed mean |i, as
shown below:


                          (x-Wi-7=   *   f*    *   * + Wi-7=]                       t4-1]
                          ^          V»                        -Jn)

where  x is the mean of a sample of size n,  s is the sample standard  deviation, and t\_^ n_\ is an upper
percentile selected from a Student's ^-distribution.  By constructing a range around the sample mean (x ),
this confidence interval is  designed to locate the  true population  mean (u) with  a high degree  of
statistical  confldence(l-2a) or conversely, with a low probability of error (2a).  If  a one-way lower
confidence interval is used, the right-hand term in  equation [4.1] would be replaced by  +00 at confidence
level 1-a.  In a similar fashion, the upper 1-a confidence interval would be defined in the range from -co
for the left-hand term to the right hand term in equation [4.1].

     When using a lower confidence interval on the mean, median,  or upper percentile,  an SSI occurs
whenever the lower edge of the confidence interval range exceeds the GWPS. For a confidence interval
around a trend line, an SSI is  declared whenever the lower confidence limit around the estimated trend

                                              4^6                                    March 2009

-------
Chapter 4. Groundwater Monitoring Programs
Unified Guidance
line first exceeds the GWPS at some point in time. By requiring that a lower confidence limit be used as
the basis of comparison, the statistical test will account for data variability  and ensure that the apparent
violation is unlikely to have occurred by chance.  Figure 4-1 below visually depicts a comparison to a
fixed GWPS for both lower confidence intervals for a stationary test like a  mean, and around  an
increasing trend. Where the confidence interval straddles the limit, the test results are inconclusive.  In
similar fashion, an SSD can be identified by using upper confidence intervals.

     Figure 4-1. Confidence Intervals Around  Means, Percentiles, or Trend Lines
                  Means, Percentiles
                                                                            GWPS
                                                                   Out-of-Compliance
                                      Time
                                                                            GWPS
               Increasing Trend
                                      Time
     SSIs offer the primary statistical justification for moving from detection monitoring to compliance
monitoring, or from compliance/assessment monitoring to corrective action.  However, it is important
that an SSI be  interpreted correctly. Any SSI at a compliance well represents a probable increase in
concentration level,  but it does not automatically imply or prove that contaminated groundwater from
the facility is the cause of the increase. Due to the complexities of the groundwater medium and the
nature of statistical testing, there are numerous reasons why a test may exhibit a statistically significant
result. These may or may not be indications of an actual release from a regulated unit.
                                             4-7
        March 2009

-------
Chapter 4. Groundwater Monitoring Programs                            Unified Guidance

     It is always reasonable to allow for a separate demonstration once an SSI occurs, to determine
whether or not the increase is actually due to a contaminant release. Such a demonstration will rely
heavily on hydrological and geochemical  evidence from the site, but could include additional  statistical
factors. Key questions and factors to consider are listed in the following sections.

4.3.1 STATISTICAL FACTORS

    *»*  Is  the result a false positive?  That is, were the data tested simply an unusual sample  of the
       underlying population triggering an SSI? Generally, this can be evaluated with repeat sampling.
    »»»  Did the test correctly identify an actual release of an indicator or hazardous constituent?
    »»»  Are there corresponding SSIs in upgradient or background wells? If so, there may be evidence of
       a natural in-situ concentration increase, or perhaps migration from an off-site source.
    *»*  Is  there  evidence  of significant  concentration  differences between separate upgradient or
       background  wells, particularly for inorganic constituents? If so,  there may  be natural spatial
       variations between distinct  well  locations that  have not been  accounted for. These spatial
       differences could  be local or systematic  (e.g..,  upgradient  wells in one formation  or zone;
       downgradient wells in another).
    *»*  Could observed SSIs for naturally occurring analytes be due to longer-term (i.e.,  seasonal or
       multi-year) variation? Seasonal or other cyclical  patterns should be observable in upgradient
       wells. Is this change occurring in both upgradient and  downgradient wells?  Depending on the
       statistical test and frequency of sampling involved, an observed SSI may be entirely due to
       temporal variation not accounted for in the sampling scheme.
    *»*  Do time series plots of the sampling data show  parallel "spikes" in concentration  levels from
       both background  and compliance well  samples  that were analyzed  at about the  same time?
       Perhaps there was an analytical problem or change in lab methodology.
    »»»  Are  there substantial correlations  among  within-well  constituents  (in  both upgradient and
       downgradient wells)? Highly correlated analytes treated as independent monitoring constituents,
       may generate incorrect significance levels for individual tests.
    *»*  Were trends properly accounted for, particularly in the background data?
    *»*  Was a  correct assumption made  concerning the underlying  distribution  from  which the
       observations were drawn (e.g., was a normal assumption applied to lognormal data)?
    »»»  Was the test computed correctly?
    »»»  Were the data input to the test of poor quality? (see various factors below)
4.3.2 WELL SYSTEM  DESIGN AND SAMPLING FACTORS

    »»»  Were early  sample data following  well installation utilized in statistical testing?  Initial well
       measurements are  sometimes highly variable during a 'break in' sampling and analysis period
       and potentially less trustworthy.
    »»»  Was there an effect attributable to recent well development, perhaps due to the use of hazardous
       constituent chemicals during development or present in drilling muds?
    »»»  Are there multiple geological formations at the site, leading to incorrect well placements?

                                              4^8                                    March 2009

-------
Chapter 4. Groundwater Monitoring Programs                             Unified Guidance

   *»*  Has there been degradation of the well casings and screens (e.g., PVC pipe)? Deteriorating PVC
       materials can release organic constituents under certain conditions. Occasionally, even stainless
       steel can corrode and release a number of metallic trace elements.
   *»*  Have there been changes in well performance over time?
   *»*  Were there excessive holding times or incorrect use of preservatives, cooling, etc.
   *»*  Was there incorrect calibration  or  drift in the field  instrumentation? This effect should be
       observable in both upgradient and downgradient data and possibly over a number of sample
       events. The data itself may be compromised or useless.
   *»*  Have there been  'mid-stream' changes in sampling procedures, e.g., increased or decreased well
       purging?  Have sampling or purging techniques been consistently applied from well to well or
       from sampling event to sampling event?

4.3.3 HYDROLOGICAL FACTORS

   *»*  Does the site have a history of previous waste management activity (perhaps prior to RCRA), and
       is there any evidence of historical groundwater contamination? Previous contamination or waste
       management contaminant levels can limit the ability to distinguish releases from the regulated
       unit, particularly  for those analytes found in historical contamination.
   »»»  Is there evidence of groundwater mounding or other anomalies that could lead to the lack of a
       reliable, definable  gradient? Interwell  statistical tests  assume that  changes  in downgradient
       groundwater quality only affect  compliance wells and  not upgradient (background)  wells.
       Changes that impact background wells also, perhaps in a complex manner involving seasonal
       fluctuations,  are often best resolved by running intrawell tests instead.
   *»*  Is there hydrologic evidence of any migration of contaminants (including DNAPL) from off-site
       sources or from other non-regulated units? Are any of these contaminants observed upgradient of
       the regulated units?
   »»»  Have there been other prior human or site-related  waste management activities which  could
       result in  the observed SSI changes for certain well  locations (e.g.,  buried  waste  materials,
       pipeline leaks, spills, etc.)?
   »»»  Have there been  unusual changes in groundwater directions and depths? Is there confidence that
       the  SSI did indeed  correspond  to  a potential  unit release based on observed groundwater
       directions, distance of the well from the unit, other well information, etc.?
   *»*  Is there evidence of migration of landfill gas affecting one or more wells?
   »»»  Have there been increases in well  turbidity and  sedimentation,  which  could  affect observed
       contaminant  levels?
   »»»  Are there preferential flow paths in the aquifer that could affect where contaminants are likely to
       be observed or not observed?
   »»»  Are the detected contaminants consistent with those found in the waste  or  leachate  of the
       regulated unit?
   »»»  Are there other nearby well pumping or extraction activities?


                                              4^9                                    March 2009

-------
Chapter 4. Groundwater Monitoring Programs                            Unified Guidance

4.3.4 GEOCHEMICAL FACTORS

   *»*  Were the measurements that triggered the SSI developed from unfiltered or filtered trace element
       sample data? If unfiltered,  is there any information regarding associated  turbidity  or  total
       suspended  solid measurements?  Unusual increases in well turbidity  can  introduce excess
       naturally occurring trace elements into the samples. This can be a particularly difficult problem in
       compliance monitoring when comparing data to a fixed standard, but can also affect detection
       monitoring well-to-well comparisons if turbidity levels vary.
   »»»  Were  there changes  in  associated  analytes at the  "triggered"  well  consistent  with  local
       geochemistry? For  example, given an SSI  for total dissolved  solids  [TDS], did measured
       cations/anions and  pH also show a consistent  change?  As another  example,  slight  natural
       geochemical changes can result in large specific conductance changes. Did  other constituents
       demonstrate a consistent change?
   »»»  Is there evidence  of  a simultaneous  release of more than one analyte, consistent with the
       composition of the waste or leachate? In particular, is there corollary evidence of degradation or
       daughter products for constituents like halogenated organics? For groundwater constituents with
       identified  SSIs,  is there a probable relationship to measured concentrations in waste or waste
       leachate? Are leachate concentrations high enough to be detectable in groundwater?
   »»»  If an  SSI is  observed in  one or more naturally occurring species, were organic  hazardous
       constituents not normally present in background  and found in  the  waste or  leachate also
       detected?  This could be an important factor in assessing the source of the possible release.
   »»»  Have  aquifer mobility factors  been  considered?  Certain  soluble constituents  like  sodium,
       chloride, or conservative volatile  organics might be expected to move through the aquifer much
       more  quickly than easily adsorbed  heavy  metals  or 4-5  ring polynuclear  aromatic [PNA]
       compounds.
   *»*  Do  the observed  data patterns (particularly  for naturally occurring constituents  in upgradient
       wells  or  other  background  conditions) make  sense in  an overall site geochemical  context,
       especially as compared with  other available local or  regional site data and published  studies? If
       not, suspect background data may need to be further evaluated for potential errors prior to formal
       statistical comparisons.
   *»*  Do  constituents exhibit correlated behavior among both upgradient and downgradient wells due
       to overall changes in the aquifer?
   »»»  Have there been natural changes in groundwater constituents over time and space due to multi-
       year, seasonal, or cyclical variation?
   »»»  Are there different geochemical regimes in upgradient vs. downgradient wells?
   *»*  Has there been a release of soil trace elements due to changes in pH?

4.3.5 ANALYTICAL  FACTORS

   *»*  Have  there been  changes  in laboratories, analytical methods, instrumentation, or procedures
       including specified detection limits that could cause  apparent jumps in concentration levels? In
       some  circumstances, using different values for  non-detects with different reporting limits has
       triggered  SSIs. Were inexperienced technicians involved in any of the analyses?

                                             4^10March  2009

-------
Chapter 4. Groundwater Monitoring Programs                            Unified Guidance

   *»*  Was more than one analytical  method used  (at different points in time) to generate the
       measurements?
   »»»  Were there changes in detection/quantification limits for the same constituents?
   »»»  Were there calibration problems, e.g.., drift in instrumentation?
   »»»  Was solvent or other laboratory contamination (e.g.,  phthalates, methylene chloride extractant,
       acetone wash) introduced into any of the physical samples?
   *»*  Were there known or probable interferences among the analytes being measured?
   *»*  Were there  "spikes" or unusually high values on  certain sampling  events  (either for one
       constituent among many wells or related analytical constituents) that would  suggest laboratory
       error?

4.3.6 DATA OR ANALYTIC ERRORS

   *»*  Were there data transcription  errors (incorrect  decimal  places, analyte units,  or data column
       entries)? These data can often be identified as being highly improbable.
   »»»  Were there  calculation  errors in either the analytical (e.g.., incorrect trace element valence
       assumptions or  dilution  factors) or in the statistical portions (mathematical mistakes, incorrect
       equation terms)  of the analysis?
                                             4-11                                    March 2009

-------
Chapter 4. Groundwater Monitoring Programs                         Unified Guidance
                    This page intentionally left blank
                                       4-12                              March 2009

-------
Chapter 5.  Background	Unified Guidance

          CHAPTER 5.   ESTABLISHING AND  UPDATING
                                BACKGROUND
       5.1   IMPORTANCE OF BACKGROUND	5-1
         5.1.1  Tracking Natural Groundwater Conditions	5-2
       5.2   ESTABLISHING AND REVIEWING BACKGROUND	5-2
         5.2.1  Selecting Monitoring Constituents and Adequate Sample Sizes	5-2
         5.2.2  Basic Assumptions About Background	5-4
         5.2.3  Outliers in Background	5-5
         5.2.4  Impact of Spatial Variability	5-6
         5.2.5  Trends in Background	5-7
         5.2.6  Expanding Initial Background Sample Sizes	5-8
         5.2.7  Review of Background.	5-10
       5.3   UPDATING BACKGROUND	5-12
         5.3.1  When to Update	5-12
         5.3.2  How to Update	5-12
         5.3.3  Impact of Retesting	5-14
         5.3.4  Updating When Trends are Apparent	5-14
     This chapter discusses the importance and use of background data in groundwater monitoring.
Guidance is provided for the proper identification, review, and periodic updating of background. Key
questions to be addressed include:

   »«»  How should background be established and defined?
   »«»  When should existing background data sets be reviewed?
   »«»  How and when should background be updated?
   »«»  What impact does retesting have on background updating?


5.1 IMPORTANCE OF BACKGROUND

     High  quality background  data  is the  single most  important  key to a  successful  statistical
groundwater monitoring program, especially for detection monitoring. All of the statistical tests listed in
the RCRA regulations are  predicated  on  having  appropriate  and representative  background
measurements. As indicated in Chapter 3, a statistical sample is representative if the distribution of the
sample measurements best  follows the distribution of the population from which the sample is drawn.
Representative background data has a similar but slightly  different connotation. The most  important
quality of background is that it reflects the historical conditions unaffected by the activities it is designed
to be compared to.  These conditions could range from an uncontaminated aquifer to an historically
contaminated site baseline unaffected by recent RCRA-actionable contaminant releases. Representative
background data will therefore have numerical characteristics closely matching those arising from the
site-specific aquifer being evaluated.

     Background must also be appropriate  to the statistical test.  All RCRA detection monitoring tests
involve comparisons of compliance point data against background.  If natural groundwater conditions
                                             
-------
Chapter 5.  Background	Unified Guidance

have changed over time — perhaps due to cycles of drought and recharge — background measurements
from five or ten years  ago may  not  reflect current  uncontaminated conditions.   Similarly,  recent
background data obtained using improved analytical methods may not be comparable to older data. In
each case, older background data may  have to be discarded in favor of more recent measurements in
order to construct an appropriate comparison. If intrawell tests are utilized due to strong evidence of
spatial  variability, traditional upgradient well  background data will not provide  an appropriate
comparison.  Even if the  upgradient measurements are reflective  of uncontaminated  groundwater,
appropriate background data must be obtained from each compliance point well.  The main point is that
compliance samples should be tested against data which best can represent background  conditions now
and those likely to occur in the future.

5.1.1  TRACKING NATURAL GROUNDWATER CONDITIONS

     Background measurements, especially from upgradient wells, can provide essential  information for
other than formal statistical testing. For one, background data can be used to gauge mean levels and
develop estimates of variability in naturally occurring groundwater constituents. They can also be used
to confirm the presence or absence of anthropogenic or non-naturally occurring constituents in the site
aquifer.  Ongoing sampling of  upgradient background  wells  provides a means of tracking  natural
groundwater conditions.  Changes that occur in parallel between the  compliance point and background
wells may signal site-wide aquifer changes in groundwater quality not specifically attributable to onsite
waste management.  Such observed changes may also  be indicative  of analytical  problems  due to
common artifacts of laboratory analysis (e.g., re-calibration of lab equipment,  errors in batch  sample
handling, etc), as well as indications of groundwater mounding, changes in groundwater gradients and
direction, migration of contaminants from other locations or offsite, etc.

     Fixed  GWPS  like  maximum   contaminant  levels   [MCLs]   may  be contemplated  for
compliance/assessment monitoring or corrective action. Background data analysis is important  if it is
suspected that naturally occurring levels of the constituent(s) in question are higher than the standards or
if a given hazardous constituent does not have  a  health- or  risk-based standard.  In the first case,
concentrations in  downgradient wells may indeed exceed the standard, but may not  be attributable to
onsite waste management if natural background levels also exceed the standard.  The Parts 264 and 258
regulations recognize these possibilities, and allow for GWPS to be based on background levels.

5.2 ESTABLISHING AND  REVIEWING BACKGROUND

     Establishing appropriate background  depends on  the  statistical approach contemplated  (e.g.,
interwell vs. intrawell). This section outlines the major  considerations concerning how to select and
develop background data including monitoring constituents and sample sizes,  statistical  assumptions,
and the presence of data outliers, spatial variation or trends.  Expanding and reviewing background data
are also discussed.

5.2.1  SELECTING MONITORING  CONSTITUENTS AND ADEQUATE SAMPLE SIZES

     Due to the  cost of management, mobilization,  field labor, and especially laboratory analysis,
groundwater monitoring  can be an expensive endeavor. The most efficient way to limit costs and still
meet environmental performance requirements is  to minimize the total number of samples which must
be sampled and analyzed.  This  will require tradeoffs  between  the number of monitoring constituents

                                             
-------
Chapter 5. Background	Unified Guidance

chosen, and the frequency of background versus compliance well testing. The number of compliance
wells  and  annual  frequency of testing also affect  overall  costs, but are  generally  site-specific
considerations. By limiting the number of constituents and ensuring adequate background sample sizes,
it is possible to select certain statistical tests which help minimize future compliance (and total) sample
requirements.

     Selection of an appropriate number of detection monitoring constituents should be dictated by the
knowledge of waste or waste leachate composition and the corresponding groundwater concentrations.
When historical background data are available, constituent choices may be influenced by their statistical
characteristics. A few representative constituents or analytes may serve to accurately assess the potential
for a release. These constituents should stem from the regulated wastes, be  sufficiently mobile,  stable
and occur at high enough concentrations to be readily detected in the groundwater. Depending on the
waste composition, some non-hazardous organic or inorganic  indicator analytes may serve the same
purpose. The  guidance suggests that between 10-15 formal  detection  monitoring constituents should be
adequate for most site conditions.  Other constituents can still be reported but not directly incorporated
into formal  detection monitoring, especially when large simultaneously analyzed suites like TCP-trace
elements, volatile or semi-volatile organics data are run.  The focus of adequate background and  future
compliance test sample sizes can then be limited to the selected monitoring constituents.

      The RCRA regulations do not consistently specify how many  observations must be collected in
background. Under  the Part 265 Interim  Status regulations, four quarterly background measurements are
required during the first year of monitoring.  Recent modifications to Part 264 for Subtitle C facilities
require a  sequence of at least four observations  to be collected in background during an interval
approved by the Regional Administrator.   On the other hand, at least four measurements must be
collected from each background well  during the  first semi-annual period  along with  at least one
additional observation during each subsequent period, for Subtitle D facilities under Part 258.  Although
these are minimum requirements in the regulations, are they  adequate sample sizes  for background
definition and  use?

     Four observations from a population are rarely enough  to adequately characterize its statistical
features; statisticians generally consider sample sizes  of n <  4 to be insufficient for good statistical
analysis. A decent  population survey, for example, requires several hundred and often a few to several
thousand participants to generate  accurate  results. Clinical trials of medical  treatments are usually
conducted on  dozens to hundreds of patients. In groundwater tests, such large sample sizes are  a rare
luxury.  However, it is feasible to obtain small sample  sets of up to  n = 20 for individual background
wells, and potentially larger sample sizes if the data characteristics allow for pooling of multiple well
data.

     The Unified Guidance recommends that a minimum of at least 8  to 10 independent background
observations be collected before running most statistical tests. Although still a small sample size by
statistical standards, these levels allow for minimally acceptable estimates of variability and evaluation
of trend and  goodness-of fit.   However, this  recommendation should be considered a temporary
minimum until additional background  sampling can be conducted  and the background sample size
enlarged (see further discussions below).

     Small  sample sizes in background can be  particularly  troublesome,  especially in controlling
statistical test  false positive and negative rates. False  negative rates in detection monitoring, i.e., the
                                              5^3                                    March 2009

-------
Chapter 5. Background	Unified Guidance

statistical  error of failing to identify a real concentration  increase above background,  are in part a
function of sample size. For a fixed false positive test rate, a smaller sample size results in a higher false
negative rate.   This  means  a decreased probability (i.e., statistical power) that real increases above
background will be detected.  With certain parametric tests,  control of the false positive rate using very
small sample sets comes at the price of extremely low power.  Power may be adequate using a non-
parametric test, but control of the false positive can be lost. In both cases, increased background sample
sizes result in better achievable false positive and false negative errors.

     The  overall recommendation of the guidance is to  establish background sample sizes as large as
feasible.  The final tradeoff comes in the selection of the type of detection tests to be used. Prediction
limit, control chart, and tolerance limit tests can utilize very small future sample sizes per compliance
well (in some cases a single initial sample), but require larger background sample sizes to have sufficient
power.   Since background samples generally are obtained from historical  data  sets  (plus future
increments as needed), total annual sample sizes (and costs) can be somewhat minimized in the future.

5.2.2 BASIC ASSUMPTIONS ABOUT BACKGROUND

     Any background sample should satisfy the key statistical  assumptions described in Chapter 3.
These  include statistical independence of  the  background  measurements, temporal  and  spatial
stationarity, lack of statistical outliers, and  correct distribution assumptions of the background sample
when a parametric statistical  approach is selected.  How independence and autocorrelation impact the
establishment  of background is  presented  below, with additional discussions on outliers,  spatial
variability and  trends in the following sections.   Stationarity assumptions are considered both in the
context of temporal and spatial variation.

     Both the Part 264 and 258 groundwater regulations  require statistically independent measurements
(Chapter  2).   Statistical independence is  indicated by random data sets.  But randomness is  only
demonstrated by the presence of mean and variance stationarity and the lack of evidence for effects such
as autocorrelation, trends, spatial and temporal variation.   These tests (described  in Part II of this
guidance)  generally require at least 8 to 10 separate background measurements.

     Depending on site groundwater velocity, too-frequent sampling at any given background well can
result in highly autocor related, non-independent data. Current or proposed sampling frequencies can be
tested for autocorrelation or other  statistical  dependence using the diagnostic procedures in Chapter 14.
Practically speaking, the best way  to ensure  some degree  of statistical independence is to allow as much
time as possible to elapse between sampling events. But a balance must be drawn between collecting as
many measurements as possible from a given well over  a specified time period, and ensuring that the
sample measurements are statistically independent. If significant  dependence  is identified in already
collected background, the interval between sampling events may  need to be lengthened to minimize
further autocorrelation. With fewer sampling  events per evaluation period,  it is also possible that a
change in  statistical method may be needed, say from analysis of variance [ANOVA], which requires at
least 4 new background measurements per evaluation, to prediction limits or control  charts, which may
require new background only periodically (e.g.,  during a biennial update).
                                              5-4                                    March 2009

-------
Chapter 5. Background	Unified Guidance

5.2.3 OUTLIERS IN  BACKGROUND

     Outliers or observations not derived from the same population as the rest of the sample violate the
basic statistical assumption of identically-distributed measurements. The Unified Guidance recommends
that testing of outliers be performed on background data, but they generally not be removed unless some
basis for a likely error or discrepancy can be identified.  Such possible errors or  discrepancies could
include data recording errors, unusual sampling and laboratory procedures or conditions, inconsistent
sample turbidity, and values significantly outside the historical ranges of background data. Management
of potential outliers carries both positive and negative risks, which should be carefully understood.

      If an outlier value with much higher concentration  than other background  observations is not
removed from  background prior to statistical testing, it will tend to increase both the background sample
mean and  standard deviation.  In turn,  this may  substantially raise the magnitude  of a parametric
prediction limit or control limit calculated from that sample. A subsequent compliance well test against
this background  limit will be much less likely to identify an exceedance. The  same is true with non-
parametric prediction limits, especially when the maximum background value is taken as the prediction
limit. If the  maximum  is an  outlier  not representative  of the background population, few truly
contaminated compliance wells are likely to be identified by such a test, lowering the statistical power of
the method and the overall quality of the statistical monitoring program.

     Because  of these concerns, it may  be advisable at times to remove high-magnitude outliers in
background even if the reasons for these apparently extreme observations are  not  known. The overall
impact of removal will tend to improve the power of prediction limits and control charts, and thus result
in a more environmentally protective program.

     But strategies that involve automated evaluation and removal of outliers may unwittingly eliminate
the evidence of real and  important changes to background conditions.  An example  of this phenomenon
may have occurred during the  1970s in some early ozone  depletion measurements over Antarctica
(http://www.nas.nasa.gov/About/Education/Ozone/history.html).  Automated  computer   routines for
outlier detection  apparently  removed several  measurements  indicating a sharp  reduction in ozone
concentrations, and thus prevented  identification of an enlarging ozone  hole by  many  years.   Later
review of the raw observations revealed  that these  automated  routines had statistically classified
measurements  as outliers, which were more extreme than most of the data from that time period. Thus,
there is some merit in saving  and revisiting apparent 'outliers' in future investigations,  even if removed
from present databases.

     In groundwater data collection and testing, background conditions may not be static over time.
Caution should be observed in removing observations which may signal a change in natural groundwater
quality. Even when conditions have not  changed, an apparently extreme  measurement may represent
nothing more  than a portion of the background  distribution that has  yet  to  be  observed.   This is
particularly true if the background data set contains fewer than 20 samples.

     In balancing these contrasting risks in retaining or removing one or more outliers, analyses of
historical data patterns can sometimes provide more definitive information depending on the types of
analytes and methods.  For example, if a potential order-of magnitude higher outlier is identified in a
sodium data set used as a monitoring constituent, cation-anion  balances can help  determine if this
change is geochemically probable.  In this  case,  changes  to other intrawell ions or TDS should be

                                             5^5                                     March 2009

-------
Chapter 5. Background	Unified Guidance

observed.  Similarly, if a trace element outlier is identified in a single well sampling event and occurred
simultaneously with other trace element maxima measured using the same analytical method (e.g., ICP-
AES) either in the same well or groups of wells, an analytical  error should be strongly suspected.  On
the other hand, an isolated increase without any other evidence could be a real but extreme background
measurement.  Ideally,  removal of one or more statistically identified outliers should be based on other
technical information or knowledge which can support that decision.

5.2.4 IMPACT  OF SPATIAL VARIABILITY

     In the absence of contamination, comparisons made between upgradient-to-downgradient wells
assume that the concentration distribution is spatially stationary across the well field (Chapter 3). This
implies that every well should have the same population mean and variance, unless a release occurs to
increase the concentration levels at  one or more compliance wells. At many sites, this is not the case for
many naturally occurring constituents. Natural or man-made differences in mean levels — referred to as
spatial variability or spatial variation — impact how background must be established.

     Evidence of spatial variation should drive the selection of an intrawell statistical  approach  if
observed among wells known to be uncontaminated (e.g., among a group of upgradient  background
locations).  Lack  of spatial  mean differences and a common variance allow for interwell comparisons.
Appropriate background differs between the two approaches.

     With  interwell tests,  background is derived from distinct, initially upgradient background wells,
which may be enhanced by data from historical compliance wells also shown not to exhibit significant
mean and variance differences. Future data from each of these compliance wells are then tested against
this common background.  On  the other hand, intrawell background is derived  from and represents
historical groundwater conditions in each individual compliance well. When the population mean levels
vary across a well field,  there is little  likelihood  that the upgradient background will provide an
appropriate comparison by which to judge any given compliance well.

     Although spatial variability impacts the choice of background, it does so only for those constituents
which evidence  spatial differences across the well  field.  Each  monitoring constituent should be
evaluated on its own statistical merits.  Spatial  variation in some constituents (e.g., common ions and
inorganic parameters) does not preclude the use of interwell background for other infrequently detected
or non-naturally  occurring analytes.  At  many  sites,  a mixture of  statistical  approaches may be
appropriate: interwell tests for part of the monitoring list and intrawell tests for another portion. Distinct
background observation sets will need to be developed under such circumstances.

     Intrawell background  measurements should be selected  from the available historical samples  at
each  compliance  well  and should include  only those  observations  thought  to  be uncontaminated.
Initially, this might result in very few measurements (e.g., 4 to 6). With such a small background sample,
it can be very difficult to develop an adequately  powerful intrawell prediction limit or control chart, even
when retesting is employed (Chapter 19).  Thus, additional background data will be needed to augment
the testing power.  One option is to periodically augment the existing background data base with recent
compliance well  samples  (discussed in  a further section below).  Another possible remedy  is to
statistically  augment  the   available  sample data  by  running  an analysis  of  variance [ANOVA]
simultaneously on all the sets of intrawell background from the  various upgradient and compliance wells
(see Chapter 13). The root mean squared error [RMSE] from this procedure can be used in place of the

                                              5^6                                    March 2009

-------
Chapter 5. Background	Unified Guidance

background standard deviation in parametric prediction and control limits to substantially increase the
effective background sample size of such tests, despite the limited number of observations available per
well.

     This strategy will only work if the key assumptions of ANOVA can be satisfied (Chapter 17),
particularly the requirement of equal variances  across wells. Since natural differences in mean levels
often correspond to similar differences in variability, a transformation of the data will often be necessary
to homogenize the variances prior to running the ANOVA. For  some constituents, no transformation
may work  well enough to  allow the  RMSE to be used as a  replacement estimate for the  intrawell
background standard deviation. In that case, it  may not be possible to construct reasonably  powerful
intrawell background limits  until background has been updated once or twice (see Section 5.3).

5.2.5 TRENDS IN BACKGROUND

     A key implication of the independent and identically distributed assumption [i.i.d] is that a series
of sample measurements  should be stationary over time (i.e., stable  in mean level and variance). Data
that are trending upward or downward violate this assumption since  the  mean level is changing.
Seasonal fluctuations also violate this assumption since both the mean and variance will likely oscillate.
The proper handling of trends  in background depends on the statistical approach and the cause of the
trend.  With interwell tests and  a common  (upgradient) background,  a trend can  signify several
possibilities:

   »»»  Contaminated background;
   »»»  A 'break-in' period following new well installation;
   »»»  Site-wide changes in the aquifer;
   »»»  Seasonal fluctuations, perhaps on the order of several months to a few years.
     If upgradient well background becomes contaminated, intrawell testing may be needed to avoid
inappropriate comparisons.  Groundwater flow  patterns should also be  re-examined to determine if
gradients are properly defined or if groundwater mounding might be occurring. With newly-installed
background wells, it may be necessary to discard  initially collected observations  and to wait several
months for  aquifer disturbances due to  well construction  to stabilize.  Site-wide changes  in  the
underlying  aquifer should be identifiable as similar trends in both upgradient and compliance wells. In
this case, it might be possible to  remove a common  trend from both the background and compliance
point  wells and to perform interwell testing on the trend residuals. However, professional statistical
assistance may be needed to do this correctly. Another option would be to switch to intrawell trend tests
(Chapter 17)

     Seasonal fluctuations in interwell background which are also observed in compliance wells, can be
accommodated by modeling the  seasonal trend  and removing it from all background and compliance
well data.  Data  seasonally-adjusted in this way (see Chapter 14  for details) will  generally be less
variable than the unadjusted measurements and lead to more powerful tests than if the seasonal patterns
had been ignored. For this  adjustment to work  properly, the same seasonal trend  should be  observed
across the well field and not be substantially different from well to well.
                                              5-7                                    March 2009

-------
Chapter 5. Background	Unified Guidance

     Roughly linear trends in intrawell background usually signify the need to switch from an intrawell
prediction limit or control chart to an explicit trend test, such as linear regression or the Mann-Kendall
(Chapter 17). Otherwise the background variance will be overestimated and biased on the high side,
leading to  higher than expected and ultimately less powerful prediction and control  limits. Seasonal
fluctuations in intrawell background can be treated in one of two ways. A seasonal Mann-Kendall trend
test built to accommodate such fluctuations can be employed (Section 14.3.4).  Otherwise, the seasonal
pattern can be estimated and removed from the background data, leaving a  set of seasonally-adjusted
data to be  analyzed with  either  a prediction limit or control chart.  In  this  latter approach, the same
seasonal pattern needs to be extrapolated beyond the current background to more recent measurements
from the compliance well being tested. These later observations also need to be seasonally-adjusted prior
to comparison against the adjusted background, even if there is not enough compliance data yet collected
to observe the same seasonal cycles.

     When trends are apparent in background, another option is to modify the groundwater monitoring
list to include only those constituents that appear to be temporally stable. Only certain analytes may
indicate evidence of trends or seasonal fluctuations. More powerful statistical  tests might be constructed
on constituents that appear to be stationary. All such changes to the monitoring list and method of
testing may require approval of the Regional Administrator or State Director.

5.2.6 EXPANDING  INITIAL BACKGROUND SAMPLE SIZES

       In the initial development of a detection monitoring statistical program under a permit or other
legal mechanism, a period of review will identify the appropriate monitoring constituents. For new sites
with no  prior data, plans for initial background definition need to be developed as part of  permit
conditions.  A more typical situation occurs for  interim status or older facilities which have already
collected substantial historical data in site monitoring wells.  For the most part, the  suggestions below
cover ways of expanding background data sets from existing information.

       Under  the  RCRA interim status regulations, only a single upgradient well is required as a
minimum.   Generally speaking, a  single  background well  will not generate observations that are
adequately representative of the underlying aquifer.  A single background well draws groundwater from
only one possible background location. It is accordingly not possible to determine if spatial variation is
occurring in the upgradient aquifer.  In addition, a single background well can only be sampled so often
since measurements that are collected too frequently run the risk of being autocorrelated. Background
observations  collected from a single well  are typically  neither representative nor constitute a large
enough sample to construct powerful, accurate statistical tests. One way to expand background is to
install at least 3-4 upgradient wells and collect additional data under permit.

       The early RCRA regulations  also allowed for aliquot replicate sampling as a means of expanding
background and other well sample  sizes.   This approach consisted of analyzing  splits or  aliquots of
single water quality samples.  As indicated in Chapter 2, this approach is not recommended in the
guidance. Generally limited analytical variability does not adequately capture  the overall variation based
on independent water quality sample data, and results in incorrect estimates of variability and degrees of
freedom (a function of sample size).

       Existing historical groundwater well data under consideration will need to meet the assumptions
discussed earlier  in this  chapter- independence,  stationarity, etc.,  including using statistical methods

                                               5^8                                     March 2009

-------
Chapter 5.  Background	Unified Guidance

which can  deal with  outliers,  spatial and temporal  variation including  trends.   Presuming these
conditions are met, it is statistically desirable to develop as large a background sample size as practical.
But no matter how many measurements are utilized, a larger sample size is advantageous only if the
background samples are both appropriate to the tests selected and representative of baseline conditions.

      In limited  situations, upgradient-to-downgradient,  interwell comparisons may be determined to be
appropriate using ANOVA testing of well mean differences.  To ensure appropriate and representative
background,  other conditions may also need to be satisfied when data from separate wells are pooled.
First, each  background well  should be  screened at the same hydrostratigraphic position as other
background wells.  Second, the groundwater chemistry at each of these wells should be similar. This can
be checked via  the use of standard geochemical bar charts, pie charts, and tri-linear diagrams  of the
major constituent groundwater ions and cations (Hem, 1989).  Third, the statistical characteristics of the
background wells should be similar — that is, they should be spatially stationary.,  with approximately
the same means and variances.   These conditions are particularly important for major water  quality
indicators, which generally reflect aquifer-specific characteristics.   For infrequently detected analytes
(e.g.,  filtered trace  elements like  chromium, silver,  and zinc), even data collected from wells from
different aquifers and/or geologic strata  may be statistically  indistinguishable  and  also eligible for
pooling on an interwell basis.

      If a one-way ANOVA (Chapter 13) on the set of background wells finds significant differences in
the mean  levels  for  some constituents,  and hence,  evidence of spatial variability, the guidance
recommends using intrawell tests. The data gathered from the  background wells will generally not be
used  in  formal  statistical  testing,  but are  still invaluable in ensuring that appropriate  background  is
selected.1 As indicated in the discussions above and Chapter 13, it may be possible to pool constituent
data from a number of upgradient and/or compliance wells having a common variance when parametric
assumptions  allow, even if mean differences exist.

      When larger historical databases are available, the data can be reviewed  and diagnostically tested
to determine which observations best represent  natural groundwater conditions  suitable  for  future
comparisons. During this  review, all historical well data collected from both upgradient and compliance
wells can be evaluated for potential inclusion into background.  Wells suspected of prior contamination
would need to be excluded, but otherwise each uncontaminated data point adds to the overall statistical
picture of background conditions at the  site and can be used to enlarge the background database.
Measurements can be preferentially selected to establish background samples, so long as a consistent
rationale is used (e.g., newer  analytical methods, substantial outliers in a portion of a data set, etc.)
Changes to an aquifer over time may require selecting newer data representing current groundwater
quality over earlier results even if valid.
  If the spatial variation is ignored and data are pooled across wells with differing mean levels (and perhaps variances) to run
  an interwell parametric prediction limit or control chart test, the pooled standard deviation will tend to be substantially
  larger than expected.  This will result in a higher critical limit for the test. Using pooled data with spatial variation will also
  tends to increase observed maximum values in background, leading to higher and less powerful non-parametric prediction
  limit tests. In either application, there will be a loss of statistical power for detecting concentration changes at individual
  compliance wells.  Compliance wells with naturally higher mean levels will also be more frequently determined to exceed
  the limit than expected, while real increases at compliance wells with naturally lower means will go undetected more often.

                                                
-------
Chapter 5. Background	Unified Guidance

5.2.7 REVIEW OF BACKGROUND

     As mentioned above, if a large historical database is available, a critical review of the data can be
undertaken  to  help  establish  initially  appropriate  and  representative  background samples.  We
recommend that other reviews of background also take place periodically. These include the following
situations:

    »»»  When periodically updating background, say every 1-2 years (see Section 5.3)
    »«»  When performing a 5-10 year permit review
     During these reviews, all observations designated as background should be evaluated to ensure that
they still  adequately reflect current natural or baseline groundwater  conditions.  In particular, the
background samples should be investigated for apparent trends or outliers. Statistical outliers may need
to be removed, especially if an error or discrepancy can be identified, so that subsequent compliance
tests can be improved. If  trends are indicated, a change in the statistical method or approach may be
warranted (see earlier section on "Trends in Background").

     If background has been updated or enlarged since  the last review,  and is being utilized in
parametric tests, the assumption of normality (or other distributional fit) should be re-checked to ensure
that the augmented background data are still consistent with a parametric approach. The presence of non-
detects and multiple reporting limits (especially with changes in analytical methods over time) can prove
particularly troublesome in checking distribution  assumptions. The methods of Chapters 10 "Fitting
Distributions" and Chapter 15 "Handling Non-Detects" can be consulted for guidance.

     Other periodic checks of the  revised background should also be conducted, especially in relation to
accumulated knowledge from other sites regarding analyte concentration patterns in groundwater. The
following are potential sources for comparison and evaluation:

    *»*  reliable regional groundwater data studies or investigations from nearby sites;
    *»*  published literature; EPA or other agency groundwater databases like STORET;
    »«»  knowledge of typical patterns for background  inorganic constituents and trace  elements. An
       example is found in Table 5-1 at the end of this chapter. Typical surface and groundwater levels
       for filtered trace elements can also be found in the published literature (e.g., Hem, 1989).
       Certain common features of routine groundwater monitoring analytes  summarized  in Table 5-1
have been observed in  Region 8  and  other background data sets, which can have implications for
statistical  applications.  Common water quality indicators like cations and anions, pH, TDS, specific
conductance  are almost  always  measurable  (detectable) and  generally  have  limited within-well
variability.  These  would be more amenable to  parametric applications;  however, these measurable
analytes are also most likely to exhibit well-to-well spatial variation and various kinds of within- and
between-well temporal  variation  including  seasonal and annual trends.  Many of these within-well
analytes are highly  correlated, and would not meet the criterion for independent data if simultaneously
used as monitoring constituents.

       A  second level of  common indicator analytes- nitrate/nitrite species, fluoride, TOC and TOX-
are less frequently detected and  subject to more  analytical  detection  instability (higher and  lower


                                             
-------
Chapter 5. Background	Unified Guidance

detect!on/quantitation limits).  As such, these analyte data are somewhat less reliable.   There is less
likelihood of temporal variation, although they can exhibit spatial well differences.

       Among routinely monitored .45u-filtered trace elements, different groups stand out.  Barium is
routinely detected with limited variation within most wells, but does exhibit spatial variation.  Arsenic
and selenium  commonly occur in groundwater as oxyanions, and data can range from virtually non-
detectable to  always detected in  different site  wells.   The largest group  of trace elements can be
considered colloidal  metals (Sb, Al, Be, Cd, Cr, Co, Fe, Hg, Mn, Pb, Ni, Sn, Tl, V and Zn).  While Al,
Mn and Fe are more commonly detected, variability is often quite high; well-to-well spatial  variability
can occur at  times.   The remaining colloidal metals  are  solubility-limited  in most background
groundwater, generally <1 to < 10 ug/1. But even with filtration,  some natural colloidal geologic solid
materials can often be detected in individual samples. Since naturally occurring Al, Mn and Fe soil solid
levels are much higher,  the effects on measured groundwater levels  are more pronounced and variable.
For most of the analytically and solubility-limited colloidal metals, there may not be any discernible well
spatial differences. Often these data can be characterized by a site-wide lognormal distribution, and may
be possible to pool individual well data to form larger background sizes.

       With unfiltered trace element data, it is more difficult to generalize even regarding background
data.  The method of well sample extraction and the aquifer characteristics will determine how much
solids material may be present in the samples.  Excessive amounts of sample solids can result in higher
levels  of  detection but  also elevated average values  and variability even for solubility-limited trace
elements.  The effect is most clearly  seen when TSS is simultaneously collected with unfiltered data.
Increases are proportional  to the amount of TSS and the natural background levels for trace elements in
soil/solid materials.  It is  recommended that TSS always be simultaneously monitored with unfiltered
trace elements.

       Most trace organic monitoring constituents are absent or non-detectable under clean background
conditions. However, with existing up-gradient sources, it is more difficult to generalize.  More soluble
constituents like benzene or chlorinated hydrocarbons may be amenable to parametric distributions,  but
changes in groundwater levels or direction can drastically affect observed levels.  For sparingly soluble
compounds like polynuclear aromatics (e.g., naphthalene), aquifer effects can result in highly  variable
data less amenable to statistical applications.

       Table 5-1  was based on the use of analytical  methods common in the 1990's to the  present.
Detectable filtered trace element data for the most part were limited by the available analytic techniques,
generally  SW-846 Method  6010 ICP-AES and select AA (atomic absorption) methods with  lower
detection limits in the 1-10 ppb range. As newer methods are  incorporated (particularly Method 6020
ICP-MS  capable of parts-per-trillion  detection  limits  for  trace  elements), higher  quantification
frequencies may result in data demonstrating more complex spatial and temporal characteristics.  Table
5-1 merely provides  a rough guide to  where various data patterns  might occur. Any extension  of these
patterns to other facility data sets should be determined by the formal guidance tests in Part II.

     The background database can also be specially organized and summarized to examine common
behavior among related  analytes (e.g.., filtered trace elements using ICP-AES) either over time or across
wells  during common sampling events. Parallel time  series plots (Chapter 9) are very useful in this
regard.  Groups of related analytes can be graphed on the same set of axes, or groups of nearby wells for
the same analyte. With either plot, highly suspect sampling events can be identified if a similar spike in
                                              
-------
Chapter 5. Background	Unified Guidance

concentration or other unusual pattern occurs  simultaneously at  all the wells or in all the analytes.
Analytical measurements that appear to be in error might be removed from the background database.

       Cation-anion balances and other more sophisticated geochemical analysis programs can also be
used to evaluate the reliability of existing water quality background data. A suite of tests like linear or
non-parametric  correlations, simple or non-parametric ANOVA described in later chapters offer overall
methods for evaluating historical data for background suitability.

5.3  UPDATING  BACKGROUND

     Due both to the complex behavior of groundwater and the need for sufficiently large sample sizes,
background once obtained should not  be regarded as a single fixed quantity.  Background  should be
sampled regularly throughout the  life of the facility, periodically reviewed and revised as necessary. If a
site  uses traditional,  upgradient-to-downgradient  comparisons,  it might seem  that updating of
background is  conceptually simple: collect new measurements from each background well at each
sampling event  and add these to the overall  background sample. However, significant trends or changes
in one or more upgradient wells might indicate problems with individual wells, or be part of a larger site-
wide groundwater change. It is worthwhile to consider the following principles for updating, whether
interwell or intrawell testing is used.

5.3.1 WHEN  TO  UPDATE

     There are  no firm rules on how often to update background data. The Unified Guidance  adopts the
general principle that  updating should occur when enough new measurements have been collected to
allow a two-sample statistical comparison between the existing background data and a potential set of
newer data. As mentioned in the following section, trend testing might also be used. With quarterly
sampling, at least 4 to 8 new measurements should be  gathered to enable such a test; this implies that
updating would take place every  1-2 years. With semi-annual sampling, the same principle would call
for updating every 2-3 years.

     Updating  should generally not  occur more frequently,  since adding  a  new observation to
background every one or two sampling rounds does not allow a  statistical evaluation of whether the
background mean is stationary over time. Enough new data needs to be collected to ensure that a test of
means (or medians in the case of non-normal data) can be conducted. Adding individual observations to
background can introduce subtle trends that might go  undetected  and ultimately reduce the statistical
power of formal monitoring tests.

     Another practical aspect is that when background is updated,  all statistical background limits (e.g.,
prediction and control limits) needs to be recomputed to account for the revised background sample. At
complex  sites,  updating the limits at  each well and constituent  on the  monitoring list may require
substantial  effort. This includes resetting the cumulative sum [CUSUM] portions of control charts to
zero after re-calculating the control limits and prior to additional testing against those limits.  Too-
frequent updating could thereby reduce the efficacy of control chart  tests.

5.3.2 HOW TO UPDATE

     Updating  background is primarily a concern for intrawell tests, although some of the  guidelines
apply to interwell data. The common (generally upgradient) interwell background pool can be tested for
                                             
-------
Chapter 5. Background	Unified Guidance

trends  and/or changes at intervals depending  on the sampling frequencies identified above.  Those
recently collected measurements from the background well(s) can be added to the existing pool if a
Student's t-test or Wilcoxon rank-sum test (Chapter 16) finds no significant difference between the two
groups at the a = 0.01 level  of significance. Individual background wells should also be evaluated in the
same manner for their respective newer data.  Two-sample tests of the interwell background data are
conducted to gauge whether or not background groundwater conditions have changed substantially since
the last update,  and are not tests  for indicating a potential  release  under detection monitoring.   A
significant  Mest or Wilcoxon rank-sum result should spur a closer  investigation and review of the
background sample,  in order to determine which observations are most representative of the current
groundwater conditions.

     With  intrawell tests using prediction limits or control charts, updating is performed both to enlarge
initially  small  well-specific  background  samples  and  to  ensure   that  more  recent  compliance
measurements are not already impacted by  a  potential  release (even if not  triggered  by the  formal
detection monitoring tests). A finding of significance using the above two-sample tests means that the
most recent data  should not be added to intrawell background.  However,  the same caveat as above
applies: these are not formal  tests  for determining a  potential release and the  existing tests and
background should continue to be used.

     Updating intrawell background should  also not occur  until at least 4 to 8 new compliance
observations have been collected. Further, a potential update is predicated on there being no statistically
significant  increase  [SSI] recorded for that well constituent, including since the last update.  Then a t-
test  or Wilcoxon rank-sum comparison can  be conducted at each compliance well between existing
intrawell background and the potential set of newer background.  A non-significant result implies that
the newer compliance data  can be re-classified as background measurements and added to the existing
intrawell background sample. On  the other hand, a  determination of significance  suggests that  the
compliance observations should be reviewed to determine whether a gradual trend or other change has
occurred that was missed by the intervening prediction limit or control  chart tests.  If intrawell tests
make use of a common pooled variance, the assumption of equal variance in the pooled wells should
also be checked with the newer data.

     Some users may wish to evaluate historical and future background data for potential trends.  If
plots of data versus time suggest either an overall trend in the combined data sets or distinct differences
in the respective  sets, linear or non-parametric trend tests covered in Chapter 17 might  be used.  A
determination of a  significant trend  might occur even if the  two-sample tests are  inconclusive, but
individual group sample sizes should be large enough to avoid identifying a significant trend based  on
too few samples and perhaps randomly occurring.  A trend in the newer data may reflect or depart from
the historical data conditions.  Some  form of statistical adjustments may  be necessary, but see Section
5.3.4 below.
                                             5-13                                    March 2009

-------
Chapter 5. Background	Unified Guidance

5.3.3 IMPACT OF RETESTING
                                                                                           r\
     A key question when updating intrawell background is how to handle the results of retesting. If a
retest confirms an  SSI, background should not be updated.  Rather, some regulatory action at the site
should be taken.  But what if an initial exceedance of a prediction or control limit is disconfirmed by
retesting? According to the logic of retesting (Chapter 19), the well passes the compliance test for that
evaluation and monitoring should continue as usual. But what should be done with the initial exceedance
when it comes time to update background at the well?

     The initial exceedance may be due  to a laboratory  error or other anomaly that has caused the
observation to be an outlier.  If so, the error should be documented and not included in the updated
background sample.  But if the exceedance is not explainable as an outlier or error,  it may represent a
portion of the background population that  has heretofore not been sampled. In that case, the data  value
could be included in the updated background sample (along with the repeat sample)  as evidence of the
expanded but true range of background  variation. Ultimately,  it is important to characterize the
background conditions at the site as completely and accurately as possible, so as  to minimize both false
positive and false negative decision errors in compliance testing.

     The severity and classification of the initial  exceedance will depend on the  specific  retesting
strategy that has been implemented (Chapter 19).  Using the same background data in a parametric
prediction limit or control chart test, background limits  are proportionately lower as the l-of-m  order
increases (higher m). Thus, a l-of-4 prediction limit will be lower than a l-of-3  limit, and similarly the
l-of-3  limit lower than  for a l-of-2 test.   An initial  exceedance triggered by a l-of-4 test limit and
disconfirmed by a repeat sample, might not trigger a lower order prediction limit test.  The initial sample
value may represent an upper tail value from the true distribution. Retesting schemes derive much  of
their statistical power by allowing more frequent initial  exceedances,  even if some  of these represent
possible  measurements  from background. The  initial and subsequent resamples taken together are
designed to identify which initial exceedances truly  represent SSIs and which do not.   These tests
presume that occasional  excursions beyond the background limit will occur. Unless the exceedance can
be documented as  an outlier or other anomaly, it should probably be included in the updated intrawell
background sample.

5.3.4 UPDATING WHEN TRENDS ARE APPARENT

     An increasing or decreasing trend may be apparent between the existing background and the newer
set  of candidate background values,  either using a time series  plot or applying Chapter 17  trend
analyses.   Should  such  trend  data  be added  to the existing background sample? Most detection
monitoring tests assume that background is stationary over time, with no discernible trends or seasonal
variation.  A mild trend will probably make very little difference, especially if a  Student-^ or Wilcoxon
rank-sum  test between the existing and candidate background data sets is non-significant. More severe
or continuing trends are likely to be flagged as SSIs by formal intrawell prediction limit or control chart
tests.
2 With interwell tests, the common (upgradient) background is rarely affected by retests at compliance point wells (unless the
  latter were included in the common pool). Should retesting fail to confirm an initial exceedance , the initial value can be
  reported alongside the disconfirming resamples in statistical reports for that facility.

                                              
-------
Chapter 5.  Background	Unified Guidance

      With interwell tests, a stronger trend in the common upgradient background may signify a change
in natural groundwater quality across the aquifer or an incomplete characterization of the full range of
background  variation. If a change  is evident,  it may be necessary to  delete  some of the earlier
background values from the updated background sample, so as to ensure that compliance testing is based
on current groundwater conditions and not on outdated measures of groundwater quality.
                                            5-15                                   March 2009

-------
Chapter 5.  Background
Unified Guidance
Table 5-1. Typical Background
Analyte
Groups
Detection Rates
Frequency of Multiple
Detection Reporting
by Well Limiits
Data Patterns for Routine Groundwater Monitoring Ana lytes
• . • . * • ' . \ • • • . • ' • , , '/•'.'''
•••..• ' .
Between Within Well Between Outlier
Well Variability Well Problems
Mean (CVs) Equal
Differ- Variances
ences
, •
Temporal Variation
Between Within Within Within Within
Well by Well Well Well Well
Analyte Among Auto- Seasonal Time
Group correl. Variation Correl.
Typical
Distribution
within well
Data
Grouping
Inorganic Constituents and Indicators
Major ions, pH,
TDS, Specific
Conductance
COS, F,
NO2,NO3
High to 100%

Some to most
detectable
'"

yv
Generally
low
(.1-.5)
Moderate Variable ^^
(.2-1.5)
"

,
Normal

Norm, Log or
NPM
Intrawell

Intrawell/
Interwell
 .45|i Filtered Trace Elements
Ba

As, Se




Al, Mn, Fe

Sb, Be, Cd, Cr,
Cu, Hg, Pb, Ni,
Ag, Tl, V, Zn
High to 100% ^^

Some wells
high, others

low to zero

Low to
Moderate
Zero to low


Low
(.1-.5)
Moderate Variable
(.2-1.5)
(some

wells)
Moderate
to high
Moderate
to high
(.5->2.0)
^ ^

^ ^




^ ^

V /V V


Normal

Normal, Log
or NPM



Log or NPM

Log or NPM


Intrawell

Intrawell/
Interwell



Intrawell/
Interwell
Interwell
or NDC

 Trace Organic and Indicator Ana lytes (patterns at sites with prior contamination; generally absent in clean sites)
VOA's-BETX and
CI-Hyd rocarbons

BNAs, Other
Trace Organics
Indicators: TOX,
TPH, TOC, sulfide
Variable, can
be high

Generally
low-mod
Variable

Variable by site and wells j


	 ,

jss

Variable by site and specific wells


" " " "

" " " "

Normal, Log
or NPM

„ „

„ „

Intrawell,
Interwell or
NDC
„ „

„ „

     NPM non-parametric methods;  NDC- never-detected constituents
      Checks:  None- unknown, absent or infrequently occurring;   •/ - Occasionally;  •/ •/ - Frequently;   •/•/•/- Very Frequently
                                                                   5-16
        March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

CHAPTER  6.  DETECTION  MONITORING PROGRAM DESIGN
        6.1  INTRODUCTION	6-1
        6.2  ELEMENTS OF THE STATISTICAL PROGRAM DESIGN	6-2
          6.2.1  The Multiple Comparisons Problem	6-2
          6.2.2  Site-Wide False Positive Rates [SWFPR]	6-7
          6.2.3  Recommendations for Statistical Power	6-13
          6.2.4  Effect Sizes and Data-Based Power Curves	6-18
          6.2.5  Sites Using More Than One Statistical Method	6-21
        6.3  How KEY ASSUMPTIONS IMPACT STATISTICAL DESIGN	6-25
          6.3.1  Statistical Independence	6-25
          6.3.2  Spatial Variation: Interwellvs. Intrawell Testing	6-29
          6.3.3  Outliers	6-34
          6.3.4  Non-Detects	6-36
        6.4  DESIGNING DETECTION MONITORING TESTS	6-38
          6.4.1  T-Tests	6-38
          6.4.2  Analysis Of Variance [ANOVA]	6-38
          6.4.3  Trend Tests	6-41
          6.4.4  Statistical Intervals	6-42
          6.4.5  Control Charts	6-46
        6.5  SITE DESIGN  EXAMPLES	6-46
6.1  INTRODUCTION

      This chapter addresses the initial statistical design of a detection monitoring program, prior to
routine implementation.   It considers what important elements should be  specified  in  site permits,
monitoring development plans or during periodic reviews.  A good statistical design can be critically
important for ensuring that the routine process of detection monitoring meets the broad objective of the
RCRA regulations:  using statistical testing to accurately evaluate whether or not there is a release to
groundwater at one or more compliance wells.

      This guidance recommends a comprehensive detection monitoring program  design, based on two
key performance characteristics: adequate  statistical power and a low  predetermined site-wide false
positive rate [SWFPR].  The design approach presented in Section 6.2 was developed in response to the
multiple comparisons problem affecting RCRA and other groundwater detection programs, discussed in
Section 6.2.1. Greater detail in applying design cumulative false positives and assessing power follows
in the next three sub-sections.  In Section 6.3, consideration is given to data features that impact proper
implementation of statistical testing, such as  outliers and non-detects, using interwell  versus intrawell
tests,  as well as the presence of spatial variability or trends.  Section 6.4 provides a general discussion of
specific detection testing methods listed in the regulations and their appropriate use. Finally, Section 6.5
applies the design concepts to three hypothetical site examples.

      The principles  and statistical tests which this chapter covers for a  detection monitoring program
can also  apply  to compliance/corrective  action monitoring when a background  standard  is used.
Designing a background standards compliance program is discussed in Chapter 7 (Section 7.5).
                                              6-1                                    March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

6.2  ELEMENTS OF THE STATISTICAL PROGRAM DESIGN

6.2.1 THE  MULTIPLE COMPARISONS PROBLEM

     The foremost goal in detection monitoring  is to identify a real release to  groundwater when it
occurs.    Tests must have adequate  statistical power  to  identify  concentration  increases  above
background.  A second critical goal is to avoid false positive decision errors, evaluations where one or
more wells are falsely declared to be contaminated when in fact their concentration distribution is similar
to background. Unfortunately, there is a trade-off (discussed in Chapter 3) between maximizing power
and minimizing the false positive rate in designing a statistical testing protocol.  The statistical power of
a given test procedure using a fixed background sample size (ri) cannot be improved without increasing
the risk of false positive error (and vice-versa).

     In RCRA and other groundwater detection monitoring programs, most facilities must monitor and
test for multiple constituents at all  compliance wells one or more times per year.  A separate statistical
test1 for each monitoring constituent-compliance well pair is  generally conducted semi-annually.  Each
additional  background  comparison test increases the  accumulative risk of making a false positive
mistake, known statistically as the multiple comparisons problem.

     The false positive rate a (or Type I error) for an individual test is the probability that the test will
falsely indicate an exceedance of background. Often, a single fixed low false positive error rate typically
found in textbooks or regulation, e.g.,  a = .01 or .05,  is applied to each statistical test performed for
every well-constituent pair at a facility.  Applying such a common false positive rate (a) to each of
several  tests  can result in an acceptable cumulative false positive error if the number of tests is  quite
small.

      But as  the number of tests increases, the false positive rate associated with the testing network as a
whole  (i.e., across all well-constituent pairs) can be surprisingly high.  If enough tests are  run, at least
one test is likely to indicate potential contamination even if a release has not occurred. As an example, if
the testing network consists of 20 separate well-constituent pairs and a 99% confidence upper prediction
limit is used  for each test (a = .01), the expected overall network-wide false positive rate is about  18%.
There is nearly a 1 in 5 chance that one or more  tests  will falsely identify a  release to groundwater at
uncontaminated wells. For 100 tests and the same statistical procedure, the overall network-wide false
positive rate  increases to more than 63%, creating additional steps to verify the lack of contamination at
falsely triggered wells.  This cumulative  false positive  error is also indicative of at least one well
constituent false positive error, but there could be more.  Controlling this cumulative false positive error
rate is essential in addressing the multiple comparisons problem.
1   The number of samples collected may not be the same as the number of statistical tests (e.g., a mean test based on 2
  individual samples). It is the number of tests which affect the multiple comparisons problem.
2  To minimize later confusion, note that the Unified Guidance applies the term "comparison" somewhat differently than most
  statistical literature. In statistical theory, multiple tests are synonymous with multiple comparisons, regardless of the kind of
  statistical test employed. But because of its emphasis on retesting and resampling techniques, the Unified Guidance uses
  "comparison" in referring to the evaluation of a single sample value or sample statistic against a prediction or control chart
  limit. In many of the procedures described in Chapters 19 and 20, a single statistical test will involve two or more such
  individual comparisons, yet all the comparisons are part of the same (individual) test.

                                               6-2                                     March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

     Three main strategies (or their combination) can be used to counter the excessive cumulative false
positive  error rate- 1) the number of tests can be reduced; 2) the individual test false positive rate can
be lowered, or 3) the type of statistical test can be changed.  A fourth strategy to increase background
sample sizes may also be appropriate.  Under an initial monitoring design, one usually works with fixed
historical sample sizes. However, background data can later be updated in periodic program reviews.

     To make use of these strategies, a sufficiently low target cumulative SWFPR needs to be initially
identified for design purposes. The target cumulative error applies to a certain regular time period.  The
guidance recommends and uses a value of 10% over a year period of testing.  Reasons for this particular
choice are  discussed in Section 6.2.2. These strategies have consequences for the overall test power of a
well monitoring network, which are considered following control of the false positive error.

     The number  of tests  depends on the number  of monitoring constituents, compliance wells and
periodic  evaluations.  Statistical testing  on a  regular basis can be limited to constituents shown to be
reliable  indicators of a contaminant release  (discussed further in  Section 6.2.2). Depending  on site
conditions, some constituents may  need to be tested only at wells for a single regulated waste unit, rather
than across the entire facility well network.  The frequency of evaluation  is a program decision, but
might be modified in certain circumstances.

     Monitoring data for other parameters should still be routinely collected and reported to trace the
potential arrival of new chemicals into the groundwater, whether from changes  in waste management
practices or degradation over time into hazardous daughter products. By limiting statistically  evaluated
constituents to the most useful  indicators, the overall number of statistical tests can be reduced to help
meet the SWFPR objective.  Fewer tests also imply a  somewhat higher single test false positive error
rate, and therefore an improvement in power.

     As a second strategy, the  Type I error rate (atest) applied to each individual test can be lowered to
meet the SWFPR. Using the Bonferroni adjustment  (Miller, 1981), the individual test error is designed
to limit  the  overall (or experiment-wise} false positive rate a associated with n individual tests by
conducting each individual test at an adjusted significance level  of octest = o/w. Computational details for
this approach are provided in a later section.

      A  full Bonferroni adjustment strategy was neither implemented in previous guidance3 nor allowed
by regulation.  However, the principle  of partitioning individual test error rates to meet an overall
cumulative false positive error target is highly  recommended as a design  element in this  guidance.
Because  of RCRA regulatory limitations, its application  is  restricted to certain  detection monitoring
3 A Bonferroni adjustment was recommended in the 1989 Interim Final Guidance [IFG] as a post-hoc (i.e., 'after the fact')
  testing strategy for individual background-to-downgradient well comparisons following an analysis of variance [ANOVA],
  However, the adjustment does not always effectively limit the risks to the intended 5% false positive error for any ANOVA
  test. If more than 5 compliance wells are tested, RCRA regulations restrict the single  test error rate to a minimum of a =
  1% for each of the individual post-hoc tests following the F-test. This in effect raises the cumulative ANOVA test risk
  above 5% and considerably higher with a larger number of tested wells. At least one contaminated well would typically be
  needed to trigger the initial F-test prior to post-hoc testing. This fact was also noted in the 1989 IFG. Additionally, RCRA
  regulations  mandate a minimum a error rate  of 5% per constituent tested with this strategy.  For sites with extensive
  monitoring  parameter lists, this  means  a substantial risk of at least one false positive test result  during any statistical
  evaluation.

                                                 6-3                                      March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

tests— prediction and tolerance limits along with control charts. Where not restricted by regulation, the
Bonferroni approach could be used to design workable single-test or post-hoc testing for ANOVAs to
meet the overall SWFPR criterion.

        Using this strategy of defining individual false positive test rates to meet a cumulative error
target, the effect on  statistical power is direct.  Given a statistical test and fixed sample  size, a lower
false positive rate coincides with lower power of the  test to detect contamination at the well.  Some
improvement in single test power can be gained by increasing background sample sizes at a fixed test
error rate.  However, the third strategy of utilizing a  different or modified  statistical test is generally
necessary.

      This strategy involves choices among certain detection monitoring tests- prediction limits, control
charts and tolerance intervals— to enhance both power and false positive error control. Except for small
sites with  a very limited  number of tests,  any of  the three detection monitoring options should
incorporate  some manner of retesting. Through proper design, retesting can simultaneously achieve
sufficiently high statistical power while maintaining control of the SWFPR.

       RECOMMENDED GUIDANCE CRITERIA

      The design of all testing strategies should specifically address the multiple comparisons problem in
light of these two fundamental  concerns— an acceptably low false positive site-wide error rate and
adequate power.  The  Unified Guidance  accordingly recommends two statistical performance criteria
fundamental to good design of a detection monitoring program:

      1.  Application of an annual cumulative SWFPR design target, suggested at 10% per year.

      2.  Use of EPA reference power curves [ERPC] to gauge the cumulative, annual ability of any
         individual  test to detect contaminated groundwater when it exists. Over the course of a
         single year assuming normally-distributed background data,  any single test performed at
         the site should have the ability to detect  3 and 4 standard deviation increases  above
         background at specific power levels at least as high as the reference curves.

     False positive  rates  (or errors)  apply  both to  individual tests and  cumulatively to all tests
conducted in some time period.  Applying the SWFPR annual 10% rate places different sites and state
regulatory programs  on an  equal footing, so that no facility is  unfairly burdened by false positive test
results. Use of a single overall target allows  a proper comparison to be made between alternative test
methods in  designing  a statistical program.  Additional  details  in applying the  SWFPR include the
following:

    »»» The SWFPR  false positive rate should be measured on a site-wide basis, partitioned among the
       total  number of annual statistical tests.

    »»» The SWFPR applies to all statistical tests conducted in an annual or calendar year period.

    »«» The total number of annual statistical tests used in SWFPR calculations depends on the number
       of valid monitoring constituents, compliance wells and evaluation periods per year.  The number
       of tests  may or may not coincide with the number of annual sampling events, for example,  if data
       for a future mean test are collected quarterly and tested semi-annually.
                                              6-4                                     March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

    »«»  The Unified Guidance recommends a uniform approach for dealing with monitoring constituents
       not historically detected in background (e.g., trace organic compounds routinely analyzed in large
       analytical suites).   It  is  recommended that  such constituents  not  be included  in  SWFPR
       computations, and an  alternate evaluation protocol be used  (referred to  as the Double
       Quantification rule) discussed in Section 6.2.2.

     Statistical power refers to the ability of a test to identify real  increases in  concentration levels
above  background  (true  SSIs).  The power  of a test is  evaluated on population characteristics and
represents average behavior defined by repeated or an infinitely large number of samples.  Power is
reported as a fraction between 0 and 1, representing the probability that the test will identify a specific
level or degree  of increase above  background. Statistical power varies with the size of the average
population concentration  above background— generally fairly low  power to detect  small  incremental
concentrations and substantially increasing power at higher concentrations.

     The  ERPC describe the cumulative,  annual  statistical  power to detect  increasing  levels  of
contamination above a true background mean.  These curves are based on  specific normal detection
monitoring prediction limit tests of single future samples against background  conducted once, twice, or
four times in a year. Reference curve power is linked to relative, not absolute, concentration levels.
Actual  statistical  test power  is  closely  tied  to the underlying  variability  of  the  concentration
measurements. Since individual data set variability will differ by site, constituent, and often by well, the
EPA reference power curves provide a generalized  ability to estimate power by standardizing variability.
By convention, all background concentration  data are assumed to follow  a standard normal distribution
(occasionally referred to  in this document as  a Z-normal distribution)  with a true mean p,  =  0 and
standard deviation o = 1.0. Then, increases above background are measured in increasing the k standard
deviation units corresponding to &a mean units above baseline. When the background population can be
normalized via a transformation, the same normal-based ERPC can be used without loss of generality.

       Ideally, actual test power should be assessed using the original  concentration data and associated
variability, referred to as  effect size power analysis.  The power of any statistical test can be readily
computed and compared to the appropriate reference  curve, if not analytically, then by Monte Carlo
simulation. But the reference power curves laid out in the Unified Guidance offer an important standard
by which to judge the adequacy of groundwater statistical programs and  tests. They  can be universally
applied to all RCRA sites and offer a uniform way to assess the environmental and health protection
afforded by a particular statistical detection monitoring program.4

     Consequently,  it is  recommended that design of any detection monitoring statistical  program
include an assessment of  its ability to meet the power standards set out  in the Unified Guidance. The
reference  power curve  approach does not place  an  undue statistical burden on facility owners  or
operators, and is believed to be generally protective of human health and the environment.
4 The ERPCs are specifically intended for comparing background to compliance data in detection monitoring. Power issues
  in compliance/assessment monitoring and corrective action are considered in Chapters 7 and 22.

                                               6-5                                     March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     Principal features of the ERPC approach include the following:

   »»»  Reference curves are based on upper 99% prediction limit tests of single future samples against
       background.  The background sample consists of n = 10 measurements, a minimally adequate
       background sample size typical of RCRA applications.  It is assumed that the background sample
       and compliance well data are normally distributed and from the same population.

   »»»  The three reference curves described below are matched to the annual frequency of statistical
       evaluations: one each for quarterly, semi-annual, and annual evaluations. The annual cumulative
       false positive testing error is maintained at 1%, testing 1, 2, or 4 single future samples annually
       against the same background.  This represents the ability to identify a release to groundwater in at
       least one of the 1, 2 or 4 tests over the course of a year. Reporting power on an annual basis was
       chosen to correspond with the application of a cumulative annual SWFPR.

   »»»  In  the  absence of an acceptable effect size  increase (Section 6.2.4), the Unified Guidance
       recommends that any statistical test provide at least 55-60% annual power to detecting a 3a(i.e.,
       3 standard deviation) increase above the true background  mean and at least  80-85%  annual
       power for detecting increases of4a.  The percent power criteria change slightly for the respective
       reference power curves, depending on the annual frequency of statistical evaluations. For normal
       populations, a 3a increase above the  background average approximately corresponds to the upper
       99th percentile of the background distribution, implying better than a 50% chance of detecting
       such an increase.  Likewise, a 4o increase corresponds  to a true mean greater than the upper
       99.99th percentile of the background distribution, with better than a 4-in-5 chance of detecting it.

   »»»  A  single statistical  test is not adequately powerful  unless  its  power matches or betters the
       appropriate reference curve, at least  for mean-level  increases of 3 to 4 standard deviation units.
       The same concept can be applied to the overall detection monitoring test design.  It is assumed
       for statistical  design  purposes that each individual  monitoring well and constituent is of equal
       importance, and assigned a common test false positive error.  Effective power then measures the
       overall ability of the statistical program to identify any  single constituent release in any well,
       assuming all remaining constituents and wells are at background levels. If a number of different
       statistical methods are employed  in a single design, effective power can be defined with respect
       to  the least powerful of the methods being employed.  Applying effective power in this manner
       would  ensure that every well and  constituent is evaluated with adequate statistical power to
       identify potential contamination, not just those where more powerful tests are applied.

   »»»  While the Unified  Guidance  recommends  effective power as a  general  approach, other
       considerations  may  outweigh statistical  thoroughness. Not all  wells and  constituents are
       necessarily of equal practical importance.  Specific site circumstances may also result in some
       anomalous weak test power (e.g., a number of missing samples in a background data  set for one
       or  more constituents), which might be remedied by eventually increasing background size. The
       user  needs to consider all factors including effective statistical  power criteria in assessing the
       overall strength of a detection monitoring program.
                                              6-6                                    March 2009

-------
Chapter 6. Detection Monitoring Design _ Unified Guidance

6.2.2 SITE-WIDE FALSE POSITIVE RATES [SWFPR]

     In this section, a number of considerations in developing and applying the SWFPR are provided.
Following a brief discussion of SWFPR computations, the next section explains the rationale for the
10% design target SWFPR. Additional detail regarding the selection of monitoring constituents follows,
and a final discussion of the Double Quantification rule for never-detected constituents is included in the
last section.

       For cumulative false positive error and SWFPR computations, the following approach is used. A
cumulative false positive  error rate acum is calculated as  the probability of at least one statistically
significant outcome for a total  number of tests  «r in a calendar year at a single  false  positive error rate
   t using the properties of the Binomial distribution:
       By rearranging to solve for atest , the 10% design SWFPR (. 1) can be substituted for acum and the
needed per-test false positive error rate calculated as:
                                      =1_(9y/T
                                  "'test   1  V y)

       Although these calculations are relatively straightforward and were used to develop certain K-
factor  tables in the Unified Guidance (discussed in Section  6.5  and in later chapters), a further
simplification is possible using the Bonferroni approximation.  This assumes that cumulative, annual
SWFPR is roughly the additive sum of all the individual test errors. For low false positive rates typical
of guidance  application, the Bonferroni results are satisfactorily close to the Binomial formula for most
design considerations.

     Using  this principle,  the  design 10%  SWFPR can be  partitioned  among  the  potential annual
statistical  tests  at  a facility in a number of ways.  For facilities  with  different annual  monitoring
frequencies,  the SWFPR can be divided among quarterly or semi-annual period tests. Given (XSWFPR = .1
and WE evaluation periods, the quarterly cumulative false positive target rate (XE at a facility conducting
quarterly testing would be (XE = (XSWFPR/WE = .1/4 = .025 or 2.5% (and similarly for semi-annual testing).
The  total or sub-divided  SWFPR can likewise be  partitioned among dedicated monitoring well
groupings at a multi-unit facility or among individual monitoring constituents as needed.

       DEVELOPMENT AND RATIONALE FOR THE SWFPR

     The  existing RCRA Part 264 regulations  for parametric or non-parametric analysis  of variance
[ANOVA] procedures mandate a Type I error of at least 1% for any individual  test, and  at least  5%
overall. Similarly, the RCRA Part 265 regulations require a minimum 1% error for indicator parameter
tests.   The rationale for minimum false positive requirements is motivated by statistical power. If the
Type I error is set too low, the power of the test will be unacceptably low  for any  given test. EPA was
historically not  able to specify a minimum level of acceptable power within the RCRA regulations. To
do so would require specification of a minimum difference of environmental concern between the null
and alternative  test hypotheses.  Limits on current knowledge about the  health and/or environmental
effects associated with incremental changes in concentration levels of Part 264 Appendix IX or Part 258
Appendix II constituents greatly complicate this task. Tests of non-hazardous  or low-hazard indicators
                                              6-7                                     March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

might have different power requirements than for hazardous constituents.  Therefore, minimum false
positive rates  were  adopted  for  ANOVA-type  procedures until  more  specific guidance  could  be
recommended.  EPA's main  concern was adequate statistical  power to  detect real contamination of
groundwater, and not enforcing commonly-used false positive test rates.

      This  emphasis is  evident  in §264.98(g)(6)  and  §258.54(c)(3)  for  detection monitoring  and
§264.99(i) and §258.55(g)(2)  for compliance monitoring. Both pairs of provisions allow the owner or
operator to demonstrate that any statistically significant difference between background and compliance
point wells or between  compliance point wells  and  the GWPS is an  artifact caused by an error in
sampling, analysis,  statistical evaluation, or natural  variation  in  groundwater chemistry.  The rules
clearly expect that there will be occasional false positive errors, but existing rules are silent regarding the
cumulative frequency of false positives at regulated facilities.

      As previously noted, it is  essentially impossible to maintain a low cumulative  SWFPR for
moderate to large monitoring networks if the Type I errors for individual tests must be kept at or above
1%.  However, the RCRA regulations do not impose similar false positive error requirements on the
remaining  control  chart, prediction  limit and tolerance interval  tests.   Strategies that incorporate
prediction limit or control chart retesting can achieve very low individual  test false positive rates while
maintaining adequate power to detect contamination.  Based on prediction limit research in  the  1990's
and after, it became clear that these alternative methods with suitable retesting could also control the
overall cumulative false positive error rate to manageable levels.

       This guidance suggests the use of an annual SWFPR of .1 or 10% as a fundamental element of
overall detection monitoring design.  The choice  of a 10% annual SWFPR was made in light  of the
tradeoffs between false positive control and testing power. An annual period was chosen to put different
sized facilities on a common footing regardless of variations in scheduled testing. It is recognized that
even with such a limited error rate, the probability of false positive outcomes over a number of years
(such  as in the lifetime of  a 5-10 year permit) will be higher.   However,  such relatively limited
eventualities can be identified and adjusted for, since the  RCRA regulations do allow for demonstration
of a false positive error.  State programs may choose to use a different annual rate such as 5%  depending
on  the circumstances.   But  some predefined  SWFPR in a given  evaluation  period  is  essential for
designing a detection monitoring program, which can  then be translated into target individual test rates
for any alternative statistical testing strategy.

      To implement this  recommendation, a given facility should identify its yearly evaluation schedule
as  quarterly,  semi-annual, or  annual. This  designation is used both  to select an appropriate EPA
reference power curve by which to gauge acceptable  power, and to select prediction limit and control
chart multipliers useful in  constructing detection  monitoring tests. Some of the strategies described in
the Unified Guidance in later chapters require that more than one observation per compliance well be
collected prior to  statistical testing. The  cumulative, annual false positive rate is linked not  to the
frequency  of sampling  but  rather to the frequency of statistical evaluation.  When resamples (or
verification resamples)  are incorporated into a statistical procedure  (Chapter 19), the  individual
resample comparisons comprise  part  of a  single test. When  a single future  mean of m  individual
observations is evaluated against a  prediction  limit,  this constitutes  a  test based on one  mean
comparison.
                                               6-8                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

NUMBER OF TESTS AND CONSTITUENTS

     In designing a detection monitoring program to achieve the target SWFPR, the number of annual
statistical  tests to be conducted needs to be identified.  This number is calculated as the number  of
distinct monitoring  constituents  x the number of compliance wells in  the network x the number  of
annual  evaluations.  Five constituents and  10  well  locations statistically  evaluated  semi-annually
constitute 100 annual tests (5x10x2), since each distinct well-constituent pair represents a different
statistical  test  that must be evaluated against their respective backgrounds. Even smaller facilities are
likely to have a substantial number of such tests, each incrementally adding to the SWFPR.

     While the retesting strategies outlined in Chapters 19 and 20 can aid tremendously in limiting the
SWFPR and ensure adequate statistical power, there are practical limits to meeting these goals due to the
limited number of groundwater observations that can be collected and/or the number of retests which can
feasibly be run.  To help  balance the risks of false positive  and false negative errors, the number  of
statistically-tested monitoring parameters should be limited to constituents thought  to be  reliable
indicators of a contaminant release.

     The  guidance assumes that data from large suites of trace elements and organics along with a set  of
inorganic  water  quality indicators (pH, TDS, common  ions, etc.) are  routinely collected as part  of
historical  site groundwater monitoring.  The number of constituents potentially available for testing can
be quite large,  perhaps as many as 100 different analytes.  At some sites, the full monitoring lists are too
large to feasibly limit the SWFPR while maintaining sufficiently high power.

     Non-naturally  occurring chemicals such as volatile organic compounds  [VOC]  and semi-volatile
organic compounds  [SVOC]  are often viewed as excellent indicators of groundwater contamination, and
are thereby included in the monitoring programs of many facilities. There is  a common misperception
that the greater the number of VOCs and SVOCs on the monitoring list, the greater the statistical power
of the monitoring program. The reasoning is that if none of these chemicals should normally be detected
in groundwater —  barring a release —  testing for more  of them ought to improve the  chances  of
identifying contamination.

     But  including a large suite of VOCs and/or SVOCs among the mix of monitoring parameters can
be counterproductive to the  goal of maintaining adequate effective  power  for the site as a whole.
Because of the trade-off between statistical power and false positive  rates (Chapter 3), the power  to
detect groundwater contamination in one  of these wells even with a retesting strategy in place may be
fairly low unless background  sample  sizes are quite large.  This is  especially  true if the regulatory
authority only  allows for a  single retest.

     Suppose  40  VOCs  and certain  inorganic  parameters  are  to  be  tested  semi-annually at 20
compliance wells totaling  1600 annual statistical tests.  To maintain a  10% cumulative annual SWFPR,
the per-test false positive rate would then need to be set at  approximately octest = .0000625.  If only 10
constituents were selected  for formal testing,  the per-test rate would be increased to octest = .00025.  For
prediction limits  and other detection tests, higher false positive test rates translate to lower K-factors and
improved  power.

      Some means of reducing the number  of tested constituents is  generally necessary to design an
effective detection monitoring system.  Earlier discussions have already suggested one obvious first  step,

                                              6-9                                     March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

by eliminating historically non-detected constituents in background  from the formal list of deletion
monitoring  constituents  (discussed  further in the following section).   These constituents  are  still
analyzed and informally tested, but do not count against the SWFPR.

     Results of waste and leachate testing and possibly soil gas analysis should serve as the initial basis
for designating constituents that are reliable leak detection indicators. Such specific constituents actually
present in, or derivable from, waste or soil gas samples, should be further evaluated to determine which
can be analytically detected a reasonable proportion of the time.  This  evaluation should include
considerations of how soluble and mobile a constituent may be in the underlying aquifer. Additionally,
waste or leachate concentrations should be high enough relative  to the groundwater levels to allow for
adequate detection.  By limiting  monitoring  and statistical  tests to fewer parameters with reasonable
detection frequencies and that are significant components of the  facility's waste, unnecessary statistical
tests can be avoided while focusing on the reliable identification of truly contaminated groundwater.

     Initial leachate testing should not serve as  the sole basis for designating monitoring parameters.
At many active hazardous waste facilities and solid waste landfills, the composition of the waste may
change over time. Contaminants  that initially were all non-detect may not remain so.  Because of this
possibility, the Unified Guidance recommends that the list of monitoring  parameters subject to formal
statistical evaluation be periodically reviewed, for example, every three to five years. Additional leachate
compositional  analysis and testing may be necessary, along with the measurement of constituents not on
the monitoring list but of potential health or environmental concern. If previously undetected parameters
are discovered in this evaluation, the permit authority should consider revising the monitoring  list to
reflect those analytes that will best identify potentially contaminated groundwater in the future.

     Further reductions are possible  in the number of constituents used for formal detection monitoring
tests,  even among constituents  periodically or always detected.  EPA's  experience at hazardous waste
sites and landfills across the country has shown that VOCs and  SVOCs detected in a release generally
occur in clusters; it is less common to detect only a single constituent at a given location. Statistically,
this implies that groups of detected VOCs or SVOCs are likely to be correlated. In  effect, the correlated
constituents are measuring a release in similar fashion and  not providing fully independent measures.
At petroleum  refinery sites, benzene, toluene, ethylbenzene and xylenes measured in a VOC  scan  are
likely to be detected together  Similarly at sites having releases of 1,1,1-trichloroethane, perhaps 10-12
intermediate chlorinated hydrocarbon degradation compounds  can  form in the aquifer over time.
Finally, among water quality indicators like common ions and TDS, there is a great deal of geochemical
inter-relatedness.   Again,  two or three indicators from  each of these  analyte groups may suffice as
detection monitoring constituents.

     The overall goal should be to select only the most reliable monitoring constituents for detection
monitoring test purposes.  Perhaps 10-15 constituents may  be a reasonable target, depending on site-
specific needs.  Those analytes not  selected should  still continue  to be  collected and evaluated.  In
addition to using the informal test to identify previously undetected constituents described in the next
section, information on the remaining constituents (e.g., VOCs, SVOCs  and trace elements) can still be
important in assessing groundwater conditions, including additional confirmation of a detected release.
                                              6-10                                    March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

DOUBLE  QUANTIFICATION RULE

     From the previous discussion, a full set of site historical monitoring parameters can be split into
three distinct groups: a) those reliable indicators and hazardous constituents selected for formal detection
monitoring testing and contributing to the SWFPR; b) other analytes which may be occasionally or even
frequently detected and will be monitored for general groundwater quality information but not tested;
and c) those  meeting the "never-detected" criteria. The last group may still be of considerable interest
for eventual formal testing,  should site or waste management conditions change and new compounds be
detected. All background measurements in the "never-detected"  group should be non-detects, whether
the full historical set or a subgroup considered most  representative (e.g., recently collected background
measurements using an  improved analytical method.5).   The following rule is suggested to provide a
means of evaluating "never-detected"  constituents.

     The  Double Quantification rule implies  that statistical tests should be designed for each of the
constituents in the first group.  Calculations involving the SWFPR should cover these constituents, but
not include constituents in second and the third '100% non-detect' categories.  Any constituent in this
third group should be evaluated by the following simple, quasi-statistical rule6:

          A confirmed exceedance is registered if any well-constituent pair in the '100%
          non-detect' group exhibits quantified measurements (i.e., at or above the
          reporting limit [RL]) in two consecutive sample and resample events.
     It is assumed when estimating an SWFPR using the Bonferroni-type adjustment, that each well-
constituent test is at equal risk for a specific,  definable false positive error. As a justification for this
Double Quantification rule,  analytical procedures involved  in identifying a reported non-detect value
suggest that  the error  risk is probably  much lower  for most chemicals analyzed as "never-detected."
Reporting limits are set high enough  so that if a chemical is not present at all in the  sample, a detected
amount will  rarely be recorded on the  lab sheet. This is particularly the case since method detection
limits  [MDLs] are often  intended as  99%  upper  prediction  limits  on the measured signal  of  an
uncontaminated laboratory sample. These limits are then commonly multiplied by a factor of 3 to 10 to
determine the RL.

     Consequently,  a  series of measurements for VOCs or SVOCs on samples  of uncontaminated
groundwater  will tend to be listed as a string of non-detects with possibly a very occasional low-level
detection.  Because the observed measurement levels (i.e., instrument signal levels)  are usually  known
only to the chemist, an approximate prediction limit  for the  chemical basically has to be set at the RL.
However, the true measurement distribution is likely to be clustered much more closely around zero than
the RL (Figure 6-1), meaning that the false positive rate associated with setting the RL as the prediction
5 Note: Early historical data for some constituents (e.g., certain filtered trace elements) may have indicated occasional and
  perhaps unusual detected values using older analytical techniques or elevated reporting limits. If more recent sampling
  exhibits no detections at lower reporting limits for a number of events, the background review discussed in Chapter 5 may
  have determined that the newer, more reliable recent data should be used as background.  These analytes could also be
  included in the '100% non-detect' group.

   The term "quasi-statistical" indicates that although the form is a statistical prediction limit test, only an approximate false
  positive error rate is implied for the reporting limit critical value.  The test form follows l-of-2 or l-of-3 non-parametric
  prediction limit tests using the maximum value from a background data set (Chapter 19).

                                               6-11                                    March  2009

-------
Chapter 6. Detection Monitoring Design
Unified Guidance
limit is likely already much lower than the Bonferroni-adjusted error rate calculated above. A similar
chain of reasoning would apply to site-specific chemicals that may be on the monitoring list but have
never been detected at the facility. Such constituents would also need a prediction limit set at the RL.

   Figure 6-1. Hypothetical Distribution of Instrument Signals  in  Uncontaminated
                                       Groundwater
                                       Measured Concentration
     In general,  there  should be  some  minimally sufficient sample  numbers  to  justify  placing
constituents in the "never-detected" category.  Even such a recommendation needs to consider individual
background well versus pooled well data.  Depending on the number of background wells (including
historical compliance well data used as background which reflect the same non-detect patterns), certain
risks may have to be taken to implement this strategy. With the same total number of non-detects (e.g.,
4 each in 5 wells versus 20 from a single well), the relative risk can change. Certain non-statistical
judgements may be needed, such as the likelihood of particular constituents arising from the waste or
waste management unit.  At a minimum, we recommend that at least 6 consecutive non-detect values
initially be present in each well of a pooled group,  and  additional background well sampling should
occur to raise this number to 10-15.

     Having  10-15 non-detects as a basis, a maximum worst-case probability of a future false  positive
exceedance under Double Quantification rule testing could be estimated.  But it should be kept  in mind
that the true individual comparison false positive rates based on analytical considerations are likely to be
considerably lower.  The number of non-detect constituents evaluated under the rule will also play a role.
There will  be some cumulative false positive error based  on the number  of comparisons at some true
false positive single test error or errors.  Since the true false positive test rates cannot be known (and may
vary considerably among analytes), it is somewhat problematic to make this cumulative false  positive
error estimate. Yet there is some likelihood that occasional false positive exceedances will occur under
this rule.
                                             6-12
        March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     Some flexibility will be required in evaluating such outcomes, particularly if there is doubt that a
confirmed exceedance is actually due to a release from the regulated unit. In this circumstance, it might
be appropriate to allow for a second resample as more definitive confirmation.

     In implementing the Double Quantification rule, consideration should be given to how soon a
repeat  sample  should be taken.   Unlike  detectable parameters, the  question of autocorrelation is
immaterial since the compound should not be present in the background aquifer.  A sufficiently long
interval should occur between the initial and repeat samples to minimize the possibility of a systematic
analytical error.  But the time interval should be short enough to avoid  missing a subsequent real
detection due to seasonal changes in the aquifer depth or flow direction.  It is suggested that 1-2 months
could be appropriate, but will  depend on site-specific hydrological conditions.

     Using this rule, it should be possible to construct adequately powerful prediction and control limits
for naturally-occurring and detectable inorganic and organic chemicals in almost every setting.  This is
especially helpful at larger sites, since the total number of tests  on which the per-test false positive rates
(atest) are based will be significantly reduced. Requiring a verified quantification for previously non-
detected constituents should ensure that spurious  lab  results do not falsely trigger  a facility into
compliance/assessment monitoring, and will more reliably  indicate the presence of chemicals  that have
heretofore not been found in background.

6.2.3 RECOMMENDATIONS  FOR STATISTICAL POWER

     The second  but  more  important regulatory goal  of a testing  strategy  is to ensure  sufficient
statistical power for detecting contaminated groundwater.  Technically, in the context of groundwater
monitoring, power refers to  the probability that  a statistical  test will  correctly identify  a significant
increase in concentration above background. Note that power is typically defined with respect to a single
test, not a network of tests.  In this guidance, cumulative power is assessed for a single test over an
annual period, depending on the frequency  of the evaluation.  Since some testing procedures may
identify contamination more readily when several wells in  the network are contaminated as opposed to
just one or two,  the Unified Guidance  recommends that all testing strategies be compared on the
following more stringent common basis.

      The effective power of a testing protocol across a network of well-constituent pairs is defined as
the probability of detecting contamination in the monitoring  network when one  and only  one well-
constituent pair is contaminated. Effective power is a conservative measure of how a testing regimen
will perform across the network, because the set of statistical tests must uncover one contaminated well
among many clean ones (i.e., like 'finding a  needle in a haystack').  As mentioned above, this initial
judgment may need to be qualified with effect size and other site-specific considerations.

       INTRODUCTION TO  POWER CURVES

     Perhaps the best way to describe the power function associated with a particular testing procedure
is via a graph, such as the example below of the power of a  standard normal-based upper prediction limit
with 99%  confidence (Figure  6-2). The power in percent  is plotted along the _y-axis against the
standardized mean level of contamination along the x-axis. The standardized contamination levels are
presented in units of standard  deviations above the baseline  (defined as the true background mean). This
                                              6-13                                    March 2009

-------
Chapter 6.  Detection Monitoring Design _ Unified Guidance

allows different power curves to be compared across constituents, wells, or well-constituent pairs. These
standardized units A in the case of normally-distributed data may be computed as:

                        (Mean Contamination Level) - (Mean Background Level)
                    A = -                [6.1]
                                    (SD of Background Population)

     In some  situations, the probability that  contamination will be detected by a particular testing
procedure may be difficult if not impossible to derive analytically and will have to be simulated using
Monte Carlo analysis on a computer. In these cases, power is typically estimated by generating normally-
distributed random values at different mean contamination levels and repeatedly simulating the test
procedure. With enough repetitions a reliable power curve can be plotted.

     In the case of the normal power curve in Figure 6-2, the power values were computed analytically,
using properties of the non-central t-distribution. In particular, the statistical  power of a normal 99%
prediction limit for the next single future value can be calculated as
                                                                                             [6.2]
where A is the number of standardized (i.e., standard deviation) units above the background population
mean, (l-(3) is the fractional power, 8 is a non-centrality parameter, and:


                                                                                             [6.3]
represents  a non-central  ^-variate with  («-l)  degrees of freedom and non-centrality parameter 8.
Equation [6.2] was used with n = 10 to generate Figure 6-2.7

     On a general power curve,  the power at A = 0 represents the false positive rate  or size of the
statistical test, because at that point no contamination is actually present (i.e., the background condition),
even though the curve indicates how often a significant concentration increase will be detected.  One
should be careful to distinguish between the SWFPR across many statistical tests and the false positive
rate represented on a curve measuring effective power. Since the effective power is defined as the testing
procedure's ability to identify a single contaminated well-constituent  pair, the  effective power curve
represents an individual test, not a network of tests.  Therefore, the value of the curve at A = 0 will  only
indicate the false  positive rate associated with an  individual test (atest),  not across the network  as  a
whole.  For many of the  retesting strategies discussed in  Chapters  19 and 20,  the individual  per-test
false positive rate will be quite small and may appear to be nearly zero on the effective power curve.
  For users with access to statistical software containing the non-central T-distribution, this power curve can be duplicated.
  For example, the A = 3o fractional power can be obtained using the following inputs: a central t-value of t.99j 9 = 2.821, 9 df,
  and 8  = 3/yl + (l/lOj = 2.8604 .  The fractional power is .5414. It should be noted that the software may report the
                                               6-14                                    March 2009

-------
Chapter 6. Detection Monitoring Design
                                                                 Unified Guidance
       Figure 6-2.  Normal Power Curve (n = 10) for 99% Prediction Limit Test

                 1.00
CD
£
o
o_
                 0.75 -
                 0.50 H
                 0.25 -
                 0.00
                        0
                                  \
                                 2
 \
3
 \
4
                                       A(SDs above Background)

     To properly interpret a power curve, note that not only is the probability greater of identifying a
concentration increase above background (shown as a decimal value between 0 and 1 along the vertical
axis) as the magnitude of the increase gets bigger (as measured along the horizontal axis), but one can
determine the probability of identifying certain kinds of increases. For instance, with effective  power
equivalent  to that in  Figure 6-2, any mean concentration increase of at least 2 background  standard
deviations will be detected about 25% percent of the time, while  an increase of 3  standard deviations
will be  detected with approximately 55% probability or better than 50-50 odds. A mean increase of at
least 4 standard deviations will be detected with about 80% probability.

     An increase of 3  or 4 standard  deviations above the  baseline may or may not have practical
implications for human health or the environment. That will  ultimately depend on  site-specific factors
such  as the  constituents  being monitored, the  local  hydrogeologic environment,  proximity to
environmentally sensitive populations, and the observed  variability in background concentrations. In
some circumstances, more sensitive testing procedures might be warranted. As a general guide especially
in the  absence of direct site-specific information,  the  Unified  Guidance  recommends  that when
background is approximately normal in distribution,8 any statistical test should be able to detect a 3
  probability as (P) rather than (1-P). For more complex power curves involving multiple repeat samples or multiple tests,
  integration is necessary to generate the power estimates.
  If a non-parametric test is performed, power (or more technically, efficiency) is often measured by Monte Carlo simulation
  using normally distributed data. So these recommendations also apply to that case.
                                              6-15
                                                                        March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

standard deviation increase at least 55-60% of the time and a 4 standard deviation increase with at least
80-85% probability.

       EPA REFERENCE POWER CURVES

       Since effect sizes discussed in the next section often cannot or have not been quantified, the
Unified Guidance recommends using the ERPC as a suitable basis of comparison for proposed testing
procedures. Each reference power curve corresponds to one of three typical yearly statistical evaluation
schedules — quarterly, semi-annual, or annual —  and represents  the cumulative  power achievable
during a single year at one  well-constituent pair by a 99% upper (normal) prediction limit based on n =
10 background measurements and one new measurement from the compliance well (see Chapter 18 for
discussion of normal prediction limits). The ERPC are pictured in Figure 6-3 below.
     Any proposed  statistical test procedure with effective power  at least as high as the appropriate
ERPC, especially in the range of three or more  standard deviations above the background mean, should
be considered to have reasonable power.9 In particular, if the effective power first exceeds the ERPC at a
mean concentration increase no greater than 3 background standard deviations (i.e., A < 3), the power is
labeled 'good;'  if the effective power first exceeds the ERPC at  a mean increase between  3  and 4
standard deviations (i.e., 3 < A < 4), the power is considered 'acceptable;' and if the first exceedance of
the ERPC does not occur until an increase greater than 4 standard deviations  (i.e., A > 4), the power is
considered 'low.'
     With respect to the ERPCs, one should keep the following considerations in mind:

  1.    The effective power of any testing method applied to a groundwater monitoring network can be
       increased merely by relaxing the SWFPR guideline, letting the SWFPR become larger than 10%.
       This is why a maximum annual SWFPR of 10% is suggested as standard guidance, to ensure fair
       power comparisons  among competing tests and to limit the overall network-wide false positive
       rate.

  2.    The ERPCs are based on cumulative power  over a one-year period.  That is, if a single well-
       constituent pair is contaminated at standardized  level A during each of the yearly evaluations, the
       ERPC  indicates the  probability that  a  99%  upper  prediction limit test  will  identify the
       groundwater  as impacted during  at least one of those evaluations. Because the number of
       evaluations not only varies by facility, but also impacts the cumulative one-year power, different
       reference power curves  should be employed  depending on a  facility's  evaluation schedule.
       Quarterly evaluators  should  utilize the  quarterly reference power curve  (Q);  semi-annual
       evaluators the semi-annual curve (S); and annual evaluators the annual curve (A).

  3.    If Monte Carlo simulations are used to evaluate  the power of a proposed testing method,  it
       should incorporate  every aspect of the procedure, from initial screens  of the data  to final
  When using a retesting strategy in a larger network, the false positive rate associated with a single contaminated well (used
  to determine the effective power) will tend to be much smaller than the targeted  SWFPR. Since the point at which the
  effective power curve intersects A = 0 on the standardized horizontal axis represents the false positive rate for that
  individual test, the effective power curve by construction will almost always be less than the EPA reference power curve for
  small concentration increases above background. Of more concern is the relative behavior of the effective power curve at
  larger concentration increases, say two or more standard deviations above background.

                                              6-16                                    March 2009

-------
Chapter 6.  Detection Monitoring Design
Unified Guidance
       decisions concerning the presence of contamination. This is especially applicable to strategies
       that involve some form of retesting at potentially contaminated wells.

       Although monitoring networks incorporate multiple well-constituent pairs, effective power can
       be gauged by simulating contamination in one and only one constituent at a single well.

       The ERPCs should be considered a minimal power standard. The prediction limit test used to
       construct these reference curves  does not incorporate retesting  of any sort,  and is based on
       evaluating a single new measurement from the contaminated well-constituent pair. In  general,
       both retesting and/or the evaluation of multiple compliance point measurements tend to improve
       statistical power, so proposed tests that include such elements should be able to match the ERPC.

       At sites employing multiple types of test procedures (e.g., non-parametric prediction limits for
       some constituents, control charts for other constituents), effective power should be computed for
       each type of procedure to  determine which type  exhibits the least statistical  power. Ensuring
       adequate power across the site implies that the least powerful procedure should match or exceed
       the appropriate ERPC, not just the most powerful procedure.

                        Figure 6-3.  EPA Reference  Power Curves
                         Annual
                         Semi-Annual
                        - Quarterly
                    ° t_
                                           2         3

                                           SO Units Above BG
                                             6-17
        March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

6.2.4 EFFECT SIZES AND DATA-BASED POWER CURVES

       EFFECT SIZES

     If site-specific or chemical-specific risk/health information is available particularly for naturally-
occurring constituents, it can be used in some circumstances to develop an effect size of importance. An
effect size  (cp) is simply the smallest concentration increase above the mean background level that is
presumed or known to have a measurable, deleterious impact on human health and/or the environment,
or that would clearly signal the presence of contamination.

     When an effect size can be quantified for a given constituent and is approved by the regulating
authority, the acceptable power of the statistical test can be tailored to that amount. For instance, if an
effect size for lead in groundwater at a particular site is 9 = 10 ppb, one might require that the statistical
procedure have an 80% or 95% chance of detecting such an increase. This would be true regardless of
whether the power curve for lead at that site matches the ERPC. In some cases, an agreed-upon effect
size will result in a more stringent power requirement compared to the ERPCs. In other cases, the power
standard might be less stringent.

     Effect sizes are not known or have not been  determined  for many groundwater constituents,
including many inorganic parameters that have detection frequencies high  enough to be amenable to
effect size calculations. Because of this, many users will routinely utilize the relative power guidelines
embodied in the ERPC. Even if a specific effect size cannot be determined, it is helpful to consider the
site-specific and test-specific  implications of a three or four standard deviation concentration  increase
above background. Taking the background sample mean (x )  as the estimated baseline, and estimating
the underlying population variability by using the sample background standard deviation (s),  one can
compute the approximate actual concentrations associated with a three, four, five, etc. standard deviation
increase above the baseline (as would be done in computing a data-based power curve; Section 6.2.4).
These concentration values will only be approximate,  since the true background mean (|i) and  standard
deviation (a) are unknown. However, conducting this  analysis can be useful in at least two ways. Each
is illustrated by a simple example.

       By associating the standardized units on a reference power curve with specific but approximate
concentration levels, it is possible to evaluate whether the anticipated power characteristics of the chosen
statistical method are adequate for the site in question. If not, another method with better power might
be needed.  Generally, it is useful to discuss  and report statistical power in terms of concentration
levels rather than theoretical units.

       ^EXAMPLE 6-1

       A potential permit GWPS for lead is 15 ppb, while natural background lead levels are normally
distributed  with  an average of 6  ppb and a standard  deviation of 2  ppb.   The regulatory agency
determines  that a statistical test should be able to identify an exceedance of this GWPS with high power.
Further assume that the power curve for a particular statistical test indicated  40% power at 3  standard
deviations and 78% power at 4o above background (a low power rating).

       By comparing the actual standard deviation estimate to the required target increase q> = (15-6)72 =
4.5 standard units,  the power at the critical effect size would be 80% or higher using Figure  6-2 as a

                                             6-18                                   March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

rough guide. This might be sufficient for monitoring needs even though the test did not meet the EPA
reference criteria. Of course, the results apply only to this specific well-constituent test. -^

       For a given background  sample, one can consider the regulatory and environmental impact of
using that particular background as the basis of comparison in detection monitoring. Especially when
deciding between interwell and intrawell tests at the same  site,  it is not unusual for the intrawell
background from an individual well to exhibit much  less variability than a larger set of observations
pooled from multiple upgradient wells. This difference can be important since  an intrawell  test and an
interwell test applied to the same site —  using identical relative power criteria — might be associated
with different risks to human health and the environment. A similar type of comparison might also aid in
deciding whether the degrees of freedom  of an intrawell test ought to be enlarged via a pooled estimate
of the intrawell standard deviation (Chapter 13), whether a non-adjusted intrawell test is adequate, or
whether more background sampling ought to be conducted prior to running intrawell tests.

       ^EXAMPLE 6-2

     The  standard deviation  of an intrawell background population is  Ointra = 5 ppb,  but  that of
upgradient, interwell background is Ointer = 10 ppb. With the increased precision of an intrawell method,
it may be possible to detect a 20 ppb increase with high probability  (representing a A = 4Ointra increase),
while the corresponding probability for an interwell test is much lower (i.e., 20 ppb = 2ointer = A). Of
course, even if the  intrawell test meets the ERPC target at four standardized units above background,
consideration should be given as to whether or not 20 ppb is a meaningful increase. -4

     One caveat is that calculation of either effect sizes or data-based power curves (see below) requires
a reasonable estimate of the background standard deviation (a). Such calculations may often be possible
only for naturally-occurring inorganics or other constituents with  fairly high detection frequencies in
groundwater. Otherwise, power computations based on an effect size or the estimated standard deviation
(s) are likely to be unreliable due to the presence of left-censored measurements (i.e., non-detects).

     A type of effect  size  calculation  is  presented  in  Chapter  22 regarding  methods for
compliance/assessment and corrective  action monitoring. A  comparable  effect size is computed by
considering changes in mean concentration levels  equal  to  a multiple of a  fixed GWPS or clean-
up/action level. While  the mean level changes are multiples of the concentration limit and in that sense
still relative, because they are tied to a fixed concentration standard, the power  of the test can be linked
to specific concentration levels.

       DATA-BASED POWER CURVES

     Even if basing power on a specific effect size is impractical for a given facility or constituent, it is
still possible to relate power to absolute concentration  levels rather  than to  the standardized units of the
ERPC. While exact statistical power depends on the unknown population standard deviation (a), an
approximate power curve can be constructed based on the estimated background standard deviation (s).
Instead of an estimate of power at a single effect size (depicted in Example  6-1), the actual power over a
range of effect sizes can be evaluated.  Such a graph is denoted in the Unified Guidance as  a data-based
power curve, a term first coined by Davis  (1998).
                                              6-19                                   March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     Since the sample standard deviation (s) is calculated from actual groundwater measurements, this
in turn  changes  an abstract power curve based on relative concentrations (i.e., ka units above  the
baseline  mean) into one displaying approximate,  but absolute, concentrations (i.e., ks units above
baseline). The advantages of this approach include the following:

    »»»  Approximate data-based power curves allow the user to determine statistical  power  at any
       desired effect size ((])).

    »»»  Even if the effect size ((])) is unspecified, data-based power curves tie the performance of the
       statistical test back to actual concentration levels of the population being tested.

    »»»  Once the theoretical power curve of a particular statistical test is known,  a  data-based power
       curve  is extremely easy to construct. One merely substitutes the observed background standard
       deviation (s) for a and multiply by k to determine concentration values along the horizontal axis
       of the power curve. Even if the theoretical power curve is unknown, the same calculations can be
       made on the reference curve to derive an approximate site-specific, data-based power curve for
       tests roughly matching the performance of the ERPCs.

    »»»  If the choice between  an interwell test  and an intrawell approach is a  difficult one  (Section
       6.3.2), helpful power comparisons can  be made between intrawell and interwell tests at the same
       site using data-based power curves. Even if both tests meet the ERPC criteria, they may be based
       on different sets of background measurements,  implying that the  interwell standard deviation
       Center) might  differ from the intrawell standard deviation (smtra). By plotting both data-based
       power curves on the same set of axes, the comparative performance of the tests  can be gauged.

       ^EXAMPLE 6-3

     The following  background sample is used to construct a  test with theoretical  statistical power
similar to the ERPC for annual evaluations  (see Figure 6-2). What will an approximate data-based
power curve look like, and  what is the approximate power for detecting a concentration  increase of 75
ppm?
Quarter
1/95
4/95
7/95
10/95
1/96
Mean
SD
Sulfate Concentrations (ppm)
BW-1 BW-2
560
530
568
490
510
545.0 ppm
29.7 ppm
550
570
540
542
590

                                             6-20                                    March 2009

-------
Chapter 6. Detection Monitoring Design
        Unified Guidance
       SOLUTION
     The  sample standard deviation of the pooled background sulfate concentrations is 29.7 ppm.
Multiplying this amount by the number of standard deviations above background along the x-axis in
Figure 6-2 and re-plotting, the approximate data-based power curve of Figure 6-3 can be generated.
Then the statistical power for detecting an increase of 75 ppm is almost 40%.

               Figure 6-3. Approximate s-Based  Power Curve for Sulfate
                         o.o
                                                100        150

                                      Sulfate Cone. Increase (ppm)
200
     Had the pooled sample size been n = 16 using the same test and sample statistics, a different and
somewhat more powerful theoretical power curve would result.  This theoretical curve can be generated
(for a 1-of-l prediction limit test) using the non-central T-distribution described earlier, if a user has the
appropriate statistical  software package.  The power for a 75 ppm increase can be calculated using
S = 75/A/l + (l/16) = 2.45 and t.99, 15 = 2.602, as closer to 46%.  The larger background sample  size
makes for a more powerful test.  -^

6.2.5 SITES USING MORE THAN ONE STATISTICAL METHOD

     There is no requirement that a facility apply one and only one statistical method to its groundwater
monitoring program.  The RCRA  regulations explicitly  allow for the  use of multiple  techniques,
depending on the distributional properties of the constituents being monitored and the characteristics of
the site.  If some constituent data contain a high percentage of non-detect values,  but  others can be
normalized, the statistical approach should vary by constituent.

     With interwell testing, parametric prediction limits might be  used with certain constituents and
non-parametric prediction limits for other highly non-detect parameters. If intrawell testing is used, the
most appropriate statistical technique for one constituent might differ at certain groups of wells than for
others. Depending on the monitoring constituent, available individual  well background, and other  site-
specific factors, some combination of intrawell prediction limits, control charts, and Wilcoxon rank-sum
tests might come into play. At other sites, a mixture of intrawell and interwell tests might be conducted.

     The Unified Guidance offers a range of possible methods which can be matched to the statistical
characteristics of the observed data. The primary goal is that the statistical program should maximize the
                                             6-21
               March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

odds  of making correct judgments about groundwater quality. The guidance  SWFPR and  ERPC
minimum power criteria serve as comprehensive guides for assessing any of the statistical methods.

     One major concern  is how statistical power should be compared when multiple methods  are
involved. Even if each method is so designed as not to exceed the recommended SWFPR, the effective
power for identifying contaminated  groundwater may vary considerably by technique and specific type
of test. Depending on the well network and statistical characteristics of available data, a certain control
chart test may or may not be as powerful as normal prediction limits.  In turn, a specific non-parametric
prediction limit test may be more powerful than some parametric versions.  It is important that effective
power be defined consistently, even at sites where more than one statistical method is employed.

     The guidance encourages employing the effective power concept in assessing  the ability of the
statistical  program to correctly identify and  flag real concentration increases above background.  As
already  defined, effective power is the probability that such an increase will be identified even  if only
one well-constituent pair is contaminated. Each well-constituent pair being tested should be considered
equally  at risk of containing a true increase above background. This also implies that the effective power
of each  statistical test in use should meet the criteria of the EPA reference curves.  That is, the test with
the least power should still have adequate power for identifying mean concentration increases.

     The Unified Guidance does not recommend that a single composite measure of effective power be
used  to gauge  a program's ability to identify  potential contamination.   To understand  this last
recommendation, consider the following hypothetical example.  Two  constituents  exhibiting different
subsurface travel times and diffusive potentials in the underlying aquifer are monitored with different
statistical techniques. The constituent  with the  faster travel time might be measured using a test with
very low effective power (compared to the ERPC), while the slower moving parameter is measured with
a test having very high effective power. Averaging the separate power results  into a single composite
measure might result in an effective power roughly equivalent to the ERPC. Then the chances of
identifying a release in a timely manner would  be diminished unless rather large concentrations of the
faster constituent began appearing  in compliance wells.  Smaller mean increases — even if 3 or 4
standard deviation units above background levels — would have little chance of being detected, while
the time it took for more readily-identified levels of the slower constituent to arrive at compliance wells
might be too long  to be  environmentally protective.  Statistical power  results   should be reported
separately,  so that the effectiveness of each distinct test can be adequately judged.  Further data-specific
power evaluations could still be necessary to identify the appropriate test(s).

     The following basic  steps are  recommended for  assessing effective power at sites using multiple
statistical methods:

  1.   Determine the number and assortment of distinct statistical tests. Different power characteristics
      may be  exhibited by  different statistical techniques.  Specific control  charts,  Mests, non-
      parametric prediction limits, etc. all tend to vary in their performance. The  performance of a
      given technique is  also strongly affected by the data characteristics.  Background sample sizes,
      interwell versus intrawell choices, the number of retests and type of retesting plan, etc., all affect
       statistical power. Each distinct data configuration and retesting plan will  delineate a slightly
      different statistical test method.
                                              6-22                                   March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

  2.    Once the various methods have been identified, gauge the effective power of each.10 Often the
       easiest way to measure power is via Monte Carlo simulation. Effective power involves a single
       well-constituent pair, so the simulation needs to incorporate only one population of background
       measurements representing  the baseline condition  and  one  population of compliance point
       measurements.

  3.    To run a Monte Carlo simulation, repeat the following algorithm a large number of times (e.g., N
       = 10,000). Randomly generate a set of measurements from the background population in order to
       compute either a comparison limit for a control chart or some type of prediction limit test, or the
       background portion for a ^-test or Wilcoxon rank-sum calculation, etc. Then generate compliance
       point samples  at successively higher  mean  concentration levels,  representing increases  in
       standard deviation units above the baseline average.  Perform each distinct test on the simulated
       data,  recording the result of each iteration. By determining how frequently the concentration
       increase is  identified at each successive mean level (including retests if necessary), the effective
       power for each distinct method can be estimated and compared.

       ^EXAMPLE 6-4

     As a  simple  example of measuring effective power, consider a site using two different statistical
methods. Assume that most of the constituents will be tested interwell with a l-of-3 parametric normal
prediction  limit retesting plan for  individual observations  (Chapter  19).  The remaining  constituents
having low detection rates and small well sample sizes will be tested  intrawell with a Wilcoxon rank-
sum test.

     To measure  the effective power of the normal prediction  limits, note that the same number of
background measurements (n = 30) is likely to be available for each of the relevant constituents. Since
the per-constituent false positive rate (ac) and the number of monitored wells (w) will also be identical
for these chemicals, the same K multiplier can be used  for each prediction limit, despite  the fact that the
background mean and standard deviation will almost certainly vary by constituent.

     Because of these identical data and well configurations,  the  effective power  of each normal
prediction limit will also be the same,11  so that only one prediction limit  test need be simulated. It is
sufficient to assume the background population has a standard normal distribution.  The compliance
point population at the  single contaminated well also has a  normal distribution with the same standard
deviation but a mean (|i) shifted upward to reflect successive  relative concentration increases  of 1
standard deviation, 2 standard deviations, 3 standard deviations, etc.

     Simulate  the power by conducting  a large number of iterations (e.g., N =  10,000-20,000) of the
following algorithm: Generate 30 random observations from background and compute the sample mean
10 Since power is a property of the statistical method and not linked to a specific data set, power curves are not needed for all
  well-constituent pairs, but only for each distinct statistical method. For instance, if intrawell prediction limits are employed
  to monitor barium at 10 compliance wells and the intrawell background sample  size is the same for each well, only one
  power curve needs to be created for this group of tests.
11 Statistical power measures the likely performance of the technique used to analyze the data, and is not a statement about the
  data themselves.

                                               6^23                                    March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

and standard deviation. Calculate the prediction limit by adding the background mean to K times the
background standard deviation. For a l-of-3 retesting plan, generate 3 values from the compliance point
distribution (i.e., a normal distribution with unit standard deviation but mean equal to (l).  If the first of
these measurements does not exceed the prediction limit, record a score of zero and move on to the next
iteration. If, however, the first value is an exceedance, test  the second value and possibly the third. If
either resample does not exceed the prediction limit, record a  score of zero and move to the  next
iteration. But if both resamples are also exceedances, record a score of one. The fraction of iterations (TV)
with scores equal to one is an estimate  of the effective power at a concentration  level of (I standard
deviations above the baseline.

     In the case of the intrawell Wilcoxon rank-sum test, the power will depend on the number of
intrawell background samples available at each well and for each  constituent.12 Assume for purposes of
the example that  all the intrawell  background sizes  are  the same with n = 6  and that two  new
measurements will be collected at each well during the evaluation period. The power will also depend on
the frequency of non-detects in the underlying groundwater population. To simulate this  aspect of the
distribution for each separate  constituent, estimate  the proportion (p) of observed  non-detects across a
series of wells. Then set a RL for purposes of the simulation equal to zp, the/>th quantile of the standard
normal distribution.

     Finally, simulate the effective power by repeating a large number of iterations of the following
algorithm:  Generate  n =  6  samples from  a standard normal distribution to  represent intrawell
background.  Also generate two samples from a normal distribution with unit standard deviation and
mean equal to [j, to represent new compliance point measurements from a distribution with mean level
equal to [j, standard deviations above background. Classify any values as non-detects that fall below zp.
Then jointly rank the background  and  compliance values  and compute  the Wilcoxon rank-sum test
statistic, making any necessary adjustments for ties (e.g., the  non-detects). If this test statistic exceeds its
critical value, record a score of one for the iteration. If not, record a score of zero. Again estimate the
effective power at mean concentration level (I as the proportion of iterations (TV) with scores of one.

     As a last step, examine the  effective  power for each of the two techniques. As long as the power
curves of the normal prediction  limit and the Wilcoxon rank-sum test both meet the criteria of the
ERPCs, the statistical program taken as a whole should provide acceptable power. -4
12 Technically, since the Wilcoxon rank-sum  test will often be applied to non-normal data, power will also depend
  fundamentally on the true underlying distribution at the compliance well. Since there may be no way to determine this
  distribution, approximate power is measured by assuming the underlying distribution is instead normal.

                                              6-24                                     March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance



6.3  HOW KEY ASSUMPTIONS IMPACT STATISTICAL DESIGN

6.3.1 STATISTICAL IN DEPENDENCE

       IMPORTANCE OF INDEPENDENT, RANDOM MEASUREMENTS

     Whether a facility is in detection monitoring, compliance/assessment, or corrective action, having
an appropriate and valid sampling program is critical. All statistical procedures infer information about
the underlying population  from the observed sample measurements. Since these populations are only
sampled a few times a year, observations should be carefully chosen to provide accurate information
about the underlying population.

     As discussed in Chapter 3, the mathematical theory behind standard statistical tests assumes that
samples were randomly obtained from the underlying population.  This is necessary to insure that the
measurements are independent and identically distributed [i.i.d.]).  Random sampling means that each
possible concentration value in the population has an equal or known chance of being selected any time
a measurement is taken. Only random sampling guarantees with sufficiently high probability that a set of
measurements is adequately representative of the underlying population. It also ensures that human
judgment will not bias the sample results, whether by intention or accident.

     A number of factors make classical random sampling of groundwater virtually impossible.  A
typical small number  of wells represent only a very small portion of an entire well-field.  Wells are
screened at specific depths  and combine potentially different horizontal and vertical flow regimes.   Only
a minute portion of flow that passes  a  well is actually  sampled.  Sampling normally occurs  at fixed
schedules, not randomly.

     Since a typical aquifer cannot be sampled at random, certain assumptions are made concerning the
data from  the  available  wells.  It  is  first  assumed that the  selected well locations  will generate
concentration data  similar to a randomly distributed  set of  wells. Secondly, it is  assumed that
groundwater flowing through the well screen(s) has a concentration distribution identical to the aquifer
as a whole. This second assumption is unlikely to be valid unless groundwater is flowing through the
aquifer at a pace fast enough and in such a way as to allow adequate mixing of the  distinct water
volumes  over a relatively  short (e.g., every few months or so) period of time, so that groundwater
concentrations seen at an existing well could also have been observed at other possible well locations.

     Adequate sampling of aquifer concentration distributions cannot be accomplished unless enough
time elapses between sampling events to allow different portions of the aquifer to pass through the well
screen.    Most  closely-spaced  sampling events  will  tend   to  exhibit  a statistical dependence
(autocorrelation). This means that pairs of consecutive measurements taken in a series will be positively
correlated, exhibiting a stronger similarity in concentration levels than expected from pairs collected at
random times. This would be particularly true for overall water quality indicators which are continuous
throughout an aquifer and only vary slowly with time.

     Another form of statistical dependence is  spatial correlation. Groundwater concentrations  of
certain constituents exhibit natural spatial variability, i.e., a distribution that varies depending on the
location of the  sampling coordinates.  Spatially variable constituents exhibit mean and occasionally
                                             6^25                                   March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

variance differences from one well to another.  Pairs of spatially variable measurements collected from
the same or nearby locations exhibit greater similarity than those collected from distinct, widely-spaced,
or distant wells.

     Natural spatial variability can result from a number  of geologic and hydrological processes,
including varying soil composition  across an aquifer. Various geochemical,  diffusion, and adsorption
processes may dominate depending on the specific locations being measured. Differential flow paths
can also impact the spatial distribution of contaminants in groundwater, especially if there is limited
mixing of distinct groundwater volumes over the period of sampling.

     An adequate groundwater monitoring sampling program needs to account for not only site-specific
factors such as hydrologic characteristics, projected flow rates, and directional  patterns, but also meeting
data assumptions  such  as independence.   Statistical  adjustments are  necessary, such as  selecting
intrawell comparisons for spatially distinct wells or removing autocorrelation effects in the case of time
dependence.

       DARCY'S EQUATION AND AUTOCORRELATION

     Past EPA  guidance  recommended  the use of  Darcy's equation  as  a  means  of establishing  a
minimum time interval between samples.  When validly applied  as a basic estimate of groundwater
travel time in a given aquifer, the Darcy equation ensures that separate volumes of groundwater are
being sampled (i.e., physical independence).  This increases the probability that the samples will also be
statistically independent.

     The Unified Guidance in Chapter  14 also includes  a  discussion  on  applying Darcy's equation.
Caution is advised in its use, however, since Darcy's equation cannot guarantee temporal independence.
Groundwater travel  time  is only  one  factor that  can influence the  temporal pattern  of aquifer
constituents.   The measurement  process  itself can  affect time related dependency.  An imprecise
analytical method might impart  enough  additional variability to  make the  measurements essentially
uncorrelated even in a short sampling interval.  Changes in analytical methods or laboratories and even
periodic re-calibration of analytical  instrumentation can impart time-related dependencies in a data set
regardless of the time intervals between samples.

     The overriding interest is  in  the behavior of  chemical contaminants  in  groundwater, not  the
groundwater itself. Many  chemical compounds do  not travel at  the same  velocity  as  groundwater.
Chemical characteristics such as adsorptive potential,  specific gravity, and molecular size can influence
the way chemicals move in the subsurface.  Large molecules, for example, will tend to travel slower than
the average linear velocity of groundwater because of matrix interactions. Compounds that exhibit  a
strong adsorptive potential will undergo a similar fate, dramatically changing  time of travel predictions
using the Darcy equation.  In some  cases,  chemical interaction with the matrix material will alter the
matrix structure  and its associated hydraulic conductivity and may result in an increase in contaminant
mobility. This last effect has been observed, for instance, with certain organic  solvents in clay units (see
Brown and  Andersen, 1981).

     The Darcy  equation is also not valid in turbulent  and non-linear laminar flow regimes. Examples of
these particular hydrological environments include karst and  'pseudo-karst' (e.g.,  cavernous basalt and
extensively fractured rock) formations. Specialized methods have been investigated by Quinlan  (1989)

                                             6-26                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

for developing alternative monitoring procedures. Dye tracing as described by Quinlan (1989) and Mull,
et al.  (1988)  can be  useful  for  identifying  flow paths and travel times  in  these  two particular
environments;  conventional groundwater monitoring wells are often of little value in designing an
effective monitoring system in these type of environments.

     Thus, we suggest that Darcy's equation not be exclusively relied upon to gauge statistical sampling
frequency.   At many sites, quarterly or semi-annual sampling often provides a reasonable balance
between maintaining statistical  independence  among  observations  yet  enabling early detection of
groundwater problems. The Unified Guidance recommends three tools to explore or test for time-related
dependence among groundwater measurements. Time series plots (Chapter 9) can be  constructed on
multiple wells to examine whether there is a time-related dependence in the pattern  of  concentrations.
Parallel traces on such a plot may indicate correlation across wells as part of a natural temporal, seasonal
or induced laboratory effect. For longer data series, direct estimates of the autocorrelation in a series of
measurements  from a single well can be made using either the sample autocorrelation function or the
rank von Neumann ratio (Section 14.2).

       DATA  MIXTURES INCLUDING ALIQUOT REPLICATE SAMPLES

     Some facility data sets may contain both single and aliquot replicate groundwater measurements
such as duplicate splits.  An entire data set may also consist of aliquot  replicates from a number of
independent water quality samples.  The guidance recommends against using aliquot data directly in
detection monitoring tests, since they are almost never statistically independent. Significant positive
correlation almost always exists between such duplicate samples or among aliquot sets.  However, it is
still possible to utilize some of the aliquot information within a larger water quality data set.

     Lab  duplicates  and field  splits can provide valuable information about the level of measurement
variability  attributable to  sampling and/or analytical techniques. However, to  use  them as  separate
observations in a prediction limit, control chart, analysis of variance [ANOVA] or other procedure, the
test must be specially structured to account for multiple data values per sampling event.

     Barring the use of these more  complicated methods, one suggested strategy has been to simply
average each set of field splits  and  lab duplicates and treat the resulting mean as a single observation in
the overall data set. Despite eliminating the dependence between field splits and/or lab duplicates, such
averaging  is not  an ideal solution. The  variability in means of two  correlated  measurements is
approximately  30% less than the variability associated with two single independent measurements. If a
data set consists of a mixture  of single measurements and lab  duplicates and/or field splits, the
variability of the averaged values  will be less than the  variability of the single  measurements.  This
would imply that the final data set is not identically distributed.

     When data are not identically  distributed, the  actual false positive and false negative rates of
statistical tests may be higher or lower than expected.  The effect of mixing single measurements and
averaged aliquot replicates might be balanced out in a two-sample t-test if sample sizes are roughly
equal.  However, the impact of non-identically distributed data can be substantial for an upper prediction
limit test of a future single sample where the background sample includes a mixture of aliquot replicates
and  single measurements.   Background variability  will be  underestimated, resulting in a  lowered
prediction limit and a higher false positive rate.
                                              6-27                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     One statistically defensible but expensive  approach is to perform the same number of aliquot
replicate measurements on all physical  samples collected from  background  and compliance  wells.
Aliquot replicates can be averaged,  and the same variance reduction will  occur in all  the final
observations. The statistical test degrees of freedom, however, are based on the number of independent,
averaged samples.

     Mixing single and averaged aliquot data is a serious problem if the component of variability due to
field sampling methods and laboratory measurement error is a substantial fraction of the overall sample
variance. When natural variability  in groundwater concentrations  is the largest component, averaging
aliquot replicate measurements will do little to weaken the assumption of identically-distributed data.
Even when variability due to sampling and analytical methods is a large component of the total variance,
if the percentage of samples with aliquot replicate measurements is fairly small (say, 10% or less), the
impact  of  aliquot replicate averaging should usually be negligible. However,  consultation  with  a
professional statistician is recommended.

     The simplest alternative is to  randomly select one value from each aliquot replicate set along with
all non-replicate individual measurements, for use in statistical testing. Either this approach  or the
averaged replicate method described above will result in smaller degrees of freedom than the strategy of
using all the aliquots, and will more accurately reflect the statistical properties of the data.

       CORRECTING FOR TEMPORAL  CORRELATION

     The Unified Guidance  recommends two general  methods  to  correct for observable  temporal
correlation.  Darcy's  equation is mentioned  above as a rough guide  to physical independence of
consecutive groundwater  observations. A more  generally applicable strategy for yet-to-be-collected
measurements involves adjusting  the  sampling  frequency  to  avoid autocorrelation  in consecutive
sampling events.   Where autocorrelation is a serious  concern, the  Unified  Guidance recommends
running  a pilot study at two or  three  wells and  analyzing the study  data  by using the  sample
autocorrelation function (Section 14.3.1). The autocorrelation function plots the strength of correlation
between consecutive measurements  against  the time  lag between  sampling events.  When the
autocorrelation  becomes insignificantly  different  from  zero  at a particular  sampling interval, the
corresponding sampling frequency is the maximum that will ensure uncorrelated sampling events.

     Two other strategies are recommended for  adjusting already collected data. First, a  longer data
series at a single well can be  corrected for seasonality by estimating and removing the seasonal trend
(Section 14.3.3). If both a linear trend and seasonal fluctuations are evident, the seasonal Mann-Kendall
trend test can be run to identify the trend despite the seasonal effects (Section 14.3.4).  A second strategy
is for sites where a temporal effect (e.g., temporal dependence, seasonality) is apparent across multiple
wells.  This involves estimating a  temporal  effect via a one-way ANOVA and  then creating adjusted
measurements using the ANOVA residuals (Section 14.3.3).  The  adjusted data can then be utilized in
subsequent statistical procedures.
                                             6-28                                    March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

6.3.2 SPATIAL VARIATION: INTERWELL VS.  INTRAWELL TESTING

       ASSUMPTIONS IN BACKGROUND-TO-DOWNGRADIENT COMPARISONS

     The  RCRA groundwater  monitoring  regulations initially presume  that  detection monitoring
background can be defined on the basis of a definable groundwater gradient. In a considerable number of
situations, this approach is problematic. No groundwater gradient may be measurable for identifying
upgradient and downgradient well locations around a regulated unit. The hydraulic gradient may change
in direction, depth or magnitude due  to seasonal fluctuations.   Groundwater mounding or other flow
anomalies can occur. At most locations, significant spatial variability among wells exists for certain
constituents. Where spatial variation is a natural artifact of the  site-specific geochemistry, differences
between upgradient and downgradient wells are unrelated to on-site waste management practices.

     Both the Subtitle C and Subtitle D RCRA regulations allow for a determination that background
quality may include sampling of wells not hydraulically upgradient of the waste management area. The
rules recognize that this can occur either when  hydrological information is unable to indicate which
wells  are  hydraulically  upgradient or when sampling other wells  will be "representative or more
representative than that provided by the upgradient wells."

     For upgradient-to-downgradient  well comparisons, a crucial detection monitoring assumption is
that downgradient well  changes  in groundwater quality are only caused by on-site waste management
activity.  Up- and down-gradient well measurements are also assumed to be comparable and equal on
average unless  some waste-related change  occurs.  If other factors  trigger significant increases in
downgradient well locations, it may be very difficult  to pinpoint the monitored unit as the source or
cause of the contaminated groundwater.

     Several  other  critical  assumptions apply  to the interwell approach.  It is  assumed that the
upgradient and downgradient well samples are drawn from the same aquifer and that wells are screened
at essentially the same hydrostratigraphic position. At some sites, more than one aquifer underlies the
waste site or landfill, separated by confining layers of clay or other less permeable material.  The fate
and transport characteristics of groundwater contaminants likely  will  differ in each aquifer, resulting in
unique concentration patterns. Consequently, upgradient  and downgradient observations may not be
comparable (i.e.., drawn from the same statistical population).

     Another  assumption  is that groundwater  flows in a definable pathway from  upgradient  to
downgradient wells beneath the regulated unit. If flow paths are  incorrectly determined or this does not
occur, statistical comparisons can be invalidated.  For example, a real release may be occurring at a site
known to have groundwater mounding beneath the monitored unit.  Since the groundwater may move
towards both the downgradient and upgradient wells, it may not be possible to detect the release if both
sets of wells become equally or similarly contaminated.  One exception to this might occur if certain
analytes are shown to exhibit uniform behavior  in both historical upgradient and downgradient wells
(e.g., certain infrequently detected trace elements). As long as  the flow pathway from the unit to the
                                            6-29                                   March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

downgradient wells is assured, then an interwell test based on this combined background could still
reflect a real exceedance in the downgradient wells.13

     Groundwater flow should also move at a  sufficient velocity beneath the site,  so that the  same
groundwater observed at upgradient well locations is subsequently monitored at downgradient wells in
the course of an evaluation period (e.g., six months or a year).   If groundwater flow is much slower,
measurements from upgradient and downgradient wells may be more akin to samples from two separate
aquifers. Extraneous factors may separately influence the downgradient and background populations,
confusing the determination of whether or not a release has occurred.

     While statistical testing can determine whether there are significant differences between upgradient
and downgradient well measurements, it cannot determine why such differences exist. That is primarily
the  concern  of  a  hydrologist who  has carefully reviewed  site-specific  factors.  Downgradient
concentrations may be greater than background  because contamination of the underlying aquifer has
occurred. The increase may be due to other factors,  including  spatially variable  concentration levels
attributable to changing soil  composition and geochemistry from one well location to another. It could
also be due to the migration  of contaminants from off-site sources reaching downgradient wells.  These
and  other factors (including those summarized in Chapter 4 on SSI Increases) should be considered
before deciding that statistically significant  background-to-downgradient differences  represent site-
related contamination.

     An example of how background-to-downgradient well differences can be misleading is illustrated
in Figure 6-4 below. At this Eastern coastal site,  a Subtitle D landfill was located just off a coastal river
emptying  into  the  Atlantic  Ocean  a short  distance downstream.   Tests of specific  conductance
measurements comparing the  single upgradient  well  to  downgradient  well data indicated significant
increases at all  downgradient wells, with one well indicating levels more  than an order of magnitude
higher than background concentrations.

     Based on  this analysis, it was initially concluded that waste management activities at the landfill
had impacted groundwater.  However, further hydrologic investigation  showed that nearby river water
also exhibited  elevated levels  of specific  conductance,  even  higher than  measurements at  the
downgradient wells.  Tidal fluctuations and changes in river discharge caused sea water to periodically
mix with the coastal river water at a location near the downgradient wells.  Mixed river and sea water
apparently seeped into the aquifer, impacting downgradient wells but not at the upgradient location.  An
off-site  source  as opposed to  the landfill itself was  likely responsible for the observed elevations in
specific conductance. Without this additional hydrological information, the naive statistical comparison
between upgradient and downgradient wells would have reached an incorrect conclusion.
13  The same would be true of the "never-detected" constituent comparison, which does not depend on the overall flow
  pathway from upgradient to downgradient wells.

                                              6-30                                    March 2009

-------
Chapter 6. Detection Monitoring Design
Unified Guidance
                          Figure 6-4.  Landfill Site Configuration
                                                                  CONDUCT \NC15
                                                                I (Mil: Urm SIM \\utoi
                                                                    daia i 1U vis)
                                                                Max - MUM)
                                                                \\c = 22.01*)
                                                        Sea Water
TRADEOFFS IN INTERWELL AND  INTRAWELL APPROACHES

     The choice  between  interwell and intrawell  testing  primarily  depends  on  the  statistical
characteristics of individual constituent data behavior in background  wells.  It  is presumed that  a
thorough background  study described in Chapter 5 has been completed.  This involves selecting the
constituents deemed appropriate for detection  monitoring, identifying distributional characteristics, and
evaluating  the constituent  data  for trends,  stationarity,  and mean spatial  variability among wells.
ANOVA tests can be used to assess both well mean  spatial variability and the potential for pooled-
variance estimates if an intrawell approach is needed.

     As discussed in Chapter 5, certain classes of potential monitoring constituents are more likely to
exhibit spatial variation.  Water quality indicator parameters are  quite frequently spatially variable.
Some authors,  notably Davis and McNichols  (1994) and Gibbons  (1994a), have  suggested that
significant  spatial variation is  a nearly ubiquitous feature at RCRA-regulated landfills and  hazardous
waste sites, thus  invalidating the use of interwell  test methods.  The Unified Guidance  accepts that
interwell tests still have an important role in groundwater monitoring, particularly for certain classes of
constituents like non-naturally occurring VOCs and some trace elements.  Many sites may best be served
by a statistical program which combines interwell and intrawell procedures.

     Intrawell testing is an appropriate and recommended alternative strategy for many constituents.
Well-specific backgrounds afford intrawell tests  certain advantages over the interwell approach.  One
key advantage is confounding results due to spatial variability are eliminated, since all data used in an
intrawell test are obtained from a single location.  If natural background levels change substantially from
                                             6-31
        March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

one well to the next, intrawell background provides the most accurate baseline for use in statistical
comparisons.

      At times, the variability in a set of upgradient background measurements pooled from multiple
wells can be larger than the variation in individual intrawell background wells.  Particularly if not
checked with ANOVA well mean testing, interwell variability could substantially increase if changes in
mean levels from one location to the next are also incorporated. While pooling should not occur among
well means determined to be significantly different using ANOVA, a more likely situation is that pooled
well true means  and variance may be slightly different at  each well.  The  ANOVA test might still
conclude that the mean differences were insignificant and satisfy the equal variance assumption. The net
result (as explained below) is that intrawell tests can be more statistically powerful than comparable
interwell tests using upgradient background, despite employing a smaller background sample size.

     Another  advantage using  intrawell  background is that a reasonable baseline for tests of future
observations can be established at historically contaminated wells. In this case, the intrawell background
can be used to track the onset of even more extensive contamination in the future.  Some compliance
monitoring wells  exhibit chronic elevated contaminant levels (e.g., arsenic) considerably above other site
wells which may not be clearly attributed to a regulated unit  release.  The regulatory agency has the
option of continuing detection monitoring  or  changing to  compliance/corrective action monitoring.
Unless the  agency has already determined that the pre-existing contamination is subject to compliance
monitoring or remedial action under RCRA, the detection monitoring option would be to test for recent
or  future  concentration increases above the  historical  contamination  levels by  using intrawell
background as a well-specific baseline.

     Intrawell tests are not preferable for all groundwater  monitoring scenarios.  It may be unclear
whether a given compliance well was historically contaminated prior to being regulated or more recently
contaminated.  Using intrawell  background to  set  a  baseline  of comparison may ignore  recent
contamination subject to compliance testing and/or  remedial action. Even more  contamination in the
future would then be required to trigger a statistically significant increase [SSI] using the intrawell test.
The Unified Guidance  recommends the use  of intrawell  testing only when  it is clear that  spatial
variability is not the result of recent contamination attributable to the regulated unit.

     A second concern is that intrawell tests typically utilize a smaller set  of background data than
interwell methods. Since statistical power depends significantly on background sample size, it may be
more difficult to  achieve comparable statistical power with intrawell tests than with interwell methods.
For the latter, background data can be collected from multiple wells when appropriate, forming a larger
pool of measurements than would be available at a  single well. However, it may also be possible to
enhance intrawell sample sizes for parametric tests using the pooled- variance approach.

     Traditional  interwell tests can be appropriate for certain constituents if the hydraulic assumptions
discussed earlier are verified and there is no evidence of significant spatial variability. Background data
from other  historical compliance wells not significantly different from upgradient wells using ANOVA
may also be used in some cases.   When these conditions are met, interwell tests can be preferable as
generally more powerful tests.  Upgradient groundwater quality can then be more easily monitored in
                                              6-32                                    March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

parallel to  downgradient locations.  Such upgradient monitoring can signal changes in natural in-situ
concentrations or possible migration from off-site sources. 14

       For most situations, the background constituent data patterns will determine which option is most
feasible. Clear indications of spatially distinct well means through ANOVA testing will necessitate
some form of intrawell  methods. Further choices are then which type of statistical  testing will provide
the best power.

       It may be possible to increase the effective sample size associated with a series of intrawell tests.
As explained in Chapters 13 & 19, the K-multipliers for intrawell prediction limits primarily depend on
the number of background measurements used to estimate the standard deviation.  It is first necessary to
determine that  the intrawell background in a series of compliance  wells is both uncontaminated  and
exhibits similar levels of variability from  well to well.  Background data from these wells can then be
combined to form a pooled intrawell  standard deviation estimate with larger degrees of freedom, even
though  individual well  means vary.   A  transformation may be needed to stabilize the well-to-well
variances.  If one or more of the compliance wells is already contaminated, these should  not be mixed
with uncontaminated well data in obtaining the pooled standard deviation estimate.

       A  site-wide  constituent pattern of  no  significant  spatial variation will generally favor  the
interwell testing approach.  But given the potential for hydrological and other issues discussed above,
further evaluation of intrawell methods may be appropriate.  Example 6-2 provided an illustration of a
specific intrawell constituent having a lower absolute standard deviation than an interwell pooled data
set, and hence greater relative and absolute power.  In making such  an  interwell-intrawell comparison,
the specific test and all  necessary design inputs  must be considered.  Even if a given intrawell data set
has a low background standard deviation compared to an interwell counterpart, the advantage in absolute
terms over the  relative power approach will change with differing design inputs. The simplest way to
determine if the intrawell approach might be advantageous is to calculate the  actual background limits of
a potential test using existing intra- and inter-well data sets. In a given prediction limit test, for example,
the actual lower limit will determine the more powerful test.

      If desired,  approximate data-based power curves (Section  6.2.4) can be constructed to evaluate
absolute power over a range of concentration level  increases.  In practice,  the method for comparing
interwell versus intrawell testing strategies with the same well-constituent pair involves the following
basic steps:

   1.  Given the interwell background sample  size (winter),  the statistical test method (including  any
       retesting), and the individual per-test a for that well-constituent pair, compute  or simulate the
       relative power of the test at multiples of fester above the baseline mean level.  Let k range from 0
       to  5 in  increments of 0.5, where  the interwell  population standard deviation (Omter) has been
       replaced by the sample background standard deviation (sinter).
14 The same can be accomplished via intrawell methods if upgradient wells continue to be sampled along with required
  compliance well locations. Continued tracking of upgradient background groundwater quality is recommended regardless
  of the testing strategy.

                                              6^33                                     March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

   2.  Repeat Step 1 for the intrawell test.  Use the intrawell background sample size («intra), statistical
       test method, background sample standard deviation (sintra), and the same individual per-test a to
       generate a relative power curve.

   3.  On  the same graph, plot overlays of the estimated data-based interwell and  intrawell  power
       curves  (as  discussed  in  Section 6.2.4).  Use  the  same  range  of (absolute, not  relative)
       concentration increases over baseline along the horizontal axis.

   4.  Visually inspect the data-based power curves to determine which method offers better  power
       over a wider range of possible concentration increases.

     The  Unified  Guidance  recommends that users  apply the most powerful statistical  methods
available in detecting and identifying contaminant releases for each well-constituent pair.  The  ERPC
identifies a minimum acceptable standard for judging the relative power of particular tests. However,
more powerful methods based on absolute power may be considered preferable in certain circumstances.

     As a final concern, very small individual well samples in the early stages of a monitoring program
may make it difficult to utilize an intrawell method having both sufficient statistical power and meeting
false positive design criteria.  One  option would be to temporarily defer tests on those well-constituent
pairs until additional background observations can be collected.  A second option is to use the intrawell
approach despite  its inadequate power, until the  intrawell background is sufficiently large via periodic
updates (Chapter 5).  A third option might be to use a more powerful intrawell test (e.g., a higher order
\-of-m parametric or non-parametric prediction limit test).  Once background  is increased, a lower order
test might suffice.  Depending on the type of tests, some control of power may be lost (parametric) or the
false positive (non-parametric tests).  These tradeoffs are considered more fully in Chapter 19. For the
first two options and the parametric test under the third  option,  there is some added risk that a release
occurring during the period of additional data collection might be missed.  For the non-parametric test
under the third option, there is an increased risk of a true false positive error. Any of these options might
be included as special permit conditions.

6.3.3 OUTLIERS

     Evaluation of outliers should begin with historical  upgradient and possibly compliance well data
considered  for defining initial background, as described in Chapter 5, Section 5.2.3.  The key goal is to
select the data most representative  of near-term and likely future background. Potentially discrepant or
unusual values can occur for many reasons including 1) a contaminant release that significantly impacts
measurements at compliance wells;  2)  true but extreme  background groundwater  measurements, 3)
inconsistent sampling or analytical chemistry methodology resulting in laboratory contamination or other
anomalies;  and 4) errors in the transcription of data values  or decimal points.  While the first two
conditions may appear to be discrepant values, they would not be considered outliers.

     When appraising extensive background  data sets with long periods of  acquisition and somewhat
uncertain quality, it is recommended that a formal statistical evaluation of outliers not be conducted until
a  thorough review of  data  quality  (errors,  etc.) has  been  performed.    Changes in analytical
methodologies,  the presence of sample interferences or  dilutions can affect  the historical data record.
Past and current treatment of non-detects should also  be investigated, including whether there  are
multiple reporting limits  in the data base. Left-censored values can impact whether or not the sample

                                              6-34                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

appears normal  (Chapter 15),  especially if  the  data  need to be normalized via a transformation.
Techniques for evaluating censored data should be considered,  especially those  which can properly
account for multiple RLs.  Censored probability plots (Chapter 15) or quasi-nonparametric box plots
(Chapter 12) adapted by John Tukey (1977) can be used as methods to screen for outliers.

     The guidance also recommends that statistical testing of potential outliers also be performed on
initial background data, including historical compliance well data potentially considered as additional
background data.  Recognizing the potential risks  as discussed in Chapter 5, removal of significant
outliers may be appropriate even if no probable error or discrepancy can be firmly identified. The risk is
that  high values  registering  as  statistical  outliers  may reflect  an extreme, but real  value  from the
background population rather than a true outlier, thereby increasing the likelihood of a false positive
error. But the effect of removing outliers from the background data will usually be to improve the odds
of detecting upward changes in concentration levels at compliance wells, and thus providing further
protection of human health and the environment.  Automated screening and removal of background data
for statistical outliers is not recommended without some consideration of the likelihood of an outlier
error.

     A statistical outlier is defined as a value  originating from a different statistical population than the
rest of the sample.  Outliers or  observations not derived from the same population as the rest of the
sample violate the basic statistical assumption of identically-distributed measurements. If an outlier is
suspected,  an initial helpful step is to construct a probability plot of the ordered sample data versus the
standardized normal distribution (Chapter 12). A probability plot is designed to judge  whether the
sample data are consistent with a normal population model. If the data can be normalized, a probability
plot of the transformed observations should also be constructed.  Neither is a formal test, but can still
provide important visual evidence as to whether the suspected outlier(s) should be further evaluated.

     Formal testing for outliers should be done only if an observation seems particularly high compared
to the rest of the sample. The data can be evaluated with either Dixon's or Rosner's tests (Chapter 12).
These outlier tests assume that the rest of the  data except for the suspect observation(s), are normally-
distributed (Barnett and Lewis, 1994). It is recommended that tests also be conducted on transformed
data, if the original  data indicates  one or more  potential  outliers.    Lognormal and other skewed
distributions can  exhibit apparently  elevated  values in  the original concentration  domain, but still be
statistically indistinguishable when normalized via a transformation. If the latter is the case, the outlier
should be retained and the data set treated as fitting the transformed distribution.

     Future background  and compliance well data  may also be  periodically  tested for  outliers.
However, removal of outliers should  only take  place under certain conditions, since a true elevated value
may  fit the pattern of a release or a change in historical background  conditions.   If either Dixon's or
Rosner's test identifies an observation as a statistical outlier, the measurement should not be treated as
such until a specific physical reason for the  abnormal  value can be determined. Valid reasons might
include  contaminated sampling  equipment,  laboratory  contamination   of  the   sample,  errors in
transcription of the data values, etc. Records documenting the sampling and analysis of the measurement
(i.e., the  "chain of custody") should be thoroughly investigated. Based on this review, one of several
actions might be taken as a general rule:
                                              6-35                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

    »«»  If an error in transcription, dilution, analytical procedure, etc. can be identified and the correct
       value recovered, the observation should be replaced by its corrected value and further statistical
       analysis done with the corrected value.

    »»»  If it can shown that the observation is in error but the correct value cannot be determined, the
       observation should be removed from the data set and further statistical analysis performed on the
       reduced data set. The fact that the observation was removed and the reason for its removal should
       be documented when reporting results of the analysis.

    »»»  If no error in the value can be documented, it should be assumed that the observation is a true but
       extreme value.  In this  case, it should not be altered or removed. However, it may helpful to
       obtain another observation in order to verify or confirm the initial measurement.

6.3.4 NON-DETECTS

     Statistically, non-detects are considered 'left-censored'  measurements because the concentration of
any non-detect is known or assumed only to fall within a certain range of concentration values (e.g.,
between 0 and the RL). The direct estimate has been censored by limitations of the measurement process
or analytical technique.

     As noted, non-detect values can affect evaluations of potential outliers. Non-detects and detection
frequency also impact  what detection  monitoring tests  are appropriate  for a given constituent. A low
detection frequency  makes it  difficult to  implement parametric statistical  tests, since it may not  be
possible to determine if the underlying population is normal or can be normalized.  Higher detection
frequencies  offer more options,  including simple  substitution  or estimating the mean and standard
deviation of samples  containing non-detects by means of a censored estimation technique (Chapter 15).

     Estimates of the background mean  and standard deviation are needed to construct parametric
prediction and control chart limits, as well as confidence intervals. If simple substitution is appropriate,
imputed values for individual  non-detects can be used  as  an alternate way to construct  mean and
standard deviation estimates.  These estimates are also needed to update the cumulative sum [CUSUM]
portion of control charts or to compute  means of order/?  compared against prediction limits.

     Simple substitution is not recommended in  the Unified Guidance unless no more than 10-15% of
the sample observations are non-detect. In those  circumstances, substituting half the RL for each non-
detect  is not  likely to substantially  impact the  results  of statistical testing.  Censored estimation
techniques like Kaplan-Meier or robust regression on order statistics [ROS] are recommended any time
the detection frequency is no less than 50% (see Chapter 15).

     For  lower  detection  frequencies,  non-parametric  tests  are recommended.  Non-parametric
prediction limits (Chapter 18) can be constructed as an  alternative to  parametric prediction limits or
control charts. The Tarone-Ware two-sample test  (Chapter 16) is specifically designed to accommodate
non-detects  and serves as an alternative  to the t-test. By  the  same token, the Kruskal-Wallis test
(Chapter 17)  is a non-parametric, rank-based alternative to the parametric ANOVA.  These latter tests
can be used when the non-detects and detects can be jointly sorted and partially ordered (except for tied
values).
                                              6-36                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     When all data are non-detect, the Double Quantification rule (Section 6.2.2) can be used to define
an approximate non-parametric prediction limit, with the RL as  an upper bound. Before doing this, it
should be determined whether chemicals never or not recently detected in groundwater should even be
formally tested. This will depend on whether the monitored constituent from a large analytical  suite is
likely to originate in the waste or leachate.

     Even if a data set contains only  a small proportion of non-detects,  care should be taken when
choosing between the method detection limit [MDL], the quantification  limit [QL], and the RL in
characterizing 'non-detect' concentrations.  Many non-detects are  reported with  one of three data
qualifier flags: "U," "J," or "E." Samples with a U data qualifier represent 'undetected' measurements,
meaning that the signal  characteristic  of that analyte could not be observed or distinguished  from
'background noise' during lab  analysis. Inorganic  samples with an E flag and organic samples with  a J
flag  may  or  may not be reported  with an estimated concentration.  If no concentration estimate is
reported, these samples represent 'detected,  but not quantified' measurements. In this case, the actual
concentration is assumed to be  positive, falling somewhere between zero and the QL or possibly the RL.

     Since the  actual  concentration  is  unknown, the suggested  imputation  when using  simple
substitution is to replace each non-detect having a qualifier of E or J by one-half the RL. Note, however,
that E and J samples reported with estimated concentrations should be treated as valid measurements  for
statistical purposes. Substitution of one-half the RL is not recommended for these measurements, even
though the degree of uncertainty associated with the estimated concentration is probably greater than that
associated with measurements above the RL.

     As a general rule, non-detect concentrations should not be  assumed to be bounded above by the
MDL. The MDL is usually estimated on the basis of ideal laboratory conditions  with physical  analyte
samples that  may or  may not account for matrix or other interferences encountered when analyzing
specific field  samples. For certain trace element analytical methods, individual laboratories may report
detectable limits closer to an MDL than a nominal QL.  So long as the laboratory has confidence in the
ability to  quantify  at its lab- or occasionally event-specific  detection level, this  RL  may also  be
satisfactory. The RL should typically be taken as a more reasonable upper bound for non-detects when
imputing estimated concentration values to these measurements.

     RLs are sometimes but not always equivalent to a particular laboratory's QLs. While analytical
techniques may  change  and  improve  over time leading  to  a lowering of the  achievable  QL, a
contractually negotiated RL might be much higher. Often a multiplicative factor is built into the RL to
protect  a contract lab against  particular liabilities. A  good practice is to periodically review a given
laboratory's capabilities and to encourage reporting non-detects with actual QLs whenever possible, and
providing standard qualifiers with all data measurements as well as estimated concentrations for E- and
J-flagged samples.

     Even when no estimate of concentration can  be made, a lab  should regularly report the distinction
between 'undetected'  and 'detected, but not quantified' non-detect measurements. Data sets with such
delineations can be used  to advantage in rank-based non-parametric procedures. Rather than  assigning
the same tied rank to all non-detects (Chapter 16), 'detected but not quantified' measurements should
be given larger ranks than those assigned to 'undetected' samples.  These two types of non-detects should
be treated as two distinct groups of tied observations for use in the non-parametric Wilcoxon rank-sum
procedure.
                                             6-37                                    March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

6.4 DESIGNING DETECTION  MONITORING TESTS

     In the following sections, the main formal detection monitoring tests covered in this guidance are
described in the context  of site  design  choices.   Advantages as well  as limitations are  presented,
including the use of certain methods as diagnostic tools in determining the appropriate formal test(s).

6.4.1  T-TESTS

     A statistical comparison between two sets of data is known as a two-sample test. When normality
of the sample data can be presumed, the parametric Student t-tesi is commonly used (Section 16.1). This
test compares two distinct populations, represented by two samples.  These samples  can either be
individual well data sets, or a common pooled background versus individual compliance well data. The
basic goal of the t-test is to determine whether there is any statistically significant difference between the
two population means. Regulatory requirements for formal use of two-sample t-tests are limited to the
Part 265 indicator parameters, and have generally been superseded in the Parts 264 and 258 rules by tests
which can account for multiple comparisons.

     When the sample data are non-normal and may contain non-detects, the Unified Guidance provides
alternative two-sample tests to the parametric t-test. The Wilcoxon rank-sum test (Section 16.2) requires
that the combined  samples be sorted and ranked. This test evaluates potential differences in population
medians rather than the means. The Tarone-Ware test (Section 16.3) is specially adapted to handle left-
censored measurements, and also tests for differences in population medians.

     The t-test or a non-parametric variant is recommended as a validation tool when updating intrawell
or other background data  sets (Chapter 5).   More recently collected data considered for background
addition are compared to the  historical data set.   A non-significant  test result  implies no mean
differences, and the newer data may be added to the original set. These tests are generally useful  for any
two-sample diagnostic comparisons.

6.4.2  ANALYSIS OF VARIANCE  [ANOVA]

     The parametric one-way ANOVA is an extension of the t-test to  multiple sample groups.  Like its
two-sample counterpart, ANOVA tests  for significant differences in one or more  group (e.g., well)
means. If an overall significant difference is found as measured  by the F-statistic, post-hoc statistical
contrasts  may be used to  determine where the differences lie among individual  group means.   In the
groundwater detection  monitoring context,  only  differences  of mean well increases relative  to
background are considered of importance. The ANOVA test also has wide applicability as a  diagnostic
tool.

       USE OF ANOVA IN FORMAL DETECTION MONITORING TESTS

     RCRA regulations under Parts 264 and  258 identify parametric and non-parametric ANOVA as
potential detection monitoring tests. Because of its flexibility and power, ANOVA can sometimes be an
appropriate  method  of statistical  analysis  when groundwater monitoring is  based on an  interwell
comparison of background and compliance well data.  Two types of ANOVA are presented  in the
Unified Guidance: parametric and non-parametric one-way ANOVA (Section 17.1).   Both methods
                                            6-38                                  March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

attempt to assess whether distinct monitoring wells  differ in average concentration  during  a given
evaluation period.15

     Despite the potential attractiveness of ANOVA tests, use in formal detection monitoring is limited
by these important factors:

         »«»  Many monitoring constituents exhibit significant spatial variability and cannot make use of
            interwell comparisons;

         »«»  The test can be confounded by a large number of well network comparisons;

         »«»  A minimum well sample size must be available for testing; and

         *»*  Regulatory false positive error rate restrictions limit the ability to effectively control the
            overall false positive rate.

     As  discussed in Section 6.2.3, many if not most inorganic monitoring constituents exhibit spatial
variability, precluding an interwell form of testing.  Since ANOVA is inherently an interwell procedure,
the guidance recommends against its use for these constituents and  conditions.   Spatial variability
implies that the average groundwater concentration levels vary from well to well because of existing on-
site conditions.   Mean differences of this  sort can be identified by ANOVA,  but the cause of the
differences cannot. Therefore, results of a statistically significant ANOVA might be falsely attributed as
a regulated unit release to groundwater.

     ANOVA  testing might be  applied to synthetic  organic and trace element constituent data.
However, spatial variation across a site is also likely to occur from  offsite or prior site-related organic
releases.    An  existing contamination  plume  generally  exhibits varying  average  concentrations
longitudinally, as well  as in cross-section  and depth.  For other organic  constituents never detected  at a
site, ANOVA testing would be unnecessary.   Certain trace elements like  barium, arsenic and selenium
do often exhibit some spatial  variability.  Other trace element data generally have low overall detection
rates, which may also preclude ANOVA applications. Overall, very few routine monitoring constituents
are measurable (i.e., mostly detectable) yet not spatially distinct to warrant  using ANOVA as a formal
detection monitoring test. Other guidance  tests better serve this purpose.

     ANOVA  has good power for  detecting real  contamination provided the network is  small to
moderate  in size.   But for  large  monitoring networks, it may  be difficult to  identify single well
contamination.  One explanation is that the ANOVA F-statistic simultaneously combines all compliance
well effects into a single number, so that many other uncontaminated  wells with their own variability can
mask  the test effectiveness to detect the contaminated well.   This might occur at larger sites with
multiple waste units, or if only the edge of a plume happens to intersect one or two boundary wells.

     The statistical power of ANOVA depends significantly on having  at least 4 observations per well
available for testing.  Since the measurements must be statistically independent, collection of four well
observations may necessitate  a wait of several months to a few years if the natural groundwater velocity
15 Parametric ANOVA assesses differences in means; the non-parametric ANOVA compares median concentration levels.
  Both statistical measures are a kind of average.

                                              6-39                                     March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

is low.  In this case, other strategies (e.g., prediction limits) might be considered that allow each new
groundwater measurement to be tested as it is collected and analyzed.

     The  one-way ANOVA test in the RCRA regulations is not designed to control the false positive
error rate for multiple constituents.  The rules mandate a minimum false positive error rate (a) of 5% per
test application.  With an overall false positive rate of approximately 5% per constituent, a potentially
very high  SWFPR can result as the number of constituents tested by ANOVA increases and if tests are
conducted more than once per year.

     For  these reasons, the  Unified Guidance does not generally recommend  ANOVA  for formal
detection monitoring. ANOVA might be applicable to a small number of constituents, depending on the
site.  Prediction limit and control chart strategies using retesting are usually more flexible and offer the
ability to  accommodate even  very large monitoring networks, while meeting the false positive and
statistical power targets recommended by the guidance.

       USE OF ANOVA IN  DIAGNOSTIC TESTING

     In contrast, ANOVA is a versatile tool for diagnostic testing, and is frequently used in the guidance
for that purpose.  Parametric or non-parametric one-way versions are the principal means of identifying
prior spatial variability among background monitoring wells (Chapter 13). Improving sample sizes
using intrawell pooled variances also makes use of ANOVA (Chapter 13). Equality of variances among
wells  is evaluated with ANOVA (Chapter 11).  ANOVA is  also applied when determining certain
temporal trends in parallel well sample constituent data (Chapter 14).

     Tests of natural spatial variability can be made by running ANOVA prior to any waste disposal at a
new facility located above an undisturbed aquifer (Gibbons, 1994a).  If ANOVA identifies significant
upgradient and downgradient well differences when wastes have not yet been managed on-site, natural
spatial variability is the likely cause.  Prior on-site contamination might also be revealed in the form of
significant ANOVA differences.

     Sites with multiple upgradient background wells  can initially conduct an ANOVA on historical
data from just these locations.  Where  upgradient wells are  not significantly different for a  given
constituent, ANOVA testing can be extended to existing historical compliance well data for evaluating
potential additions to the upgradient background data base.

     If intrawell tests are chosen because of natural spatial variation, the results of a one-way ANOVA
on background data from multiple wells can sometimes  be used to  improve intrawell background  limits
(Section 13.3). Though the amount of intrawell background  at  any given well may be small, the
ANOVA provides an estimate of the root mean  squared error [RMSE], which is very close  to  an
estimate of the average per-well standard deviation. By substituting the  RMSE for the usual  well-
specific standard deviation (s), a  more powerful and accurate intrawell limit can be constructed, at least
at those sites where intrawell background across the group of wells can be normalized and the variances
approximately equalized using a common transformation.

     Although the Unified Guidance primarily makes use of one-way ANOVA, many kinds of ANOVA
exist. The one-way ANOVA applications so far discussed— in formal detection monitoring or to assess
well mean differences— utilize data from spatial locations as the factor of interest.  In some situations,

                                             6-40                                  March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

correlated behavior may exist for a constituent among well  samples evaluated in different temporal
events.  A constituent measured in a group of wells may simultaneously rise or fall in different time
periods. Under these conditions, the data are no longer random and independent.  ANOVA can be used
to assess the significance of such systematic changes, making time the factor of interest. Time can also
play a role if the sample data exhibit cyclical seasonal patterns or if parallel upward or downward trends
are observed both in background and compliance point wells.

      If time is an important second factor, a two-way ANOVA is probably appropriate.  This procedure
is discussed in Davis (1994). Such a method can be used to test for and adjust data either for seasonality,
parallel trends, or changes  in  lab  performance that cause temporal (i.e., time-related)  effects. It is
somewhat more complicated to apply than a one-way test. The main advantage of a two-way ANOVA is
to separate components of overall data variation into three sources: well-to-well mean-level differences,
temporal effects, and random  variation  or statistical  error.  Distinguishing the sources of variation
provides a more powerful test of whether significant well-to-well differences actually exist compared to
using only a one-way procedure.

     A significant temporal factor does not necessarily mean that the one-way ANOVA will not identify
actual well-to-well spatial differences. It  merely does not have as strong a  chance of doing so. Rarely
will  the one-way ANOVA identify non-existent well-to-well differences. One situation where this can
occur is when there is a  strong statistical  interaction between the well-to-well factor and the time factor
in the two-way ANOVA.  This would imply that changes in  lab performance or seasonal cycles affect
certain wells (e.g., compliance point) to a different degree or in a different manner than other wells (e.g.,
background). If this  is  the  case, professional consultation is recommended before  conducting  more
definitive statistical analyses.

6.4.3 TREND  TESTS

     Most formal detection monitoring tests in the guidance compare background and compliance point
populations under the key assumption that the populations are stationary over time. The distributions in
each group or  well  are assumed to be  stable during the period of monitoring,  with only random
fluctuations around a constant mean level.  If a significant trend occurs in the background data, these
tests cannot be directly  used.  Trends can  occur for several reasons including natural cycles, gradual
changes in aquifer parameters or the effects of contaminant migration from off-site sources.

     Although not specifically provided for in  the RCRA regulations, the guidance necessarily includes
a number  of tests for evaluating potential trends.  Chapter  17, Section 17.3 covers three basic trend
tests. (1)  Linear regression is a parametric method requiring normal and independent trend  residuals,
and  can be used both to identify a linear trend and estimate its magnitude; (2) For non-normal  data
(including sample data with left-censored  measurements), the Mann-Kendall test offers a non-parametric
method for identifying trends; and (3) To gauge trend magnitude with non-normal  data, the  Theil-Sen
trend line can be used.

     Trend analyses  are primarily diagnostic tests, which should be applied to background data prior to
implementing formal detection  monitoring tests.  If a significant trend is uncovered, two options may
apply.   The particular monitoring constituent may be dropped in favor of alternate  constituents not
exhibiting non-stationary behavior.  Alternatively, prediction limit or control chart testing can make use
of stationary trend residuals for testing purposes. One limitation of the latter approach requires making

                                              6-41                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

an assumption that the historical trend will continue into future monitoring periods.  In addition, future
data needs to be de-trended prior to  testing.   If a  trend happened to be  of limited duration, this
assumption may not be reasonable and could result in identifying a background exceedance when it does
not exist.  If a trend occurs in future data at a compliance well and prior background data was stationary,
other detection monitoring tests are likely to eventually identify it. Trend testing may also be applied to
once-future data considered for a periodic background update, although the guidance primarily relies on
t-testing of historical and future groups to assess data suitability.

       At historically contaminated compliance wells, establishing a proper baseline for a prediction
limit or control chart is problematic,  since uncontaminated concentration  data cannot be collected.
Depending on the pattern of contamination, an intrawell  background may either have a stable  mean
concentration level or exhibit an increasing or decreasing trend. Particularly when intrawell background
concentrations are rising, the assumption of a static baseline population required by prediction limits and
control charts will be violated.

     As  an alternative, the Unified Guidance recommends a test for trend to measure the extent and
nature of the apparent increase. Trend testing can determine if there is a  statistically significant positive
trend over the period of monitoring and can also determine the magnitude (i.e., slope) of the trend. In
identifying a positive trend, it might be possible to  demonstrate that the level of contamination has
increased relative to historical behavior  and indicate how rapidly levels are increasing.

     Trend  analyses can  be  used  directly as  an alternative test against a GWPS in compliance and
corrective action monitoring.  For typical compliance monitoring, data collected at each compliance well
are used to generate a lower confidence limit compared to the fixed standard (Chapters 7, 21 and 22).
A similar situation occurs when corrective action is triggered, but making use  of an upper confidence
interval for comparison.  For compliance well  data  containing a trend, the  appropriate  confidence
interval is constructed around a linear regression trend line (or its non-parametric alternative) in order to
better estimate the most current concentration levels.   Instead of a single confidence limit for stationary
tests, the confidence limit (or band) estimate changes with time.

6.4.4 STATISTICAL INTERVALS

     Prediction limits, tolerance limits, control chart limits and confidence limits belong to the class of
methods  known as  statistical intervals.  The first three are used  to define their respective detection
monitoring test limits, while  the last is used in fixed standard compliance and corrective action tests.
When using a background GWPS, either approach is possible (see Section 7.5).  Intervals are generated
as a statistic from reference sample data, and represent a probable range of occurrence either for a future
sample statistic or some parameter of the population (in the case of confidence intervals) from which the
sample was drawn.   A future sample statistic might be one or more single values, as well as a future
mean or median of specific size, drawn from one or more  sample sets to be compared with the interval
(generally an upper limit).   Both the  reference  and  comparison  sample populations are  themselves
unknown, with the latter initially presumed to be identical to the reference  set population.  In the
groundwater monitoring context, the initial reference sample is the background data set.
                                              6-42                                    March 2009

-------
Chapter 6.  Detection Monitoring Design	Unified Guidance

      The key difference in confidence limits16 is that a statistical interval based on a single sample is
used to estimate the probable range of a population parameter like the true mean, median or variance.
The three detection monitoring tests use intervals to identify ranges of future sample statistics likely to
arise from the background population based on the initial sample, and are hence two- or multiple-sample
tests.

     Statistical intervals are  inherently two-sided,  since they represent a finite range in which the
desired  statistic or population parameter is expected to occur.  Formally, an interval is associated with a
level of confidence (1-a); by construction, the error rate a represents the remaining likelihood that the
interval does not contain the appropriate statistic or parameter. In a two-sided interval, the a-probability
is  associated with ranges both above and below the statistical interval.  A  one-sided upper interval is
designed to contain the desired statistic or parameter at the  same (1-a) level of confidence, but the
remaining  error represents only the range above  the limit.   As a general  rule, detection monitoring
options discussed below use one-sided upper limits because of the nature of the test hypotheses.

     PREDICTION LIMITS

     Upper prediction limits (or intervals) are constructed to contain with (1-a) probability,  the next few
sample  value(s) or sample statistic(s) such as a mean from a background population.  Prediction limits
are exceptionally versatile, since they can be designed to accommodate a wide variety of potential site
monitoring  conditions.   They have  been  extensively  researched,  and  provide a  straightforward
interpretation of the test results.  Since this guidance strongly encourages use of a comprehensive design
strategy to account for both the cumulative SWFPR and effective power to identify real exceedances,
prediction  limit options offer a most  effective  means of accounting  for both criteria.  The guidance
provides test options in the form  of parametric normal  and non-parametric prediction limit methods.
Since a retesting strategy of some form is usually necessary to meet both criteria, prediction limit options
are constructed to formally include resampling as part of the overall tests.

     Chapters 18 and 19 provide nine parametric  normal prediction limit test options:   four tests of
future values (l-of-2, l-of-3,  l-of-4 or a modified California plan) and five future mean options (1-of-l,
l-of-2,  or  l-of-3 tests  of mean size 2, and  1-of-l  or l-of-2 tests of mean size 3).  Non-parametric
prediction limit options cover the  same future value test options as the parametric versions, as well as
two median tests of size 3 (1-of-l or l-of-2 tests).  Appendix D tables provide the relevant  K-factors for
each parametric normal test option, the achievable false positive rates for non-parametric tests, and a
categorical rating of relative test power for each set of input conditions. Prediction limits can be  used
both for interwell  and intrawell testing. Selecting from among these  options should allow the two site
design criteria to be addressed for most groundwater site conditions.

     The options provided in the guidance are based on a wider class  known in the statistical literature
as p-of-m prediction limit tests.  Except for the two modified California plan options, those selected are
l-of-m test varieties.  The number of future measurements to be predicted (i.e., contained) by the interval
is  also denoted in the Unified Guidance by m and can be as  small as m =  1.  To test for a release to
groundwater, compliance well measurements are  designated  as future observations.  Then a limit is
constructed on the background  sample, with the prediction limit formula based on the number  of m
16
  Confidence limits are further discussed in Chapters 7,21 and 22 for use in compliance and corrective action testing.
                                              6-43                                     March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

future values or statistics to be tested against the limit.  As long as the compliance point measurements
are similar to background, the prediction limit should contain all m of the future values or statistics with
high probability (the level  of confidence).   For  a l-of-m test, all m values must be larger than the
prediction limit to be declared an exceedance, as initial evidence that compliance point concentrations
are higher than background.

     Prediction limits  with retesting are presented in Chapter  19.   When retesting is part of the
procedure, there are significant and instructive differences in statistical performance between parametric
and non-parametric prediction limits.

     Parametric prediction limits are constructed using the general formula: PL = x + K-s, where x
and s are the background sample mean and standard deviation, and K is the specific multiplicative factor
for the type of test, background sample size, and the number of annual tests.  The number of tests made
against a common background is also an input factor for interwell comparison.  The  Appendix D K-
factors are specifically designed to meet the SWFPR objective, but power will vary. Larger background
sample sizes and higher order (m) tests afford greater power.

     When  background data cannot be normalized, a non-parametric prediction limit can be used
instead. A non-parametric prediction limit test makes use of one or another of the largest sample values
from the background data set as the limit. For a given background sample size and test type, the level of
confidence of that maximal value is fixed.

     Using the absolute maximum of a background data set affords the highest confidence and lowest
single-test false positive error. However, even this confidence level may not be adequate to meet the
SWFPR objective, especially for lower order l-of-m tests. A higher order future values test using the
same maximum and background  sample size will provide greater false positive confidence and hence a
lower false positive error rate. For a fixed background sample size, a l-of-4 retesting scheme will have a
lower achievable significance level (a) than a l-of-3 or  l-of-2 plan for  any specific maximal value.  A
larger background sample  size using a fixed maximal value for any test also has a  higher confidence
level (lower a) than a smaller sample.

     But for a fixed non-parametric limit of a given background sample size, the power decreases as the
test order increases. If the non-parametric prediction limit is set at the maximum, a l-of-2 plan will be
more powerful than a l-of-4 plan.  It is relatively easy to understand why this is the case. A verified
exceedance in a l-of-2 test occurs only if two values exceed the limit, but would require four to exceed
for the  l-of-4  plan.  As  a rule, even the  highest order non-parametric  test using some  maximal
background value will be powerful enough to meet the ERPC power criteria, but achieving a sufficiently
low single-test error rate to meet the SWFPR is more problematic.

     If the SWFPR objective can be  attained at a maximum value for higher order l-of-m tests, it may
be possible to utilize lower maxima from a large background data base.  Lower maxima will have greater
power and a somewhat higher false positive rate.  Limited comparisons of this type can be made when
choosing between the largest or  second-largest order statistics in the  Unified  Guidance Appendix D
Tables 19-19 to 19-24.  A more useful and flexible comparison for l-of-m  future value plans can be
obtained using the EPA Region  8 Optimal Rank Values  Calculator discussed in Chapter 19.   The
calculator identifies the lowest ranked maximal value of a background data set for 1-of-l to l-of-4 future

                                             6-44                                   March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

value non-parametric tests which can meet the SWFPR objective, while providing ERPC ratings and
fractional power estimates at 2, 3, and 4 standard deviations above background.

       TOLERANCE INTERVALS

     Tolerance intervals  are  presented  in Section  17.2.  A  tolerance interval is generated  from
background sample data to contain a pre-specified proportion of the underlying population (e.g., 99% of
all possible population measurements) at a certain level of confidence. Measurements falling outside the
tolerance interval can be judged to be statistically different from background.

     While tolerance intervals are  an acceptable  statistical technique under RCRA as  discussed in
Section 2.3, the Unified Guidance generally recommends prediction limits instead. Both  methods can
be used to  compare compliance point measurements to background in detection monitoring. The  same
general formula is used in both tests for constructing a parametric upper limit of comparison: x + Ks.
For non-parametric upper limit tests, both prediction limits and tolerance intervals use an observed  order
statistic in  background  (often  the background maximum). But prediction limits  are ultimately  more
flexible and easier to interpret than tolerance intervals.

     Consider a parametric upper prediction limit test for the next two compliance point measurements
with 95% confidence. If either measurement exceeds the limit, one of two conditions is true: either the
compliance point distribution is significantly different and higher than background, or a false positive
has been observed and the two distributions are similar. False positives in this case are expected to occur
5% of the time. Using an upper tolerance interval is not so straightforward. The tolerance interval has an
extra statistical parameter that must be specified — the coverage (y) — representing  the fraction of
background to be contained beneath the upper limit. Since the confidence level (1-a) governs how  often
a statistical interval contains its target population parameter (Section 7.4), the complement a does not
necessarily represent the false positive rate in this case.

     In fact, a tolerance interval constructed  with 95% confidence to cover 80%  of background is
designed so that as many as 20% of all background measurements will exceed the limit with  95%
probability. Here, a = 5% represents the probability that the true coverage will be less than 80%. But less
clear is the false positive rate  of a tolerance interval  test in which as many  as 1 in 5 background
measurements are expected to exceed the upper background limit.  Are compliance point values above
the tolerance interval indicative of contaminated groundwater  or merely representative of the upper
ranges of background?

     Besides a more confusing interpretation, there is an added concern.  Mathematically valid retesting
strategies can be computed for prediction limits, but not yet for tolerance intervals,  further limiting their
usefulness in groundwater testing. It is also  difficult to construct powerful intrawell tolerance intervals,
especially when the intrawell background sample size is small. Overall, there is little practical  need for
two similar (but not identical) methods in the Unified Guidance, at least in detection monitoring.

     If tolerance intervals are employed as an alternative to ^-tests  or ANOVA  when performing
interwell tests, the RCRA regulations allow substantial flexibility in the choice of a.  This  means that a
somewhat arbitrarily high confidence level (1-a) can be specified when constructing a tolerance interval.
However, unless the coverage coefficient (y) is also set to a high value (e.g., > 95%), the test is  likely to
incur a large risk of false positives despite a small a.

                                              6-45                                    March  2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     One setting in which an upper tolerance interval is very appropriate is discussed in Section 7.5.
Some constituents  that must be evaluated under compliance/assessment or corrective action may not
have a fixed GWPS.  Existing background levels may also  exceed a fixed GWPS. In these cases, a
background standard can be constructed  using an upper tolerance interval on background with 95%
confidence and 95% coverage. The standard will then represent a reasonable upper bound on background
and an achievable target for compliance and remediation testing.

6.4.5 CONTROL CHARTS

     Control charts (Chapter 20) are a viable alternative to  prediction limits in detection monitoring.
One advantage of a control chart over a prediction limit is that control  charts allow compliance point
data to be viewed and assessed graphically over time. Trends and changes in concentration levels can be
easily seen, because the compliance measurements are consecutively plotted on the chart as they are
collected, giving the data analyst an historical overview of the concentration pattern. Standard prediction
limits allow only point-in-time comparisons between the most recent data and background, making long-
term trends more difficult to identify.

     The guidance recommends use of the combined Shewhart-CUSUM control chart.  The advantage
is   that  two  statistical quantities  are assessed  at  every  sampling event,  both the new individual
measurement and the cumulative sum [CUSUM] of past and current measurements. Prediction limits do
not incorporate a CUSUM, and this can give control charts comparatively greater sensitivity to gradual
(upward) trends  and shifts in concentration  levels.  To enhance  false positive error rate control and
power, retesting can also be incorporated into the Shewhart-CUSUM control chart.  Following the same
restrictions as for prediction limits, they may be applied either to interwell or intrawell testing.

     A disadvantage in applying control charts to groundwater monitoring data is that less is understood
about their statistical performance, i.e., false positive rates and power. The control limit used to identify
potential  releases to groundwater is not based  on a formula incorporating a desired false positive rate (a).
Unlike prediction limits, the control limit cannot be precisely set to meet a pre-specified SWFPR, unless
the behavior of the control chart is modeled via Monte Carlo simulation. The same is true for assessing
statistical power. Control charts usually provide less flexibility than prediction limits in designing a
statistical monitoring program for a network.

     In addition, Shewhart-CUSUM control charts are a parametric procedure with no existing non-
parametric counterpart.  Non-parametric  prediction  limit tests are  still generally needed when the
background data on which the control chart is constructed cannot be normalized. Control charts are
mostly appropriate  for analytes with a reasonably high detection frequency in monitoring wells.  These
include inorganic constituents (e.g., detectable trace elements and geochemical monitoring parameters)
occurring naturally  in groundwater, and other persistently-found, site-specific chemicals.

6.5  SITE DESIGN  EXAMPLES

     Three  hypothetical design examples consider a  small, medium and large facility, illustrating the
principles discussed in this chapter.  In each example, the goal is to determine what statistical method or
methods  should be chosen and how those  methods can be implemented in light of the two fundamental
design criteria.  Further design details are covered in respective Part III detection monitoring test
                                             6-46                                   March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

chapters, although very detailed site design is beyond the  scope of the guidance.  More detailed
evaluations and examples of diagnostic tests are found in Part II of the guidance.

       ^EXAMPLE 6-5 SMALL FACILITY

     A municipal landfill has  3 upgradient wells and  8  downgradient wells. Semi-annual statistical
evaluations are required for five inorganic constituents.  So far, six observations have been collected at
each  well.  Exploratory  analysis  has  shown  that  the concentration  measurements appear  to be
approximately  normal  in distribution.  However, each of the five monitored  parameters exhibits
significant  levels of natural spatial variation from well to well.  What statistical approach should be
recommended at this landfill?

       SOLUTION
     Since the inorganic monitoring parameters are measurable and have significant spatial variability,
it is recommended that parametric intrawell rather than interwell tests should be considered. Assuming
that none of the downgradient wells is recently contaminated, each well has n = 6 observations available
for its respective intrawell background.  Six background measurements may or may not be enough for a
sufficiently powerful test.

     To address the potential problem  of inadequate power, a one-way ANOVA should be run on the
combined set of wells  (including background locations). If the well-to-well variances are significantly
different, individual standard deviation  estimates should be made from the six observations at the eight
downgradient wells. If the variances are approximately equal, a pooled standard deviation estimate can
instead be  computed from the ANOVA table. With  11 total wells and 6 measurements per well, the
pooled standard deviation has df = 11x5 = 55 degrees of freedom, instead of df= 5 for each individual
well.

     Regardless of ANOVA results, the per-test false positive rate is approximately the design SWFPR
divided by the annual number of tests. For w = 8 compliance wells, c = 5 parameters monitored,  and nE
=  2  statistical  evaluations  per  year,  the  per-test false positive  rate is approximately  atest  =
SWFPR/(wxcxnE) = 0.00125.  Given normal distribution data, several different parametric prediction
limit retesting plans can be examined,17 using either the combined sample size ofdf+ 1= 56 or the per-
well sample size of n = 6.

     Explained in greater detail  in Chapter 19, K-multiples and power ratings for each test type (using
the inputs w = 8  and n = 6 or 56  are obtained from  the nine parametric Appendix D Intrawell tables
labeled '5 COC, Semi-Annual'.  The following K-factors were obtained for tests of future values at n = 6:
K = 3.46 (l-of-2 test); K = 2.41 (l-of-3); K = 1.81 (l-of-4); and K = 2.97 (modified California) plans. For
future means, the corresponding K-factors were:  K =  4.46  (1-of-l  mean size 2); K = 2.78 (l-of-2 mean
size 2); K = 2.06 (l-of-3 mean size 2); K = 3.85 (1-of-l mean size 3); and K = 2.51 (l-of-2 mean size 3).
In these tables, K-factors reported in Bold have good power, those Italicized have acceptable power and
Plain  Text  indicates low power. For single well intrawell tests, only l-of-3 or l-of-4 plans for future
values, l-of-2 or l-of-3 mean size 2 or l-of-2 mean size 3 plans meet the ERPC criteria.
17 Intrawell control charts with retesting are also an option, though the control limits associated with each retesting scheme
  need to be simulated.

                                              6-47                                   March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     Although each of these retesting plans is adequately powerful, a final choice would be made by
balancing 1) the cost of sampling and chemical analysis at the site; 2) the ability to collect statistically
independent samples should the sampling frequency be increased; and 3) a comparison of the actual
power curves of the three plans.  The last can be used to assess how differences in power might impact
the rapid identification of a groundwater release.  Since a l-of-3 test for future observations has good
power, it is unnecessary to make use of a l-of-4 test.  Similarly, the l-of-3 test for mean size  2 and a 1-
of-2 test for mean size 3 might also be eliminated, since a l-of-2 test of a mean  size 2 is  more than
adequate.   This leaves the  l-of-3 future values and l-of-2 mean 2 tests as the final prediction limit
options to consider.

     Though prediction limits around future means are more powerful than plans for observations, only
3 independent measurements might be required for a l-of-3 test, while 4 might be necessary for the 1-of-
2 test for mean size 2.  For most tests at  background, a single sample might suffice for the l-of-3 test and
2 independent samples for the test using a l-of-2 mean size 2.

     Much greater flexibility is afforded if the pooled intrawell standard deviation estimate can be used.
For this example, any of the nine parametric intrawell retesting plans is sufficiently powerful, including a
l-of-2 prediction limit test on observations and a 1-of-l  test of mean size 2.  In order to  make this
assessment using the  pooled-variance  approach,  a careful  reading of Chapter 13, Section  13.3. is
necessary to generate comparative K-factors.

     Less overall sampling is needed with the l-of-2 plan on observations, since only a single sample
may be  needed for most background conditions.   Two observations are always required for  the 1-of-l
mean size 2 test. More prediction limit testing options are generally available for a small facility. -^

       ^EXAMPLE 6-6  MEDIUM FACILITY

     A  medium-sized hazardous waste  facility has 4 upgradient background wells and 20 downgradient
compliance wells. Ten  initial measurements have been  collected at each upgradient well and 8 at
downgradient wells. The permitted monitoring list includes  10 inorganic parameters and 30 VOCs.  No
VOCs have yet been detected in groundwater. The remaining 10 inorganic constituents are normal or can
be normalized, and five show evidence of significant spatial variation across  the  site.  Assume that
pooled-variances cannot be obtained  from the historical upgradient or downgradient well data.  If one
statistical  evaluation  must  be  conducted each  year,  what  statistical  method  and  approach  are
recommended?

       SOLUTION
     At this site, there are potentially 800 distinct  well-constituent pairs that might be tested.  But since
none of the VOCs has been detected in  groundwater in background wells, all 30 of the VOCs should be
handled using the double quantification  rule (Section 6.2.2).  A second confirmatory resample should be
analyzed at those compliance wells for any of  the 30  VOC constituents initially  detected.  Two
successive quantified  detections above the RL are considered  significant evidence of groundwater
contamination at that well and VOC constituent. To properly limit the SWFPR, the 30 VOC constituents
are excluded from further SWFPR calculations, which is now based onw  x c x «# = 20 x 10 x 1= 200
annual tests.
                                             6-48                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     The five inorganic constituent background data sets indicate insignificant spatial variation and can
be normalized.  The observations from the four upgradient wells can be pooled to form background data
sets with an n = 40 for each of these five constituents. Future samples from the 20 compliance wells are
then compared against the respective interwell background  data.  With one annual evaluation, c =  10
constituents, w = 20 wells and n = 40 background samples, the Interwell '10 COC, Annual' tables for
parametric prediction limits with retesting can be searched in Appendix D.  Alternatively, control chart
limits can be fit to this configuration via Monte Carlo simulations.  Even though only five constituents
will be tested this way,  all of the legitimate constituents (c) affecting the SWFPR calculation, are used in
applying the tables.

     Most  of the interwell prediction limit retesting plans, whether for observations or means, offer
good power relative to the annual evaluation ERPC.  The final choice of a plan may be resolved by a
consideration of sampling effort and cost, as well as perhaps a more detailed power comparison using
simulated curves.   For prediction limits, a l-of-2 test for observations (K = 2.18) and the 1-of-l
prediction limit for a mean of order 2 (K = 2.56) both offer good power.  These two plans also require the
least amount of sampling to identify a potential  release (as discussed in Example 6-6).  Beyond this
rationale, the more powerful 1-of-l test of a future mean size 2 might be selected.  Full power curves
could be constructed and overlaid for several competing plans.

     The remaining 5 inorganic constituents  must be managed using intrawell methods based on
individual compliance well  sizes of n = 8.  For the same c, w, and n£ inputs as above, the Appendix D
Intrawell '10 COC, Annual' tables should be used.  Only four of the higher order prediction limit tests
have acceptable or good power:  l-of-4 future values (K = 1.84); l-of-2 mean size 2 (K = 2.68); l-of-3
mean size 2 (K = 2.00); and l-of-2 mean size 3 (K = 2.39) tests.  The l-of-2 mean size  2 has only
acceptable power.  The first two tests require the fewest samples under most background conditions and
in total, with the l-of-4 test having superior power. -4

       ^EXAMPLE 6-7 LARGE FACILITY

     A larger solid waste facility must conduct two statistical evaluations per year at two background
wells and 30 compliance wells.  Parameters on the monitoring list include five trace metals with a high
percentage  of non-detect measurements, and five other inorganic  constituents.  While the inorganic
parameters  are either normal or can be normalized, a significant degree of spatial variation is present
from one well to the next.  If 12 observations  were collected from  each background well, but only 4
quarterly measurements from each compliance well, what statistical approach is recommended?

       SOLUTION
     Because the two groups of constituents evidence distinctly different statistical characteristics, each
needs to be separately considered.  Since the  trace metals have occasional  detections or  'hits,' they
cannot be excluded from the  SWFPR computation. Because of their high non-detect rates, parametric
prediction limits or control charts may not be appropriate or valid unless a non-detect adjustment such as
Kaplan-Meier or robust regression on order statistics is used (Chapter 15). Assuming for this example
that parametric tests cannot be applied, the trace  metals  should  be  analyzed using non-parametric
prediction limits. The presence  of frequent non-detects may substantially limit  the potential degree of
spatial  variation, making an interwell non-parametric test potentially feasible. The Kruskal-Wallis non-
parametric ANOVA (Chapter 17) could be used to test this assumption.

                                             6-49                                    March 2009

-------
Chapter 6. Detection Monitoring Design	Unified Guidance

     In this case, the number of background measurements is n = 24, and this value along with w = 30
compliance wells would be used to examine possible non-parametric retesting plans in the Appendix D
tables  for  non-parametric prediction limits. As these tables  offer achievable per-evaluation, per-
constituent false positive rates for each configuration of compliance wells and background levels, the
target a level must be determined. Given semi-annual evaluations, the per-evaluation false positive rate
is  approximately OLE = O.IO/WE =  0.05.  Then, with 10 constituents altogether, the approximate per-
constituent false positive rate for each trace metal becomes ocCOnst = 0.05/10 = 0.005.

     Only one retesting plan meets the target false positive rate, a l-of-4 non-parametric prediction limit
using the maximum value in background  as the comparison  limit.  This plan has 'acceptable' power
relative to the ERPC. Other more powerful plans all have higher-than-targeted false positive rates.

     For the remaining 5 inorganic constituents, the presence of significant spatial variation and the fact
that the observations can be  normalized, suggests the use of parametric intrawell prediction or control
limits.  As in the previous Example 6-6, interwell prediction limit  tables in Appendix D are used by
identifying K multipliers  and power ratings based  on  all 10 constituents  subject to the SWFPR
calculations. This is true even though these parametric options only pertain to 5  constituents.  The total
number of well-constituent pair tests per year is equal tow  x c x «# = 30 x 10^2 = 600 annual tests.

     Assuming none of the observed spatial variation is due to already contaminated compliance wells,
the number of measurements that can be used as intrawell  background per well is small (n = 4). A quick
scan of the intrawell  prediction limit retesting plans  in Appendix D '10COC, Semi-Annual' tables
indicates that none of the plans offer even acceptable power for identifying a potential release. A one-
way ANOVA should be run  on the combined set of w = 30 compliance wells to determine if a pooled
intrawell standard deviation estimate can be used.

     If levels of variance across these wells are  roughly  the same, the pooled standard deviation will
have df = w(n -1)= 30 x 3 = 90 degrees of freedom, making each intrawell prediction or control limit
much more powerful. Using the R script provided in Appendix C for intrawell prediction limits with a
pooled standard deviation estimate (see Section 13.3), based on n = 4 and df = 90,  all of the relevant
intrawell prediction limits are sufficiently powerful  compared  to the semi-annual ERPC.   With  the
exception of the l-of-2 future values test at acceptable power, the other tests have good power.  The final
choice of retesting plan can be made by weighing the costs of required sampling versus perhaps a more
detailed comparison of the full power curves.  Plans with lower sampling requirements may be the most
attractive. ^
                                             6-50                                    March 2009

-------
Chapter 7.  Compliance Monitoring Strategies	Unified Guidance

                   CHAPTER  7.     STRATEGIES FOR
  COMPLIANCE/ASSESSMENT AND CORRECTIVE ACTION
       7.1   INTRODUCTION	7-1
       7.2   HYPOTHESIS TESTING STRUCTURES	7-3
       7.3   GROUNDWATERPROTECTION STANDARDS	7-6
       7.4   DESIGNING A STATISTICAL PROGRAM	7-9
         7.4.7  Fafae Positives and Statistical Power in Compliance/Assessment	 7-9
         7.4.2  False Positives and Statistical Power In Corrective Action	 7-72
         7.4.3  Recommended Strategies	 7-73
         7.4.4  Accounting for Shifts and Trends	 7-14
         7.4.5  Impact of Sample Variability, Non-Detects, And Non-Normal Data	 7-77
       7.5   COMPARISONS TO BACKGROUND DATA	7-20
     This chapter covers the fundamental design principles for compliance/assessment and corrective
action statistical monitoring programs. One important difference between these programs and detection
monitoring is that a fixed external GWPS is often used in evaluating compliance. These GWPS can be
an MCL, risk-based or background limit as well as a remedial  action goal. Comparisons to a GWPS in
compliance/assessment and corrective action are generally one-sample tests as opposed to the two- or
multi-sample tests in detection monitoring. Depending on the  program design, two- or multiple-sample
detection  monitoring  strategies   can  be  used  with  well  constituents  subject to  background
compliance/corrective action testing.   While a general framework is presented in this chapter, specific
test applications  and strategies are presented  in Chapters 21 and 22 for fixed GWPS comparisons.
Sections 7.1 through 7.4 discuss comparisons to  fixed GWPSs, while Section 7.5 covers background
GWPS testing  (either as a  fixed limit or based on a background statistic). Discussions of regulatory
issues are generally limited to 40 CFR Part 264, although they also apply to corresponding sections of
the 40 CFR Part 258 solid waste rules.

7.1 INTRODUCTION

     The  RCRA regulatory structure for compliance/assessment and corrective action monitoring is
outlined in Chapter 2. In detection and compliance/assessment monitoring phases, a facility is presumed
not to be  'out  of compliance' until significant evidence of an impact or groundwater  release can be
identified. In corrective action monitoring, the presumption  is reversed since contamination of the
groundwater has already been identified and confirmed. The null hypothesis of onsite contamination is
rejected only when there is significant evidence that the clean-up or remediation  strategy has been
successful.

     Compliance/assessment monitoring is generally begun when statistically significant concentration
exceedances above background have been confirmed for one or more detection monitoring constituents.
Corrective action is undertaken when at least one exceedance of a hazardous constituent GWPS  has
been  identified  in compliance/assessment  monitoring.   The  suite   of constituents  subject  to
compliance/assessment monitoring is determined from Part 264 Appendix IX or Part 258 Appendix n
testing, along with prior hazardous constituent data evaluated  under the detection monitoring program.
Following a  compliance monitoring statistical  exceedance, only a few of these constituents may require
                                             7-1                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

the change in hypothesis structure to corrective action monitoring.  This formal corrective action testing
will need to await completion of remedial activities, while continued monitoring can track progress in
meeting standards.

     The same general statistical  method  of confidence interval testing  against  a fixed  GWPS is
recommended in both compliance/assessment and corrective action programs.  As discussed more fully
below  and in Chapter  21, confidence intervals  provide a flexible and statistically accurate method to
test how a parameter estimated from a single sample compares to  a fixed numerical  limit.  Confidence
intervals explicitly account for variation and uncertainty in  the sample data used to construct them.

     Most decisions about  a statistical program under §264.98 detection  monitoring  are tailored to
facility conditions, other than selecting a target site-wide cumulative false positive rate and a scheme for
evaluating power.  Statistical design details are likely to be site-specific, depending on the available data,
observed distributions  and  the  scope  of the monitoring network.   For compliance/assessment and
corrective action testing under §264.99 and §264.100 or similar tests against fixed health-based or risk-
based  standards, the testing regimen is instead  likely to  be  determined in  advance by the regulatory
agency. The Regional Administrator or State Director is charged with defining the nature of the tests,
constituents to be  tested,  and the wells or  compliance points to be  evaluated.  Specific  decisions
concerning false positive rates and power may also need to  be defined at a regulatory program level.

     The  advantage of  a  consistent approach  for compliance/assessment  and  corrective  action
monitoring tests is that  it can be applied across all Regional or State facilities. Facility-specific input is
still needed, including  the observed  distributions of key constituents and  the  selection  of  statistical
power  and false positive criteria for permits. Because of  the asymmetric nature of the  risks  involved,
regulatory  agency and  facility perspectives  may  differ on  which statistical risks  are most critical.
Therefore,  we  recommend  that  the  following  issues be addressed for compliance/assessment  and
corrective action  monitoring (both §264.99 and §264.100),  as well  as  for other programs  involving
comparisons to fixed standards:

    »«»  What are  the appropriate hypothesis  testing structures for making comparisons to a fixed
       standard?
    »»»  What do fixed GWPS represent in statistical terms  and which population parameter(s)  should be
       tested against them?
    »«»  What is a desirable frequency of sampling and testing, which test(s), and for what constituents?
    »«»  What statistical  power requirements  should be included to ensure protection of health and  the
       environment?
    »»»  What confidence level(s) should be  selected to  control false positive  error rates,  especially
       considering sites with multiple wells and/or constituents?
     Decisions regarding these five questions are  complex and interrelated, and have not been fully
addressed by previous RCRA guidance or existing regulations. This chapter addresses each of these
points  for both §264.99 and §264.100 testing. By developing answers at a regulatory program level,  the
necessity of re-evaluating the same questions at each specific site may be avoided.
                                               7-2                                    March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

7.2  HYPOTHESIS TESTING STRUCTURES

     Compliance testing under §264.99 specifically  requires a determination that one or more well
constituents  exceeds   a   permit-specific   GWPS.   The   correct   statistical   hypothesis  during
compliance/assessment monitoring is that groundwater concentrations are presumed not to exceed the
fixed  standard unless sampling data from one or more well constituents indicates  otherwise. The null
hypothesis, HQ, assumes that downgradient well concentration levels are less than or equal to a standard,
while the alternative hypothesis, HA, is accepted only if the standard is significantly exceeded. Formally,
for some  parameter  (0)  estimated  from  sample  data and representing a standard  G, the relevant
hypotheses under §264.99 compliance monitoring are stated as:

                                    H0:®G                                [7.1]

      Once a positive determination has  been made that  at least  one compliance well constituent
exceeds the fixed standard (i.e.., GWPS), the facility is subject to corrective action  requirements under
§264.100.  At this point,  the regulations imply and statistical  principles  dictate  that the hypothesis
structure should be reversed (for  those compliance  wells  and constituents  indicating exceedances).
Other compliance constituents (i.e., those not exceeding their respective GWPSs)  may continue to be
tested using equation 7.1 hypotheses.  It is then assumed that contamination equal to or in excess of the
GWPS exists and is presumed to be the case unless demonstrated otherwise.  A positive determination
that  groundwater concentrations  are below the  standard  is  necessary  to  demonstrate regulatory
compliance for any wells and constituents under remediation. In statistical terms, the  relevant hypotheses
for §264.100 are:

                                    H0:®>Gvs.HA:®
-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

     Non-RCRA programs seeking to use methods presented in the Unified Guidance may also presume
a different statistical hypothesis structure from that presented here. The primary goal is to ensure that the
statistical approach matches the  appropriate hypothesis framework. It is also allowable under RCRA
regulations to define GWPS based on background data, discussed further in Section 7.5.

     Whatever the population parameter (0) selected as representative of the GWPS, testing consists of
a confidence interval derived from the compliance point data at some choice of significance level (a),
and then compared to the standard G. The confidence intervals describe the probable distribution of the
sample statistic, 0,  employed to estimate the true parameter©. For testing under  compliance/assessment
monitoring, a lower confidence limit around the true parameter — LCL(0)  — is utilized. If LCL(0)
exceeds the standard, there is statistically significant evidence in favor of the alternative hypothesis, H\.
0 > G, that the compliance standard has been violated. If not, the confidence limit test is inconclusive
and the null hypothesis accepted.

     When the corrective action  hypothesis of [7.2] is employed, an upper confidence limit UCL(0) is
generated from the compliance point data and compared to the  standard G.  In  this case, the UCL(0)
should lie  below the standard to accept  the alternative hypothesis  that concentration  levels are in
compliance, HA'.  &
-------
Chapter 7. Compliance Monitoring Strategies _ Unified Guidance

constituent. The statistic used to estimate (I is the sample mean (x). With this statistic and normally-
distributed data, the lower and upper confidence limits are symmetric:


                                                                                           [7.3]
                                                                                           [7.4]


for a selected significance level (a) and sample  size n.  Note in these formulas that s is the sample
standard deviation, and tl_an_l is a central Student's lvalue with n-\ degrees of freedom.

     The two hypothesis structures and tests are defined as follows:

    Case A. Test of non-compliance (§264.99) vs. a fixed standard (compliance/assessment monitoring):

       Test Hypothesis: HQ : jU < G vs. HA : jU > G

                                       g
       Test Statistic: LCLl_a = x- tl_a ^ —j=
                                       V«

       Rejection Region: Reject null hypothesis (Ho) if LCLl_a > G ; otherwise, accept null hypothesis

    Case B. Test of compliance (§264.100) vs. a fixed standard (corrective action):

       Test Hypothesis: HQ : jU > G vs. HA : jU < G
       Test Statistic: UCLl_a = x + tl_a ^ -=
       Rejection Region: Reject null hypothesis (Ho) if UCLl_a < G ; otherwise, accept null hypothesis

     For all  confidence  intervals and tests presented in Chapters 21 and 22, the test structures are
similar to those above. But not every pair of lower and upper confidence limits (i.e., LCL and UCL) will
be symmetric, particularly for skewed distributions and in non-parametric tests on upper percentiles. For
a non-parametric technique such as a confidence interval around the median, exact confidence levels will
depend  on the available sample size and which order statistics are  used to estimate  the  desired
population parameter. In these cases, an exact target confidence level may or may not be attainable.

     When calculating confidence intervals, assignment  of the false positive error (a)  differs between a
one-sided and two-sided  confidence interval test.  The symmetric upper and lower confidence intervals
are shown in Figure 7-1 largely  for illustration purposes.  If the lower confidence interval for some
tested parameter 0 is the critical  limit, all of the a error is assigned to the region below the LCL(0).
Hence, a 1-a confidence level covers the range from the lower limit to positive infinity. Similarly, all of
the a error for an upper confidence limit  UCL(0) is assigned to the region above this value.  For a two-

                                               7-5                                    March 2009

-------
Chapter 7.  Compliance Monitoring Strategies
                                     Unified Guidance
sided interval, the error rate is equally partitioned on both sides of the respective confidence interval
limits. A 95% lower confidence limit implies that a 5% chance of an error exists for values lying below
the limit.   In contrast, a two-sided 95% confidence interval implies a 2.5% chance above and a 2.5%
chance of an error below the confidence level.  Depending on how confidence intervals are defined, the
appropriate statistical adjustment (e.g., the lvalue in Equations 7-3 and 7-4) needs to take this into
account.

      Figure 7-1. Confidence Interval  on Mean vs.  Fixed Upper Percentile Limit
                    a
                    o
                   ••a
                    cS
                   3
                   60
                                                                 Population Values
                        GWPS
CI on 95thD
Percentile
                                                     CI on Mean
7.3 GROUNDWATER PROTECTION STANDARDS

     A second essential design step is to identify the appropriate population parameter and its associated
statistical  estimate.   This  is primarily a  determination of what a given fixed GWPS represents in
statistical  terms. Not all fixed  concentration standards are meant  to  represent the  same statistical
quantities.  A distinction is drawn between 1) those central tendency standards designed to represent a
mean or  average concentration level and 2)  those which represent  either  an upper percentile or the
maximum of the concentration distribution. If the fixed standard represents an average concentration, it
is assumed in the Unified Guidance that the mean concentration (or possibly the median concentration)
in groundwater should not exceed the limit. When a fixed standard  represents  an upper percentile or
maximum, no more than a small, specified fraction of the individual concentration measurements should
exceed the limit.

     The  choice of confidence interval should be based on the type of fixed  standard to which the
groundwater data will be compared.  A fixed  limit best representing  an upper percentile concentration
(e.g., the upper 95th percentile) should not be compared to a confidence interval  constructed around the
arithmetic mean. Such an interval only estimates the location of the population mean, but says nothing
about the  specific upper percentile of the concentration distribution. The average concentration level
                                             7-6                                    March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

could be  substantially less  than the standard even though a significant fraction  of the individual
measurements exceeds the standard (see Figure 7-1).

     There are a variety of fixed standards to which different statistical measures apply.  Alternative
GWPSs based on Agency risk-assessment protocols are cited as an option in the solid waste regulations
at §258.55(i)(l).  Many of the risk-assessment procedures identified in the CERCLA program make use
of chronic, long-term exposure models for ingestion or inhalation. These procedures are identified in the
(EPA,  1989b) Risk Assessment  Guidance for Superfund (RAGS) and the Supplemental Guidance for
Calculating the Concentration Term (EPA,  1992c), and serve as guidance for other EPA programs.  In
the  latter  document, the arithmetic mean  is  identified as  the appropriate  parameter for identifying
environmental exposure levels. The levels are intended to identify chronic, time-weighted concentration
averages based on lifetime exposure scenarios.

      The primary maximum contaminant levels [MCL] promulgated under the Safe Drinking Water Act
(SDWA) follow the same exposure evaluation principles. An MCL is typically based on 70-year risk-
exposure scenarios (for carcinogenic compounds), assuming an ingestion rate of 2 liters of water per day
at the average concentration over time. Similarly, long-term risk periods (e.g., 6-years) are used for non-
carcinogenic  constituents, assuming average  exposure concentrations.  The  promulgated levels  also
contain a  safety multiplicative factor and are applied at the end-user tap.  Calculations for ingestion
exposure risk to soil contaminants by an individual randomly traversing a contaminated site are based on
the  average estimated  soil concentration. It is expected that an exposed individual drinking  the water or
ingesting the soil is  not afforded any protection in the form of prior treatment.

     Other standards  which may represent a  population mean  include some RCRA site permits that
include comparisons against an alternate  concentration limit [ACL] based  on the average value of
background data. In addition,  some standards represent time-weighted averages used for carcinogenic
risk assessments such as the lifetime average daily dose [LADD].

     Fixed limits based explicitly on the median concentration include fish ingestion exposure factors,
used in testing fish tissue  for  certain  contaminants.  The  exposure  factors represent  the  allowable
concentration level  below which at  least half of the fish sample concentrations  should lie, the 50th
percentile of the  observed concentration distribution. If this distribution is  symmetric,  the mean and
median will be identical. For positively skewed populations, the mean concentration could exceed the
exposure factor even though the median (and hence, a majority of the individual concentrations) is below
the  limit. It would therefore not  be appropriate to compare such exposure factors  against a confidence
interval around the mean contaminant level, unless one could be certain the distribution was symmetric.

     Fixed standards  are sometimes based on upper percentiles. Scenarios of this  type include risk-
based standards designed to limit acute effects that result from short-term exposures to certain chemicals
(e.g., chlorine gas leaking from a rail car or tanker).  There is greater interest in possible acute effects or
transient exposures  having a significant short-term risk.  Such exposure events may not happen often, but
can be important to  track for monitoring and/or compliance purposes.

     When even short exposures can result  in deleterious health or environmental effects,  the fixed limit
can be  specified as a maximum allowable concentration.  From a statistical standpoint, the standard
identifies  a level which can  only  be exceeded  a small  fraction  of the time (e.g.,  the  upper 90th
percentile). If a larger than allowable fraction of the individual exposures exceeds the standard, action is

                                               7-7                                     March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

likely warranted, even if the average concentration level is below the standard. Certain MCLs are
interpreted in this same manner; the term 'maximum' in maximum contaminant level would be treated
statistically as an upper percentile limit. Examples include criteria for bacterial counts and nitrate/nitrite
concentrations, best regarded as upper percentile limits.

     As an example, exposure of infants to nitrate concentrations in excess of 10 mg/L (NC>3~ as N) in
drinking water is  a  case where  greater concern surrounds  acute effects  resulting  from short-term
exposure.  The flora in the intestinal tract of infant humans and animals does not fully develop until the
age of about six months.  This results in a lower acidity in the intestinal tract, which permits the growth
of nitrate  reducing  bacteria.  These  bacteria convert  nitrate  to nitrite.  When absorbed  into  the
bloodstream,  nitrite interferes with the absorption of oxygen.  Suffocation by oxygen starvation in this
manner produces a bluish skin discoloration — a condition known  as "blue  baby" syndrome (or
methemoglobinemia) — which can result in serious health problems, even  death.  In such a scenario,
suppose that acute effects resulting from short-term exposure above some critical level should normally
occur in no more than 10 percent of all exposure events. Then the critical level so identified would be
equivalent to the upper 90th percentile of all exposure events.

     Another example is the so-called 20-year flood recurrence interval for structural design. Flood
walls and  drainage culverts are designed to handle not just the average flood level, but also flood levels
that have a 1  in 20 chance of being equaled or exceeded in any single year. A 20-year flood recurrence
level is  essentially equivalent to estimating the upper 95th percentile of the  distribution of flood levels
(e.g., a flood of this magnitude is expected to occur only 5 times every 100 years).

     The  various limits identified as potential GWPS in Chapter 2 pose some interpretation problems.
§264.94 Table 1 values are identified as "Maximum Concentration[s] of Constituents for Groundwater
Protection" for 14 hazardous constituents,  originating  from  earlier Federal Water Pollution Control
Administration efforts.  While not  a definitive protocol for comparison, it was indicated that the limits
were intended to represent  a concentration level that should not be exceeded most of the time.  In an
early Water Quality Criteria report (USDI, 1968), the  language is as follows:

       "It is  clearly  not  possible to  apply  these  (drinking water) criteria solely as maximum single
       sample values. The criteria  should not be exceeded over substantial portions of time."

       Similarly, the more current MCLs promulgated  under the  SDWA are identified as "maximum
contaminant limits".  Even if the limits were derived from chronic, risk-based assessments, the same
implication is that these limits should not be exceeded.

       Individual EPA programs make sample data  comparisons to MCLs using  different approaches.
For small-facility systems  monitored  under the  SDWA, only one or  two samples a year might be
collected for comparison.  Anything other than direct  comparisons isn't possible.  Some Clean Water Act
programs  use arithmetic comparisons (means or medians) rather than  a  fully statistical  approach.
CERCLA typically utilizes these standards in mean statistical comparisons, consistent with other chronic
health-based levels derived from their program risk assessment equations.  In  short, EPA nationwide
does not have a single operational definition or measure for assessing MCLs with sample data.

       The Unified Guidance cannot directly  resolve these issues.  Since the regulations promulgated
under   RCRA  presume  the use  of fully statistical measures  for groundwater  monitoring program
                                              7-8                                    March 2009

-------
Chapter 7.  Compliance Monitoring Strategies	Unified Guidance

evaluations, the guidance provides a number of options for both centrality-based and upper limit tests.
It falls upon  State or Regional programs to determine which is  the most appropriate parameter for
comparison to a GWPS.  As indicated above, the guidance does recommend that any operational
definition of the appropriate parameter of comparison to GWPS's be applied uniformly across a program.

       If  a  mean- or  median-based centrality parameter  is chosen,  the guidance  offers fairly
straightforward confidence interval testing options.   For a parameter representing some infrequent level
of exceedance to address the "maximum"  or "most" criteria, the program would  need to identify a
specific upper proportion and confidence level that the GWPS represents.  Perhaps a proportion of 80 to
95% would be appropriate, at 90-95% confidence.  It is presumed that the same standard would apply to
both compliance and corrective action testing under §264.99 and  §264.100. If non-parametric upper
proportion tests must be used for certain data, very high proportions make for especially difficult tests to
determine a return to compliance (Chapter 22) because of the number of samples required.

7.4  DESIGNING A STATISTICAL PROGRAM

7.4.1 FALSE POSITIVES AND STATISTICAL POWER IN  COMPLIANCE/ASSESSMENT

     As  discussed in Chapters 3 and 6,  the twin criteria in   designing an acceptable detection
monitoring statistical program are the site-wide false positive rate [SWFPR] and the effective power of
the testing regimen. Both statistical measures are  crucial  to  good statistical design, although from a
regulatory perspective, ensuring adequate power to detect  contaminated  groundwater is of primary
importance.

     In compliance/assessment monitoring, statistical power  is also of prime concern to EPA. There
should be a high probability that  the statistical test will  positively identify concentrations that have
exceeded a fixed, regulatory standard. In typical applications  where a confidence interval is compared
against a fixed standard,  a low false positive error rate (a) is chosen without respect to the power of the
test. Partly this is due to a natural  desire to have high statistical  confidence in the test, where (l-oc)
designates the confidence level of the interval. But statistical confidence is not the same as power. The
confidence level merely indicates how often — in repeated applications — the interval will contain the
true population parameter (0); not how often the test will indicate  an exceedance of a fixed standard.  It
has historically been much easier to select a single value for the false positive rate (a) than to measure
power, especially since power is not a single number but  & function of the level of contamination (as
discussed in Section 3.5).

     The power to detect increases above  a fixed standard using a lower confidence limit can be
negligible when contaminant variability is high, the sample size is small and especially when a high
degree of confidence has been selected. To remedy this problem, the Unified Guidance recommends
reversing the usual sequence: first select a desired level of power for the test (1-P), and then compute the
associated (maximum) false positive rate (a). In this way, a pre-specified power can be maintained even
if the sample size is too low to simultaneously minimize the risks of both Type I and Type n errors (i.e.,
false positives and false negatives).

     Specific methods for choosing power and computing false positive rates with confidence interval
tests are  presented in  Chapter 22. Detailed applications of confidence  interval tests are provided in
Chapter 21. The focus here is on setting a basic framework and consistent strategies.

                                             7-9                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

     As noted above, selecting false positive error rates in compliance or assessment testing (§264.99)
has traditionally been accomplished under RCRA by choosing a fixed, individual test a. This strategy is
attractive if only for the sake of simplicity. Individual test-wise false positive rates in the range of a =
.01 to a = .10  are traditional and easily understood. In addition, the Part 264 regulations in §264.97(i)(2)
require a minimum individual false positive rate of a = .01  in both compliance and corrective action
testing against a  fixed standard, as well as in those tests not specifically exempted under detection
monitoring.1

     Given a fixed sample size and constant level  of variation, the statistical power of a  test method
drops as the false positive rate decreases. A low false positive rate is often associated with low power.
Since statistical power is of particular concern to EPA in compliance/assessment monitoring, somewhat
higher false positive rates  than the minimum a = .01 RCRA requirement may be necessary to maintain a
pre-specified power over  the range of sample sizes and variability likely to be  encountered in RCRA
testing situations. The key is sample variability. When the true population coefficient of variation [CV]
is  no greater than 0.5 (whether the underlying distribution is normal or lognormal), almost all lower
confidence limit tests exhibit adequate power. When the variation is higher, the risk of false  negative
error is typically much greater (and thus the power is lower), which may necessitate setting a larger than
usual individual a.

     Based on the discussion regarding false positives  in detection monitoring in  Chapter  6,  some
might be concerned about the use of relatively high individual test-wise false positive rates  (a) in order
to  meet a pre-specified power, especially when considering the cumulative false positive error rate across
multiple wells and/or constituents (i.e., SWFPR).  Given that  a  number of compliance wells and
constituents might need to be tested, the likelihood of occurrence of at least one false positive error
increases dramatically. However, several factors specific to compliance/assessment monitoring need to
be considered. Unlike detection monitoring where the number of tests is  easily  identified,  the issue is
less obvious for  compliance/assessment or corrective action testing.  The RCRA  regulations do  not
clearly specify which wells and constituents must be compared to the GWPS in compliance/assessment
monitoring other than wells at the 'compliance point.' In some situations, this has been interpreted to
mean all compliance wells; in other instances, only at those wells with a documented exceedance.

     While all hazardous constituents including additional  ones detected in Part  264 Appendix IX
monitoring are potentially subject to testing, many may still be at concentration levels insignificantly
different from  onsite background. Constituents without health-based limits may or may not  be  included
in  compliance testing. The latter would be tested  against background levels, using perhaps  an ACL
computed as a tolerance limit on background (see Section 7.5). This also tends to complicate derivation
of SWFPRs in compliance testing. It was also noted in Section 7.2 that the levels at which contaminants
are released bear  no necessary relationship to fixed, health-based standards. In a typical release,  some
constituent levels from a suite of analytical parameters may lie orders of magnitude below their GWPS,
while certain carcinogenic compounds may easily exceed their standards.
1  In  some  instances, a test with "reasonable confidence"  (that is, having adequate statistical power) for identifying
  compliance violations can be designed even if a < 0.01.  This is particularly the case when the sample coefficient of
  variation is quite low, indicating small degrees of sample variability.

                                              7-10                                    March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

     The  simple example below illustrates typical low-level aquifer concentrations following a release
of four common petrochemical facility hazardous organic constituents often detected together:
Analyte

Benzene
Toluene
Ethylbenzene
Xylene
Aquifer Concentration
Mean
20
35
40
100
(ug/i)
SD
10
15
20
35
MCL(ug/l)

5
1,000
700
10,000
       While benzene as a carcinogen has a very low health standard, the remaining three constituents
have aquifer concentrations orders of magnitude lower than their respective MCLs. Realistically, only
benzene is likely to impact the cumulative false positive rate in LCL testing. Similar relationships occur
in releases measured by trace element and semi-volatile organic suites.

     Even  though the null hypotheses in detection and compliance/assessment monitoring are similar
(and compound)  in nature (see [7.1]), it is reasonable to presume  in detection monitoring that the
compliance wells have average concentrations no less than mean background levels.  Since it is these
background levels to which the compliance point data are compared in the absence of a release, the
compound null hypothesis in detection monitoring (Ho: (ic < M-BG) can be reformulated practically as (H^.
M-c = M-BG)-  In this framework, individual concentration  measurements are likely to occasionally exceed
the background average and at times cause false positives to be identified even when there has been no
change in average groundwater quality.

     In compliance/assessment  monitoring, the situation is  generally different.  The compound null
hypothesis (H0: MC < GWPS) will include some wells and  constituents where the sample mean equals or
nearly  equals the GWPS when testing begins. But many well-constituent  pairs may have true means
considerably less  than the standard, making false positives much less likely for those comparisons and
lowering the  overall SWFPR. How much  so  will depend on both the variability of each individual
constituent and the degree to which the true mean (or relevant statistical parameter 0) is lower than the
GWPS for that analyte.

     Because of this, determining the relevant number of comparisons with non-negligible false positive
error rates may be quite difficult. The SWFPR in this situation would be defined as the probability that at
least one or more lower confidence limits exceeded the fixed standard G,  when the true parameter 0
(usually the mean) was actually below the standard. However, the relevant number of comparisons will
depend on the nature and extent of the release.  For a more extensive release, there is greater likelihood
that the null hypothesis is no longer true at one or more wells. Instead of computing false positive rates,
the focus should shift to minimizing false negative errors  (i.e., the risk of missing  contamination above
the GWPS).
2 Note that background might consist of early intrawell measurements from compliance wells when substantial spatial
  variability exists.

                                              7-11                                    March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

     On  balance,  the  Unified  Guidance  considers  computation  of  cumulative  SWFPRs  in
compliance/assessment testing to be problematic, and  reliance  on individual test false positive rates
preferable.  The above arguments also suggest that flexibility in setting individual test-wise a levels may
be appropriate.

7.4.2 FALSE POSITIVES AND STATISTICAL POWER IN CORRECTIVE ACTION

     When contamination above a GWPS  is confirmed, corrective action is triggered.  Following a
period of remediation activity, formal statistical testing  will usually involve an upper confidence limit
around the mean or an upper percentile compared against a  GWPS. EPA's overriding  concern in
corrective action is that remediation efforts not be declared successful without sufficient statistical proof.
Since groundwater is now presumed to be  impacted at unacceptable levels, a facility should not exit
corrective action until there is sufficient evidence that contamination has been abated.

     Given the reversal of test hypotheses from compliance/assessment monitoring to corrective action
(i.e., comparing  equation [7.1] with [7.2]), there is an asymmetry in regulatory considerations of false
positive and false negative  rates  depending  on the stage of monitoring.  In  compliance/assessment
monitoring using tests of the lower confidence limit,  the principal regulatory concern is that a given test
has adequate statistical power to detect exceedances above the GWPS.

     Permitted RCRA monitoring is likely to involve small annual well sample sizes based on quarterly
or semi-annual sampling.  To meet a pre-specified level of power by controlling the false negative rate
(P) necessitates varying the false positive rate  (a) for individual  tests. Controlling an SWFPR for these
tests (using a criterion like the SWFPR) is usually not practical because of the ambiguity in identifying
the relevant number of potential  tests  and the difficulty of properly assigning  via the subdivision
principle (Chapter 19) individual fractions of a targeted  SWFPR.

     By contrast under corrective  action using an  upper confidence limit for testing,  the principal
regulatory and environmental concern is that one or more constituents might falsely be declared below a
GWPS in concentration. Under the corrective action  null hypothesis [7.2] this would be a, false positive
error, implying that a should be minimized during this sort of testing, instead of p. Specific methods for
accomplishing this goal are presented in Chapter 22.

     A remaining question is whether  SWFPRs should be controlled during corrective action. While
potentially  desirable, the number of well-constituent pairs exceeding their respective GWPS and subject
to corrective action testing is likely to be small relative to compliance testing. Not all compliance wells
or constituents may have been impacted, and some  may not be contaminated to levels exceeding the
GWPS, depending on the nature, extent, and intensity of the plume. Remediation efforts would focus on
those constituents exceeding their GWPS.

     As noted  in  Section   7.4.1, the tenuous  relationship  between  ambient  background  levels,
contaminant magnitudes, and risk-based health standards implies  that most GWPS exceedances are
likely to be carcinogens,  usually  representing  a  small portion of all monitored constituents. Some
exceedances may also be related compounds, for instance, chlorinated hydrocarbon daughter degradation
products.
                                             7-12                                    March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

     Statistically, the fact that some wells are contaminated while others may not be makes it  difficult
to define SWFPRs in corrective action. Instead, the Unified  Guidance attempts to limit the individual
test-wise a at those wells where exceedances have been confirmed and that are undergoing remediation.
Since the most important consideration is to ensure that the true population parameter (0) is actually
below the clean-up standard before declaring remediation a success, this guidance recommends the use
of a reasonably low, fixed test-wise false positive rate (e.g., a = .05 or .10). Under this framework, there
will  be a 5% to 10% chance of incorrectly declaring any  single  well-constituent pair of being in
compliance when its concentrations are truly above the remedial standard.

     The regulatory position in corrective action concerning  statistical  power is  one of relative
indifference. Although power under [7.2]  represents the probability that the confidence interval test will
correctly identify concentrations to be below the regulatory standard when in fact they  are, the onus of
proof for demonstrating that remediation has succeeded (e.g., (ic < GWPS) falls on the regulated facility.
As it is the facility's interest to demonstrate compliance, it may wish to develop statistical power criteria
which would enhance this possibility (including increasing test sample sizes).

7.4.3 RECOMMENDED STRATEGIES

     As noted in Section 7.1, the Unified Guidance recommends the use of confidence intervals in both
compliance/assessment and corrective action testing. In compliance/assessment, the lower confidence
limit is the appropriate statistic of interest, while in corrective action it is the upper confidence limit. In
either case, the confidence  limit is compared against a fixed,  regulatory standard as a one-sample test.
These recommendations are consistent with good statistical practice, as well as literature in the field,
such as Gibbons and Coleman (2001).

     The type of confidence  interval test will initially be determined by the choice of parameter(s) to
represent the GWPS (Section 7.2). While this discussion has  suggested that the mean may be the most
appropriate parameter for chronic, health-based limits, other choices are possible.  Chapter 21 identifies
potential test statistical tests of a mean, median or upper percentile as the most appropriate parameters
for comparison to a  GWPS.   In turn, data characteristics will determine whether parametric or non-
parametric test versions can be used.  Depending on whether normality can be assumed for the original
data  or following transformation, somewhat different approaches may be  needed.  Finally, the presence
of data trends affects how confidence interval testing can be applied.

     Some regulatory programs prefer to  compare each individual measurement against G, identifying a
well  as  out-of-compliance if  any  of the individual concentrations exceeds the standard. However, the
false positive rate associated with such strategies tends to be  quite high if the parameter choice has not
been clearly specified.   Using this  individual comparison   approach and  assuming  a  mean as  the
parameter of choice,  is of particular concern.  If the true mean is less than but close  to the standard,
chances are very high that one or more  individual measurements will be greater than the limit even
though the hypothesis in [7.1] has not been violated. Corrective action could then be initiated on a false
premise. To evaluate whether a limited number of sample data exceed a standard, a lower confidence
interval test would need to be based on a pre-specified upper percentile assumed to be the appropriate
parameter for comparison to the GWPS.

     Small  individual   well  sample  sizes   and data  uncertainty   can rarely  be   avoided  in
compliance/assessment and corrective action. Given the nature of RCRA permits, sampling frequencies
                                             7-13                                    March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

in compliance/assessment or corrective action monitoring are likely to be  established  in  advance.
Relatively small sample sizes per well-constituent pair each year are likely to be the rule; the Unified
Guidance assumes that quarterly and semi-annual sampling will be very typical.

     For small and highly variable sample data sets, compliance/assessment monitoring and corrective
action tests will have low statistical power either to detect exceedances above fixed  standards or to
demonstrate compliance in corrective action. One way to both enhance statistical power and control false
positive error rates  is through incremental or sequential pooling of compliance point  data over time.
Adding  more  data  into a test of non-compliance or  compliance will  generally result in  narrower
confidence intervals and a clearer decision with respect to a compliance standard.

     The  Unified Guidance recommends accumulating compliance data over time at each  well, by
allowing construction of confidence limits on overlapping as opposed to distinct or mutually  exclusive
data sets.  If the lower confidence limit [LCL] exceeds the GWPS in compliance/assessment, a clear
exceedance can be  identified.  If the upper confidence limit [UCL] is below the GWPS in corrective
action, remediation  at that well can be declared a success.  If neither of these respective events occurs,
further sampling should continue. A confidence interval can be recomputed after each additional 1 or 2
measurements and  a determination made whether the  position of the confidence limit has changed
relative to the compliance standard.

     Tests constructed  in  this  way at each  successive evaluation  period will not  be  statistically
independent; instead, the proposed testing strategy falls into the realm of sequential analysis. But it
should help to minimize the possibility that a small group of spurious values will either push  a facility
into needless corrective action or prevent a successful remedial effort from being identified.

     One  caveat with this approach is that it must be reasonable to assume that the population parameter
© is stable over time. If a release has occurred and a contaminant plume is spreading through the aquifer,
concentration shifts in the form of increasing trends over time may be more likely at contaminated wells.
Likewise under active remediation, decreasing trends for a period of time may be more likely. Therefore,
it is recommended that the sequential testing approach be used after aquifer conditions have stabilized to
some degree. While concentration levels are actively changing with time, use of confidence  intervals
around a trend line should be pursued (see Section 7.4.4 and Chapter 21).

7.4.4 ACCOUNTING FOR SHIFTS AND TRENDS

     While accumulating compliance point data over time and successively re-computing confidence
limits is appropriate for  stable (i.e., stationary) populations, it can give misleading or false results when
the underlying population is  changing. Should a release  create an expanding contaminant plume within
the aquifer, concentration levels at some or all of the compliance wells will tend to shift upward,  either
in discrete jumps (as illustrated in Figure 7-2) or an increasing trend over time. In these cases, a  lower
confidence limit constructed on  accumulated data will be overly wide (due to high sample variability
caused by  combining pre- and post-shift data) and not be reflective of the more recent upward shift in the
contaminant distribution.
                                             7-14                                    March 2009

-------
Chapter 7.  Compliance Monitoring Strategies
                                 Unified Guidance
      Figure 7-2. Effect on Confidence Intervals of Stable Contamination Level
                      65
                      60
                   3  55 —
                   1
                      50
                      45-
                          CJWPS
                              90
                                    91
                                                  \
                                           Compliance Monitoring Begins
                                         92
                                              93    94

                                               Year
                                                        95
                                                              96
                                                                  97
                                                                       98
     A similar problem  can arise  with  corrective  action  data.  Aquifer modifications as part of
contaminant removals are likely to result in observable declines in constituent concentrations during the
active  treatment phase. At some point following cessation of remedial action, a new steady-state
equilibrium may be established (Figure 7-3).

                Figure  7-3. Decreasing Trend During Corrective Action

                    500
                 ~ 450  -
                 E
                 CL
                 d
                 ^ 400  -
                 O
                 U
                 S 350  H
                 O
                 u 300  -
                    250
8      12      16

 Sampling Month
                                                                 20
24
     Until  then, it is inappropriate to use a confidence interval test  around the mean or an upper
percentile to evaluate remedial success with respect to a clean-up  standard.  During active treatment
phases and under non-steady state conditions, other forms of analysis such as confidence bands around a
trend (see below), are recommended and should be pursued.
                                            7-15
                                         March 2009

-------
Chapter 7. Compliance Monitoring Strategies
Unified Guidance
     The Unified Guidance considers two basic types of non-stationary behavior: shifts and (linear)
trends. A shift refers to a significant mean concentration increase or decrease departing from a roughly
stable mean level.  A trend refers to  a series of consecutive measurements that evidence successively
increasing or decreasing concentration levels. More complicated non-random data patterns are also
possible, but beyond the  scope of this guidance.   With these two  basic scenarios, the strategy for
constructing an appropriate confidence interval differs.

     An important preliminary step is to track the individual compliance point measurements on a time
series plot (Chapter 9). If a discrete shift in concentration level is evident, a confidence limit should be
computed on the most recent stable measurements.  Limiting the observations in this fashion to a
specific time period is often termed a 'moving window.' The reduction in sample size will often be more
than offset by the gain in statistical power.  More recent measurements may exhibit less variation around
the shifted mean value, resulting in a shorter confidence interval (Figure 7-4). The sample size included
in  the  moving  window   should  be   sufficient  to   achieve   the  desired  statistical  power
(compliance/assessment)  or  false positive rate (corrective  action). However, measurements that are
clearly unrepresentative of the newly shifted distribution should not be included, even if the sample size
suffers.  Once a stable mean can be assumed, the strategy of sequential pooling can be used.

           Figure 7-4. Effect on Confidence  Intervals of Concentration Shift
                    500
                    450
                 I  400
                 o.
                 C
                    350
                u
                    300
                    250
                              Compliance Monitoring (CM) begins

                          GWPS
                                                                  CI on all
                                                                  CM data
                           _,  !  ,!   ,(,j  ,  j  ,  !   |   ^  ,   j   ,_

                       89    90    91     92    93    94    95     96    9?     98
                                                  Year
       If well concentration levels exhibit an increasing or decreasing trend over time (such as the
example in Figure 7-5) and the pattern is reasonably linear or monotone, the trend can be identified
using the methods detailed in Chapter 17. To measure compliance or non-compliance, a confidence
band can be constructed around the estimated trend line, as described in Chapter 21. A confidence band
is essentially a continuous series  of confidence intervals estimated along every point of the trend. Using
this technique, the appropriate upper or lower confidence limits at one or more points in the most recent
                                              7-16
        March 2009

-------
Chapter 7.  Compliance Monitoring Strategies
Unified Guidance
portion of the end of the sampling record can be compared against the fixed standard.  The lower band is
used to determine whether or not an exceedance has occurred in compliance/assessment, and an upper
confidence band to determine if remedial success has been achieved in corrective action.

                Figure  7-5. Rising Trend During Compliance Monitoring

                    900
                  .  750 -
                 ""
                    600 -
                 u 450 -
                    300
                                                          Sulfate
                                                          LinearRegressiot
                            _,  j,  |	,	l  ,|   ,  |   ,  1  ,   |  ,   |  ,-
                         89   90   91    92   93   94   95   96   97   98
                                               Year

     By explicitly accounting for the trend, the confidence interval in Chapter 21 will adjust upward or
downward with the trend and thus more accurately estimate the current true concentration levels. Trend
techniques are not just used to track progress towards exceeding or meeting a fixed standard. Confidence
bands around the trend line can also provide an estimate of confidence  in the average concentration as it
changes over time. This subject is further covered in the Comprehensive Environmental Response,
Compensation, and Liability Act [CERCLA] guidance Methods for Evaluating the Attainment of
Cleanup Standards — Volume 2: Groundwater (EPA, 1992a).

     A final determination of remedial success should not solely be a statistical decision.  In many
hydrologic settings, contaminant concentrations tend to  rise after groundwater pumping wells are turned
off due to changes in well  drawdown patterns.   Concentration levels may exhibit more complicated
behavior than the two situations considered above. Thus, on balance, it  is recommended that determining
achievement of corrective action goals be done in consultation with the site manager, geologist, and/or
remedial engineer.

7.4.5 IMPACT OF  SAMPLE VARIABILITY,  NON-DETECTS, AND NON-NORMAL DATA

     Selection of hazardous constituents to be monitored in compliance/assessment or corrective action
is largely determined  by permit decisions. Regulatory requirements (e.g., Part 264, Appendix IX)  may
also dictate the number of constituents.  As a  practical  matter,  the most  reliable  indicators  of
contamination should be favored. Occasionally, constituents subject to degradation and transformation in
the aquifer  (e.g., chlorinated  hydrocarbon suites) may result  in additional,  related constituents  of
concern.
                                            7-17
        March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

      Since health-based considerations  are paramount in this type of monitoring, the most sensitive
constituents from a health risk standpoint could be selected. But even with population parameters (0),
sample sizes, and  constituents  determined,  selecting an  appropriate confidence interval test from
Chapter 21 can be problematic. For mildly variable sample data, measured  at relatively stable levels,
tests based on the normal distribution should be favored, whether constructed around a mean or an upper
percentile. With highly variable sample data,  selection of a test is less straightforward. If the observed
data happen to be lognormal, Land's confidence interval around the arithmetic mean is a valid option;
however, it has low  power to measure  compliance as the observations become more variable,  and
upward adjustment of the false positive rate (a) may be necessary to maintain sufficient power.

      In addition, the extreme variability of an upper confidence  limit using Land's technique  can
severely restrict its usage in tests of compliance during corrective action. Depending on the  data pattern
observed, degree of variability, and how  closely the sample mimics the lognormal model, consultation
with  a professional statistician should be considered to  resolve unusual  cases. When the lognormal
coefficient  of variation is quite high, one alternative is to construct an upper confidence limit around the
lognormal geometric mean (Chapter 21). Although such a confidence limit does not fully  account for
extreme  concentration values in the right-hand tail  of the lognormal distribution, a bound on the
geometric mean will account for the bulk of possible measurements. Nonetheless, use of  a geometric
mean  as a surrogate  for the population arithmetic  mean  leads to distinctly  different statistical  test
characteristics in terms of power and false positive rates.

      In  sum,  excessive sample  variability  can   severely limit the  effectiveness of traditional
compliance/assessment  and  corrective action testing. On the other hand, if  excessive variability  is
primarily due to trends observable in the data,  confidence bands around a linear trend can be constructed
(Section 7.4.4).

       LEFT-CENSORED SAMPLES

      For  compliance point data  sets  containing  left-censored   measurements  (i.e., non-detects),
parametric  confidence intervals  cannot be computed  directly  without some  adjustment.  All of the
parametric  confidence intervals described in Chapter 21 require estimates of the population  mean u^ and
standard deviation o. A  number of adjustment strategies are presented in Chapter 15. If the percentage
of non-detects is small — no more than 10-15% — simple substitution of half the reporting limit [RL]
for each non-detect will generally work to give an approximately correct confidence interval.

      For samples of at least  8-10 measurements and up to 50% non-detects, the Kaplan-Meier or robust
regression on order statistics [ROS] methods  can be used.  Data should first be assessed via a censored
probability plot whether the  sample can be normalized. If so, these techniques can be used to compute
estimates of the mean u^ and standard deviation o adjusted for the presence of left-censored values. These
adjusted estimates can be used in place of the sample mean (x ) and standard deviation (s)  listed in the
confidence interval formulas of Chapter 21  around either a mean or upper percentile.

      If none of these adjustments is appropriate, non-parametric confidence intervals on either the
median or an upper percentile (Section 21.2) can be calculated. Larger sample  sizes are needed than with
parametric  confidence interval counterparts, especially for intervals around an upper percentile, to ensure
a high level of confidence and a sufficiently narrow interval. The principal advantage of non-parametric


                                              7-18                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

intervals is their flexibility.  Not only can large fractions of non-detects be accommodated, but non-
parametric confidence intervals can also be applied to data sets which cannot be normalized.

     For heavily censored small data sets of 4-6 observations, the options are limited. One approach is
to replace each non-detect by half its RL and compute the confidence interval  as if the sample were
normal. Though the resulting interval will be approximate, it can provide a preliminary indication of the
well's compliance with the standard until further sampling data can be accumulated and the confidence
interval recomputed.

     Confidence bands around a trend can be constructed  with censored  data using a bootstrapped
Theil-Sen non-parametric trend line  (Section  21.3.2).  In this method,  the Theil-Sen trend  is first
computed using the  sample  data,  accounting  for the non-detects.  Then  a large number bootstrap
resamples are  drawn from the original sample,  and an alternate Theil-Sen trend is conducted on each
bootstrap  sample. Variability in these alternate  trend estimates is then used to construct a confidence
band around the original trend.

       LOGNORMAL AND OTHER NORMALIZED DATA

     Lognormal data may require special treatment  when building a  confidence interval around the
mean. Land's method (Section  21.1.3) can offer a reasonable way to accommodate the transformation
bias associated  with  the  logarithm,  particularly when computing  a  lower  confidence  limit  as
recommended  in compliance/assessment monitoring. For data normalized by transformations other than
the logarithm,  one  option is to calculate a normal-based confidence interval  around the mean using the
transformed measurements, then  back-transform  the limits  to  the original concentration scale. The
resulting interval will not represent a confidence interval around the arithmetic mean of the original data,
but rather will  estimate the confidence intervals of the median and/or geometric mean.

     If the difference between the arithmetic mean and median is not considered important for a given
GWPS, this strategy will be the easiest to implement. A wide range of results can occur with Land's
method on highly skewed lognormal populations especially when computing an upper confidence limit
around the arithmetic mean (Singh et al, 1997).  It  may be better to either construct a confidence interval
around the lognormal geometric mean (Section 21.1.2) or to use the technique of bootstrapping (Efron,
1979; Davison and Hinkley, 1997) to create a non-parametric interval around the arithmetic mean.3

     For confidence intervals  around an upper percentile, no bias is  induced by data that have been
normalized via a transformation. Whatever the transformation used (e.g., logarithm, square root, cube,
etc.), a confidence interval can be constructed on the transformed data.  The resulting limits can then be
back-transformed to provide confidence limits around the desired upper percentile in the concentration
domain.
3 Bootstrapping is widely available in statistical software, including the open source R computing environment and EPA's
  free-of-charge ProlJCL package. In some cases, setting up the procedure correctly may require professional statistical
  consultation.

                                             7-19                                    March 2009

-------
Chapter 7.  Compliance Monitoring Strategies	Unified Guidance

7.5 COMPARISONS TO BACKGROUND  DATA

     Statistical tests in compliance/assessment and corrective action monitoring will often involve a
comparison between compliance point measurements and a promulgated fixed health-based limit or a
risk-based remedial action goal as the GWPS, described earlier. But a number of situations arise where
a GWPS must be based on a background limit.  The Part 264 regulations presume such a standard as one
of the options under §264.94(a); an ACL may also be determined from background under §264.94(b).
More recent Part 258 rules specify a background GWPS where a promulgated or risk-based standard is
not available or if the historical background is greater than an MCL [§258.55(h)(2) & (3)].

     Health-based risk standards bear no necessary relationship  to site-specific aquifer concentration
levels.  At many sites this poses no  problem,  since the observed  levels of many constituents may be
considerably  lower  than  their GWPS.   However,  either naturally-occurring or pre-existing  aquifer
concentrations of certain analytes can exceed promulgated standards.  Two commonly monitored trace
elements in particular- arsenic  and  selenium— are occasionally found at uncontaminated background
well  concentrations exceeding their respective MCLs. The regulations then provide that a GWPS based
on background levels is appropriate.

     A number of factors should be considered in designing a background-type GWPS testing program
for compliance/assessment or corrective action monitoring.  The most fundamental decision is whether
to base  such comparisons on two- (or multiple-) sample versus single-sample tests. For the first, many
of the design factors discussed  for detection monitoring in Chapter  6 will be appropriate; for single
sample comparisons to a fixed background GWPS, a confidence level approach similar to that discussed
earlier for testing fixed health standards in this Chapter 7 would be applied. This basic decision then
determines how the GWPS is defined, the appropriate test hypotheses,  types of statistical tests, what the
background GWPS represents in statistical  terms, and the relevance of individual test and  cumulative
false positive error rates. Such decisions may also be constrained by State groundwater anti-degradation
policies.  Other design factors to consider are the number of wells and constituents tested, interwell
versus  intrawell options, background sample sizes, and power. Unlike a single fixed standard like an
MCL, background GWPS's  may be uniquely  defined for  a  given monitoring well constituent by a
number of these factors.

     SINGLE- VERSUS TWO-SAMPLE TESTING

       One of two fundamental testing approaches can be used with site-specific background GWPSs.
Either 1) a GWPS is defined as the critical limit from a pre-selected detection-level statistical test (e.g., a
prediction limit) based on background measurements, or 2) background data are used to generate a fixed
GWPS somewhat elevated above current background levels. In both cases, the resulting GWPS will be
constituent-  and possibly compliance well- specific.  The first  represents  a two-sample test of two
distinct populations (or more if a multiple-sample test) similar  to those utilized in detection monitoring.
As such, the individual  test false positive rate, historical background  sample size, cumulative false
positive considerations, number of annual tests and desired future sample size will uniquely determine
the limit.  Whatever the critical value  for a selected background test, it becomes the  GWPS under
compliance/assessment or corrective action monitoring.
                                             7-20                                   March 2009

-------
Chapter 7.  Compliance Monitoring Strategies	Unified Guidance

     The only allowable hypothesis test structure for the two-sample approach follows that of detection
and compliance monitoring [7.1].  Once exceeded and in corrective action, a return to compliance is
through evidence that future samples lie below the GWPS using the same hypothesis structure.

     The second option uses a fixed statistic from the background data as the GWPS in a single-sample
confidence interval test.   Samples from a single population are  compared to  the fixed limit.  In other
respects, the strategy follows that outlined in Chapter 7 for fixed health- or risk-based GWPS tests. The
compliance/assessment test hypothesis structure also follows [7.1], but the hypotheses are reversed as in
[7.2] for corrective action testing.

     The  choice of the  single-sample GWPS deserves careful  consideration. In the past, many such
standards were simply computed as multiples of the background sample average (i.e., GWPS = 2-x).
However,  this approach  may  not fully account for natural variation in  background levels and lead to
higher than  expected false positive rates.  If the GWPS were to be set at the historical background
sample mean, even  higher false positive rates would occur during  compliance monitoring,  and
demonstrating corrective action compliance becomes almost impossible.

      In the recommendations which follow below, an upper tolerance limit based on both background
sample size  and sample variability is recommended for identifying the background GWPS at a suitably
high enough level above current background to allow for reversal of the test hypotheses.  Although a
somewhat arbitrary choice, a  GWPS based on this method allows for a variety of confidence interval
tests (e.g., a  one-way normal mean confidence interval identified in equations [7.3] and [7.4]).

      WHAT  A  BACKGROUND GWPS REPRESENTS

     If the testing protocol involves two-sample comparisons, the background GWPS is an upper limit
statistical  interval derived from a given set of background data  based on one or another detection
monitoring tests discussed in Chapter 6 and detailed in Part III. In these cases, the appropriate testing
parameter is the true mean for the parametric tests, and the true  median for non-parametric tests. This
would include \-of-m prediction  limit detection tests involving future values.   If a single-sample
comparison against a fixed background GWPS is used, the appropriate parameter will also depend upon
the type of confidence interval test to be used (Part IV).  Except for parametric or non-parametric upper
percentile  comparisons, the likely statistical parameter would again be a mean (arithmetic, logarithmic,
geometric) or  the median.  A background  GWPS could be defined  as  an upper percentile parameter,
making use  of normal test confidence  interval structures found in Section  21.1.4.  Non-parametric
percentile  options would likely require test sample sizes too  large for most applications. The Unified
Guidance  recommended approaches for defining single-sample GWPSs discussed later in this section
presume a central tendency test parameter like the mean or median.

     NUMBER  OF MONITORED WELLS AND CONSTITUENTS

     Compliance/assessment or corrective  action monitoring tests  against a fixed health- or risk-based
standard (including single-sample background GWPSs) are not affected in a significant manner by the
number of annual tests.   But this would not be true for two- or multiple-sample background GWPS
testing.   In  similar  fashion  to detection  monitoring, the  total number of tests is an  important
consideration in defining the appropriate false positive error test rate (atest).  The total number of annual
tests is determined by how many compliance wells, constituents and evaluations occur per year.

                                            7-21                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

     Regulatory  agency interpretations  will  determine  the number  and location  of compliance
monitoring wells. These can differ depending on whether the wells are unit-specific, and if a reasonable
subset can be shown to be affected by  a  release. Perhaps  only those compliance wells containing
detectable levels of a compliance monitoring constituent  need be  included.  Formal annual tests are
generally required semi-annually, but other approaches may be applied.

     The number of constituents subject to two-sample background GWPS testing will also depend on
several factors. Only hazardous constituents not having a health- or risk-based standard are considered
here.   The basic criterion in interpreting required Part 264 Appendix IX or Part 258 Appendix II
analyses is to identify those hazardous constituents found in downgradient compliance wells.  Some
initially detected common laboratory or sampling contaminants might be eliminated following a repeat
scan. The remainder of the qualifying constituents will then require some form of background GWPS's.
Along with the number of wells and annual evaluations, the total annual number of background tests will
then be used in addressing an overall design cumulative design false positive rate.

     In corrective action testing (for either the one- or two-sample approaches),  the number of
compliance wells and  constituents may differ.  Only those  wells and constituents showing a significant
compliance test exceedance might be used.  However, from  a standpoint of eventually demonstrating
compliance under corrective  action, it might be  appropriate to still use the compliance/assessment
GWPS for two-sample tests.  With single-sample tests, the  GWPS is compared individually by well and
constituent as described.

     BACKGROUND SAMPLE SIZES and  INTERWELL vs. INTRAWELL TESTING

     Some potential constituents may already have been monitored during the detection phase, and have
a reasonable background  size.   Others identified under Part 264 Appendix IX or part 258 Appendix n
testing may have no historical background  data bases and require a period of background sampling.

     Historical constituent well data patterns and the results of this testing may help determine  if an
interwell  or intrawell  approach should be used  for a given  constituent.  For example, if arsenic and
selenium were historical constituents in detection monitoring, they might also be identified as candidates
for compliance background GWPS testing. There may already be indications that individual well spatial
differences will  need to be taken  into account  and an intrawell  approach followed.   In  this  case,
individual compliance well background GWPSs need to be established and tested. On the other hand,
certain hazardous  trace  elements and organics  may only be detected and  confirmed in one  or more
compliance wells with non-detects in background upgradient wells and possibly historical compliance
well data.   Under the latter conditions,  the simpler Double Quantification Rule (Section 6.2.2.) might be
used with the GWPS set at a quantification limit.  However,  this could pose  some interpretation
problems.  Subsequent testing against the background GWPS at the same compliance well concentration
levels causing the initial detection monitoring exceedance,  might very likely  result in further excursions
above the background GWPS.  The more realistic option would be  to  collect and  use  additional
compliance well data to establish a specific minimum intrawell background,  and only apply the Double
Quantification Rule at other wells not exhibiting detections. Even this approach might be unnecessarily
stringent if a contaminant plume were to expand in size and gradually  affect other compliance wells
(now subject to GWPS testing).
                                             7-22                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

     CUMULATIVE & INDIVIDUAL TEST FALSE POSITIVE RATES

      Each of the independent two-sample tests against background standards will have a roughly equal
probability of being  exceeded by chance alone.  Since an exceedance in the  compliance monitoring
mode based on background can result in a need for corrective action, it is  recommended that the
individual test false positive rate be set sufficiently low.  Much of the discussion in Chapter 6, Section
6.2.2 is relevant here. An a priori, cumulative error design rate must first be identified.  To allow for
application of the  Unified Guidance  detection  monitoring strategies and  Appendix D  tables,  it is
suggested that the . 1 SWFPR value also be applied to two-sample background GWPS testing. In similar
fashion to Chapter 6 and Part III, this can be translated into individual test configurations.

     If  the single-sample confidence  interval  option  will  be used with an elevated  GWPS,  the
compliance level test will have a very low probability of being exceeded by truly background data.
Cumulative   false   positive   error   considerations   are   generally   negligible.  For  testing
compliance/assessment or corrective action hypotheses, there is still a need to identify an appropriately
low single test false positive rate which meets the regulatory goals.    Generally, a single test false
positive error rate  of .1 to  .05  will be  suitable with  the recommended approach  for defining the
background GWPS.

     UNIFIED GUIDANCE RECOMMENDATIONS

      Two-Sample GWPS Definition and Testing

     As indicated above, any of the detection monitoring tests described in Chapter 6 might be selected
for two- or multiple- sample background compliance testing.  One highly recommended statistical test
approach is a prediction limit. Either a parametric prediction limit for a future mean (Section 18.2.2) or
a non-parametric prediction limit for a future  median (Section 18.3.2) can be used,  depending on the
constituent being  tested and  its statistical  and  distributional characteristics (e.g., detection  rate,
normality, etc.).  It would be equally possible  to utilize one of the l-of-m future value prediction limit
tests, on an interwell  or intrawell basis.  Use of repeat samples as part of the selected test is appropriate,
although the expected number of annual compliance/corrective action  samples may dictate which tests
can apply.

     One parametric example is the 1-of-l future mean test. If the background data can be normalized,
background  observations are used to  construct a parametric  prediction  limit  with  (1-oc) confidence
around a mean of order/?, using the equation:

                                                     n   r
                                                                                         [7.5]


      The next/? measurements from each compliance well are averaged and the future mean compared
to the background  prediction limit, PL (considered the background GWPS). In compliance/assessment
monitoring, if any  of the means exceeds the limit, those well-constituent pairs are deemed to be out of
compliance. In corrective  action, if the future mean is no greater than  PL, it can be concluded that the
well-constituent pair is sufficiently similar to background to be within the remediation goal. In both
monitoring phases, the  prediction limit is constructed  to  represent a reasonable upper limit on the

                                             7-23                                    March 2009

-------
Chapter 7.  Compliance Monitoring Strategies	Unified Guidance

background  distribution.  Compliance point means above  this limit  are  statistically  different  from
background; means below it are similar to background.

     If the background sample cannot be normalized perhaps due to a large fraction of non-detects, two-
sample non-parametric upper prediction limit detection monitoring tests (Chapters 18 & 19) can be
used.   As an example, a maximal order statistic (often the highest or second-highest value) can be
selected from background as a non-parametric 1-of-l upper prediction limit test of the median. Table
18-2 is used to guide  the choice based on background sample size (n) and the achievable confidence
level (a). The median  of the next 3 measurements from each compliance well is compared to the upper
prediction limit. As with the parametric case in compliance/assessment, if any of the medians exceeds
the limit, those well-constituent pairs would be considered out of compliance. In corrective action, well-
constituent pairs with  medians no greater than the background prediction limit would be considered as
having met the standard.

       If background measurements for a particular constituent are all non-detect, the GWPS should be
set equal to  the highest RL. In similar fashion to detection monitoring, l-of-2 or l-of-3 future value
prediction limit tests can be applied (Section 6.2.2 Double Quantification rule).

             Single-Sample GWPS  Definition and Testing

     For single-sample testing, the Unified  Guidance recommendation is to  define a fixed GWPS or
ACL based on a background upper tolerance limit with 95% confidence and 95% coverage (Chapter
17). For normal background, the appropriate formula for the GWPS would be the same as that given in
Section 17.2.1, namely:

                                  GWPS = x + r(n,.95,.95]-s                             [7.6]

where n = number of  background  measurements, x and s represent the background sample mean and
standard deviation, and T is a tolerance factor selected from Table  17-3. If the background sample is a
mixture of detects and  non-detects, but the non-detect  fraction  is no  more than 50%,  a censored
estimation method such as Kaplan-Meier or robust regression on order statistics [ROS]  (Chapter 15)
can be attempted to compute adjusted estimates of the background mean [j,  and standard deviation o in
equation [7.5].

     For  larger  fractions  of non-detects,  a non-parametric  tolerance limit  can  be  constructed, as
explained  in Section 17.2.2. In this case, the GWPS median will often be set to the largest or second-
largest observed value in background. Table 17-4 can be used to determine the achieved confidence
level (1-a) associated  with a 95% coverage GWPS constructed in this way.  Ideally,  enough background
measurements  should  be used to set the tolerance limit as  close to the target of 95% coverage, 95%
confidence as possible. However, this could require very large background sample sizes (n > 60).

     Multiple independent measurements are used to form either a mean or median confidence interval
for comparison  with the  background  GWPS. Preferably  at  least  4  distinct  compliance point
measurements  should  be used to define the  mean confidence interval  in the  parametric case, and 3-7
values should be used with a non-parametric median test. The guidance does not recommend retesting in
single-sample background GWPS compliance/assessment monitoring.  An implicit kind of retesting is
built in to any test of a sample mean or median as explained in Section 19.3.2.
                                             7-24                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

      In essence, the background tolerance limit is used to set a somewhat higher mean target GWPS
which can  accommodate both compliance and corrective action testing under background conditions.
The GWPS in equation [7.6] can be interpreted as an approximation to the upper 95th percentile of the
background distribution. It is designed to be a reasonable maximum on the likely range of background
concentrations. It is high enough that compliance wells exceeding the GWPS via a confidence interval
test (i.e.., LCL > GWPS) are probably impacted and  not mere  false positives.  At the same  time,
successful remedial efforts must show that concentrations at contaminated wells have decreased to levels
similar to background. The GWPS above represents an upper bound on background but is not so low as
to make proof of remediation via an upper confidence limit [GWPS] impossible.

     To ensure  that the  GWPS in equation  [7.6] sets a reasonable target,  the Unified  Guidance
recommends that at least 8 to 10 background measurements (n) be utilized, and more if available.  If the
background sample is not normal, but can be normalized via a transformation, the tolerance limit should
be computed on the transformed measurements and the result back-transformed to obtain a limit in the
concentration scale (see Chapter 17 for further details).

     TRADEOFFS  IN BACKGROUND GWPS TESTING METHODS

     A two-sample GWPS approach offers a stricter test of background  exceedances.  There is also
greater flexibility in designing tests for  a variety of future comparison values (single with  repeat,  small
sample means, etc.).  The true test parameter is explicitly defined by the type of test chosen.   Non-
parametric upper prediction limit tests also allow for greater flexibility when data sets include significant
non-detect values or are not transformable to a normal distribution assumption.  The approach suggested
in this section  accounts for the cumulative false positive error rate.

     One negative feature of two-sample GWPS testing is that the test hypotheses cannot be reversed
for correction action monitoring.  The trigger for compliance/assessment testing may also be quite small,
resulting in important consequences (the need to move to corrective action).  It may also be difficult to
demonstrate longer-term compliance following remedial activities, if the actual background is somewhat
elevated.

     Single-sample GWPS testing, by contrast, does allow for the reversal of test hypotheses. Using a
suitable definition of the somewhat elevated GWPS takes into account background sample variability
and size.   Cumulative false positive error rates for compliance or corrective action testing are not
considered, and standardized alpha error levels (.1 or .05) can be used.  Exceedances under compliance
monitoring also offer clear evidence of a considerable increase above background.

     But applying an arbitrary increase above background recommended for single-sample testing may
conflict with  State anti-degradation policy.  Defining the GWPS as a specific population parameter is
also somewhat arbitrary. Using the suggested  guidance  approach  for defining the GWPS in equation
[7.6] above, may result  in very high values if the data  are not normal (including logarithmic or non-
parametric  applications).  There is also less flexibility in identifying testing options,  especially with data
sets containing significant non-detect values.  Annual testing with quarterly sampling may be the only
realistic choice.

     A possible compromise might utilize both approaches.    That  is, initially apply the two-sample
approach for compliance/assessment testing. Then evaluate the single-sample approach with reversed

                                             7-25                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

hypotheses.  Some of the initially significant increases under the two-sample approach may also meet the
upper confidence level limit when tested against the higher GWPS. Those well constituents that cannot
meet this limit can then be subjected to corrective action remediation and full post-treatment testing.
This implies that the background GWPS would be a range based on the two testing methods rather than
a single value.
     ^EXAMPLE  7-1

       A  facility has triggered a  significant increase under detection monitoring.   One hazardous
constituent (arsenic) was identified which must be tested against a background GWPS at six different
compliance wells, since background well levels were above  the appropriate arsenic MCL of 10 ug/1.
Two semi-annual tests are required for compliance/assessment monitoring.  Assume that arsenic had
been detected in both background and downgradient wells, but was significantly higher in one  of the
compliance wells.   It must be determined whether any of the compliance wells have exceeded their
background GWPS,  and might require corrective action.

       Design a background GWPS monitoring system for the following arsenic data from the elevated
Well #1, consisting  of eight hypothetical historical intrawell background samples and four future annual
values  for two different simulated data distribution cases shown in the table below.  Sample means and
standard deviations are provided in the bottom row:

                    	Compliance Well #1 Arsenic (ng/l)	
                        Historical Well Data        Case 1        Case2
74.1
10.8
32.8
25.0


41.5
41.0
30.8
40.0
X = 37.0
s = 18.16
61.5
58.7
76.8
81.3
x = 69.58
s = 11.15
95.0
73.4
73.3
90.0
x = 82.93
s = 11.24
       Background values were randomly generated from a normal distribution with a true mean of// =
40 and a population standard deviation of a = 16.  Case 1 future data were from a normal distribution
with a mean 1.5 times higher, while Case 2 data were from a normal distribution twice as high as the
background true mean.  Both cases used the same background population standard deviation.  The intent
of these simulated values is to allow exploration  of both of the  Unified  Guidance recommended
background GWPS methods when background increases are relatively modest and sample sizes small.

       The two-sample background GWPS approach is first evaluated. Assume that the background
data are normal and stationary (no evidence of spatial or temporal variation and other forms of statistical
dependence).   Given a likely limit  of future quarterly sampling and required  semi-annual evaluations,
two guidance prediction limit options would seem appropriate—either a l-of-2 future values or a 1-of-l
future mean size 2 test conducted twice a year.  The l-of-2 future values option  is  chosen.

        Since there are a total  of  6 compliance wells, one background constituent and two annual
evaluations, there are a total of 12 annual background tests to  be conducted.  Either the Unified
Guidance tables in Appendix D or R-script can be used to identify the appropriate prediction limit K-
                                             7-26                                   March  2009

-------
Chapter 7. Compliance Monitoring Strategies	Unified Guidance

factor. For the l-of-2 future values test, K = 1.83 (found by interpolation from the second table on page
D-118), based on w = 6, COC = 1, and two tests per year.   The calculated prediction limit using the
background data set statistics and K-factor is 70.2 ug/1, serving as the background GWPS.

       When the future values from the table above are tested against the GWPS, the following results
are obtained.  A "Pass"  indicates that the  compliance/assessment null hypothesis was achieved, while a
"Fail" indicates that the alternative hypothesis (the GWPS has been exceeded) is accepted.
                            Well #1 As Compliance Comparisons
                                l-of-2 Future Values Test (ng/l)
Case 1
(data)
61.5
58.7
76.8
81.3
Result

Pass

Fail
Case 2
(data)
95.0
73.4
73.3
90.0
Result

Fail

Fail
                                         GWPS = 70.2
       Both cases indicate at least one GWPS exceedance using the l-of-2 future values tests.   These
may be indications of a statistically significant increase above background, but the outcome for Case 1 is
somewhat troubling.  While a  50%  increase above background (based on the simulated population
parameters)  is potentially significant, more detailed power evaluations indicate that  such a detected
exceedance would only be expected about 24% of the time (using R-script power calculations with a Z-
value of 1.25 standard deviations above background for the l-of-2 future values test).  In contrast, the
2.5 Z-value  for Case 2 would be expected to be exceeded about 76% of the time.  In order to further
evaluate the  extent of significance of these results, the single-sample GWPS method is also considered.

       Following the guidance above, define the single-sample mean GWPS using equation [7.6] for the
upper 95%  confidence,  95% proportion tolerance limit.  Then apply upper  and lower normal mean
confidence intervals tests of the Case 1 and 2 n = 4 sample data using equations [7.3] and [7.4].

       From Table 21-9 on page D-246, a r-factor of 3.187 is used with the background mean  and
standard deviation to generate the GWPS = 94.9. One-way upper and lower mean confidence levels are
evaluated at  90 or 95% confidence for the tests and compared to the fixed background GWPS.

       LCL  test Pass/Fail results are the same as above for the two-sample compliance test. However, a
"Pass" for the UCL test implies that the alternative hypothesis (less than the standard) is accepted while
a "Fail" implies greater than or equal to the GWPS under corrective action monitoring hypotheses:
                                             7-27                                   March 2009

-------
Chapter 7. Compliance Monitoring Strategies                             Unified Guidance
           As Mean Confidence Interval Tests Against Background GWPS (jig/1)
90%
LCL
60.5
73.7
LCL Test
Result 95%
LCL
Pass 56.5
Pass 69.7
Result
Cas
Pass
Cas
Pass
90%
UCL
e 1 Data
78.7
e 2 Data
92.1
UCL Test
Result 95%
UCL
Pass 82.7
Pass 96.2
Result
Pass
Fail
                                         GWPS = 94.9
       For either chosen significance level, the Case 1 90% and 95% UCLs of 78.7 and 82.7 are below
the GWPS and the alternative corrective action hypothesis (the mean is less than the standard) can be
accepted.  For Case 2, the 90% UCL of 92.1 is below the GWPS, but the 95% UCL of 96.2 is above.  If
a higher level of test confidence is appropriate, the Case 2 arsenic values can be considered indicative of
the need for corrective action.

       If only the single-sample background GWPS approach were applied to the same data as above in
compliance/assessment monitoring tests, neither case mean  LCLs would exceed the standard,  and no
corrective action monitoring would be necessary.   However, it should be noted from the  example that
this approach does allow for a  significant increase  above the reference background level before any
action would be indicated. -4

       The approaches provided above presume that well constituent data subject to background GWPS
testing are stationary over time.  If sampling data show evidence of a trend, the situation becomes more
complicated in making compliance or corrective action test decisions. Two- and single-sample stationary
scenarios for identifying standards  may not be appropriate.  Trend behavior  can be determined by
applying one of the methods provided in Chapter 17 (e.g., linear  regression or Mann-Kendall trend
tests) to historical data. A significant increasing slope can be  indicative of a background exceedance,
although it should be clear that the increase  is not  due to natural conditions.  A decreasing or non-
significant  slope can be considered evidence for  compliance  with  historical background.  The  most
problematic standard would be setting an eventual background  target for  compliance testing under
corrective action.  To a great extent, it will depend on site-specific conditions including the behavior of
specific constituent subject  to remediation.  A background GWPS might be determined following the
period of remediation and monitoring when aquifer conditions have hopefully stabilized.

       Setting and applying background GWPSs have not received a great deal of attention in previous
guidance.  The discussions and example above help illustrate the somewhat difficult regulatory choices
that need to be made. A regulatory agency needs to determine what levels, if any, above background can
be considered acceptable.  A further consideration is the degree of importance  placed on background
GWPS  exceedances, particularly when tested along with constituents  having health-based limits.
Existing regulatory programs may have already developed procedures to deal with many  of the issues
discussed in this section.
                                             7-28                                   March 2009

-------
Chapter 8.  Methods Summary	Unified Guidance

   CHAPTER 8.   SUMMARY  OF  RECOMMENDED METHODS

       8.1   SELECTING THE RIGHT STATISTICAL METHODS 	 8-1
       8.2   TABLE 8.1 INVENTORY OF RECOMMENDED METHODS	8-4
       8.3   METHOD SUMMARIES	8-9
     This chapter provides a quick guide to the statistical procedures discussed within the Unified
Guidance. The first section is a basic road map designed to encourage the user to ask a series of key
questions. The other sections offer thumbnail sketches of each method and a matrix of options to help in
selecting the right procedure, depending on site-specific characteristics and constraints.

8.1 SELECTING THE RIGHT  STATISTICAL METHODS

     Choosing  appropriate statistical methods  is  important in developing a sound groundwater
monitoring statistical program. The statistical test(s) should be  selected  to match  basic site-specific
characteristics such as number and configuration of wells, the water quality constituents being measured,
and general hydrology. Statistical methods should  also be selected with reference to the statistical
characteristics of the monitored  parameters — proportion  of non-detects, type of concentration
distribution (e.g., normal, lognormal), presence or absence of spatial variability, etc.

     Because site conditions  and permit  requirements  vary  considerably,  no single  "cookbook"
approach is readily available to select the right statistical method. The best strategy is to consider site-
specific conditions and ask a series of questions. A table of recommended options (Table  8-1) and
summary descriptions is presented in Section 8.2 to help select an appropriate basic approach.

     The first question is: what stage of monitoring is required?  Detection monitoring is the first stage
of any  groundwater monitoring program and typically involves comparisons between measurements of
background and compliance point groundwater. Most of the methods described in this document (e.g.,
prediction limits, control charts, tests for  trend,  etc.) are designed  for facilities engaged in detection
monitoring. However, it must be determined whether an interwell (e.g., upgradient-to-downgradient) or
an intrawell test is warranted. This entails consideration of the site hydrology, constituent detection rates,
and deciding whether separate (upgradient) wells or past intrawell data serves  as the most appropriate
and representative background.

     Compliance/assessment monitoring is required for facilities  that no longer meet the requirements
of a detection monitoring program by  exhibiting statistically significant indications of a release to
groundwater. Once in compliance/assessment,  compliance point measurements are typically tested
against a fixed GWPS. Examples of fixed standards include Maximum Concentration Limits [MCL],
risk-derived limits  or a single limit derived from background data. The most appropriate statistical
method for tests against GWPS is a lower confidence limit. The type of confidence limit will depend on
whether the regulatory standard represents an average concentration; an absolute maximum, ceiling, or
upper percentile; or whether the compliance data exhibit a trend over time.

     In cases where no fixed GWPS is specified for  a particular constituent, compliance point data may
be directly compared against background data. In this situation, the most appropriate statistical method is

                                             JTl                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

one or another detection monitoring two- or multiple-sample tests using the critical design limit as the
GWPS (discussed in Section 7.5).

     Corrective action is reserved for facilities where evidence of a groundwater release is confirmed
above a GWPS.  In these situations, the facility is required to submit an appropriate remediation plan to
the Regional Administrator and to institute steps to insure adequate containment and/or clean-up of the
release. Remediation  of groundwater can be very costly and also difficult to measure. EPA has not
adopted a uniform approach in the setting of clean-up standards or how one should determine whether
those clean-up standards have been attained. Some guidance on this issue is given in the EPA document,
Methods for Evaluating the Attainment of Cleanup Standards, Volume II: Groundwater (EPA, 1992).

     The null  hypothesis  in corrective  action  testing  is reversed  from  that  of detection  and
compliance/assessment monitoring. Not only is it assumed that contamination is above the compliance
or clean-up standard, but corrective action should continue until the average concentration level is below
the clean-up limit for periods specified in the regulations. For any fixed-value standard (e.g., the GWPS
or a remediation goal) a  reasonable  and consistent statistical test for  corrective action is  an upper
confidence limit. The type of confidence limit will  depend on whether the data have a stable mean
concentration or exhibit a trend over time. For those well constituents requiring remediation,  there will
be a period of activity before formal testing can take place.  A number of statistical techniques (e.g. trend
testing) can be applied to the data collected in this interim period to gauge prospects for eventual GWPS
compliance. Section 7.5 describes corrective action testing limitations involving a two-sample GWPS.

     Another  major question involves the statistical  distribution most appropriate to the observed
measurements. Parametric tests are those which assume the underlying population follows a known and
identifiable distribution, the most common examples in groundwater monitoring being the normal and
the lognormal. If a specific distribution cannot be determined, non-parametric test methods can be used.
Non-parametric tests do not require a known statistical distribution and can be helpful when the  data
contain a substantial  proportion of non-detects. All of the parametric tests described in the Unified
Guidance, except  for  control charts, have non-parametric counterparts that  can be used  when the
underlying distribution is uncertain or difficult to test.

     A special consideration in fitting distributions is the presence of non-detects, also known as  left-
censored measurements. As long as a sample contains a small fraction of non-detects  (i.e., no more than
10-15%), simple substitution of half the reporting limit [RL] is generally adequate. If the proportion of
non-detects is  substantial,  it may be difficult or impossible to determine whether a specific parametric
distributional model provides a good fit to the data. For some tests, such as the t-test, one can switch to a
non-parametric test with little loss of power or accuracy. Non-parametric interval tests, however,  such as
prediction and tolerance  limits,  require  substantially  more  data before providing statistical  power
equivalent to parametric intervals. Partly because of this  drawback,  the Unified Guidance  discusses
methods to  adjust datasets with  significant fractions of non-detects so that parametric distributional
models may still be used (Chapter 15).

     The Unified Guidance now recommends  a single, consistent Double Quantification rule approach
for handling constituents that have either never been detected or have not been recently detected. Such
constituents are not  included in cumulative annual  site-wide  false positive error rate  [SWFPR]
computations;  and no special adjustment for non-detects is necessary. Any confirmed quantification (i.e.,

                                              8^2                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

two  consecutive  detections above  the  RL)  at a  compliance point provides sufficient evidence of
groundwater contamination by that parameter.

     A key question when picking a test for detection monitoring is whether traditional background-to-
downgradient interwell or single-well intrawell tests are appropriate. If intrawell testing is  appropriate,
historical measurements form the individual compliance well's own background while future values are
tested  against these  data. Intrawell tests eliminate any  natural spatial differences among monitoring
wells.  They  can also be used when the groundwater flow gradient is uncertain  or unstable, since all
samples being tested  are collected from the same well.

     Possible disadvantages to intrawell tests also need to be considered. First, if the compliance well
has already been impacted, intrawell background will also be impacted.  Such contaminated  background
may provide  a skewed comparison to later data  from the same well, making it difficult to identify
contaminated groundwater in the  future. Secondly, if intrawell background is constructed from only a
few  early measurements, considerable  time  may be needed to accumulate a sufficient number of
background observations (via periodic updating) to run a statistically powerful test.

     If a compliance well has already been impacted by previous contamination,  trend testing can still
indicate whether conditions  have deteriorated since intrawell background was  collected.  For sites
historically contaminated above background, the only way to effectively monitor compliance wells may
be to establish an historical intrawell baseline and measure increases  above this baseline.

     Besides  trend  tests, techniques  recommended  for intrawell  comparisons  include  intrawell
prediction limits, control charts, and sometimes the Wilcoxon rank-sum test. The best choice between
these methods is not always clear.  Since  there is no non-parametric counterpart to control charts, the
choice will depend on whether the data is normal or can be normalized  via a transformation.  New
guidance for control  charts shows they also can be designed to incorporate retesting. For  sites with a
large number of well-constituent pairs, intrawell prediction  limits can incorporate retesting to  meet
specific site-wide false positive rate and statistical  power characteristics. Parametric intrawell prediction
limits  can  be used with  background that is  normal or transformable to normality; non-parametric
versions can also be applied for many other data sets.

     If interwell,  upgradient-to-downgradient tests are  appropriate, the choice  of  statistical method
depends primarily  on the number of compliance wells and  constituents being monitored, the number of
observations available from each of these wells, and the detection rates and distributional properties of
these parameters. If a very small number of comparisons must be tested (i.e., two or three  compliance
wells versus  background, for one or two constituents), a ^-test or Wilcoxon rank-sum test may be
appropriate if there are a sufficient number  of compliance measurements (i.e., at least two per well).

     For other  cases, the Unified  Guidance recommends a prediction limit or control chart constructed
from background. Whenever more than  a  few  statistical  tests  must be  run,  retesting should be
incorporated into the procedure. If multiple observations per compliance well can be collected during a
given evaluation period, either a prediction limit for 'future'  observations,  a prediction limit for means
or medians, or  a control chart can be considered, depending  on which option best achieves statistical
power  and SWFPR targets, while balancing the site-specific costs and feasibility of sampling. If only one
observation per compliance well can  be  collected per evaluation, the only practical  choices are a
prediction limit for individual observations or a control chart.

                                              JTs                                    March 2009

-------
Chapter 8. Methods Summary
Unified Guidance
8.2 TABLE 8-1 INVENTORY OF RECOMMENDED METHODS
Chapter 9. Exploratory
Statistical Method

Time Series Plot
Box Plot
Histogram
Scatter Plot
Probability Plot
Tools
Chapter
§9.1
§9.2
§9.3
§9.4
§9.5

Use

Plot of measurement levels over time; Useful for assessing trends,
data inconsistencies, etc.
Graphical summary of sample distribution; Useful for comparing key
statistical characteristics in multiple wells
Graphical summary of sample distribution; Useful for assessing
probability density of single data set
Diagnostic tool; Plot of one variable vs. another; Useful for exploring
statistical associations
Graphical fit to normality; Useful for raw or transformed data
Chapter 10. Fitting Distributions
Statistical Method
Skewness Coefficient
Coefficient of Variation
Shapiro-WilkTest
Shapiro-Francia Test
Filliben's Probability
Plot Correlation
Coefficient
Shapiro-Wilk Multiple
Group Test
Chapter 11. Equality of
Statistical Method
Box Plots (side-by-
side)
Levene's Test
Mean-SD Scatter Plot
Chapter
§10.4
§10.4
§10.5.1
§10.5.2
§10.6
§10.7
Variance
Chapter
§11.1
§11.2
§11.3
Use
Measures symmetry/asymmetry in distribution; Screening level test
for plausibility of normal fit
Measures symmetry/asymmetry in distribution; Screening tool for
plausibility of normal fit; Only for non-negative data
Numerical normality test of a single sample; for n < 50
Numerical test of normality for a single sample; Supplement to
Shapiro-Wilk; Use with n > 50
Numerical test of normality for a single sample; Interchangeable with
Shapiro-Wilk; Use with n < 100; Good supplement to probability plot
Extension of Shapiro-Wilk test for multiple samples with possibly
different means and/or variances; Good check to use with Welch's t-
test

Use
Graphical test of differences in population variances; Good screening
tool for equal variance assumption in ANOVA
Numerical, robust ANOVA-type test of equality of variance for > 2
populations; Useful for testing assumptions in ANOVA
Visual test of association between SD and mean levels across group
of wells; Use to check for proportional effect or if variance-stabilizing
transformation is needed
Chapter 12. Outliers
Statistical Method
Probability Plot
Box Plot
Dixon's Test
Rosner's Test
Chapter
§12.1
§12.2
§12.3
§12.4
Use
Graphical fit of distribution to normality; Useful for identifying
extreme points not coinciding with predicted tail of distribution
Graphical screening tool for outliers; quasi-non-parametric, only
requires rough symmetry in distribution
Numerical test for single low or single high outlier; Use when n < 25
Numerical test for up to 5 outliers in single dataset; Recommended
when n > 20; User must identify a specific number of possible
outliers before running
                                8-4
     March 2009

-------
Chapter 8. Methods Summary
Unified Guidance
Chapter 13. Spatial Variation
Statistical Method
Box Plots (side-by-
side)
One-Way Analysis of
Variance [ANOVA] for
Spatial Variation
Chapter 14. Temporal
Statistical Method
Time Series Plot
(parallel)
One-way ANOVA for
Temporal Effects
Sample
Autocorrelation
Function
Rank von Neumann
Ratio
Darcy Equation
Seasonal Adjustment
(single well)
Temporally-Adjusted
Data Using ANOVA
Seasonal Mann-Kendall
Test
Chapter 15. Managing
Statistical Method

Simple Substitution
Censored Probability
Plot
Kaplan-Meier
Robust Regression on
Order Statistics
Cohen' Method and
Parametric Regression
on Order Statistics
Chapter
§13.2.1
§13.2.2
Variability
Chapter
§14.2.1
§14.2.2
§14.2.3
§14.2.4
§14.3.2
§14.3.3
§14.3.3
§14.3.4
Use
Quick screen for spatial variability; Look for noticeably staggered
boxes
Test to compare means of several populations; Use to identify spatial
variability across a group of wells and to estimate pooled (background)
standard deviation for use in intrawell tests; Data must be normal or
normalized; Assumption of equal variances across populations

Use
Quick screen for temporal (and/or spatial) variation; Look for parallel
movement in the graph traces at several wells over time
Test to compare means of distinct sampling events, in order to
assess systematic temporal dependence across wells; Use to get
better estimate of (background) variance and degrees of freedom in
data with temporal patterns; Residuals from ANOVA also used to
create stationary, adjusted data
Plot of autocorrelation by lag between sampling events; Requires
approximately normal data; Use to test for temporal correlation
and/or to adjust sampling frequency
Non-parametric numerical test of dependence in time-ordered data
series; Use to test for first-order autocorrelation in data from single
well or population
Method to approximate groundwater flow velocity; Use to determine
sampling interval guaranteeing physical independence of consecutive
groundwater samples; Does not ensure statistical independence
Method to adjust single data series exhibiting seasonal correlations
(i.e., cyclical fluctuations); At least 3 seasonal cycles must be evident
on time series plot
Method to adjust multiple wells for a common temporal dependence;
Use adjusted data in subsequent tests
Extension of Mann-Kendall trend test when seasonality is present; At
least 3 seasonal cycles must be evident
Non-Detect Data
Chapter
§15.2
§15.3
§15.3
§15.4
§15.5
Use

Simplest imputation scheme for non-detects; Useful when < 10-15%
of dataset is non-detect
Probability plot for mixture of non-detects and detects; Use to check
normality of left-censored sample
Method to estimate mean and standard deviation of left-censored
sample; Use when < 50% of dataset is non-detect; Multiple detects
and non-detects must originate from same distribution
Method to estimate mean and standard deviation of left-censored
sample; Use when < 50% of dataset is non-detect; Multiple detects
and non-detects must originate from same distribution
Other methods to estimate mean and standard deviation of left-
censored sample; Use when < 50% of dataset is non-detect; Detects
and non-detects must originate from same distribution and there
must be a single censoring limit
                                       8-5
       March 2009

-------
Chapter 8. Methods Summary
Unified Guidance
Chapter 16. Two-sample Tests
Statistical Method
Pooled Variance t-Test
Welch's t-Test
Wilcoxon Rank-Sum
Test
Tarone-Ware Test
Chapter 17. ANOVA,
Statistical Method
One-Way ANOVA
Kruskal-Wallis Test
Tolerance Limit
Non-parametric
Tolerance Limit
Linear Regression
Mann-Kendall Trend
Test
Theil-Sen Trend Line
Chapter
§16.1.1
§16.1.2
§16.2
§16.3
Use
Test to compare means of two populations; Data must be normal or
normalized, with no significant spatial variability; Useful at very small
sites in upgradient-to-downgradient comparisons; Also useful for
updating background; Population variances must be equal
Test to compare means of two populations; Data must be normal or
normalized, with no significant spatial variability; Useful at very small
sites in interwell comparisons; Also useful for updating background;
Population variances can differ
Non-parametric test to compare medians of two populations; Data
need not be normal; Some non-detects OK; Should have no
significant spatial variability; Useful at very small sites in interwell
comparisons and for certain intrawell comparisons; Also useful for
updating background
Extension of Wilcoxon rank-sum; non-parametric test to compare
medians of two populations; Data need not be normal; Designed to
accommodate left-censored data; Should have no significant spatial
variability; Useful at very small sites in interwell comparisons and for
certain intrawell comparisons; Also useful for updating background
Tolerance Limits, & Trend Tests
Chapter
§17.1.1
§17.1.2
§17.2.1
§17.2.2
§17.3.1
§17.3.2
§17.3.3
Use
Test to compare means across multiple populations; Data must be
normal or normalized; Should have no significant spatial variability if
used as interwell test; Assumes equal variances; Mandated in some
permits, but generally superceded by other tests; Useful for
identifying spatial variation; RMSE from ANOVA can be used to
improve intrawell background limits
Test to compare medians across multiple populations; Data need not
be normal; some non-detects OK; Should have no significant spatial
variability if used as interwell test; Useful alternative to ANOVA for
identifying spatial variation
Test to compare background vs. > 1 compliance well; Data must be
normal or normalized; Should have no significant spatial variability if
used as interwell test; Alternative to ANOVA; Mostly superceded by
prediction limits; Useful for constructing alternate clean-up standard
in corrective action
Test to compare background vs. > 1 compliance well; Data need not
be normal; Non-Detects OK; Should have no significant spatial
variability if used as interwell test; Alternative to Kruskal-Wallis;
Mostly superceded by prediction limits
Parametric estimate of linear trend; Trend residuals must be normal
or normalized; Useful for testing trends in background or at already
contaminated wells; Can be used to estimate linear association
between two random variables
Non-parametric test for linear trend; Non-detects OK; Useful for
documenting upward trend at already contaminated wells or where
trend already exists in background
Non-parametric estimate of linear trend; Non-detects OK; Useful for
estimating magnitude of an increasing trend in conjunction with
Mann-Kendall test
                                       8-6
       March 2009

-------
Chapter 8. Methods Summary
Unified Guidance
Chapter 18. Prediction
Statistical Method
Prediction Limit for m
Future Values
Prediction Limit for
Future Mean
Non-Parametric
Prediction Limit for m
Future Values
Non-parametric
Prediction Limit for
Future Median
Chapter 19. Prediction
Statistical Method
Prediction Limits for
Individual
Observations With
Retesting
Prediction Limits for
Means With Retesting
Non-Parametric
Prediction Limits for
Individual
Observations With
Retesting
Non-Parametric
Prediction Limits for
Medians With
Retesting
Limit Primer
Chapter Use
§18.2.1 Test to compare m measurements from compliance well against
background; Data must be normal normalized; Useful in retesting
schemes; Can be adapted to either intrawell or interwell tests; No
significant spatial variability allowed if used as interwell test
§18.2.2 Test to compare mean of compliance well against background; Data
must be normal or normalized; Useful alternative to traditional
ANOVA; Can be useful in retesting schemes; Most useful for interwell
(e.g., upgradient to downgradient) comparisons; No significant
spatial variability allowed if used as interwell test
§18.3.1 Non-parametric test to compare m measurements from compliance
well against order statistics of background; Non-normal data and/or
non-detects OK; Useful in non-parametric retesting schemes; Should
have no significant spatial variability if used as interwell test
§18.3.2 Test to comPare median of compliance well against order statistics of
background; Non-normal data and/or non-detects OK; Useful in non-
parametric retesting schemes; Most useful for interwell (e.g.,
upgradient to downgradient) comparisons; No significant spatial
variability allowed if used as interwell test
Limit Strategies with Retesting
Chapter Use
§19.3.1 Tests individual compliance point measurements against background;
Data must be normal or normalized; Assumes common population
variance across wells; No significant spatial variability allowed if used
as interwell test; Replacement for traditional ANOVA, extends
Dunnett's multiple comparison with control (MCC) procedure; Allows
control of SWFPR across multiple well-constituent pairs; Retesting
explicitly incorporated; Useful at any size site
§19.3.2 Tests compliance point means against background; Data must be
normal or normalized; Assumes common population variance across
wells; No significant spatial variability allowed if used as interwell
test; Replacement for traditional ANOVA, extends Dunnett's multiple
comparison with control (MCC) procedure; More flexible than a series
of intrawell t-tests if used as intrawell test; Allows control of SWFPR
across multiple well-constituent pairs; Must be feasible to collect >2
resamples per evaluation period to incorporate retesting; 1-of-l
scheme does not require explicit retesting
§19.4.1 Non-parametric test of individual compliance point observations
against background; Non-normal data and/or non-detects OK; No
significant spatial variability allowed if used as interwell test;
Retesting explicitly incorporated; Large background sample size
helpful
§19.4.2 Non-parametric test of compliance point medians against
background; Non-normal and/or non-detects OK; No significant
spatial variability allowed if used as interwell test; Large background
sample size helpful; Must be feasible to collect > 3 resamples per
evaluation period to incorporate retesting; 1-of-l scheme does not
require explicit retesting
                                       8-7
       March 2009

-------
Chapter 8. Methods Summary
Unified Guidance
Chapter 20. Control
Statistical Method

Shewhart-CUSUM
Control Chart
Charts
Chapter
§20.2

Use

Graphical test of significant increase above background; Data must
be normal or normalized; Some non-detects OK if left-censored
adjustment made; At least 8 background observations
recommended; Viable alternative to prediction limits; Retesting can
be explicitly incorporated; Control limits can be set via published
literature or Monte Carlo simulation
Chapter 21. Confidence Intervals
Statistical Method

Confidence Interval
Around Normal Mean
Confidence Interval
Around Lognormal
Geometric Mean
Confidence Interval
Around Lognormal
Arithmetic Mean
Confidence Interval
Around Upper
Percentile
Non-Parametric
Confidence Interval
around Median
Non-Parametric
Confidence Interval
Around Upper
Percentile
Confidence Band
Around Linear
Regression
Non-parametric
Confidence Band
Around Theil-Sen Line
Chapter
§21.1.1
§21.1.2
§21.1.3
§21.1.4
§21.2
§21.2
§21.3.1
§21.3.2
Use

Data must be normal; Some non-detects OK if left-censored
adjustment made; Used in compliance/assessment or corrective
action to compare compliance well against fixed, mean-based
groundwater standard; Should be no significant trend; 4 or more
observations recommended
Data must be lognormal; Some non-detects OK if left-censored
adjustment made; Used in compliance/assessment or corrective
action to compare compliance well against fixed, mean-based
groundwater standard; Should be no significant trend; 4 or more
observations recommended; Geometric mean equivalent to
lognormal median, smaller than lognormal mean
Data must be lognormal; Some non-detects OK if left-censored
adjustment made; Used in compliance/assessment or corrective
action to compare compliance well against fixed, mean-based
groundwater standard; Should be no significant trend; 4 or more
observations recommended; Lognormal arithmetic mean larger than
lognormal geometric mean
Data must be normal or normalized; Some non-detects OK if left-
censored adjustment made; Used in compliance/assessment to
compare compliance well against percentile-based or maximum
groundwater standard; Should be no significant trend
For non-normal, non-lognormal data; Non-detects OK; Used in
compliance/assessment or corrective action to compare compliance
well against fixed, mean-based groundwater standard; Should be no
significant trend; 7 or more observations recommended
For non-normal, non-lognormal data; Non-detects OK; Used in
compliance/assessment or corrective action to compare compliance
well against percentile-based or maximum groundwater standard;
Should be no significant trend; Large background sample size helpful
Use on data with significant trend; Trend residuals must be normal or
normalized; Used in compliance/assessment or corrective action to
compare compliance well against fixed groundwater standard; > 8
observations recommended
Use on data with significant trend; Non-normal data and/or non-
detects OK; Used in compliance/assessment or corrective action to
compare compliance well against fixed groundwater standard;
Bootstrapping of Theil-Sen trend line used to construct confidence
band
                                       8-8
       March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

8.3  METHOD SUMMARIES

       TIME SERIES PLOT (SECTIONS 9.1 AND  14.2.1)
Basic purpose:  Diagnostic  and exploratory tool. It is  a  graphical technique to display changes in
   concentrations at one or more wells over a specified period of time or series of sampling events.

Hypothesis tested:  Not a formal statistical  test. Time series plots can be used to informally gauge the
   presence of temporal  and/or spatial variability in a collection of distinct wells sampled during the
   same time frame.

Underlying assumptions: None.

When to use: Given a collection of wells with several  sampling events recorded at each well, a time
   series plot can provide information not only on whether the mean concentration level changes from
   well to well (an indication of possible spatial variation),  but also on whether there exists time-related
   or temporal dependence in the data.  Such temporal dependence can be seen in parallel movement on
   the time series plot, that is, when several wells exhibit the same pattern of up-and-down fluctuations
   over time.

Steps involved:  1) For each well, make a plot of concentration against time or date of sampling for the
   sampling events that occurred during the specified time  period; 2) Make sure each well is identified
   on the plot with a distinct symbol and/or connected line pattern (or trace); 3) To observe possible
   spatial  variation,  look  for  well traces that are substantially  separated  from one  another in
   concentration level; 4) To look for temporal dependence,  look for well traces that rise and fall
   together in roughly the same (parallel) pattern; 5) To ensure that artificial trends due to changing
   reporting limits are not reported, plot any non-detects with a distinct symbol, color, and/or fill.

Advantages/Disadvantages: Time  series  plots are an excellent tool for examining the behavior of one
   or more samples over time. Although,  they  do  not  offer the compact summary of distributional
   characteristics that, say, box plots do, time series plots display each and every data point and provide
   an excellent  initial indication  of temporal dependence. Since temporal  dependence affects the
   underlying variability  in the data, its identification is important so adjustments can be made to the
   estimated standard deviation.

       Box PLOT (SECTIONS 9.2,  12.2, AND 13.2.1)
Basic purpose: Diagnostic and exploratory tool. Graphical summary of data distribution; gives compact
   picture of central tendency and dispersion.

Hypothesis tested: Although not a formal statistical test, a side-by-side box plot of multiple datasets can
   be used as  a rough  indicator of either  unequal   variances or  spatial  variation  (via  unequal
   means/medians). Also serves as a quasi-non-parametric screening tool for outliers in a symmetric
   population.

Underlying assumptions: When used to screen outliers,  underlying population should be approximately
   symmetric.
                                              8-9                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

When to use: Can be  used as a  quick  screen  in  testing for unequal variances across multiple
   populations. Box lengths indicate the range of the central 50% of sample data values. Substantially
   different box  lengths suggest possibly  different  population  variances.  It is useful  as a rough
   indication of spatial variability across multiple well  locations. Since the median (and often the mean)
   are graphed on each box, significantly staggered medians and/or means on a multiple  side-by-side
   box plot can suggest possibly different population means at distinct well locations. Can  also be used
   to screen for outliers: values falling beyond the 'whiskers' on the box plot are labeled as potential
   outliers.

Steps  involved:  1)  Compute  the median,  mean,  lower and  upper  quartiles (i.e., 25th and 75th
   percentiles) of each dataset; 2) Graph each set of summary statistics side-by-side on the same set of
   axes.  Connect the lower and upper quartiles as the ends of a box, cut the box in two with a line at the
   median,  and use an  'X'  or other symbol  to represent the  mean.  3) Compute the  'whiskers' by
   extending lines below and above the box by an amount  equal to 1.5 times the interquartile range
   [IQR].

Advantages/Disadvantages:  The box plot is an excellent screening tool and  visual aid in diagnosing
   either unequal variances for testing the assumptions of ANOVA, the possible presence of spatial
   variability, or  potential outliers. It is not  a formal statistical test, however, and should  generally be
   used in conjunction with numerical test procedures.

       HISTOGRAM (SECTION 9.3)
Basic purpose: Diagnostic and exploratory tool. It is a graphical summary of an entire data distribution.

Hypothesis tested: Not a formal statistical test.

Underlying assumptions: None.

When to use: Can be used as a rough estimate of the  probability density of a single sample. Shape of
   histogram helps determine whether the distribution is symmetric or  skewed.  For larger data sets,
   histogram can be visually compared to a normal distribution or other known model to assess whether
   the shapes are similar.

Steps involved: 1) Sort and bin the data set into non-overlapping concentration segments that span the
   range of measurement values; 2)  Create a bar chart of the bins created in Step 1: put the height of
   each bar equal to the number or fraction of values falling into each bin.

Advantages/Disadvantages:  The histogram is a good visual aid in exploring possible distributional
   models that might be appropriate. Since it is not a formal test, there is no way to judge possible
   models solely  on the basis of the histogram; however, it provides a visual 'feel' for a data set.

       SCATTER PLOT (SECTION 9.4)
Basic purpose: Diagnostic tool. It is a graphical method to explore the association between two random
   variables or two paired statistical samples.

Hypothesis tested: None


                                             JTlO                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying Assumptions: None.

When to use: Useful as an exploratory tool  for discovering or identifying statistical  relationships
   between pairs of variables. Graphically illustrates the degree of correlation or association between
   two quantities.

Steps involved: Using Cartesian pairs of the variables of interest, graph each pair on the scatter plot,
   using one symbol per pair.

Advantages/Disadvantages: A scatter plot is not a formal test, but rather an excellent exploratory tool.
   Helps identify statistical relationships.

       PROBABILITY PLOT (SECTIONS 9.5 AND 12.1)
Basic purpose: Diagnostic tool.  A graphical method to compare a dataset against a particular statistical
   distribution,  usually the normal. Designed  to  show how well the data match up to or 'fit' the
   hypothesized distribution.  An  absolutely straight line  fit indicates perfect  consistency with the
   hypothesized model.

Hypothesis tested: Although not a formal test,  the probability plot can be used to graphically indicate
   whether a dataset is normal. The straighter the plot, the more consistent the dataset with a  null
   hypothesis of normality; significant curves, bends, or other non-linear patterns suggest a rejection of
   the normal model as a poor fit.

Underlying Assumptions: All observations come from a single statistical population.

When to use: Can be used as a  graphical indication of normality on a set  of raw measurements or, by
   first making a transformation, as an indication of normality on  the transformed  scale.  It should
   generally be supplemented by a formal  numerical test of normality. It can be used  on  the residuals
   from  a one-way ANOVA to test the joint normality of the groups being compared. The test can  also
   be used to help identify  potential  outliers (i.e.,  individual  values not  part of  the  same basic
   underlying population).

Steps  involved:  1)  Order the dataset  and determine matching percentiles  (or quantiles)  from the
   hypothesized distribution (typically the standard normal); 2) Plot the ordered data values against the
   matching percentiles; 3) Examine the plot for a straight line fit.

Advantages/Disadvantages:  Not  a formal test of  normality;  however,  the probability plot is an
   excellent graphical supplement to any  goodness-of-fit test. Because each data value  is depicted,
   specific  departures from normality can be  identified (e.g.,  excessive  skewness, possible outliers,
   etc.).

       SKEWNESS COEFFICIENT (SECTION 10.4)
Basic purpose: Diagnostic tool. Sample statistic  designed  to measure the degree of symmetry  in a
   sample. Because the normal distribution is perfectly symmetric, the skewness coefficient can provide
   a quick indication of whether a given dataset is symmetric enough to be consistent with the normal
   model.  Skewness coefficients close to zero are consistent with normality;  skewness values large in
   absolute value suggest  the underlying population is asymmetric and non-normal.

                                              JTll                                    March 2009

-------
Chapter 8. Methods Summary _ Unified Guidance

Hypothesis tested: The skewness coefficient is used in groundwater monitoring as a screening tool
   rather than a formal hypothesis test. Still, it can be used to roughly test whether a given sample is
   normal  by using the following rule of thumb: if the skewness coefficient is no greater than one in
   absolute value, accept a null hypothesis of normality; if not, reject the normal model as ill-fitting.

Underlying Assumptions: None

Steps involved:  1) Compute skewness coefficient; 2) Compare to cutoff of 1;  3) If skewness is greater
   than 1, considering running a formal test of normality.

Advantages/Disadvantages: Fairly simple calculation, good screening tool. Skewness coefficient can
   be  positive or negative,  indicating positive  or negative  skewness in the dataset, respectively.
   Measures symmetry rather than normality,  per se; since other non-normal  distributions can also be
   symmetric, might give a misleading result.  Not as powerful or accurate a test of normality as either
   the Shapiro-Wilk or Filliben tests, but a more accurate indicator than  the coefficient of variation,
   particularly for data on a transformed scale.

       COEFFICIENT OF VARIATION  [CV] (SECTION 10.4)
Basic purpose: Diagnostic tool. Sample statistic used to measure skewness in a sample of positively-
   valued measurements. Because the CV of positively-valued normal measurements must be close to
   zero,  the CV provides an easy indication  of whether a given sample  is symmetric enough to be
   normal. Coefficients  of variation close to zero are consistent with normality;  large CVs indicate a
   skewed, non-normal population.

Hypothesis tested: The coefficient of variation is not a formal hypothesis test. Still, it can be used to
   provide a 'quick and easy' gauge of non-normality: if the CV exceeds 0.5, the population is probably
   not normal.

Underlying Assumptions: Sample must be positively-valued for CV to have meaningful interpretation.

Steps involved:  1) Compute  sample mean and standard deviation; 2) Divide standard deviation by mean
   to get coefficient of variation.

Advantages/Disadvantages:  Simple calculation,  good  screening tool.  It  measures  skewness  and
   variability in positively-valued data. Not an accurate a test of normality, especially if data have been
   transformed.

       SHAPIRO-WILK AND  SHAPIRO-FRANCIA TESTS (SECTION 10.5)
Basic purpose:  Diagnostic  tool and a formal numerical goodness-of-fit  test of normality.  Shapiro-
   Francia test is a close variant of the Shapiro-Wilk useful when  the sample size is larger than 50.
Hypothesis tested: HQ — the dataset being tested comes from an underlying normal population.
   the underlying population is non-normal (note that the form  of this alternative  population is  not
   specified).

Underlying assumptions: All observations come from a single normal population.
                                             8-12                                   March 2009

-------
Chapter 8. Methods Summary _ Unified Guidance

When to use: To test normality on a set of raw measurements or following transformation of the data. It
   can also be used with the residuals from  a one-way ANOVA to test the joint normality of the groups
   being compared.

Steps involved (for  Shapiro-Wilk):  1) Order the dataset and compute successive differences between
   pairs of extreme values (i.e., most extreme pair = maximum - minimum, next most extreme pair =
   2nd  largest - 2nd smallest, etc.);  2) Multiply the pair differences by the Shapiro-Wilk coefficients
   and  compute the  Shapiro-Wilk test statistic;  3) Compare the test statistic against an a-level critical
   point; 4) Values higher than the critical point are consistent with the null hypothesis of normality,
   while values  lower than the critical point suggest a non-normal  fit.

Advantages/Disadvantages: The Shapiro-Wilk procedure is considered  one of the very best tests of
   normality. It  is much more powerful than the skewness coefficient or chi-square goodness-of-fit test.
   The  Shapiro-Wilk and Shapiro-Francia test statistics will tend to be large (and more indicative of
   normality)  when a probability plot  of the  same data exhibits a  close-to-linear pattern. Special
   Shapiro-Wilk coefficients are available for sample  sizes up to 50.  For larger sample sizes, the
   Shapiro-Francia test does not require a table of special coefficients, just the ability to compute
   inverse normal probabilities.

       FILLIBEN'S PROBABILITY PLOT CORRELATION COEFFICIENT TEST (SECTION 10.6)
Basic purpose: Diagnostic tool and a formal numerical goodness-of-fit procedure to test for normality.
Hypothesis tested: HQ — the dataset being tested comes from an underlying normal population.
   the underlying population is non-normal (note that the form of this alternative  population is  not
   specified).

Underlying assumptions: All observations come from a single normal population.

When to use: To test normality  on a set of raw measurements or following transformation of the data on
   the transformed scale. It can also be used on the residuals from a one-way ANOVA to test the joint
   normality of the groups being compared.

Steps involved: 1) Construct a normal probability plot of the  dataset; 2) Calculate the correlation
   between the pairs on the probability plot; 3) Compare the test statistic against an a-level critical
   point; 4) Values higher than the critical point are consistent with the null hypothesis of normality,
   while values lower than the critical point suggest a non-normal  fit.

Advantages/Disadvantages: Filliben's procedure is an  excellent test of normality,  with very similar
   characteristics to the Shapiro-Wilk test. As a correlation on  a  probability plot, the Filliben's test
   statistic will tend to be close to one (and more indicative of normality) when a probability plot of the
   same data exhibits a close-to-linear pattern. Critical points for Filliben's test are available for sample
   sizes up to 100. A table of special coefficients is not needed to run Filliben's test,  only the ability to
   compute inverse normal probabilities.

       SHAPIRO-WILK  MULTIPLE GROUP TEST (SECTION 10.7)
Basic purpose: Diagnostic tool and a formal normality goodness-of-fit test for multiple groups.

                                              JTl3                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Hypothesis tested: HO — datasets being tested all come from underlying normal populations, possibly
   with different means and/or variances. HA — at least one underlying population is non-normal (note
   that the form of this alternative population is not specified).

Underlying assumptions: The observations in each group all come from,  possibly different, normal
   populations.

When to use:  Can be used to test normality on multiple sets of raw measurements or, by first making a
   transformation, to test normality of the  data groups on  the transformed scale.  It is particularly
   helpful when used in conjunction with Welch's t-test.

Steps  involved:  1)  Compute  Shapiro-Wilk statistic (Section 10.5) on  each group  separately;  2)
   Transform  the Shapiro-Wilk  statistics  into z-scores  and combine into an omnibus z-score;  3)
   Compare the test  statistic against an a-level critical point; 4) Values higher than the critical point are
   consistent with the null hypothesis of normality for all the populations, while values lower than the
   critical point suggest a non-normal fit of one or more groups.

Advantages/Disadvantages:  As an extension of the Shapiro-Wilk test, the  multiple group test shares
   many  of its  desirable properties. Users  should  be careful, however, not to assume that a  result
   consistent  with the hypothesis of normality implies that  all  groups  follow the same  normal
   distribution.  The multiple group test does not  assume that all  groups have  the same means or
   variances.  Special coefficients are needed to convert Shapiro-Wilk statistics into z-scores, but once
   converted,  no other special tables needed to run test besides a standard normal table.

       LEVENE'STEST (SECTION 11.2)
Basic purpose: Diagnostic tool. Levene's test is a formal numerical test of equality of variances across
   multiple populations.

Hypothesis tested: HQ — The population variances across all the datasets being tested are equal. HA —
   One or more pairs of population variances are unequal.

Underlying assumptions: The data set  from each  population is assumed to be roughly normal in
   distribution. Since Levene's test is designed to work well even with somewhat non-normal data (i.e.,
   it is fairly robust to non-normality), precise normality is not an overriding concern.

When to use:  Levene's method can be used to test the equal variance assumption underlying one-way
   ANOVA for a group of wells. Used in this way, the test is run on the absolute values of the residuals
   after first subtracting the  mean of each group being compared. If Levene's test is significant,  the
   original data may need to be transformed to stabilize the variances before running an ANOVA.

Steps involved: 1) Compute  the residuals of each group by subtracting the group mean; 2) conduct a
   one-way ANOVA on the absolute values of the residuals; and 3) if the  ANOVA F-statistic is
   significant  at the  5% a-level, conclude  the underlying population variances  are unequal. If not,
   conclude the data are consistent with the null hypothesis of equal variances.

Advantages/Disadvantages:  As a test of equal variances, Levene's test is  reasonably robust to non-
   normality.  It is much more so than for Bartlett's  test (recommended within the 1989 Interim  Final

                                             JTl4                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   Guidance [IFG]). In addition, Levene's method uses the same basic equations as those needed to run
   a one-way ANOVA.

       MEAN-STANDARD DEVIATION SCATTER PLOT (SECTION  11.3)
Basic purpose: Diagnostic tool. It is a graphical method to examine degree of association between mean
   levels and standard deviations at a series of wells. Positive correlation or association between these
   quantities is known as a 'proportional effect' and is characteristic of skewed distributions such as the
   lognormal.

Hypothesis tested: Though not a formal test, the mean-standard deviation scatter plot provides a visual
   indication of whether variances are roughly equal from well to well, or whether the variance depends
   on the well mean.

Underlying Assumptions: None.

When to use: Useful  as a graphical indication of 1) equal variances or 2) proportional effects between
   the standard deviation and mean levels. A positive correlation between well means and standard
   deviations may signify that a transformation is needed to stabilize the variances.

Steps involved: 1) Compute the sample mean and  standard deviation for each well; 2) plot the mean-
   standard deviation pairs on a scatter plot; and 3) examine the plot for any association between the
   two quantities.

Advantages/Disadvantages: Not a formal test of homoscedasticity (i.e., equal variances). It is helpful in
   assessing whether a transformation might be warranted to stabilize unequal variances.

       DIXON'S TEST (SECTION  12.3)
Basic purpose: Diagnostic tool. It is used to identify (single) outliers within smaller datasets.

Hypothesis tested: HO — Outlier(s) comes from  same normal distribution as rest of the dataset. HA —
   Outlier(s) comes from different distribution than rest of the dataset.

Underlying assumptions: Data without  the suspected  outlier(s)  are normally distributed.  Test
   recommended only for sample  sizes up to 25.

When to use:  Try Dixon's test when one  value in a  dataset appears anomalously low or anomalously
   high when compared to the other data values. Be cautious about screening apparent high outliers in
   compliance  point wells. Even if found to be statistical  outliers,  such extreme concentrations may
   represent contamination events. A safer application of outlier tests is  with background or baseline
   samples. Even then, always try to establish a physical reason for the outlier if possible (e.g..,
   analytical error, transcription mistake, etc.).

Steps involved: 1) Remove the suspected  outlier and test remaining data for normality. If non-normal,
   try a transformation to achieve  normality; 2) Once remaining data are  normal, calculate Dixon's
   statistic, depending on the sample size «; 3)  Compare Dixon's statistic against an a-level  critical
   point;  and 4) If Dixon's statistic exceeds the critical  point, conclude the suspected value is  a
   statistical outlier. Investigate this measurement further.

                                              JTl5                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Advantages/Disadvantages: Dixon's test is only recommended for sample sizes up to 25. Furthermore,
   if there is more than one outlier, Dixon's test may lead to masking (i.e., a non-significant result)
   where two or more outliers close in value 'hide' one another. If more than one outlier is suspected,
   always test the least extreme value first.

       ROSNER'S TEST (SECTION  12.4)
Basic purpose: Diagnostic tool. It is used to identify multiple outliers within larger datasets.

Hypothesis tested: HO — Outliers come from same normal distribution as the rest of the dataset. HA —
   Outliers come from different distribution than the rest of the dataset.

Underlying assumptions:  Data  without the  suspected  outliers  are  normally distributed. Test
   recommended for sample sizes of at least 20.

When to use: Try Rosners's  test when multiple values in  a  dataset appear anomalously low or
   anomalously high when compared to  the other data  values.  As Dixon's  test, be cautious about
   screening apparent high outliers in  compliance point wells. Always try to establish a physical reason
   for an outlier if possible (e.g., analytical error, transcription mistake, etc.).

Steps involved: 1) Identify the maximum number of possible outliers (ro <  5)  and the  number of
   suspected outliers (r < ro). Remove the suspected outliers and test the remaining data for normality.
   If non-normal,  try a transformation to achieve normality; 2) Once remaining data are  normal,
   successively compute the mean  and standard deviation, removing the next most extreme value each
   time  until  ro possible outliers  have been removed; 3) Compute Rosner's statistic based on the
   number (r) of  suspected  outliers;  and 4) If Rosner's statistic exceeds an a-level critical point,
   conclude there are r statistical outliers. Investigate these measurements further.  If Rosner's statistic
   does  not  exceed the critical point, recompute the test for (r-1) possible  outliers, successively
   reducing r until either the critical point is exceeded or r = 0.

Advantages/Disadvantages: Rosner's test is only recommended for sample  sizes of 20 or more, but can
   be used to identify up to 5 outliers per use. It is more complicated to use than some other outlier
   tests, but does not require special tables other than to determine a-level critical points.

       ONE-WAY ANALYSIS OF VARIANCE [ANOVA] FOR SPATIAL VARIATION  (SECTION 13.2.2)
Basic purpose: Diagnostic  tool. Test to compare population means at multiple wells, in order to gauge
   the presence of spatial variability.

Hypothesis tested: HO — Population means across all tested wells are equal. HA — One or more pairs
   of population means are unequal.

Underlying assumptions:  1) ANOVA residuals at each  well  or group must be normally distributed
   using  the original data  or after transformation. Residuals should be tested for normality using a
   goodness-of-fit procedure; 2) population variances across  all wells must be equal. This assumption
   can be tested with box plots and Levene's test; and 3) each tested well should  have at least 3 to 4
   separate observations.
                                             8-16                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

When to  use: The one-way ANOVA  procedure  can be used to identify significant spatial variation
   across a group of distinct well locations. The method is particularly useful for a group of multiple
   upgradient wells, to determine whether or not there are large average concentration differences from
   one location to the next due to natural groundwater fluctuations and/or differences in geochemistry.
   If downgradient wells are included in an ANOVA,  the downgradient groundwater should not be
   contaminated, at least if a test of natural spatial variation is  desired. Otherwise, a significant
   difference in population means could reflect the presence of either recent or historical contamination.

Steps involved:  1) Form the ANOVA residuals by subtracting from  each measurement its sample well
   mean;  2) test the ANOVA residuals for normality and equal variance. If either of these assumptions
   is violated, try a transformation of the data and retest the assumptions; 3)  compute the one-way
   ANOVA F-statistic;  4)  if  the F-statistic exceeds  an  a-level  critical  point,  conclude  the  null
   hypothesis of equal population means has been violated and that there is  some (perhaps substantial)
   degree of spatial variation; 5) if the F-statistic does not exceed the critical point, conclude that the
   well averages are close  enough to treat the combined data as  coming from the  same statistical
   population.

Advantages/Disadvantages: One-way ANOVA is an excellent technique for identifying differences in
   separate well  populations,  as long  as the assumptions are generally met. However, a finding  of
   significant spatial variability does not specify the reason for the well-to-well differences. Additional
   information or investigation  may be necessary to determine why the spatial differences exist. Be
   especially  careful when (1) testing  a combination of upgradient and  downgradient wells that
   downgradient contamination  is not the source  of the difference found with ANOVA; and 2) when
   ANOVA identifies significant spatial variation and intrawell tests are called for. In the latter case, the
   ANOVA results can sometimes be used to estimate more powerful intrawell  prediction and control
   limits. Such an adjustment comes directly from the ANOVA computations, requiring no additional
   calculation.

       ANALYSIS OF VARIANCE  [ANOVA] FOR TEMPORAL EFFECTS (SECTIONS 14.2.2 & 14.3.3)
Basic purpose: Diagnostic tool. It is a test to compare population means at multiple sampling events,
   after pooling the event data across wells. The test can also used to adjust data across multiple wells
   for common temporal dependence.

Hypothesis tested: HO — Population means across all sampling events are equal. H\ — One  or more
   pairs of population means are unequal.

Underlying assumptions: 1) ANOVA residuals from the population at each sampling event must be
   normal or normalized. These  should be tested for normality using  a goodness-of-fit procedure; 2) the
   population variances across all sampling events must be equal. Test this  assumption with box plots
   and Levene's test; and 3) each tested well should have at least  3 to 4  observations per sampling
   event.

When to  use: 1) The ANOVA procedure for temporal effects should be used  to identify significant
   temporal variation  over a series of  distinct sampling events.  The method assumes that spatial
   variation by well location is not a significant factor (this should have already  been tested). ANOVA
   for temporal  effects should  be used when a time series plot of  a group of wells exhibits  roughly
   parallel traces over time, indicating a time-related phenomenon affecting all the wells in a similar
                                             JTl7March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   way on any given sampling event. If a significant temporal effect is found, the results of the ANOVA
   can  be employed to adjust the  standard deviation estimate and the degrees of freedom quantities
   needed for further upgradient-to-downgradient comparisons; 2) compliance wells can be included in
   ANOVA for temporal effects, since the temporal pattern is assumed to affect all the wells on-site,
   regardless of gradient; and 3) residuals from ANOVA for temporal effects can be used to create
   adjusted, temporally-stationary measurements in order to eliminate the temporal dependence.

Steps involved: 1) Compute the mean (across wells)  from data collected on  each  separate sampling
   event; 2)  form the ANOVA residuals by subtracting  from each measurement its sampling event
   mean; 3) test the ANOVA residuals for normality and equal variance. If either of these assumptions
   is violated,  try a transformation of the data and retest the assumptions; 4) compute the one-way
   ANOVA  F-statistic;  5)  if the F-statistic  exceeds an a-level critical point,  conclude the null
   hypothesis of equal population means has been violated and that there is some (perhaps substantial)
   degree  of temporal  dependence;  6) compute  the degrees  of freedom adjustment factor and the
   adjusted standard deviation for use in interwell comparisons; 7)  if the F-statistic does not exceed the
   critical point, conclude that the sampling event averages are close enough to treat the combined data
   as if there were no  temporal dependence; and use  the residuals, if necessary, to create adjusted,
   temporally-stationary measurements, regardless of the significance of theF-test (Section  14.3.3).

Advantages/Disadvantages:  1)  One-way ANOVA for  temporal effects is  a good technique for
   identifying time-related effects among a group  of wells. The procedure should be employed when a
   strong temporal dependence is indicated by parallel traces in time series plots; 2) if there is both
   temporal dependence and strong spatial variability, the ANOVA for temporal effects may be non-
   significant due to the added spatial variation.  A two-way ANOVA for temporal and spatial effects
   might be considered instead;  and 3) even if the ANOVA is non-significant, the ANOVA residuals
   can still be used to adjust data for apparent temporal  dependence.

       SAMPLE AUTOCORRELATION FUNCTION (SECTION 14.2.3)
Basic purpose:  Diagnostic tool. This is a parametric  estimate and test of autocorrelation (i.e., time-
   related dependence) in a data series from a single population.

Hypothesis tested: HO — Measurements from the population are independent of sampling  events (i.e..,
   they  are not influenced  by  the  time  when the data were collected). HA  — The distribution of
   measurements is impacted by the time  of data collection.

Underlying assumptions:  Data should be  approximately  normal,  with few non-detects. Sampling
   events represented in the sample should be fairly regular and evenly spaced in time.

When to use: When testing a data series from a single  population (e.g., a single well),  the sample
   autocorrelation function (also known as the correlogram) can determine whether there is a significant
   temporal dependence in the data.

Steps involved: 1) Form overlapping ordered pairs from  the data series by pairing measurements
   'lagged' by  a certain number of sampling events (e.g.,  all  pairs  with measurements spaced by k = 2
   sampling  events); 2) for each  distinct lag (K), compute the sample  autocorrelation;  3)  plot the
                                             8-18                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   autocorrelations from Step  2 by lag (k) on a scatter plot;   and 4) count any autocorrelation as
   significantly different from zero if its absolute magnitude exceeds 2/^1 n , where n is the sample size.

Advantages/Disadvantages:  1) The  sample  autocorrelation function provides a  graphical test of
   temporal dependence. It can be used not only to identify autocorrelation, but also as a planning tool
   for adjusting the sampling interval between events. The smallest lag (k) at which the autocorrelation
   is insignificantly  different  from zero  is  the minimum sampling interval  ensuring  temporally
   uncorrelated data; 2) the test only applies to a single population at a time and cannot be used to
   identify temporal effects that span across groups of wells simultaneously. In that scenario, use a one-
   way ANOVA for temporal  effects; and 3) tests for significant autocorrelation depend on the data
   being approximately normal; use the rank von Neumann ratio for non-normal samples.

       RANK VON  NEUMANN  RATIO (SECTION 9.4)
Basic purpose: Diagnostic tool.  It is  a non-parametric test of first-order autocorrelation (i.e., time-
   related dependence) in a data series from a single population.

Hypothesis tested: HQ — Measurements from the population are independent of sampling events (i.e..,
   they are not influenced by the time when the  data  were collected). H& — The distribution of
   measurements is impacted by the time of data collection.

Underlying assumptions: Data need not be normally distributed. However, it is assumed that the data
   series can be uniquely ranked according to concentration level. Ties in the data (e.g., non-detects) are
   not technically allowed. Although a mid-rank procedure (as used in the Wilcoxon rank-sum test) to
   rank tied values might be considered, the available critical points for the rank von Neumann ratio
   statistic only directly apply to cases where a unique ranking is possible.

When to  use:  When testing a data series from a single population (e.g.,  a single well)  for use in,
   perhaps, an  intrawell prediction limit, control chart, or test of trend, the rank von Neumann ratio can
   determine whether there is a significant temporal dependence in the data. If the dependence is
   seasonal, the data may be adjusted using a seasonal correction (Section 14.3.3). If the dependence is
   a  linear trend,  remove the  estimated trend and re-run the  rank von Neumann ratio on the trend
   residuals before concluding there are additional time-related effects.  Complex dependence may
   require consultation with a professional statistician.

Steps involved: 1) Rank the measurements by concentration level, but then list the ranks in the order the
   samples were collected; 2)  using the ranks, compute the von Neumann ratio;  3) if the  rank von
   Neumann ratio exceeds an a-level critical  point,  conclude the data exhibit no significant temporal
   correlation.  Otherwise, conclude that a time-related pattern does exist. Check for seasonal cycles or
   linear trends using time series plots.  Consult a professional statistician regarding possible statistical
   adjustments if the pattern is more complex.

Advantages/Disadvantages:  The rank von Neumann ratio,  as opposed to other common  time  series
   methods for determining autocorrelation, is a non-parametric test based on using the ranks of the
   data instead of the actual concentration measurements.  The test is simple to compute and can be used
   as a formal  confirmation of temporal dependence, even if the autocorrelation appears fairly obvious
   on a time series plot. As a limiting feature, the test only applies to a single population at a time and

                                              JTl9                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   cannot be used to identify temporal effects that span across groups of wells simultaneously. In that
   scenario, a one-way ANOVA for temporal effects is a better diagnostic tool. Because critical points
   for the rank von Neumann ratio have not been developed for the presence of ties, the test will not be
   useful for datasets with substantial portions of non-detects.

       DARCY EQUATION (SECTION 14.3.2)
Basic purpose: Method  to determine a sampling  interval ensuring that distinct physical volumes of
   groundwater are sampled on any pair of consecutive events.

Hypothesis tested: Not a statistical test or formal procedure.

Underlying assumptions: Flow regime is one in which Darcy's equation is approximately valid.

When to use: Use Darcy's equation to gauge the minimum travel time necessary for distinct volumes of
   groundwater to pass through each well screen. Physical independence of samples does not guarantee
   statistical  independence, but it increases the likelihood of statistical independence. Use to design or
   plan for a site-specific sampling frequency, as well  as what formal statistical tests and retesting
   strategies  are possible given the amount of temporally-independent data that can be collected each
   evaluation period.

Steps involved: 1) Using knowledge of the site hydrogeology, calculate the horizontal and vertical
   components  of average groundwater velocity with  Darcy's equation; 2) Determine the minimum
   travel time needed  between field samples to ensure physical independence; 3) Specify a sampling
   interval during monitoring no less than the travel time obtained via the Darcy computation.

Advantages/Disadvantages:  Darcy's equation is  relatively  straightforward, but is not a statistical
   procedure. It is not  applicable to certain hydrologic environments. Further, it is not a substitute for a
   direct estimate of autocorrelation. Statistical independence is not assured using Darcy's equation, so
   caution is advised.

       SEASONAL CORRECTION (SECTION 14.3.3)
Basic purpose: Method to adjust a longer data series from a single population for an obvious seasonal
   cycle or fluctuation pattern. By removing the seasonal pattern, the remaining residuals can be used in
   further statistical procedures (e.g., prediction limits, control charts) and treated as independent of the
   seasonal correlation.

Hypothesis tested: The  seasonal  correction is not a formal statistical test. Rather, it is a statistical
   adjustment to data for which a definite seasonal pattern has been identified.

Underlying assumptions:  There should be enough data so that  at least 3  full  seasonal cycles are
   displayed on a time series plot. It is also assumed that the seasonal component has a stationary (i.e.,
   stable) mean and variance during the period of data collection.

When to use: Use the seasonal correction when a  longer series of data must be examined, but a time
   series plot indicates a clearly recurring, seasonal fluctuation of concentration levels. If not removed,
   the seasonal dependence will tend to upwardly  bias the estimated variability and  could  lead to
   inaccurate or insufficiently powerful tests.

                                             8^20                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Steps involved: 1) Using a time series plot of the data series, separate the values into common sampling
   events for each year (e.g., all January measurements, all third quarter values, etc.); 2) compute the
   average of each subgroup and the overall mean of the dataset; and 3) adjust the data by removing the
   seasonal pattern.

Advantages/Disadvantages: The seasonal correction described in the Unified Guidance  is relatively
   simple  to perform and offers  a more accurate standard  deviation estimates  compared to  using
   unadjusted data. Removal of the seasonal  component may reveal other previously unnoticed features
   of the data, such  as  a slow-moving trend.   A fairly long data series is required to  confirm the
   presence of a recurring seasonal cycle. Furthermore, many complex time-related patterns cannot be
   handled by this simple correction. In such cases, consultation with a professional statistician may be
   necessary.

       SEASONAL MANN-KENDALL TEST FOR TREND (SECTION 14.3.4)
Basic purpose: Method  for detection monitoring. It is used to identify the presence  of a significant
   (upward) trend at a compliance point when data also exhibit seasonal fluctuations. It may also be
   used  in  compliance/assessment and corrective action  monitoring to track upward or downward
   trends.

Hypothesis tested: HQ — No discernible linear trend exists in the concentration data over time. HA — A
   non-zero, (upward) linear component to the trend does exist.

Underlying assumptions: Since the seasonal Mann-Kendall trend  test is a non-parametric method, the
   underlying data need not be normal or follow a particular distribution. No special adjustment for ties
   is needed.

When to use: Use when  1) upgradient-to-downgradient comparisons are inappropriate so that intrawell
   tests are called for; 2) a control chart or intrawell prediction limit cannot be used because of possible
   trends in the intrawell background, and  3) the data also exhibit seasonality. A trend test can be
   particularly  helpful  at sites with  recent or historical contamination  where  it  is uncertain if
   background is already contaminated. An  upward trend in these cases will document the changing
   concentration levels more accurately than either a control chart  or intrawell prediction limit, both of
   which assume a stationary background mean concentration.

Steps involved:  1) Divide the data into separate groups representing common sampling events from
   each year; 2) compute the Mann-Kendall  test statistic (5) and its standard deviation (SD[,S]) on each
   group; 3) sum the separate  Mann-Kendall statistics into an overall test statistic; 4) compare this
   statistic against an a-level critical point;  and 5) if the statistic exceeds the critical  point, conclude
   that a significant upward trend exists. If not, conclude there is insufficient evidence for identifying a
   significant, non-zero trend.

Advantages/Disadvantages: 1) The seasonal Mann-Kendall test does not require any special treatment
   for non-detects, only  that all non-detects  be set to a common value lower than any of the detected
   values;  and 2) the test is easy to compute and reasonably efficient for detecting (upward) trends in
   the presence of seasonality.  Approximate critical  points  are  derived  from the standard  normal
   distribution.

                                             8^21                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

       SIMPLE SUBSTITUTION (SECTION 15.2)
Basic purpose: A simple adjustment for non-detects in a dataset. One-half the reporting limit [RL] is
   substituted for  each  non-detect  to  provide a numerical  approximation to the  unknown true
   concentration.

Hypothesis tested: None.

Underlying assumptions: The true non-detect concentration is assumed to lie somewhere between zero
   and the reporting limit. Furthermore, that the probability of the true concentration being less than
   half the RL is about the same as the probability of it being greater than half the RL.

When to use:  In general, simple substitution should be used when the dataset contains a relatively small
   proportion of non-detects, say no more than 10-15%. Use with larger non-detect proportions can
   result in biased estimates, especially if most of the detected concentrations are recorded at low levels
   (e.g.., at or near RL).

Steps involved:  1) Determine the reporting limit; and 2) replace each non-detect with one-half RL as a
   numerical approximation.

Advantages/Disadvantages: Simple substitution of half the RL is the easiest adjustment available for
   non-detect data. However, it can lead to biased estimates of the mean and particularly the variance if
   employed when more than 10-15% of the data are non-detects.

       CENSORED PROBABILITY PLOT (SECTIONS 15.3 AND 15.4)
Basic purpose: Diagnostic tool. It is a graphical  fit to normality of a mixture of detected and non-detect
   measurements.  Adjustments are made to the plotting  positions of the detected data  under the
   assumption that all measurements come from a common distributional model.

Hypothesis tested: As a graphical tool, the censored  probability  plot is not a formal  statistical test.
   However, it can provide an indication as to whether a dataset is consistent with the hypothesis that
   the mixture of detects and non-detects come from the same distribution and that the non-detects
   make up the lower tail of that distribution.

Underlying assumptions: Dataset consists of a mixture of detects  and non-detects, all arising from a
   common distribution. Data must be normal or normalized.

When to use: Use the censored  probability plot to check the viability of the Kaplan-Meier or robust
   regression  on  order statistics [ROS] adjustments for non-detect measurements. If the plot is linear,
   the data are consistent with a model in which the unobserved non-detect concentrations comprise the
   lower tail of the underlying distribution.

Steps involved: 1) Using either Kaplan-Meier or ROS, construct a partial ranking of the detected values
   to account for the presence of non-detects; 2) determine  standard normal quantiles  that match the
   ranking of the detects; and 3) graph the detected values against their matched normal quantiles on a
   probability plot and examine for a linear fit.
                                             8-22                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Advantages/Disadvantages: The censored probability plot  offers a visual indication  of whether a
   mixture of detects and non-detects come from the same (normal) distribution. There are, however, no
   formal critical points to aid in deciding when the fit is 'linear enough.' Correlation coefficients can
   be computed to informally aid the assessment. Censored probability plots can also be constructed on
   transformed data to help select a normalizing transformation.

       KAPLAN-MEIER ADJUSTMENT (SECTION 15.3)
Basic purpose: Diagnostic tool.  It is used to adjust a mixture of detected and non-detect data for the
   unknown concentrations of  non-detect values.  The Kaplan-Meier procedure leads to adjusted
   estimates for the mean and standard deviation of the underlying population.

Hypothesis tested: As a statistical  adjustment procedure,  the Kaplan-Meier method  is not a formal
   statistical test.  Rather, it allows estimation of characteristics of the population by  assuming the
   combined group of detects and non-detects come from a common distribution.

Underlying assumptions: Dataset consists of a mixture of detects and non-detects, all  arising from the
   same distribution.  Data must be normal or normalized in the context of the Unified  Guidance.
   Kaplan-Meier should not be used when more than 50% of the data are non-detects.

When  to use: Since the Kaplan-Meier adjustment assumes all the measurements arise from the same
   statistical process,  but that some of these measurements (i.e.., the non-detects) are unobservable due
   to limitations in analytical technology,  Kaplan-Meier should be used when this model is the most
   realistic or reasonable choice.  In particular, when constructing prediction limits, confidence limits, or
   control  charts, the  mean and  standard deviation of the underlying population must be estimated. If
   non-detects  occur  in the dataset (but do not account for  more than half of the observations), the
   Kaplan-Meier adjustment can be used to determine these estimates, which in turn can be utilized in
   constructing the desired statistical test.

Steps involved: 1) Sort the detected values and compute the 'risk set'  associated with each detect; 2)
   using the risk  set, compute  the Kaplan-Meier cumulative distribution  function  [CDF] estimate
   associated with each detect;  3) calculate adjusted  estimates of the population mean and standard
   deviation using the Kaplan-Meier CDF values; and 4)  use these adjusted population estimates in
   place of the sample mean and standard  deviation in prediction limits, confidence limits, and control
   charts.

Advantages/Disadvantages: Kaplan-Meier offers a way to adjust for significant fractions of non-detects
   without having to  know the actual non-detect concentration values. It is more difficult to use than
   simple substitution, but avoids the biases inherent in that method.

       ROBUST REGRESSION ON ORDER STATISTICS [ROS] (SECTION 15.4)
Basic purpose: Diagnostic tool.  It is  a method to adjust  mixture of detects and non-detects for the
   unknown concentrations of non-detect values. Robust ROS leads to adjusted estimates for the mean
   and standard deviation of the  underlying population by imputing a distinct estimated value for each
   non-detect.
                                             8-23                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Hypothesis tested: As a statistical  adjustment procedure, robust ROS is not a formal statistical test.
   Rather, it allows estimation of characteristics of the population by assuming the combined group of
   detects and non-detects come from a common distribution.

Underlying assumptions: Dataset consists of a mixture of detects and non-detects, all arising from the
   same distribution. Data must be normal or normalized in the context of the Unified Guidance.
   Robust ROS should not be used when more than 50% of the data are non-detects.

When to use: Since robust regression on order statistics assumes all the measurements arise from the
   same statistical process, robust ROS should be used when this model is reasonable.  In particular,
   when constructing prediction  limits, confidence limits, or control charts, the mean  and standard
   deviation of the underlying population must be estimated. If non-detects occur in the dataset (but do
   not account for more than half  of the observations), robust ROS can be used to determine these
   estimates, which in turn can be utilized to construct the desired statistical test.

Steps involved: 1) Sort the distinct reporting limits [RL] for non-detect values and compute 'exceedance
   probabilities'  associated with each RL; 2) using the exceedance probabilities, compute 'plotting
   positions' for the non-detects, essentially representing CDF estimates associated  with each RL; 3)
   impute values for individual non-detects based on their RLs and plotting positions; 4) compute
   adjusted mean and standard  deviation estimates via the sample mean and standard deviation of the
   combined set  of detects and imputed non-detects;  and 5) use these adjusted population estimates in
   place of the (unadjusted) sample mean and standard deviation in prediction limits,  confidence limits,
   and control charts.

Advantages/Disadvantages: Robust ROS offers an alternative to Kaplan-Meier to adjust for significant
   fractions of non-detects without having to know the  actual non-detect  concentration values. It is
   more difficult to use than simple  substitution, but avoids the biases inherent in that  method.

       COHEN'S  METHOD AND PARAMETRIC ROS (SECTION  15.5)
Basic purpose: Diagnostic tools.  These are other methods to adjust mixture of detects and non-detects
   to obtain the unknown mean and  standard deviation for the entire data set

Hypothesis tested:   Neither technique is a formal statistical test. Rather, they allow  estimation of
   characteristics of the population  by assuming the combined group of detects and  non-detects come
   from a common distribution.

Underlying assumptions: Dataset consists of a mixture of detects and non-detects, all arising from the
   same distribution. Data must be normal or normalized in the context of the Unified Guidance.
   Neither should be used when more than 50% of the data  are non-detects  nor when data contain
   multiple non-detect levels.

When to  use:  Since these methods assume that all the  measurements arise from the  same statistical
   process, they should be used when  this model is  reasonable.  In  particular, when constructing
   prediction limits, confidence  limits, or control charts, the mean and  standard deviation  of the
   underlying population must be estimated. If non-detects occur in the dataset (but do not account for
   more than half of the observations), they can be used to determine these estimates, which in turn can
   be utilized to construct the desired statistical test.
                                             8^24                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Steps involved: Cohen's Method:  1) data are sorted into non-detect and detected portions; 2) detect
   mean and standard deviation estimates are calculated; 3) intermediate quantities of the ND% and a
   factor y are calculated and used to locate the appropriate X value from a table; and 4) full data set
   mean and standard deviation estimates are then obtained using formulas based on the detected mean,
   standard deviation, the detection limit and X. Parametric ROS:   1)  detected data are  sorted in
   ascending order; 2) standardized normal distribution Z-values are generated from the full set of
   ranked  values.  Those corresponding to the sorted detected values are retained;   3)  the detected
   values are then regressed against the Z-values;  and 4) the resulting regression intercept and slope are
   the estimates of the mean and standard deviation for the full data set.

Advantages/Disadvantages:   These two methods offer alternatives to Kaplan-Meier and robust ROS.
   The key limitation is  that only data containing  a single censoring limit can be used.   In some
   situations using logarithmic data, their application can lead to biased estimates  of the mean and
   standard deviation. Where appropriate, these methods are less computationally intensive that either
   Kaplan-Meier or robust ROS.

       POOLED VARIANCE T-TEST (SECTION 16.1.1)
Basic purpose: Method for detection monitoring. This test compares the means of two populations.

Hypothesis tested: HQ — Means of the two populations are equal; HA — Means of the two  populations
   are unequal (for the usual one-sided alternative,  the hypothesis would state that the mean of the
   second  population is greater than the mean of the first population).

Underlying assumptions:  1) The data from  each population must be normal or normalized; 2) when
   used for interwell  tests, there  should be no significant spatial variability; 3)  at least 4 observations
   per well should be available before applying the test; and 4) the two group variances are equal.

When to use: The pooled variance t-test can be used to test for groundwater contamination at very small
   sites, those  consisting of maybe 3  or 4 wells  and monitoring for 1 or 2  constituents.  Site
   configurations with larger combinations of wells and constituents should employ a retesting scheme
   using either prediction limits or control charts. The pooled variance t-tesi can also be  used to test
   proposed updates to intrawell background. A non-significant t-test in this latter case suggests the two
   sets of data are sufficiently similar to allow the initial background to be updated by augmenting with
   more recent measurements.

Steps involved:  1) Test the  combined residuals from each population for normality. Make a  data
   transformation if necessary; 2) test for equal  variances, and if equal, compute a pooled variance
   estimate; 3) compute the pooled variance ^-statistic and the  degrees of freedom; 3) compare the t-
   statistic against a critical point based on both the a-level and the degrees of freedom; and 4) if the t-
   statistic exceeds the critical point, conclude the null hypothesis of equal means has been violated.

Advantages/Disadvantages:  1)  The pooled variance t-test is one  of the  easiest  to compute  t-tesi
   procedures, but requires an assumption of equal variances across both populations; 2) because the t-
   test is a well-understood statistical procedure, the Unified  Guidance recommends its  use at very
   small groundwater monitoring facilities.  For larger sites, however, repeated use  of the t-test at a
   given a-level will  lead to an unacceptably high  risk of false positive error; and  3) if substantial
   spatial variability exists, the use of any t-test for upgradient-to-downgradient comparisons may lead
                                             8^25                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   to inaccurate conclusions. A significant difference in the population averages could also indicate the
   presence of natural geochemical factors differentially affecting the concentration levels  at different
   wells. In these situations, consider an intrawell test instead.

       WELCH'S T-TEST (SECTION 16.1.2)
Basic purpose: Method for detection monitoring. This test compares the means of two populations.

Hypothesis tested: HO — Means of the two populations are equal; HA — Means of the two populations
   are unequal  (for  the usual one-sided alternative, the hypothesis would state that the mean of the
   second population is greater than the mean of the first population).

Underlying assumptions:  1) The data from each population must be normal or normalized; 2) when
   used  for  interwell tests, there should be  no  significant spatial variability;  and 3)  At least 4
   observations per well should be available before applying the test.

When to use: Welch's t-tesi can be used to test for groundwater contamination at very small sites, those
   consisting of maybe 3 or 4 wells and monitoring for 1 or 2 constituents. Site configurations with
   larger  combinations of wells  and constituents  should employ  a retesting scheme using  either
   prediction limits or control  charts. Welch's t-test  can also  be used to test proposed  updates to
   intrawell background data. A non-significant t-test in this latter case suggests the two sets of data are
   sufficiently similar to allow the initial background to be updated by augmenting with the more recent
   measurements.

Steps involved:  1)  Test the combined  residuals from each population for normality. Make  a data
   transformation if necessary; 2) compute Welch's ^-statistic and approximate degrees of freedom; 3)
   compare the  ^-statistic against a critical point based  on both the a-level and the estimated degrees of
   freedom; and 4)  if the  ^-statistic exceeds the critical point, conclude the null  hypothesis of equal
   means has been violated.

Advantages/Disadvantages: 1) Welch's  t-test is slightly more difficult to compute than other common
   t-test procedures, but has the advantage of not  requiring equal variances across both populations.
   Furthermore, it has been shown to perform statistically as well or better than other ^-tests;  2) it can be
   used at  very  small groundwater monitoring facilities, but should be avoided at larger sites. Repeated
   use of the t-test at a given a-level will lead to an unacceptably high risk of false positive error; and 3)
   if there is substantial  spatial  variability, use  of Welch's t-test  for interwell tests  may lead to
   inaccurate conclusions.  A significant difference  in the population averages may reflect the presence
   of natural geochemical  factors  differentially affecting the  concentration  levels  at different wells. In
   these situations, consider an intrawell test instead.

       WILCOXON RANK-SUM TEST (SECTION  16.2)
Basic purpose: Method for detection monitoring. This test compares the medians of two populations.

Hypothesis tested:  HQ — Both populations  have equal medians  (and, in  fact, are identical  in
   distribution). H^ —  The two population medians are unequal  (in the usual one-sided alternative, the
   hypothesis would state  that the median of the second population  is larger than the median of the
   first).

                                              8^26                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying assumptions: 1) While the Wilcoxon rank-sum test does not require normal data, it does
   assume both populations have the same distributional form and that the variances are equal.  If the
   data are non-normal but there at most a few non-detects, the equal variance assumption may be
   tested through the use of box plots and/or Levene's test. If non-detects  make-up a large fraction of
   the observations, equal variances may have to be assumed rather than formally verified; 2) use  of the
   Wilcoxon rank-sum procedure for interwell tests assumes there is no significant spatial variability.
   This is  more likely to be the case in precisely those circumstances where the Wilcoxon procedure
   might be used: when there are high fractions  of non-detects, so  that most  of the concentration
   measurements at any location are at low levels;  and 3) there  should be at least 4  background
   measurements and at least 2-4 compliance point values.

When to use: The Wilcoxon rank-sum test can be used to test for groundwater contamination at very
   small sites, those  consisting of maybe 3 or  4 wells and monitoring for 1 or 2  constituents. Site
   configurations with larger combinations of wells and constituents should employ a retesting scheme
   using non-parametric prediction limits. Note, however, that non-parametric prediction limits often
   require  large background sample sizes to be effective. The Wilcoxon rank-sum  can be useful when a
   high percentage of the data is non-detect, but the amount of available background data is limited.
   Indeed, an intrawell Wilcoxon procedure may be helpful in some situations where the false positive
   rate would otherwise be too high to run intrawell prediction limits.

Steps involved: 1) Rank the combined set of values from the two datasets, breaking ties if necessary by
   using midranks; 2) compute the sum of the ranks from the compliance point well and calculate the
   Wilcoxon test statistic; 3) compare the Wilcoxon test statistic against an a-level critical point; and 4)
   if the test statistic  exceeds the critical point, conclude that the null hypothesis of equal medians  has
   been violated.

Advantages/Disadvantages:  1) The  Wilcoxon rank-sum test is an excellent technique for small sites
   with constituent non-detect data. Compared to other possible methods such as the test of proportions
   or exact binomial  prediction limits, the Wilcoxon rank-sum  does a better job overall  of correctly
   identifying elevated groundwater concentrations while limiting false positive error; 2) because the
   Wilcoxon rank-sum is easy to compute and understand, the Unified Guidance recommends its  use at
   very small groundwater monitoring facilities. For larger sites, repeated use of the Wilcoxon rank-
   sum at a given a-level will lead to  an unacceptably high risk of false positive error; and 3) if
   substantial spatial variability exists, the use of the Wilcoxon rank-sum for interwell tests may lead to
   inaccurate conclusions.  A significant difference in the population medians may signal the presence
   of natural geochemical  differences  rather  than  contaminated groundwater.  In  these situations,
   consider an intrawell test instead.

       TARONE-WARE TEST (SECTION  16.3)
Basic purpose: Non-parametric method  for detection monitoring. This is an extension of Wilcoxon
   rank-sum,  an alternative test  to  compare the medians  in two populations when non-detects  are
   prevalent.

Hypothesis tested: HQ — Both populations  have equal  medians (and, in fact,  are  identical  in
   distribution). HA — The two population medians are unequal (in the usual one-sided alternative, the
   hypothesis would  state that the median of the second population is larger than the median of the
   first).	
                                              8-27                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying assumptions: 1) The Tarone-Ware test does not require normal data, but does assume both
   populations have the same distributional form and that the variances are equal;  and 2) use of the
   Tarone-Ware procedure for interwell tests assumes there is no significant spatial variability. This is
   more likely to be the case when there are high fractions of data non-detects, so that most of the
   concentration measurements at any location are at low and similar levels.

When to use: The Tarone-Ware test can be used to test for groundwater contamination at very small
   sites, those consisting  of perhaps 3 or  4  wells  and  monitoring  for  1  or 2  constituents. Site
   configurations with larger combinations of wells and  constituents should employ a retesting scheme
   using non-parametric prediction limits. Note, however, that non-parametric prediction limits often
   require  large background sample sizes to be effective. The Tarone-Ware test can be useful when a
   high percentage of the data is non-detect, but the amount of available background data is limited.
   The Tarone-Ware test is also an  alternative to the  Wilcoxon  rank-sum when there are multiple
   reporting limits and/or it is unclear how to fully rank the data as required by the Wilcoxon.

Steps involved: 1) Sort the distinct detected values in the combined data set; 2) count the  'risk set'
   associated  with each distinct value from Step 1 and compute the expected number of compliance
   point detections within each risk set; 3) form the Tarone-Ware test statistic from the expected counts
   in Step 2;  4) compare the test statistic against a standard normal a-level critical point; and 5) if the
   test statistic exceeds the critical  point, conclude that the null hypothesis of equal  medians has been
   violated.

Advantages/Disadvantages:  The  Tarone-Ware test is  an excellent technique  for  small sites  with
   constituent non-detect data having multiple reporting limits.  If substantial spatial variability exists,
   use of the  Tarone-Ware  test for interwell tests may  lead to inaccurate conclusions.  A significant
   difference  in the population medians may signal the presence  of natural geochemical differences
   rather than contaminated groundwater. In these situations, consider an intrawell test instead.

       ONE-WAY ANALYSIS OF VARIANCE [ANOVA] (SECTION 17.1.1)
Basic purpose: Formal interwell detection monitoring test and diagnostic tool.  It compares population
   means  at  multiple  wells,  in  order  to  detect  contaminated  groundwater when  tested against
   background.

Hypothesis tested: HQ — Population means across all tested wells are equal. HA — One or more pairs
   of population means are unequal.

Underlying assumptions: 1) ANOVA residuals at each well or population must be normally distributed
   or  transformable  to  normality. These  should be tested for  normality using  a goodness-of-fit
   procedure;  2) the population variances across all wells must be equal.  This assumption can be tested
   with box plots and Levene's test;  and  3) each tested well should have at least 3  to 4  separate
   observations.

When to use: The one-way ANOVA can sometimes be  used to  identify to  simultaneously test for
   contaminated groundwater across a group  of distinct  well locations. As an inherently interwell test,
   ANOVA should be utilized only on constituents exhibiting little to no spatial variation. Most uses of
   ANOVA have been superseded by prediction limits and control charts, although it is commonly
   employed to identify spatial variability or temporal dependence across a group of wells.
                                              8^28                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Steps involved: 1) Form the ANOVA residuals by subtracting from each measurement its sample well
   mean; 2) test the ANOVA residuals for normality and equal variance. If either of these assumptions
   is violated, try a transformation of the data and retest the assumptions; 3) compute the one-way
   ANOVA  F-statistic;  4) if  the  F-statistic exceeds  an a-level critical  point, conclude the null
   hypothesis of equal population means has been violated and that at least  one pair of wells shows a
   significant difference in concentration levels;  and 5) test each compliance well individually to
   determine which one or more exceeds background.

Advantages/Disadvantages: ANOVA is  only likely to be infrequently  used  to make upgradient-to-
   downgradient comparisons in formal detection monitoring testing. The  regulatory restrictions for
   per-constituent a-levels using ANOVA make it difficult to adequately control site-wide false positive
   rates [SWFPR].  Even if spatial variability is not a significant problem, users are advised to consider
   interwell prediction limits or control charts, and to incorporate some form of retesting

       KRUSKAL-WALLIS TEST (SECTION 17.1.2)
Basic purpose: Formal interwell  detection monitoring test and diagnostic tool. It compares population
   medians at multiple  wells,  in  order to detect  contaminated groundwater  when tested  against
   background.  It is  also  useful as a non-parametric alternative to  ANOVA for identifying spatial
   variability in constituents with non-detects or for data that cannot be normalized.

Hypothesis tested: HQ — Population medians across all tested wells are equal. HA — One or more pairs
   of population medians are unequal.

Underlying assumptions:  1) As a non-parametric alternative to ANOVA, data need not be normal; 2)
   the population variances across all wells must be equal. This assumption can be tested with box plots
   and Levene's test if the non-detect proportion is not too high; and 3) each tested well should have at
   least 3 to 4 separate observations.

When to  use:  The Kruskal-Wallis  test can sometimes be used to identify to simultaneously  test for
   contaminated groundwater across a group of distinct well locations. As an inherently interwell test,
   Kruskal-Wallis  should  be utilized for this purpose  only  with constituents exhibiting little to  no
   spatial variation. Most  uses of the Kruskal-Wallis (similar to ANOVA) have been superseded by
   prediction limits, although it can be used to identify spatial variability and/or temporal dependence
   across a group of wells when the  sample data  are non-normal or  have higher proportions  of non-
   detects.

Steps involved: 1)  Sort and form the ranks of the combined measurements; 2) compute the rank-based
   Kruskal-Wallis test statistic (//); 3) if the //-statistic exceeds an a-level critical point, conclude the
   null hypothesis  of equal population medians has been violated and that  at least one pair of wells
   shows a significant difference in concentration levels; and  5) test each compliance well individually
   to determine which one or more exceeds background.

Advantages/Disadvantages: 1) The Kruskal-Wallis test is only likely to be infrequently used to make
   upgradient-to-downgradient  comparisons in  formal  detection  monitoring  testing. The regulatory
   restrictions for per-constituent a-levels using ANOVA make  it difficult to adequately control the
   SWFPR.   Even if spatial variability is  not  a  significant  problem, users are advised to consider

                                              8^29                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   interwell prediction limits, and to incorporate some form of retesting; and 2) the Kruskal-Wallis test
   can be used to test for spatial variability in constituents with significant fractions of non-detects.

       TOLERANCE LIMIT (SECTION 17.2.1)
Basic  purpose: Formal interwell detection monitoring test of background versus  one  or  more
   compliance wells. Tolerance limits can be used as  an alternative to one-way ANOVA. These can
   also be used in corrective action as an alternative clean-up limit.

Hypothesis tested: HQ — Population means across all tested wells are equal. HA — One or more pairs
   of population means are unequal.

Underlying assumptions: 1) Data should be normal or normalized;  2) the population variances across
   all wells are assumed to be equal.  This assumption can be difficult to test when comparing a single
   new observation from each compliance well against a tolerance limit based on background; and  3)
   there should be a minimum of 4 background measurements, preferably 8-10 or more.

When  to  use: A tolerance limit can  be used  in place of ANOVA  for  detecting contaminated
   groundwater. It  is more flexible  than ANOVA since 1) as few  as  one  new measurement per
   compliance well is needed to run  a tolerance limit  test, and 2) no post-hoc testing is necessary  to
   identify which compliance wells are elevated over background. Most uses of tolerance limits (similar
   to ANOVA) have been superseded by prediction limits, due to difficulty of incorporating retesting
   into tolerance limit schemes. If a  hazardous  constituent requires  a background-type standard  in
   compliance/assessment or corrective action, a tolerance limit can be computed on background and
   used as a fixed GWPS.

Steps involved: 1) Compute background sample mean and  standard deviation; 2) calculate upper
   tolerance limit on background with  high  confidence and  high  coverage;  3) collect one or more
   observations from each compliance well and test each against the tolerance limit;  and 4) identify a
   well as contaminated if any of its observations exceed the tolerance limit.

Advantages/Disadvantages: Tolerance limits are likely to be used only infrequently to be used as either
   interwell or intrawell tests.  Prediction limits or control charts offer better control of false positive
   rates, and less is known about the impact of retesting on tolerance  limit performance.

       NON-PARAMETRIC TOLERANCE LIMIT (SECTION 17.2.2)
Basic  purpose: Formal interwell detection monitoring test of background versus  one  or  more
   compliance wells. Non-parametric tolerance limits can be used as an alternative to the Kruskal-
   Wallis  test. They may  also be used in  compliance/assessment or corrective action to define a
   background GWPS.

Hypothesis tested: HQ — Population medians across all tested wells are equal. HA — One or more pairs
   of population medians are unequal.

Underlying assumptions: 1) As a non-parametric  test, non-normal data with non-detects can be used;
   and 2) there should be a minimum of 8-10 background measurements and preferably more.
                                             8-30                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

When to  use:  A non-parametric tolerance limit can be used in place of the Kruskal-Wallis test for
   detecting contaminated groundwater. It is more flexible than Kruskal-Wallis since 1) as few as one
   new measurement per compliance well is needed to run a tolerance limit test, and 2)  no post-hoc
   testing is necessary to identify which compliance wells are elevated over background. Most uses of
   tolerance limits have been superseded by prediction limits, due to difficulty of incorporating retesting
   into tolerance limit schemes. However, when a clean-up limit cannot or has not been  specified in
   corrective action,  a tolerance limit can be computed on background and used as  a  site-specific
   alternate concentration limit [ACL].

Steps involved: 1) Compute a large order statistic from background and set this value as the upper
   tolerance limit;  2) calculate the confidence and coverage associated with the tolerance limit; 3)
   collect one or more observations from each compliance well and test each against the tolerance limit;
   and 4) identify a well as contaminated if any of its observations exceed the tolerance limit.

Advantages/Disadvantages:  1) Tolerance limits are likely to be used only infrequently to be used as
   either  interwell  or intrawell tests.   Prediction limits or control charts offer better control of false
   positive rates, and less is known about the impact of retesting on tolerance limit performance; and 2)
   non-parametric tolerance limits have the added disadvantage of generally requiring large background
   samples to ensure adequate confidence and/or coverage. For this reason, it is strongly recommended
   that a parametric tolerance limit be constructed whenever possible.

       LINEAR  REGRESSION  (SECTION 14.4)
Basic purpose: Method for detection monitoring and diagnostic tool.  It is used to identify the presence
   of a significantly increasing trend at a compliance point or any trend in background data sets.

Hypothesis tested: HQ — No discernible linear trend exists in the concentration data over time. HA — A
   non-zero, (upward) linear component to the trend does exist.

Underlying assumptions: Trend residuals should be  normal or normalized, equal  in variance, and
   statistically independent. If a small fraction of non-detects exists (<10-15%), use simple substitution
   to replace each non-detect by half the reporting limit [RL]. Test homoscedasticity of residuals with a
   scatter plot (Section 9.1).

When to use: Use a test for trend when 1) upgradient-to-downgradient comparisons are inappropriate so
   that intrawell tests are called for, and 2) a control chart or intrawell prediction limit cannot be used
   because of possible  trends in the intrawell background. A trend test can be particularly helpful at
   sites  with recent or  historical contamination  where it is  uncertain to what  degree  intrawell
   background is already contaminated. The presence of an upward trend in these cases will document
   the changing nature of the concentration data much more accurately than either a control chart or
   intrawell prediction limit, both of which assume a stable baseline concentration.

Steps involved: 1) If a linear  trend is evident on a time series plot, construct the linear regression
   equation; 2) subtract  the estimated trend  line  from each observation to form residuals;  3) test
   residuals for assumptions  listed  above;  and 4) test regression slope to  determine whether it is
   significantly different from  zero.  If so and the slope is positive, conclude there is evidence  of a
   significant upward trend.

                                             JTsi                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Advantages/Disadvantages: Linear regression is a standard statistical method for identifying trends and
   other linear associations between pairs of random variables.  However, it requires approximate
   normality  of  the  trend residuals.  Confidence bands  around regression trends  can be used in
   compliance/assessment and corrective action to determine compliance with fixed standards  even
   when concentration levels are actively changing (i.e.., when a trend is apparent).

       MANN-KENDALL TEST  FOR TREND (SECTION 17.3.2)
Basic purpose: Method for detection monitoring and diagnostic tool. It is used to identify the presence
   of a significant (upward) trend at a compliance point or any trend in background data.

Hypothesis tested: HO — No discernible linear trend exists in the concentration data over time. HA — A
   non-zero, (upward) linear component to the trend does exist.

Underlying assumptions: Since the Mann-Kendall  trend  test  is  a  non-parametric method,  the
   underlying data need not be normal or follow any particular distribution. No special  adjustment for
   ties is needed.

When  to use: Use a test for trend when 1) interwell tests are inappropriate  so that intrawell tests are
   called for,  and 2)  a control chart or intrawell prediction limit cannot be  used because  of possible
   trends in intrawell background. A trend  test can be  particularly  helpful at sites with recent or
   historical contamination where it is uncertain if intrawell background is already contaminated. An
   upward trend in these cases documents changing concentration levels more accurately than either a
   control  chart or intrawell  prediction  limit, both of which assume a  stationary background mean
   concentration.

Steps involved:  1) Sort the data values by time of sampling/collection; 2) consider all possible pairs of
   measurements from different sampling events; 3) score each pair depending on whether the later data
   point is higher or lower in concentration than the earlier  one, and sum the scores to get Mann-
   Kendall statistic; 4) compare  this statistic against an a-level critical  point;  and 5)  if the statistic
   exceeds the critical point, conclude that a significant upward trend exists. If not, conclude there is
   insufficient evidence for identifying a significant, non-zero trend.

Advantages/Disadvantages: The Mann-Kendall test does not require any special treatment for  non-
   detects, only that all non-detects can be set to a common value lower than any of the detects.   The
   test is easy to compute and reasonably efficient for detecting (upward) trends. Exact critical points
   are provided in the Unified Guidance for n < 20; a normal approximation can be used for n > 20. 3)
   A version of the Mann-Kendall test (the seasonal Mann-Kendall, Section 14.3.4) can be used to test
   for trends in data that exhibit seasonality.

       THEIL-SEN TREND LINE (SECTION 17.3.3)
Basic  purpose: Method  for  detection monitoring.  This is  a non-parametric  alternative to linear
   regression for estimating a linear trend.

Hypothesis tested: As  presented in the Unified Guidance,  the Theil-Sen trend line is not a formal
   hypothesis test but rather an estimation procedure. The algorithm can be  modified to formally test
   whether the true slope is significantly different from zero, but this question will already be answered
   if used in conjunction with the Mann-Kendall procedure.
                                              8^32                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying assumptions: Like the Mann-Kendall trend test, the Theil-Sen trend line is non-parametric,
   so the underlying data need not be normal or follow a particular distribution. Furthermore, data ranks
   are not used, so no special adjustment for ties is needed.

When to use: It is particularly helpful when used in conjunction with the Mann-Kendall test for trend.
   The latter test offers information about whether a trend exists, but does not estimate the trend line
   itself. Once a  trend is identified, the Theil-Sen procedure indicates how quickly the concentration
   level is  changing with time.

Steps involved: 1) Sort the data set by date/time of sampling; 2)  for each pair of distinct sampling
   events,  compute the simple pairwise slope; 3) sort the list of pairwise slopes and set the overall slope
   estimate (Q) as the median slope in this list; 4) compute the median concentration and  the median
   date/time of sampling; and 5) construct the Theil-Sen trend as the line passing through  the median
   scatter point from Step 4 with slope Q.

Advantages/Disadvantages: Although non-parametric, the Theil-Sen slope estimator does not use data
   ranks but rather the concentrations themselves. The method is non-parametric because  the median
   pairwise slope is utilized, thus ignoring extreme values that might otherwise skew the slope estimate.
   The Theil-Sen trend line is as easy to compute as the Mann-Kendall test and does not require any
   special adjustment for ties (e.g., non-detects).

       PREDICTION LIMIT FOR M FUTURE VALUES (SECTION 18.2.1)
Basic purpose: Method for detection monitoring. This technique estimates numerical bound(s) on a
   series of m independent future values. The prediction limit(s) can be used to test whether the mean of
   one or more compliance well populations are equal to the mean of a background population.

Hypothesis tested: HQ — The true mean of TO future observations arises from the same population as the
   mean of measurements used to construct the prediction limit. HA — The m future observations come
   from a  distribution with a different mean than the population  of measurements. Since an upper
   prediction limit is of interest in detection monitoring, the alternative hypothesis would state that the
   future observations are distributed with a larger mean than the background population.

Underlying assumptions: 1) Data used to construct the prediction limit must be normal or normalized.
   Adjustments for small to moderate fractions of non-detects can be made, perhaps using Kaplan-
   Meier or robust ROS; 2) although the variances of both populations (background and future values)
   are assumed to be equal, rarely will there be  enough data from the future population to verify this
   assumption except during periodic  updates to background; and 3) if used for  upgradient-to-
   downgradient comparisons, there should be no significant spatial variability.

When to use: Prediction  limits  on individual observations can be  used as an alternative in detection
   monitoring  to either  one-way  ANOVA  or Dunnett's  multiple comparison with control  [MCC]
   procedure. Assuming there is  insignificant natural spatial variability, an interwell prediction limit can
   be  constructed using  upgradient or other representative background data.  The  number of future
   samples (m) should be chosen to reflect a single new observation collected from each downgradient
   or compliance well prior to the next  statistical evaluation, plus a fixed number (m-l)  of possible
   resamples. The  initial  future observation at each compliance point is then  compared  against the
   prediction limit. If it  exceeds the  prediction  limit,  one or more resamples are collected from the
                                             JTslMarch 2009

-------
Chapter 8. Methods Summary	Unified Guidance

    'triggered' well and also tested against the prediction limit. If substantial spatial variability exists,
    prediction  limits for individual values can be constructed on a well-specific basis using intrawell
    background. The larger the intrawell background size, the better. To incorporate retesting, it must be
    feasible to collect up to (m-\) additional, but independent, resamples from each well.

Steps involved: 1)  Compute the estimated mean and standard deviation of the background data; 2)
    considering the type of prediction limit (i.e., interwell or intrawell), the number of future samples m,
    the desired site-wide false  positive rate, and the number of wells and monitoring parameters,
    determine the prediction limit multiplier (K); 3) compute the prediction limit as the background mean
    plus K times the background standard deviation; and 4)  compare each initial future observation
    against  the prediction limit. If both the  initial  measurement  and  resample(s) exceed the limit,
    conclude the null hypothesis of equal means has been violated.

Advantages/Disadvantages: Prediction  limits for individual values offer several advantages compared
    to the traditional one-way  ANOVA and Dunnett's multiple comparison  with control  [MCC]
    procedures. Prediction limits are not bound to a minimum 5% per-constituent false positive rate and
    can be  constructed  to meet a target site-wide  false positive rate  [SWFPR]  while  maintaining
    acceptable statistical power. Unlike the one-way ANOVA F-test, only the comparisons of interest
    (i.e., each  compliance point against  background) are tested.  This gives the prediction limit more
    statistical power.  Prediction limits can be designed for intrawell as well as interwell comparisons.

       PREDICTION LIMIT FOR  FUTURE MEAN (SECTION 18.2.2)
Basic purpose: Method for  detection monitoring or  compliance  monitoring. It  is used to  estimate
    numerical limit(s) on an independent mean constructed from/? future values. The prediction limits(s)
    can be  used to test whether the  mean  of one population is equal to  the  mean  of a  separate
    (background) population.

Hypothesis tested: HO — The true mean of p future observations arise from the same population as the
    mean of measurements used to construct the prediction limit. HA — The/? future observations come
    from a distribution with a different mean than the population of background measurements.  Since an
    upper prediction limit is of interest in both detection and compliance  monitoring, the alternative
    hypothesis would state that the future observations are distributed with a larger mean than that of the
    background population.

Underlying assumptions: 1) Data used to construct the prediction  limit must be normal or normalized.
    Adjustments for small to moderate fractions of  non-detects can be made, perhaps using Kaplan-
    Meier or robust ROS; 2) although the variances of both populations (background and future values)
    are assumed to be equal, rarely will  there  be  enough data  from the future population to verify this
    assumption; and 3)  if used  for  upgradient-to-downgradient  comparisons, there should be  no
    significant spatial variability.

When to use:  Prediction limits on means can be used as an alternative in detection  monitoring to either
    one-way ANOVA or Dunnett's multiple comparison with control [MCC] procedure. Assuming there
    is insignificant natural spatial  variability, an interwell prediction limit can be constructed using
    upgradient or  other representative background data. The  number of future samples p should be
    chosen to reflect the number of samples that will be collected at each compliance well prior to the
    next statistical evaluation  (e.g., 2, 4, etc.). The average of these/? observations at each compliance
                                             8^34                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   point is then compared against the prediction limit. If it is feasible to collect at least/? additional, but
   independent, resamples from each well, retesting can be incorporated into the procedure by using
   independent mean(s) of/? samples as confirmation value(s).

    If substantial spatial variability exists, prediction limits for means can be constructed on a well-
   specific basis using intrawell background. At least  two future values must be available per well.
   Larger intrawell background size  are preferable. To incorporate retesting, it must be feasible to
   collect at least/? independent resamples from each well,  in addition to the initial set ofp samples. A
   prediction limit can also be used in some compliance monitoring settings when a fixed compliance
   health based limit cannot be use and the compliance  point data must be compared  directly to a
   background GWPS. In this case, the compliance point mean concentration is tested against an upper
   prediction limit computed from background.  No retesting would be employed for this latter kind of
   test.

Steps involved: 1) Compute  the background sample mean and standard deviation;  2) considering the
   type of prediction limit (i.e.,  interwell or intrawell), the number of future samples/?, use of retesting,
   the desired site-wide  false  positive rate,  and the  number of wells  and  monitoring parameters,
   determine the prediction limit multiplier (K); 3) compute the prediction limit as the background mean
   plus K times the background  standard deviation; 4) compare each future mean of orderp (i.e.., a mean
   constructed from p values) against the prediction limit; and 5) if the future mean exceeds the limit
   and retesting is not feasible (or if used  for compliance monitoring), conclude the null hypothesis of
   equal means has been violated. If retesting is  feasible, conclude the null hypothesis has been violated
   only when the resampled mean(s) of order/? also exceeds the prediction limit.

Advantages/Disadvantages:  Prediction limits  on means  offer several  advantages compared to the
   traditional  one-way ANOVA and Dunnett's multiple comparison with control  [MCC] procedure:
   Prediction  limits are not  bound to a minimum 5% per-constituent  false  positive rate.  As such,
   prediction limits can be constructed to meet a target SWFPR, while maintaining acceptable statistical
   power.  Unlike the  one-way F-test, only the comparisons  of interest (i.e.., each compliance point
   against background) are tested,  giving the prediction limit more statistical power. Prediction limits
   can be designed  for intrawell  as well  as  interwell  comparisons. One slight  disadvantage is that
   ANOVA combines compliance point data with background  to  give  a somewhat better  per-well
   estimate of variability. But  even this disadvantage can be  overcome when using  an interwell
   prediction limit by first running ANOVA on the combined background and compliance point data to
   generate a  better variance estimate with a larger degree of freedom.  A disadvantage compared to
   prediction limits on individual future values is that two  or more new  compliance point observations
   per well must be available to run the prediction  limit on means.  If only one new measurement per
   evaluation  period can be collected, the user should instead construct a prediction  limit on individual
   values.

       N ON-PARAMETRIC PREDICTION LIMIT FOR M FUTURE VALUES (SECTION  18.3.1)
Basic purpose: Method for detection monitoring. It is a  non-parametric technique to estimate numerical
   limits(s) on a series of m independent future values. The  prediction limit(s) can be used to test
   whether two samples are drawn  from the same or different populations.

Hypothesis tested:  HQ  — The m  future observations  come  from  the same distribution as the
   measurements used to construct the prediction limit.  HA — The m future observations  come from a
                                             8^35                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   different distribution than the population of measurements used to build the prediction limit. Since
   an upper prediction limit is of interest in detection monitoring, the alternative hypothesis is that the
   future observations are distributed with a larger median than the background population.

Underlying assumptions: 1) The  data  used  to construct the  prediction limit need not be normal;
   however, the forms of the both the background distribution and the future distribution are assumed to
   be  the  same. Since the non-parametric prediction  limit is constructed  as an order statistic of
   background,  high fractions  of non-detects  are  acceptable; 2)  although  the variances  of both
   populations (background and future values) are assumed to be equal, rarely will there be enough data
   from the future population to verify this assumption; and 3)  if used for upgradient-to-downgradient
   comparisons, there should be no significant spatial variability. Spatial variation is less likely to be
   significant in many cases where  constituent data are primarily non-detect, allowing the use of a non-
   parametric interwell prediction limit test.

When  to use: Prediction  limits on individual values can be used as a non-parametric alternative in
   detection  monitoring to either one-way ANOVA or Dunnett's multiple comparison  with  control
   [MCC]  procedure. Assuming there is  insignificant natural spatial variability, an interwell prediction
   limit can be  constructed using upgradient or other representative background data. The number of
   future samples  m should  be chosen to  reflect  a  single  new  observation collected from each
   compliance well prior  to the next statistical evaluation, plus a  fixed number (m-\) of possible
   resamples. The  initial  future  observation at each compliance point  is then compared against the
   prediction  limit. If it exceeds the prediction limit, one or more  resamples  are collected from the
   'triggered' well and also compared to the prediction limit.

Steps involved:  1) Determine the maximum, second-largest, or other highly ranked value in background
   and set the non-parametric prediction limit equal  to this level; 2) considering the  number of future
   samples m, and the number of wells and monitoring parameters, determine the achievable site-wide
   false positive rate  [SWFPR]. If the error rate is  not acceptable, consider possibly enlarging the pool
   of background data used to  construct the limit or increasing the number of future samples m; 3)
   compare each  initial future  observation against the prediction  limit;  and 4) if both the initial
   measurement and  resample(s) exceed the limit, conclude the null hypothesis of equal distributions
   has been violated.

Advantages/Disadvantages: Non-parametric  prediction  limits on  individual  values  offer  distinct
   advantages compared to the Kruskal-Wallis non-parametric ANOVA test.  Prediction limits are not
   bound to  a minimum  5% per-constituent false  positive rate. As such, prediction limits  can be
   constructed to  meet a target  SWFPR, while maintaining acceptable  statistical power. Unlike the
   Kruskal-Wallis  test,  only  the  comparisons   of interest  (i.e..,  each compliance point  against
   background)  are tested,  giving the prediction limit more statistical power.  Non-parametric prediction
   limits have the disadvantage of generally requiring fairly large background samples to effectively
   control false positive error and ensure  adequate power.

       PREDICTION LIMIT FOR FUTURE MEDIAN (SECTION 18.3.2)
Basic purpose: Method for detection monitoring and compliance monitoring.  This is a non-parametric
   technique  to estimate  numerical limits(s)  on  the  median  of p independent future values.  The
   prediction limit(s) is used to test whether the median of one or more compliance well populations is
   equal to the median of the background population.
                                             8^36                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Hypothesis tested: HO — The true median of p future observations arise from the same population as
   the median of measurements used to construct the prediction limit. HA — The/? future observations
   come from a distribution with a different median than the background population of measurements.
   Since  an upper prediction limit  is of interest  in  both  detection  monitoring  and compliance
   monitoring, the alternative hypothesis is that the future observations are distributed with a larger
   median than the background population.

Underlying assumptions: 1) The data used to construct the  prediction limit need  not be normal;
   however, the forms of the both the background distribution and the future distribution are assumed to
   be  the  same. Since  the non-parametric prediction  limit is constructed as  an  order statistic of
   background,  high  fractions of non-detects  are  acceptable: 2) although  the variances  of both
   populations (background and future values) are assumed to be equal, rarely will there be enough data
   from the future  population to verify this assumption; and 3)  if used for upgradient-to-downgradient
   comparisons, there should be no significant spatial variability.

When  to use: Prediction limits on medians can be used as a non-parametric alternative in detection
   monitoring to  either one-way  ANOVA or Dunnett's multiple comparison with  control  [MCC]
   procedure. Assuming there is insignificant natural spatial variability, an interwell  prediction limit
   can be constructed  using upgradient or other representative background data. The number of future
   samples/* should be odd and chosen to reflect the number of samples that will be collected at each
   compliance well prior to the next statistical evaluation (e.g., 3).  The median of these p observations
   at each compliance point is then compared against the prediction limit. If it is feasible to collect at
   least/? additional, but independent, resamples from each well, retesting can be incorporated into the
   procedure by using independent median(s) of p samples as confirmation value(s). A prediction limit
   for a compliance point median can also be constructed in certain compliance monitoring settings,
   when no fixed health-based compliance limit can be used and  the compliance point data must be
   directly  compared  against  a  background GWPS.  In this  case,  the  compliance point  median
   concentration is compared to an upper prediction limit computed from background. No retesting is
   employed for this latter kind of test.

Steps involved:  1) Determine the maximum, second-largest, or other highly ranked value in background
   and set the non-parametric prediction limit equal  to this level; 2) considering the number of future
   samples p, whether or not retesting will be incorporated, and the number of wells and monitoring
   parameters, determine the achievable SWFPR. If the error rate is  not acceptable,  increase the
   background sample size or  consider a non-parametric prediction limit on individual future values
   instead; 3) compare each future median of order/? (i.e., a median of/? values) against the prediction
   limit; and 4) if the future median exceeds the limit and retesting is not feasible (or if the test is used
   for compliance  monitoring), conclude the null hypothesis of equal medians has been violated. If
   retesting is feasible,  conclude the null  hypothesis  has been violated  only  when the  resampled
   median(s) of order/? also exceeds the prediction limit.

Advantages/Disadvantages: Non-parametric prediction  limits  on medians offer distinct advantages
   compared to the Kruskal-Wallis test (a non-parametric one-way ANOVA). Prediction limits are not
   bound to a minimum 5% per-constituent false positive rate.  As such, prediction limits  can be
   constructed to meet a target SWFPR, while  maintaining acceptable  statistical power.  Unlike the
   Kruskal-Wallis  test,  only  the  comparisons  of  interest  (i.e., each  compliance point  against
   background)  are tested, giving the  prediction limit  more  statistical power.  A disadvantage in
                                             JTs?                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   detection monitoring compared to non-parametric prediction limits on individual future values is that
   at least three new compliance point observations per well must be available to run the prediction
   limit on medians. If only one new observation per evaluation period can be collected,  construct
   instead a non-parametric prediction limit for individual values. All non-parametric prediction limits
   have the disadvantage of usually requiring fairly large background  samples  to effectively  control
   false positive error and ensure adequate power.

       SHEWHART-CUSUM CONTROL CHART (SECTION 20.2)
Basic purpose: Method for detection monitoring. These are used to quantitatively and visually track
   concentrations at a given well over time to determine whether they exceed a critical threshold (i.e..,
   control limit), thus implying a significant increase above background conditions.

Hypothesis tested: HQ — Data plotted on the  control chart  follow  the  same distribution  as the
   background data used  to compute the baseline chart parameters. HA —  Data plotted  on the chart
   follow a different distribution with higher mean level than the baseline data.

Underlying assumptions: Data used to construct the  control chart must be approximately normal or
   normalized. Adjustments for small to moderate  fractions of non-detects, perhaps using Kaplan-Meier
   or ROS, can be acceptable. There should be no discernible trend in the baseline data used to calculate
   the control limit.

When to use: Use control charts as an alternative to  parametric prediction limits, when 1) there are
   enough uncontaminated baseline data to compute an accurate control limit, and 2) there are no trends
   in intrawell background. Retesting can be incorporated into  control charts by judicious choice of
   control limit. This may need to be estimated using Monte Carlo simulations.

Steps  involved: 1)  Compute the intrawell  baseline  mean  and standard deviation;  2)  calculate  an
   appropriate control limit from these baseline parameters, the desired retesting strategy and number of
   well-constituent pairs  in  the network; 3)  construct the chart,  plotting the control  limit,  the
   compliance point observations, and the cumulative  sums [CUSUM];  and  4) determine that the null
   hypothesis is violated when either an individual concentration measurement or the  cumulative sum
   exceeds the control limit.

Advantages/Disadvantages: Unlike prediction limits, control charts offer an explicit visual tracking of
   compliance point values over time and provide  a method to judge whether these concentrations have
   exceeded  a critical threshold. The Shewhart portion of the  chart is especially good at  detecting
   sudden concentration increases, while the CUSUM  portion is preferred for detecting slower, steady
   increases over time. No non-parametric version of the combined Shewhart-CUSUM control chart
   exists, so non-parametric prediction limits should be considered if the data cannot be normalized.

       CONFIDENCE INTERVAL AROUND NORMAL MEAN (SECTION 21.1.1)
Basic purpose: Method for compliance/assessment monitoring or corrective action. This is a technique
   for estimating a range of concentration values from  sample data, in which the true mean of a  normal
   population is expected to occur at a certain probability.

Hypothesis tested: In compliance monitoring, HO — True mean concentration at the compliance point is
   no greater than the  predetermined groundwater protection standard [GWPS]. HA — True mean
                                             8^38                                    March 2009

-------
Chapter 8.  Methods Summary	Unified Guidance

   concentration is greater than the GWPS.  In corrective action, HQ — True mean concentration at the
   compliance point is greater than or equal to the fixed GWPS. HA — True mean concentration is less
   than or equal to the fixed standard.

Underlying  assumptions:  1) Compliance  point  data  are approximately  normal  in distribution.
   Adjustments for small to moderate fractions of non-detects, perhaps using Kaplan-Meier or ROS, are
   encouraged;  2)  data do not exhibit any significant trend over time;  3) there are a minimum of 4
   observations for testing. Generally, at least 8 to 10 measurements are recommended; and 4) the fixed
   GWPS is assumed to represent a true mean average concentration, rather than a maximum or upper
   percentile.

When to use: A mean confidence interval can be used for normal data  to determine whether there is
   statistically significant evidence that the average is either  above a  fixed GWPS (in compliance
   monitoring) or below the fixed standard (in corrective action). In either case, the null hypothesis is
   rejected only when the entire confidence interval lies on one or the  other side of the GWPS.  The key
   determinant in compliance monitoring is whether the lower confidence limit exceeds the GWPS,
   while in  corrective action  the upper confidence limit lies below the clean-up standard. Because of
   bias introduced by transformations when estimating a mean, this  approach should not be used for
   highly-skewed or non-normal  data. Instead consider a confidence interval around a lognormal mean
   or a non-parametric confidence interval. It is also not recommended for use when the data exhibit a
   significant trend. In that case,  the estimate  of variability  will likely be too high,  leading to an
   unnecessarily wide interval and possibly little chance of deciding the hypothesis. When a trend is
   present, consider instead a confidence interval around a trend line.

Steps involved: 1) Compute the sample mean and  standard deviation; 2) based on the sample size and
   choice of a confidence level (1-oc), calculate either the lower confidence limit (for use in compliance
   monitoring) or the upper confidence limit (for use in corrective action); 3) compare the confidence
   limit against the GWPS or clean-up standard; and 4) if the lower confidence limit exceeds the GWPS
   in compliance monitoring or the upper confidence limit is below the clean-up standard, conclude that
   the null hypothesis should be rejected.

Advantages/Disadvantages: Use of a confidence interval instead of simply the sample mean for
   comparison to a fixed standard accounts for both the level of statistical variation in the data and the
   desired or targeted  confidence level.  The same basic  test  can  be used  both  to document
   contamination above the compliance standard in compliance/assessment  and to show a sufficient
   decrease in concentration levels below the clean-up standard in corrective action.

       CONFIDENCE INTERVAL ON  LOGNORMAL  GEOMETRIC MEAN (SECTION 21.1.2)
Basic purpose: Method for compliance/assessment monitoring or corrective action. It is  a technique to
   estimate the range  of concentration values from sample data, in which the  true geometric mean of a
   lognormal population is expected to occur at a certain probability.

Hypothesis tested: In  compliance monitoring, HQ — True mean concentration at the compliance point is
   no greater than the fixed compliance or groundwater protection standard [GWPS]. HA — True mean
   concentration is greater than the GWPS.  In corrective action, HO — True mean concentration at the
   compliance point is greater than the  fixed compliance  or  clean-up standard. HA — True mean
   concentration is less than or equal to the fixed standard.
                                             8^39                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying assumptions:  1)  Compliance  point  data  are approximately lognormal  in  distribution.
   Adjustments for small to moderate fractions of non-detects, perhaps using Kaplan-Meier or ROS, are
   encouraged; 2) data do not exhibit any  significant trend over time;  3) there are a minimum of 4
   observations. Generally, at least 8 to 10 measurements are recommended; and 4) the fixed GWPS is
   assumed to  represent a  true  geometric mean  average  concentration  following  a  lognormal
   distribution, rather than a maximum or upper percentile. The GWPS also represents the true median.

When to use: A confidence interval on the geometric mean can be used for lognormal data to determine
   whether there  is statistically significant evidence that the geometric average is either above a fixed
   numerical standard (in compliance monitoring) or below a fixed standard (in corrective  action). In
   either case, the null hypothesis is rejected only when the entire confidence interval is to one side of
   the compliance or clean-up standard. Because of this fact, the key question in compliance monitoring
   is whether the lower confidence limit exceeds the GWPS, while in corrective action the user must
   determine whether the upper confidence  limit is below  the clean-up standard. Because of bias
   introduced  by  transformations  when estimating the arithmetic  lognormal  mean, and  the  often
   unreasonably  high upper confidence limits  generated by Land's  method for lognormal  mean
   confidence intervals (see  below), this approach is an alternative approach for lognormal data. One
   could also consider a non-parametric confidence interval. It is also not recommended for use when
   data exhibit a significant trend. In that case, the estimate of variability will likely be too high, leading
   to an unnecessarily wide interval and possibly little chance of deciding the hypothesis. When a trend
   is present, consider instead a confidence interval around a trend line.

Steps involved: 1) Compute the  sample log-mean and log-standard deviation; 2) based on the sample
   size and choice of confidence level  (1-oc), calculate either the lower confidence limit (for use in
   compliance monitoring) or the upper confidence limit (for use in corrective action) using the logged
   measurements and exponentiate the result; 3) compare the confidence limit against the  GWPS or
   clean-up standard;  and 4)  if the lower confidence limit exceeds the GWPS in compliance monitoring
   or the upper confidence  limit is below the clean-up standard, conclude that the  null hypothesis
   should be rejected.

Advantages/Disadvantages: Use of a confidence interval instead of simply the sample geometric mean
   for comparison to a fixed standard accounts for both statistical variation in the data and the targeted
   confidence level.  The  same basic test  can  be used both to document contamination  above the
   compliance  standard in compliance/assessment and to  show a sufficient decrease in concentration
   levels below the clean-up  standard in corrective action.

       CONFIDENCE INTERVAL ON LOGNORMAL ARITHMETIC MEAN (SECTION 21.1.3)
Basic purpose: Test for compliance/assessment monitoring or corrective action. This  is a method by
   Land (1971) used to estimate  the range of concentration values from sample data, in which the true
   arithmetic mean of a lognormal population is expected to occur at a certain probability.

Hypothesis tested: In  compliance monitoring, HO — True mean concentration at the compliance point is
   no greater than the fixed compliance or groundwater protection standard [GWPS]. HA — True mean
   concentration is greater than the GWPS.  In corrective action, HO — True mean concentration  at the
   compliance  point  is  greater than the fixed  compliance or clean-up  standard. HA — True  mean
   concentration is less than or equal to the fixed standard.

                                             8^40                                  March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying assumptions:  1) Compliance point data  are  approximately  lognormal in distribution.
   Adjustments for small to moderate fractions of non-detects, perhaps using Kaplan-Meier or ROS, are
   encouraged; 2) data do not exhibit any significant trend over time; 3)  there are a minimum of 4
   observations. Generally, at least 8 to 10 measurements are strongly recommended; and 4) the fixed
   GWPS  is  assumed  to  represent  the true  arithmetic  mean  average  concentration,  rather than a
   maximum or upper percentile.

When to use: Land's confidence interval procedure can be used for lognormally-distributed data to
   determine whether there is statistically significant evidence that the average is either above a fixed
   numerical standard (in compliance monitoring) or below a fixed standard (in corrective action). In
   either case, the null hypothesis is rejected only when the entire confidence interval is to one side of
   the compliance or clean-up standard. Because of this fact, the key question in compliance monitoring
   is whether the lower confidence limit exceeds the GWPS, while in corrective action the user must
   determine  whether the upper  confidence  limit is  below  the  clean-up  standard.   Because the
   lognormal distribution can have a highly skewed upper tail, this approach should only be used when
   the data fit the lognormal model rather closely, especially if used in corrective action. Consider
   instead  a confidence interval around the lognormal geometric mean or a non-parametric confidence
   interval if this is not the case. It is also not recommended for data that exhibit a significant trend. In
   that situation,  the  estimate of variability will likely  be too high, leading to an unnecessarily wide
   interval and possibly little chance of deciding the hypothesis. When a trend is  present,  consider
   instead  a confidence interval around a trend line.

Steps involved: 1) Compute the sample log-mean and log-standard deviation; 2) based on the sample
   size, magnitude of the log-standard deviation and choice of confidence level (1-a), determine Land's
   adjustment  factor;  3)  then calculate either the lower  confidence limit (for use in compliance
   monitoring) or the upper confidence limit (for use in corrective action);  4) compare the confidence
   limit against the GWPS  or clean-up standard; and 5) if the lower confidence limit exceeds the GWPS
   in compliance montoring or the upper confidence limit  is below the clean-up standard, conclude that
   the null hypothesis should be rejected.

Advantages/Disadvantages:  Use of a  confidence interval instead  of simply the  sample mean for
   comparison to a fixed standard accounts for both statistical  variation in the data and the targeted
   confidence level.  The same  basic test can be  used both to document contamination above the
   compliance standard in compliance/assessment and to show a sufficient decrease in concentration
   levels below the clean-up standard  in  corrective action. Since the upper confidence limit on a
   lognormal mean can be extremely high for some populations, the user may need to consider a non-
   parametric upper confidence limit on the median concentration as an alternative  or use a program
   such as  Pro-UCL to determine an alternate upper confidence limit.

       CONFIDENCE INTERVAL ON UPPER PERCENTILE (SECTION 21.1.4)
Basic purpose: Method for compliance monitoring. It is  used to estimate  the range of concentration
   values from sample data in which a pre-specified true proportion of a normal population is expected
   to occur at a certain probability.  The test can also be used to identify the range of a true proportion
   or percentile (e.g., the 95th) in population data which can be normalized.
                                             8-41                                   March 2009

-------
Chapter 8.  Methods Summary	Unified Guidance

Hypothesis tested: HO — True upper percentile concentration at the compliance point is no greater than
   the fixed  compliance or groundwater protection standard [GWPS]. HA — True upper percentile
   concentration is greater than the fixed GWPS.

Underlying  assumptions:  1)  Compliance  point  data  are  either  normal in distribution or can be
   normalized. Adjustments for small to moderate fractions of non-detects, perhaps using Kaplan-Meier
   or ROS, are encouraged;  2) data do not exhibit any significant trend over  time; 3) there  are a
   minimum of at least 8  to  10 measurements; and 4) the fixed GWPS is assumed to represent a
   maximum or upper percentile, rather than an average concentration.

When to use:  A confidence interval around an upper percentile can be used to determine whether there
   is statistically significant evidence that the percentile is above a fixed numerical standard.  The null
   hypothesis is rejected only when the entire confidence interval  is greater than the compliance
   standard. Because of this fact, the  key  question in compliance monitoring is whether the lower
   confidence limit exceeds the GWPS. This  approach is  not recommended for use when  the data
   exhibit a significant trend. The  estimate of variability will likely be  too  high, leading to an
   unnecessarily wide interval and possibly little chance of deciding the hypothesis.

Steps involved: 1) Compute the sample mean and standard deviation; 2) based on the sample size, pre-
   determined true proportion  and test confidence level  (1-a),  calculate the lower confidence limit; 3)
   compare the confidence limit against the GWPS; and 4) if the lower confidence limit exceeds the
   GWPS, conclude that the true upper percentile is larger than the compliance standard.

Advantages/Disadvantages: If a fixed GWPS is intended to represent a 'not-to-be-exceeded' maximum
   or an upper percentile, statistical comparison requires the prior definition of a true or expected upper
   percentile against which sample data can be compared. Some standards may explicitly identify the
   expected percentile.  The appropriate test then must estimate the confidence interval in which this
   true proportion is expected to lie.    Either an upper or  lower confidence limit can be generated,
   depending on whether compliance or corrective action hypothesis testing is appropriate. Whatever
   the interpretation of a given limit used as a GWPS, it should be determined in advance what a  given
   standard represents before choosing which type of confidence interval to construct.

       N ON-PARAMETRIC CONFIDENCE INTERVAL ON MEDIAN (SECTION  21.2)
Basic purpose: Test for compliance/assessment monitoring or corrective action.  It is a non-parametric
   method used to estimate the range of concentration values from sample data in  which the true
   median of a population is expected to occur at a certain probability.

Hypothesis tested: In compliance monitoring, HO — True median concentration at the compliance point
   is no greater than the fixed compliance  or groundwater  protection standard [GWPS].  HA — True
   median concentration  is  greater than  the  GWPS.  In  corrective action, HO — True  median
   concentration at the compliance point is greater than the fixed compliance or clean-up standard. HA
   — True median concentration is less than or equal to the fixed standard.

Underlying  assumptions: 1) Compliance data need not  be normal  in distribution; up to 50% non-
   detects are acceptable; 2) data do not exhibit any significant trend over time; 3) there are a minimum
   of at least 7 measurements; and 4) the fixed GWPS is assumed to represent a true median average
   concentration, rather than a maximum or upper percentile.
                                             8^42                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

When to use:  A confidence interval on the median can be used for non-normal data (e.g., samples with
   non-detects)  to determine whether there is statistically significant evidence that the average (i.e.,
   median) is either above a fixed numerical standard  (in compliance monitoring) or below a fixed
   standard (in  corrective action). In either case, the null hypothesis is rejected only when the  entire
   confidence interval is to one side of the compliance  or clean-up  standard.  Because of this fact, the
   key question in compliance monitoring is whether the lower confidence limit exceeds the GWPS,
   while in corrective  action the user must determine whether the upper confidence limit is below the
   clean-up standard.  This approach is not recommended for use when data exhibit a significant trend.
   In that case, the variation in the data will likely be too high, leading to an unnecessarily wide interval
   and possibly little chance of deciding the hypothesis.  It is also possible that the apparent trend is an
   artifact of differing detection  or reporting  limits that have changed over time.  The trend may
   disappear if  all non-detects are imputed at a common value or RL. If a trend is still present after
   investigating this possibility, but a significant portion of the data are non-detect,  consultation with a
   professional statistician is recommended.

Steps involved: 1)  Order and rank the data values; 2)  pick tentative interval endpoints close to the
   estimated median concentration; 3) using the selected endpoints, compute the achieved confidence
   level of the lower confidence limit for use in compliance monitoring or that of the upper confidence
   limit  for corrective action;  4) iteratively expand the interval until either the  selected endpoints
   achieve the targeted confidence level  or  the maximum or minimum  data value is chosen as the
   confidence limit; and 5) compare the confidence limit against the GWPS or clean-up standard. If the
   lower confidence limit exceeds the GWPS in compliance monitoring or the upper confidence limit is
   below the clean-up standard, conclude that the null hypothesis should be rejected.

Advantages/Disadvantages: Use of a confidence  interval instead  of simply the  sample median for
   comparison to a fixed  limit accounts  for both statistical variation in the  data and the targeted
   confidence level. The same  basic test can be used both to document contamination  above the
   compliance standard in compliance/assessment and to show a sufficient decrease in concentration
   levels below the clean-up standard in corrective action. By not requiring normal  or normalized data,
   the non-parametric confidence interval  can accommodate a substantial fraction of non-detects.  A
   minor disadvantage is that a non-parametric confidence interval estimates the location of the median,
   instead of the mean. For symmetric populations, these quantities will be the same, but for skewed
   distributions  they will differ. So if the compliance or clean-up standard is designed to represent a
   mean concentration, the non-parametric interval around the median may not provide a completely
   fair and/or accurate comparison. In some cases, the non-parametric confidence limit will not achieve
   the desired confidence level even if set to the maximum or minimum data value, leading to a higher
   risk of false positive error.

       N ON-PARAMETRIC CONFIDENCE INTERVAL ON UPPER PERCENTILE (SECTION  21.2)
Basic purpose: Non-parametric method for compliance  monitoring.  It is used to estimate the range of
   concentration values from sample data in which a pre-specified  true proportion of a population is
   expected to occur at a certain probability. Exact probabilities will depend upon sample data ranks.

Hypothesis tested: HQ  — True upper percentile concentration at the compliance point is no greater than
   the fixed compliance or groundwater  protection standard [GWPS]. HA — True  upper percentile
   concentration is greater than the GWPS.

                                              8^43                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying assumptions: 1) Compliance point data need not be normal; large fractions of non-detects
   can be acceptable; 2) data do not exhibit any significant trend over time; 3) there are a minimum of
   at least 8 to 10 measurements; and 4) the fixed GWPS is assumed to represent a true upper percentile
   of the population, rather than an average concentration.

When to use: A confidence interval on an upper percentile can be used to determine whether there is
   statistically significant evidence that the percentile is above a fixed numerical standard. The null
   hypothesis is  rejected only when the entire confidence interval is greater than the compliance
   standard. Because of this fact, the key determinant in compliance/assessment monitoring is whether
   the lower confidence limit exceeds the GWPS. This approach is not recommended for use when data
   exhibit a significant trend. In that case, the estimate of variability will likely be too high, leading to
   an unnecessarily wide interval and possibly little chance of deciding the hypothesis.

Steps involved: 1)  Order and rank the data values; 2) select tentative  interval endpoints close to the
   estimated upper percentile concentration; 3)  using the  selected endpoints, compute the achieved
   confidence level  of the lower confidence limit; 4) iteratively  expand the interval  until either the
   selected lower endpoint achieves the targeted confidence level or the minimum data value is chosen
   as the  confidence limit; and 5) compare the  confidence limit against the GWPS. If the lower
   confidence limit exceeds the GWPS, conclude that the population upper percentile is larger than the
   compliance standard.

Advantages/Disadvantages: If a fixed GWPS is intended to represent a 'not-to-be-exceeded'  maximum
   or an upper percentile, statistical comparison requires the prior definition of a true or expected upper
   percentile against which sample data can be compared. Some standards may explicitly identify the
   expected percentile.  The appropriate test then must  estimate the confidence interval in which this
   true  proportion is expected to lie.   Either an  upper or lower confidence limit can be generated,
   depending  on whether compliance or corrective action hypothesis testing is appropriate.  Whatever
   the interpretation of a given limit used as a GWPS, it should be determined in advance what a given
   standard represents  before choosing which type of confidence interval  to construct.   However,
   precise non-parametric estimation of upper percentiles often requires much larger sample sizes than
   the parametric option (Section 21.1.4). For this reason, a, parametric confidence interval for upper
   percentile tests is  recommended whenever possible,  especially if a  suitable transformation can be
   found or adjustments made for non-detect values.

      CONFIDENCE BAND AROUND LINEAR REGRESSION (SECTION 21.3.1)
Basic purpose: Method for compliance/assessment monitoring or  corrective action when  stationarity
   cannot be assumed.  It is used to estimate ranges of concentration values  from sample data around
   each point  of a predicted linear regression line at a specified probability. The prediction line (based
   on regression of concentration values against time) represents the best estimate of gradually changing
   true mean levels over the time period.

Hypothesis tested: In compliance monitoring, HO — True mean concentration at the compliance point is
   no greater than the fixed compliance or groundwater protection standard [GWPS]. HA — True mean
   concentration is  greater than the GWPS. In corrective action, HO — True mean concentration at the
   compliance point  is greater  than the fixed compliance or clean-up standard. HA — True mean
   concentration is less than or equal to the fixed standard.

                                             8^44                                    March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

Underlying assumptions: 1) Compliance point values exhibit a linear trend with time, with normally
   distributed residuals. Use simple substitution with small (<10-15%) fractions of non-detects. Non-
   detect adjustment methods  are not recommended;  2)  there are a minimum of 4  observations.
   Generally,  at least 8 to 10 measurements are recommended; and 3) the fixed GWPS is assumed to
   represent an average concentration, rather than a maximum or upper percentile.

When to use:  A confidence interval around a trend line  should be used in cases where a linear trend is
   apparent on a time series  plot of the compliance point data. Even if observed well concentrations are
   either increasing under  compliance monitoring  or  decreasing  in corrective action,  it does  not
   necessarily imply that the true mean concentration at the current time is either above or below the
   fixed GWPS. While the trend line properly accounts for the fact that the mean is  changing with
   time, the null hypothesis is rejected only when the entire confidence interval is to  one side of the
   compliance or  clean-up  standard at the most recent point(s) in time. The key  determinant in
   compliance monitoring is whether the lower confidence limit at a specified point in time exceeds the
   GWPS, while in corrective action the upper confidence limit at a specific time must lie below the
   clean-up standard to be considered in compliance.

Steps involved:  1) Check for presence of a trend on a time series plot; 2) estimate the coefficients of the
   best-fitting linear regression line; 3) compute the  trend line residuals and check for normality; 4) if
   data are non-normal, try  re-computing the regression and residuals after transforming the  data; 5)
   compute the lower confidence limit band around the trend line for compliance monitoring or the
   upper confidence  limit band around the trend line for corrective action;  and 6) compare  the
   confidence limit  at each sampling event  against the GWPS  or clean-up standard. If the lower
   confidence limit exceeds the GWPS in compliance/assessment or the upper confidence limit is below
   the clean-up standard on one  or more  recent sampling  events, conclude  that the null hypothesis
   should be rejected.

Advantages/Disadvantages: Use of a confidence interval around the trend line instead of simply the
   regression line itself for comparison to a fixed standard accounts for both statistical  variation in the
   data  and the targeted confidence level. The same basic  test can  be used both to document
   contamination above the compliance standard in compliance/assessment and to  show a sufficient
   decrease in concentration levels below the clean-up standard in corrective action.  By estimating the
   trend line first and then using the residuals to construct the confidence interval, variation due to the
   trend itself is removed, providing a more powerful test (via a narrower interval) of whether or not the
   true mean is on one side of the fixed standard.  This technique can only be used when the identified
   trend is reasonably linear  and the trend residuals are approximately normal.

       N ON-PARAMETRIC CONFIDENCE BAND AROUND THEIL-SEN TREND (SECTION 21.3.1)
Basic  purpose:  Non-parametric method  for  compliance/assessment   or  corrective action  when
   stationarity cannot be assumed. It is used to estimate ranges of concentration values from sample
   data around each point of a predicted Theil-Sen trend line at a specified probability.  The prediction
   line represents the best estimate of gradually changing true median levels over the time period.

Hypothesis tested: In compliance monitoring, HQ — True mean concentration at the compliance point is
   no greater than the fixed compliance or groundwater protection standard [GWPS]. HA — True mean
   concentration is greater than the GWPS. In corrective action, HQ — True mean concentration at the

                                             8^45                                   March 2009

-------
Chapter 8. Methods Summary	Unified Guidance

   compliance point is  greater than the fixed compliance or  clean-up standard. HA — True  mean
   concentration is less than or equal to the fixed standard.

Underlying assumptions: 1) Compliance point values exhibit a linear trend with time; 2) non-normal
   data and substantial levels of non-detects up to 50% are acceptable; 3) there are a minimum of 8-10
   observations available to construct the confidence band; and 4) the fixed GWPS is assumed to
   represent a median average concentration, rather than a maximum or upper percentile.

When to use:  A confidence interval around a trend line should be used in cases where a linear trend is
   apparent on a time series plot of the compliance point data. Even if observed well concentrations are
   either  increasing under compliance  monitoring  or decreasing  in  corrective  action,  it does  not
   necessarily imply that the true mean  concentration at  the current time is either above or below the
   fixed GWPS.  While the trend line properly accounts for the fact that the mean is changing with
   time, the null  hypothesis is rejected only when the entire confidence interval is to one side of the
   compliance or clean-up standard  at the  most recent point(s)  in time. The  key determinant in
   compliance monitoring is whether the lower confidence limit at a specified point in time exceeds the
   GWPS, while  in corrective action the upper confidence limit at  a specific time must lie below the
   clean-up standard to be considered in compliance.

Steps involved: 1) Check for presence of a trend on a time series plot; 2) construct a Theil-Sen trend
   line; 3) use bootstrapping to create a large number of simulated Theil-Sen trends on the sample data;
   4) construct a confidence band by selecting lower and upper percentiles from the set of bootstrapped
   Theil-Sen trend estimates; and 5) compare the confidence band at each  sampling event against the
   GWPS  or clean-up   standard.   If  the   lower  confidence  band   exceeds   the  GWPS   in
   compliance/assessment or the upper confidence band is below the clean-up standard on one  or more
   recent sampling events, conclude that the null hypothesis should be rejected.

Advantages/Disadvantages: Use of a confidence band around the trend line instead of simply the Theil-
   Sen  trend line itself for comparison to a fixed standard accounts for both statistical variation in the
   data  and  the  targeted  confidence  level.   The  same  basic  test  can   be  used both   in
   compliance/assessment and in corrective action.  By estimating  the trend  line  first and then  using
   bootstrapping  to  construct the confidence band, variation due to the trend itself is removed,
   providing a more powerful  test (via a narrower interval) of whether or not the  true mean is on  one
   side of the fixed standard.  This technique  can only be used when the identified trend is reasonably
   linear.  The Theil-Sen trend estimates the change in median level rather than the mean. For roughly
   symmetric  populations, this will make little difference; for highly skewed populations, the  trend in
   the median may not accurately reflect  changes in mean concentration levels.
                                             8-46                                   March 2009

-------
PART II. DIAGNOSTIC METHODS	Unified Guidance
     PART II:  DIAGNOSTIC  METHODS AND
                                 TESTING
      Part II covers diagnostic evaluations of historical facility data for checking key assumptions
implicit in the recommended statistical tests and for making appropriate adjustments to
the data (e.g., consideration of outliers, seasonal autocorrelation, or non-detects). Also included is a
discussion of groundwater sampling and how hydrologic factors such as flow and gradient can
impact the sampling program.

      Chapter 9 provides a number of exploratory data tools and examples, which can generally be
used in data evaluations. Approaches for fitting data sets to normal and other parametric distributions
follows in Chapter 10. The importance of the normal distribution and its potential uses is also
discussed. Chapter 11 provides methods for assessing the equality of variance necessary for some
formal testing. The subject of outliers and means of testing for them is covered in Chapter 12.
Chapter 13 addresses spatial variability, with particular emphasis on ANOVA means testing. In
Chapter 14, a number of topics concerning temporal variation are provided. In addition to providing
tests for identifying the presence of temporal variation, specific adjustments for certain types of temporal
dependence are covered. The final Chapter 15 of Part II discusses non-detect data and offers several
methods for estimating missing data. In particular, methods are provided to deal with data containing
multiple non-detection limits.
                                                                          March 2009

-------
PART II. DIAGNOSTIC METHODS                                  Unified Guidance
                   This page intentionally left blank
                                                                    March 2009

-------
Chapter 9. Exploratory Tools                                              Unified Guidance
          CHAPTER  9.   COMMON  EXPLORATORY TOOLS

       9.1   TIME SERIES PLOTS	9-1
       9.2   Box PLOTS	9-5
       9.3   HISTOGRAMS	9-8
       9.4   SCATTER PLOTS	9-13
       9.5   PROBABILITY PLOTS	9-16
     Graphs are an important tool for exploring and understanding patterns in any data set.  Plotting the
data visually depicts the structure and helps unmask possible relationships between variables affecting
the data set. Data plots which accompany quantitative statistical tests can better demonstrate the reasons
for the results of a formal test. For example, a Shapiro-Wilk test may conclude that data are not normally
distributed. A probability plot or histogram of the data can confirm this conclusion graphically to show
why the data are not normally distributed (e.g., heavy skewness, bimodality, a single outlier, etc.).

     Several  common exploratory  tools are presented in Chapter 9. These graphical techniques are
discussed in statistical texts,  but are presented here in detail for easy reference for the data analyst. An
example data set is used to demonstrate how each of the following plots is created.

    »«»  Time series plots (Section 9.1)
    *  Box plots (Section 9.2)
    *»*  Histograms (Section 9.3)
    *  Scatter plots (Section 9.4)
    *  Probability plots (Section 9.5)

9.1 TIME SERIES PLOTS

     Data  collected over specific time intervals (e.g., monthly, biweekly, or hourly) have a temporal
component. For example, air monitoring measurements of a pollutant may be collected once a minute or
once a day. Water quality monitoring measurements may be collected weekly  or monthly.  Typically,
groundwater sample data are collected quarterly from the same monitoring wells, either for detection
monitoring testing or demonstrating compliance to a  GWPS. An analyst examining temporal data may
be  interested in the  trends  over time, correlation  among time  periods,  or  cyclical patterns.  Some
graphical techniques specific to temporal data are  the time plot, lag plot,  correlogram, and variogram.
The degree to which some of these techniques can be used will depend in part on the frequency and
number of data collected over time.

     A data sequence collected at regular time intervals is called a time series. More sophisticated time
series data  analyses are beyond  the scope of this guidance. If needed, the interested user should consult
with a statistician or appropriate statistical texts. The graphical representations presented in this section
are recommended for any data set that includes a temporal component. Techniques described below will
help identify temporal patterns that need to be accounted for in any analysis  of the data. The analyst
examining  temporal environmental  data may be interested in seasonal trends,  directional trends, serial
correlation, or stationarity. Seasonal trends are patterns in the data that repeat over time, i.e., the data

                                             9^1                                   March 2009

-------
Chapter 9. Exploratory Tools	Unified Guidance

rise and fall regularly over one or more time periods. Seasonal trends may occur over long periods of
time (large scale), such as a yearly cycle where the data show the same pattern of rising and falling from
year to year, or the trends may be over a relatively short period of time (small scale), such as a daily
cycle. Examples of seasonal trends  are quarterly seasons (winter, spring, summer and fall),  monthly
seasons, or even hourly (e.g., air temperature rising and falling over the course of a  day). Directional
trends are increasing or decreasing patterns over time in monitored constituent data,  which may be of
importance in assessing the levels of contaminants. Serial correlation is a measure of the strength in the
linear relationship of successive observations. If successive observations are related, statistical quantities
calculated without accounting for the serial correlation may be biased. A time series is stationary if there
is no systematic change in the mean (i.e., no trend) and variance across time. Stationary data look the
same over all time periods except for random behavior. Directional trends or a change in the variability
in the data imply non-stationarity.

       A time series plot of concentration data versus time makes it easy to identify lack of randomness,
changes in location, change  in scale, small scale trends, or large-scale trends over time.  Small-scale
trends are displayed as fluctuations over smaller time periods. For example, ozone levels over the course
of one day typically rise until the afternoon, then decrease, and this process is repeated  every day. Larger
scale trends such as seasonal fluctuations  appear as regular rises and drops in the graph.  Ozone levels
tend to be higher in the summer than in the winter, so ozone data tend to show both a daily trend and a
seasonal trend. A time plot can also show directional trends or changing variability over time.

       A time plot is constructed by plotting the measurements on the vertical axis  versus the actual
time of observation or the  order of observation on the  horizontal  axis. The points plotted may be
connected by lines, but this may create an unfounded sense of continuity. It is important to use the actual
date, time or number at which the observation was made. This can create discontinuities in the plot but
are needed as the data that should have been collected now appear as "missing values" but do not disturb
the integrity of the plot. Plotting the data at equally spaced intervals when in reality there were  different
time periods between observations is  not advised.

       For environmental data, it is also important to use a different symbol or color to distinguish non-
detects from detected data. Non-detects are often reported by the analytical laboratory with  a "U" or "<"
analytical  qualifier  associated with the reporting limit  [RL]. In statistical terminology, they  are left-
censored data, meaning the actual concentration of the chemical is known only to be below the RL. Non-
detects contrast with detected data, where the laboratory reports the result as a known concentration that
is statistically  higher than the analytical limit of detection.  For example, the laboratory may report a
trichloroethene concentration in groundwater of "5 U" or "< 5" |ig/L, meaning the actual trichloroethene
concentration is unknown, but is bounded between zero and 5 |ig/L. This result is different  than a
detected concentration of 5 jig/L which is unqualified by the laboratory or data validator. Non-detects
are handled differently than detected data when  calculating summary statistics. A statistician should be
consulted  on the proper use  of non-detects in statistical analysis. For  radionuclides negative and zero
concentrations should be plotted as reported by the laboratory, showing the detection status.

       The scaling of the vertical axis of a time plot is of some  importance. A wider scale tends to
emphasize large-scale trends, whereas a narrower scale  tends to emphasize small-scale trends. A wide
scale would emphasize the  seasonal component of the data, whereas a  smaller scale would tend to
                                              9-2                                    March 2009

-------
Chapter 9.  Exploratory Tools
Unified Guidance
emphasize the daily fluctuations.  The scale needs to contain the full range of the data.  Directions for
constructing a time plot are contained in Example 9-1 and Figure 9-1.

     ^EXAMPLE 9-1

     Construct a time series plot using trichloroethene groundwater data in Table 9-1  for each well.
Examine the time series for seasonality, directional trends and stationarity.

              Table 9-1. Trichloroethene  (TCE) Groundwater Concentrations
Date
Collected
1/2/2005
4/7/2005
7/13/2005
10/24/2005
1/7/2006
3/30/2006
6/28/2006
10/2/2006
10/17/2006
1/15/2007
4/10/2007
7/9/2007
10/5/2007
10/29/2007
12/30/2007

Welll
TCE Data
(mg/L) Qualifier
0.
0.
0.
0.
0.
0.
0.
0.
0
0
0
0
005 U
005 U
004 J
006
004 U
009
017
045
.05
.07
.12
.10
NA
0
0
.20
.25
Well 2
TCE Data
(mg/L) Qualifier
0.10 U
0.12
0.125
0.107
0.099 U
0.11
0.13
0.109
NA
0.10 U
0.115
0.14
0.17
NA
0.11
                   NA = Not available (missing data).
                   U denotes a non-detect.
                   J denotes an estimated detected concentration.

       SOLUTION
Step 1.   Import the data into data analysis software capable of producing graphics.

Step 2.   Sort the data by date collected.

Step 3.   Determine the range of the data by calculating the minimum and maximum concentrations for
         each well, shown in the table below:
                                            9-3
        March 2009

-------
Chapter 9. Exploratory Tools
Unified Guidance

Welll
TCE
(mg/L)
Min 0.004
Max 0.25
Data
Qualifier
Well 2
TCE
(mg/L)
U 0.099
0.17
Data
Qualifier
U
Step 4.   Plot the data using a scale from 0 to 0.25 if data from both wells are plotted together on the
         same time series plot. Use separate symbols for non-detects and detected concentrations.  One
         suggestion is to use "open"  symbols (whose centers are white) for non-detects and "closed"
         symbols for detects.

Step 5.   Examine each  series  for directional trends, seasonality  and stationarity.  Note that  Well 1
         demonstrates a positive directional trend across time, while Well 2 shows seasonality within
         each year. Neither well exhibits stationarity.

Step 6.   Examine each  series  for missing values. Inquire from the project laboratory why data are
         missing or collected at unequal time intervals. A response from the laboratory for this  data set
         noted that on 10/5/2007 the sample was accidentally broken in the laboratory from Well 1, so
         Well 1 was resampled on 10/29/2007. Well 1 was  resampled on 10/17/2006  to confirm the
         historically high concentration collected on 10/2/2006. Well 2 was not sampled on 10/17/2006
         because the  data collected on 10/2/2006 from Well 2 did not merit a resample, as did Well 1.

Step 7.   Examine each series for elevated detection limits. Inquire why the detection limits for Well 2
         are much larger than detection limits for Well 1. A reason may be that different laboratories
         analyzed the samples from the two wells. The laboratory analyzing samples from Well 1 used
         lower detection limits than did the laboratory analyzing samples from Well 2. -^
                                              9-4
        March 2009

-------
Chapter 9. Exploratory Tools
                                                                  Unified Guidance
        Figure 9-1. Time Series Plot of Trichloroethene Groundwater for Wells 1 and
                                         2 from 2005-2007.
           (U

           I
           "5
           o
           H
                 0.25-
                 0.20-
                 0.15 -
                 0.10-
                 0.05 -
• Well 1
• Well 2
                                         -o-
                     Jan
                     2005
                     Jul       Jan       Jul
                              2006

Open symbols denote non-detects. Closed symbols denote detected concentrations.
 Jan
2007
Jul
 Jan
2008
9.2  BOX PLOTS

      Box plots (also known as Box and Whisker plots) are useful in situations where a picture of the
distribution is desired, but it is not necessary or feasible to portray all the details of the data. A box plot
displays several percentiles of the data set. It is  a simple plot, yet provides insight into the location,
shape, and spread of the data and underlying distribution. A  simple box plot contains only the 0th
(minimum data value), 25 , 50l , 75  and  1001 (maximum data value) percentiles. A box-plot divides
the data into 4 sections, each containing 25% of the data. Whiskers are the lines drawn to the minimum
and maximum data values  from the 25   and 75l percentiles.  The box shows the interquartile range
(IQR) which is defined as the difference between the 75th and the 25th  percentiles. The length of the
central box indicates the spread of the data (the central 50%), while the length of the whiskers shows the
breadth of the tails of the distribution.  The 50th percentile (median) is the line  within the box. In
addition, the mean and the 95% confidence  limits around the  mean are shown. Potential outliers are
categorized into two groups:

         »»» data points between 1.5 and 3 times the IQR above the 75th percentile or between 1.5 and 3
                                     th
            times the IQR below the 25  percentile, and
            data points that exceed 3 times the IQR above the 75™ percentile or exceed 3 times the IQR
                                                 th
                        th
            below the 25  percentile.
                                             9-5
                                                                          March 2009

-------
Chapter 9. Exploratory Tools	Unified Guidance

       The mean is shown as a star, while the lower and upper 95% confidence limits around the mean
are shown as bars. Individual data points between 1.5 and 3 times the IQR above the 75th percentile or
below the 25l  percentile are shown as circles. Individual data points at least 3 times the IQR above the
75th percentile  or below the 25th percentile are shown as squares.

       Information from box plots can assist in identifying potential data distributions. If the upper box
and whisker are approximately the same length as the lower box and whisker, with the mean and median
approximately  equal, then the data are distributed symmetrically.  The normal  distribution  is one of a
number that is  symmetric. If the upper box and whisker are longer than the lower box and whisker, with
the mean  greater than the median, then the data are right-skewed (such  as lognormal or square root
normal  distributions in original units). Conversely, if the upper box and  whisker are shorter than the
lower box and  whisker with the mean less than the median, then the data are left-skewed.

       A box  plot showing a normal distribution will have the following characteristics: the mean and
median will be in the center of the box, whiskers to the minimum and maximum values are the same
length, and there would be no potential outliers. A box plot showing a lognormal distribution (in original
units) typical of environmental applications will have the following characteristics: the mean will  be
larger than the median, the whisker above the 75l percentile will be longer than the whisker below the
25th percentile, and extreme upper values may be indicated as potential outliers. Once the data have been
logarithmically transformed, the pattern  should follow that described for a normal distribution.  Other
right-skewed distributions transformable to normality would indicate similar patterns.

       It is often helpful to show box plots of different sets of data  side by side to show  differences
between monitoring stations (see Figure 9-2).  This allows a simple method to compare the locations,
spreads and shapes of several data sets or different groups within a single  data set. In this situation, the
width of the box can be proportional to the sample size of each data set. If the data will be compared to a
standard, such  as a preliminary remediation goal (PRO) or maximum contaminant level (MCL), a line on
the graph can be drawn to show if any results exceed the criteria.

       It is important to plot the  data as reported by the laboratory  for non-detects  or negative
radionuclide data. Proxy values for non-detects should  not  be plotted  since  we want  to  see the
distribution of the original data. Different symbols can be used to display non-detects, such as the open
symbols described in Section 9.1. The mean will be biased high if using  the RL of non-detects  in the
calculation,  but the purpose of the box plot is  to assess the  distribution of the data, not quantifying a
precise estimate of an unbiased mean. Displaying the frequency of detection (number of detected values /
number of total samples) under the station name is also useful. Unlike time series plots, box plots cannot
use missing data, so missing data should be removed before producing a box plot.

       Directions for generating a box plot are contained in Example 9-2, and an example is shown in
Figure 9-2. It  is important to remove lab and field duplicates from the data before calculating summary
statistics such  as the  mean and UCL  since these statistics assume  independent data. The box plot
assumes the data are statistically independent.
                                              9-6                                    March 2009

-------
Chapter 9. Exploratory Tools	Unified Guidance

     ^EXAMPLE 9-2

     Construct a box plot using the trichloroethene groundwater data in Table 9-1 for each  well.
Examine the box plot to assess how each well is distributed (normal, lognormal, skewed, symmetric,
etc.). Identify possible outliers.

     SOLUTION

Step 1.   Import the data into data analysis software capable of producing box plots.

Step 2.   Sort the data from smallest to largest results by well.

Step3.   Compute the  0th (minimum value), 25th, 50th (median), 75th and 100th  (maximum value)
         percentiles by well.

Step 4.   Plot these points vertically. Draw a box around the 25th and 75th percentiles and add a line
         through the box at the 50l  percentile. Optionally, make the width of the box proportional to
         the sample size. Narrow boxes reflect smaller sample sizes, while wider boxes reflect larger
         sample sizes.

Step 5.   Compute the mean and the lower and upper 95% confidence limits. Denote the mean with a
         star  and  the confidence limits as bars. Also, identify potential  outliers between l.SxIQR and
         3>
-------
Chapter 9. Exploratory Tools
             Unified Guidance
           Figure 9-2.  Box Plots of Trichloroethene Data for Wells 1 & 2

                0.25 -              =
                      PRO = 0.23 mg/L
           1)
           "u
           o
           o
          H
                0.20-
                0.15-
                0.10-
                0.05-
                o.oo-
                                               outlier >3xIQR
                                               outlier> l.SxIQR
                                               mean
                                               95% LCL and UCL
                                    Welll
                                   FOD 11/14
 Well 2
FOD 10/13
9.3  HISTOGRAMS

      A histogram is a visual representation of the data collected into groups. This graphical technique
provides a visual method of identifying the underlying distribution of the data. The data range is divided
into several bins or classes and the data is sorted into the bins. A histogram is a bar graph conveying the
bins and the frequency of data points in  each bin. Other forms of the histogram use a normalization of
the bin frequencies for the heights of the bars.  The two  most common normalizations  are relative
frequencies (frequencies  divided by sample  size) and densities (relative frequency  divided by the bin
width). Figure 9-3 is an example of a histogram using frequencies and Figure 9-4 is a histogram of
densities. Histograms provide a visual method of accessing location, shape and spread of the data. Also,
extreme values and multiple modes can be identified.  The details  of the data  are lost, but an overall
picture of the data is obtained. A stem and leaf plot offers the same insights into the data as a histogram,
but the data values are retained.

      The visual impression of a histogram is sensitive to the number of bins selected. A large number of
bins will increase data detail, while fewer bins will increase the smoothness of the  histogram. A good
starting point when choosing the number of bins is the  square root of the sample size n. The minimum
number of bins for any histogram should be at least 4. Another factor in choosing bins is the choice of
endpoints. When feasible,  using  simple bin endpoints can improve the readability of the histogram.
Simple bin endpoints include multiples of 5k units for some integer k > 0 (e.g., 0 to <5,  5 to <10, etc. or
1  to  <1.5,  1.5  to <2, etc.). Finally, when plotting  a histogram for a  continuous variable (e.g.,
                                              9-8
                     March 2009

-------
Chapter 9. Exploratory Tools	Unified Guidance

concentration), it is necessary to decide on an endpoint convention; that is, what to do with data points
that fall on the boundary of a bin. Also, use the data as reported by the laboratory for non-detects and
eliminate any missing values, since histograms cannot include  missing data.  With discrete variables,
(e.g., family size) the intervals can be centered in between the variables. For the family size data, the
intervals can span between 1.5 and 2.5, 2.5 and 3.5, and so on. Then the whole numbers that relate to the
family size can  be centered within the box. Directions for generating a histogram are contained in
Example 9-3

     ^EXAMPLE 9-3

     Construct a histogram using the  trichloroethene groundwater data in Table 9-1  for each well.
Examine the histogram to assess how each well is distributed (normal, lognormal, skewed, symmetric,
etc.).

     SOLUTION

Step 1.   Import the data into data analysis software capable of producing histograms.

Step 2.   Sort the data from smallest to  largest results by well.

Step 3.   With n = 14 concentrations for Well 1, a rough estimate of the number of bins isvl4  = 3.74
         or 4 bins. Since the data from Well 1 range from 0.004 to 0.25, the suggested bin width is
         calculated as (maximum concentration - minimum concentration) / number of bins = (0.25 -
         0.004) /  4 = 0.0615. Therefore, the bins for Well  1  are 0.004 to <0.0655, 0.0655 to <0.127,
         0.127 to  <0.1885, and 0.1885  to 0.25 mg/L.

         Similarly, with n = 13 concentrations for Well 2, the number of bins isv!3 =  3.61 or 4 bins.
         Since the data from Well 2 range from 0.099 to 0.17, the suggested bin width is calculated as
         (maximum concentration - minimum concentration) / number of bins  = (0.17 - 0.099) 74 =
         0.01775. Therefore, the bins for Well 2 are 0.099 to <0.11675, 0.11675 to O.1345, 0.1345 to
         O.15225, and 0.15225 to 0.17 mg/L.

Step 4.   Construct a frequency table using the bins defined in Step 3. Table 9-2 shows the frequency or
         number of observations within each bin defined in  Step 3 for Wells 1 and 2. The third column
         shows the relative frequency  which is the  frequency divided by the  sample size n. The  final
         column of the table gives the densities or the relative frequencies divided by  the bin widths
         calculated in Step 3.

Step 5.   The horizontal axis for the data is from 0.004 to 0.25 mg/L for Well 1 and 0.099 to 0.17 for
         Well 2. The vertical axis for the histogram of frequencies is from  0 to 9  and the  vertical axis
         for the histogram of relative frequencies is from 0% - 70%.

Step 6.   The histograms  of frequencies  are shown in Figure 9-3. The  histograms  of  relative
         frequencies or densities are shown in Figure 9-4. Note that frequency,  relative frequency and
         density histograms all show the same shape since the  scale of the vertical axis is divided  by
                                              9-9                                    March 2009

-------
Chapter 9. Exploratory Tools	Unified Guidance

        the sample size  or the bin width. These histograms confirm the  data are not  normally
        distributed for either well, but are closer to lognormal.
             Table 9-2. Histogram Bins for Trichloroethene Groundwater Data

                                                     Relative
             	Bin	Frequency    Frequency (%)	Density
                                            Well 1
0.0040 to O.0655 mg/L
0.0655 to <0. 1270 mg/L
0. 1270 to <0. 1885 mg/L
0.1 885 to 0.2500 mg/L
9
3
0
2
64.3
21.4
0
14.3
10.5
3.5
0
2.3
                                            Well 2
            0.099 to <0.11675 mg/L         8             61.5               34.7
            0.11675 to <0.1345 mg/L        3             23.1               13.0
            0.1345 to <0.15225 mg/L        1              7.7               4.3
            0.15225 to 0.17 mg/L	1	11	4.3
                                         9-10                                 March 2009

-------
Chapter 9.  Exploratory Tools
                                                           Unified Guidance
               Figure 9-3. Frequency Histograms of Trichloroethene by Well.
          §
9-

8-

7-

6-

5-

4-
                9-
                8-

                7-
                6-
                5 -

                4
                3 -

                1-
                1 -

                o-
                     0.004
                  0.0655        0.127        0.1885
                     Trichloroethene (mg/L) in Well 1
0.25
                       0.099        0.11675       0.1345       0.15225
                                     Trichloroethene (mg/L) in Well 2
                                                          0.17
                                           9-11
                                                                   March 2009

-------
Chapter 9.  Exploratory Tools
                                       Unified Guidance
          Figure 9-4.  Relative Frequency Histograms of Trichloroethene by Well.
                70-

                60-

           ^   50-
           S
           §   40-
           aT
           ^   30-

                20-

                10-
                       0.004
0.0655        0.127        0.1885
   Trichloroethene (mg/L) in Well 1
0.25
               70-
               60-
               50-
           o
           S   40
           is
               30-
           Pi
               20-
               10-
                o-
                       0.099       0.11675       0.1345       0.15225
                                     Trichloroethene (mg/L) in Well 2
                                      0.17
                                           9-12
                                              March 2009

-------
Chapter 9. Exploratory Tools
Unified Guidance
9.4  SCATTER PLOTS

      For data sets consisting of multiple observations per sampling point, a scatter plot is one of the
most powerful graphical tools for analyzing the relationship between two or more variables. Scatter plots
are easy to construct for two variables, and many software packages can construct 3-dimensional scatter
plots. A scatter plot  can clearly show the  relationship between two variables if the data range is
sufficiently large. Truly linear relationships can always be identified in scatter plots, but truly nonlinear
relationships may appear linear (or some other form) if the data range is relatively small. Scatter plots of
linearly correlated variables cluster about a straight line.

      As an  example of a nonlinear relationship,  consider two  variables  where one variable is
approximately equal to the square of the other. With an adequate range in the data, a scatter plot of this
data would display a partial parabolic curve. Other important modeling relationships that may appear are
exponential or logarithmic. Two additional uses of scatter plots are the identification of potential outliers
for a single variable or for the paired variables and the identification of clustering in the data. Directions
for generating a scatter plot are contained in Example 9-4.

      ^EXAMPLE 9-4

     Construct a scatter plot using the groundwater data in Table 9-3 for arsenic and mercury from a
single well collected approximately quarterly across time. Examine the scatter plot for linear or quadratic
relationships between  arsenic and mercury, correlation, and for potential outliers.

                     Table 9-3. Groundwater Concentrations from Well 3
Date
Collected
1/2/2005
4/7/2005
7/13/2005
10/24/2005
1/7/2006
3/30/2006
6/28/2006
10/2/2006
10/17/2006
1/15/2007
4/10/2007
7/9/2007
10/5/2007
10/29/2007
12/30/2007
Arsenic Mercury
Cone. Data Cone. Data
(mg/L) Qualifier (mg/L) Qualifier
0.01 U 0.02 U
0.01 U 0.03
0.02 0.04 U
0.04 0.06
0.01 0.02
0.05 0.07
0.09 0.10
0.07 0.08
0.10 NA
0.02 U 0.03 U
0.15 0.11
0.12 0.08
0.10 0.07
0.30 0.29
0.25 0.23
Strontium
Cone.
(mg/L)
0.10
0.02
0.05
0.11
0.05
0.07
0.03
0.04
0.02
0.15
0.03
0.10
0.09
0.05
0.22
Data
Qualifier

U
U





U






         NA = Not available (missing data).
         U denotes a non-detect.
                                             9-13
        March 2009

-------
Chapter 9. Exploratory Tools
                                                          Unified Guidance
       SOLUTION
Step 1.   Import the data into data analysis software capable of producing scatter plots.

Step 2.   Sort the data by date collected.

Step 3.   Calculate the range of concentrations for each constituent. If the range of both constituents are
         similar, then scale both the X and Y axes from the minimum to the maximum concentrations
         of both constituents. If the range of concentrations are very different (e.g., two or more orders
         of magnitude), then perhaps the scales for both axes should be logarithmic (logic). The  data
         will be plotted as pairs from (Xi, YI) to (Xn, Y«) for each sampling date, where n = number of
         samples.

Step 4.   Use separate symbols to distinguish detected from non-detected concentrations. Note that the
         concentration for one constituent may  be detected, while the concentration for the other
         constituent  may  not be detected for the same sampling date. If the concentration for one
         constituent  is missing, then the pair (X;, Y;) cannot be plotted since both concentrations are
         required.  Figure  9-5  shows a linear correlation between arsenic and mercury  with  two
         possible outliers. The Pearson correlation coefficient is 0.97, indicating a significantly high
         correlation.  The  linear  regression line is displayed to show  the linear correlation between
         arsenic and  mercury. -4

                Figure 9-5. Scatter  Plot of Arsenic with Mercury from Well 3
        on
              0.30-
              0.25 -
              0.20-
              0.15 -
              0.10-
              0.05 -
• both detected
O both non-detects
O arsenic non-detect only
O mercury non-detect only
                              0.05        0.10        0.15       0.20
                                               Arsenic (mg/L)
                                                     0.25
0.30
                                              9-14
                                                                  March 2009

-------
Chapter 9.  Exploratory Tools	Unified Guidance

      Many software packages can extend the 2-dimensional scatter plot by constructing a 3-dimensional
scatter plot for 3  constituents. However, with more than 3 variables, it is difficult to construct and
interpret a  scatter plot. Therefore, several graphical representations have been developed that extend the
idea of a scatter plot for  data consisting of more than 2  variables.  The simplest of these graphical
techniques is a coded scatter plot.  All possible two-way combinations  are given a symbol and the pairs
of data are  plotted  on  one  2-dimensional scatter plot.  The  coded scatter plot does not  provide
information on three  way  or higher interactions between the variables since  only two dimensions are
plotted. If the data ranges for the variables are comparable,  then a single set of axes may suffice. If the
data ranges are too dissimilar (e.g., at least two orders of magnitude), different scales may be required.

      ^EXAMPLE 9-5

      Construct a coded scatter plot using the groundwater data in Table 9-3 for arsenic, mercury, and
strontium from Well 3 collected approximately quarterly across time. Examine the scatter plot for linear
or quadratic relationships between the three inorganics, correlation, and for potential  outliers.

      SOLUTION

Step 1.   Import the data into data analysis software capable of producing scatter plots.

Step 2.   Sort the data by date collected.

Step 3.   Calculate the range of concentrations for each constituent. If the  ranges of both constituents
         are  similar, then  scale  both  the X  and  Y  axes from the  minimum to  the maximum
         concentrations of all three constituents. Since the ranges of concentrations  are very similar, the
         minimum to the maximum concentrations of all three constituents will be used for both axes.

Step 4.   Let each arsenic concentration be denoted by X;, each mercury concentration  be denoted by
         Y;, and  each strontium concentration be denoted by Z;. The arsenic and mercury paired data
         will be plotted as pairs (X;, Y;) with solid blue circles for !
-------
Chapter 9. Exploratory Tools
                                      Unified Guidance
      Figure 9-6.  Coded Scatter Plot of Well 3 Arsenic, Mercury, and  Strontium
             0.30-
             0.25 -
             0.20-
             0.15 -
             0.10-
             0.05 -
                     Arsenic (X) vs. Mercury (Y)
                     Arsenic (X) vs. Strontium (Y)
                     Mercury (X) vs. Strontium (Y)
                            0.05
0.10
0.15
mg/L
0.20
0.25
0.30
9.5  PROBABILITY PLOTS

     A simple, but extremely useful visual assessment of normality is to graph the data as a probability
plot. The ^-axis is scaled to represent quantiles or z-scores from a standard normal distribution and the
concentration measurements are arranged in increasing order along the x-axis. As each observed value is
plotted on the x-axis, the z-score corresponding to the proportion of observations less than or equal to
that measurement is plotted as the _y-coordinate. Often, the _y-coordinate is computed by the following
formula:
                                                                                          [9.1]
where  <£> : denotes the inverse of the cumulative standard normal distribution, n represents the sample
size, and /' represents the rank position of the /'th ordered concentration. The plot is constructed so that, if
the data are normal, the points when plotted will lie on a straight line. Visually apparent curves or bends
indicate that the data do not follow a normal distribution.

     Probability plots are particularly useful for spotting irregularities within the data when compared to
a specific distributional  model (usually, but not always, the normal). It is easy to determine whether
departures from normality are occurring more or less in the middle ranges of the data or in the extreme
                                             9-16
                                              March 2009

-------
Chapter 9. Exploratory Tools	Unified Guidance

tails. Probability plots can also indicate the presence of possible outlier values that do not follow the
basic pattern of the data and can show the presence of significant positive or negative skewness.

     If a (normal) probability plot is constructed on the combined data from several wells and normality
is  accepted, it suggests — but  does not prove — that  all of the  data came from the same normal
distribution. Consequently, each subgroup of the  data  set (e.g.,  observations from distinct wells)
probably has the same mean and  standard deviation. If  a  probability  plot is constructed on the data
residuals (each value minus its  subgroup mean) and is not a straight  line, the interpretation  is more
complicated. In this case, either the residuals are not normally-distributed, or there is a subgroup of the
data with a normal distribution but a different mean or standard deviation than the other subgroups. The
probability plot will indicate  a deviation from the  underlying assumption of  a  common  normal
distribution in either case. It would be prudent to examine normal probability plots by well  on the same
plot if the ranges of the data are  similar.  This would show how the  data are distributed by well to
determine which wells may depart from normality.

     The  same probability plot technique may be used to investigate whether a set of data or residuals
follows a lognormal distribution. The procedure is generally the same, except that one first replaces each
observation by its natural logarithm. After the data have been transformed to their natural logarithms, the
probability plot  is constructed as before.  The only difference  is that the natural logarithms of the
observations  are used  on the x-axis. If the data are  lognormal, the  probability plot of the logged
observations will approximate a straight line.

     ^EXAMPLE 9-6

     Determine whether the dataset in Table 9-4 is normal by using a probability plot.

     SOLUTION

Step 1.   After combining the data into a single group, list the measured nickel concentrations in order
         from lowest to highest.

Step 2.   The cumulative probabilities, representing for each observation (xi) the  proportion of values
         less than or equal to X[, are given in the third column of the table below. These are computed
         as /'/(«+ 1) where n is the total number of samples (n = 20).

Step 3.   Determine the quantiles or z-scores from the standard normal distribution corresponding to the
         cumulative probabilities in Step 2. These can be  found by successively letting P equal each
         cumulative probability and then looking  up  the  entry in Table  10-1  (Appendix  D)
         corresponding to P.  Since the standard  normal  distribution is  symmetric about zero,  for
         cumulative probabilities P < 0.50, look up the entry for (1-P) and give this value a negative
         sign.

Step 4.   Plot the normal quantile (z-score) versus the ordered concentration for each sample, as in the
         plot below (Figure 9-7).  The curvature found  in  the probability plot indicates that  there is
         evidence of non-normality in the data. -4
                                              9-17                                    March 2009

-------
Chapter 9.  Exploratory Tools
Unified Guidance
                    Table 9-4.  Nickel Concentrations from a Single Well
Nickel
Concentration
(PPb)
1.0
3.1
8.7
10.0
14.0
19.0
21.4
27.0
39.0
56.0
58.8
64.4
81.5
85.6
151.0
262.0
331.0
578.0
637.0
942.0
Order
(0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cumulative
Probability
[//(»+!)]
0.048
0.095
0.143
0.190
0.238
0.286
0.333
0.381
0.429
0.476
0.524
0.571
0.619
0.667
0.714
0.762
0.810
0.857
0.905
0.952
Normal
Quantile
(z-score)
-1.668
-1.309
-1.068
-0.876
-0.712
-0.566
-0.431
-0.303
-0.180
-0.060
0.060
0.180
0.303
0.431
0.566
0.712
0.876
1.068
1.309
1.668
       PROBABILITY PLOTS FOR LOG TRANSFORMED DATA

Step 1.   List the natural logarithms of the measured nickel concentrations in Table 9-4 in order from
         lowest to highest. These are shown in Table 9-5.

Step 2.   The cumulative probabilities representing the proportion of values less than or equal to xt for
         each observation (x;), are given in the third column of Table 9-4. These are computed as / / (n
         + 1) where n is the total number of samples (n = 20).

Step 3.   Determine the quantiles or z-scores from the standard normal distribution corresponding to the
         cumulative probabilities in  Step 2. These can be  found by successively letting P equal each
         cumulative  probability and  then  looking  up   the  entry  in  Table  10-1 Appendix  D
         corresponding to P. Since the standard normal  distribution is  symmetric about  zero, for
         cumulative probabilities P < 0.50, look up the entry for (l-P) and give this value a negative
         sign.
                                            9-18
        March 2009

-------
Chapter 9.  Exploratory Tools                                             Unified Guidance
                  Table 9-5. Nickel Log Concentrations from a Single Well
Order
(0
1
2
O
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Log Nickel
Concentration
log(ppb)
0.00
1.13
2.16
2.30
2.64
2.94
3.06
3.30
3.66
4.03
4.07
4.17
4.40
4.45
5.02
5.57
5.80
6.36
6.46
6.85
Cumulative
Probability
[//(»+!)]
0.048
0.095
0.143
0.190
0.238
0.286
0.333
0.381
0.429
0.476
0.524
0.571
0.619
0.667
0.714
0.762
0.810
0.857
0.905
0.952
Normal
Quantile
(z-score)
-1.668
-1.309
-1.068
-0.876
-0.712
-0.566
-0.431
-0.303
-0.180
-0.060
0.060
0.180
0.303
0.431
0.566
0.712
0.876
1.068
1.309
1.668
Step 4.   Plot the normal quantile (z-score) versus the ordered logged concentration for each sample, as
         in the plot below (Figure 9-8). The reasonably linear trend found in the probability  plot
         indicates that the log-scale data closely follow a normal pattern, further suggesting that the
         original data closely follow a lognormal distribution.
                                            9-19                                  March 2009

-------
Chapter 9.  Exploratory Tools
Unified Guidance
            CD
            S   o
            CO
            r\i
               -1
               -2
                     Figure 9-7. Nickel Normal Probability Plot
                       0       200      400      600      800     1000
                               Nickel Concentration (ppb)
            Figure 9-8. Probability Plot of Log Transformed Nickel Data
                 2
             CD
             S   o
             IM
                -1
                -2
                         0246
                             Log(nickel) Concentration log(ppb)
  8
                                       9-20
       March 2009

-------
Chapter 10. Fitting Distributions	Unified Guidance

             CHAPTER  10.   FITTING  DISTRIBUTIONS
       10.1   IMPORTANCE OF DISTRIBUTIONAL MODELS	10-1
       10.2   TRANSFORMATIONS TO NORMALITY	10-3
       10.3   USING THE NORMAL DISTRIBUTION AS A DEFAULT	10-5
       10.4   COEFFICIENT OF VARIATION AND COEFFICIENT OF SKEWNESS	10-9
       10.5   SHAPIRO-WILK AND SHAPIRO-FRANCIA NORMALITY TESTS	10-13
          10.5.1   Shapiro-Wilk Test (n< 50)	10-13
          10.5.2   Shapiro-Francia Test (n > 50)	10-15
       10.6   PROBABILITY PLOT CORRELATION COEFFICIENT	10-16
       10.7   SHAPIRO-WILK MULTIPLE GROUP TEST OF NORMALITY	10-18
     Because a statistical or mathematical model is at best an approximation of reality, all statistical
tests and procedures require certain assumptions for the methods to be used correctly and for the results
to be properly interpreted. Many tests make an assumption regarding the underlying distribution of the
observed data; in particular, that the original or transformed sample  measurements follow a normal
distribution. Data transformations are discussed in Section 10.2 while considerations as to whether the
normal distribution should be used as a 'default'  are explored in Section 10.3. Several techniques for
assessing normality are also examined, including:

    »«»  The skewness coefficient (Section 10.4)
    »«»  The Shapiro-Wilk test of normality and its close variant, the Shapiro-Francia test (Section 10.5)
    »»»  Filliben's probability plot correlation coefficient test (Section 10.6)
    »«»  The Shapiro-Wilk multiple group test of normality (Section 10.7)
10.1 IMPORTANCE OF DISTRIBUTIONAL  MODELS

     As introduced in Chapter 3, all statistical testing relies on the critical assumption that the sample
data are representative of the population from which they are selected. The statistical distribution of the
sample is  assumed to be similar to the distribution of the mostly unobserved population  of possible
measurements. Many parametric testing methods make a further assumption: that the form or type of the
underlying population is at least approximately known or can be identified through diagnostic testing.
Most  of these parametric tests assume that the population is normal in distribution; the validity or
accuracy of the test results may be in question if that assumption is violated.

     Consequently, an important facet of choosing among appropriate test  methods is determining
whether a commonly-used statistical distribution such as the normal, adequately models the observed
sample data. A large variety of possible distributional models exist in the statistical literature; most are
not typically applied to groundwater measurements and  often introduce  additional  statistical or
mathematical complexity in working with them.  So groundwater statistical models  are usually confined
to the gamma distribution, the Weibull distribution, or distributions that are normal or can be normalized
via a transformation (e.g., the logarithmic or square  root).

                                             10-1                                    March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

     Although the Unified Guidance will occasionally reference procedures that assume an underlying
gamma or Weibull distribution, the presentation in this guidance will focus on distributions that can be
normalized  and diagnostic tools  for  assessing  normality.  The  principal reasons for limiting  the
discussion in this manner are: 1) the same tools useful for testing normality can be utilized with  any
distribution  that can be normalized— the  only change needed is perform  the normality test after first
making a data transformation; 2) if no transformation works to adequately  normalize the sample data, a
non-parametric test can often be used as an alternative statistical approach; and  3) addressing more
complicated scenarios  is outside the scope of the guidance and  may require  professional statistical
consultation.

     Understanding the statistical behavior of groundwater measurements can be very challenging. The
constituents of interest may occur  at  relatively  low  concentrations  and frequently be left-censored
because of current  analytical  method limitations.  Sample data are  often positively skewed  and
asymmetrical in distributional pattern, perhaps due to the presence of outliers, inhomogeneous mixing of
contaminants in the  subsurface, or spatially variable  soils deposition affecting the local groundwater
geochemistry.  For some constituents, the  distribution  in groundwater  is not stationary over time (e.g.,
due to linear or seasonal trends) or not stationary across space (due to spatial variability in mean levels
from well to well). A set of these measurements pooled over time and/or space may appear highly non-
normal, even if the underlying population at any fixed point in time or space is normal.

     Because of these complexities, fitting a distributional model to a set of sample data cannot be done
in isolation from checks of other key  statistical assumptions. The data must  also be evaluated for outliers
(Chapter  12), since the  presence of even one extreme outlier may cause an otherwise recognizable
distribution  from being correctly identified. For data  grouped  across wells,  the possible presence of
spatial variability must be considered (Chapter 13). If identified, the Shapiro-Wilk multiple group  test
of normality may  be needed to account for differing means and/or variances  at  distinct wells. Data
pooled across  sampling events (i.e., over time) must be examined for the presence of trends or seasonal
patterns (Chapter 14). A clearly identified pattern may need to be removed and the data residuals tested
for normality, instead of the raw measurements.

     A frequently encountered problem involves testing normality on data sets containing non-detect
values. The best goodness-of-fit tests attempt to assess whether the  sample data closely resemble the
tails of the candidate distributional model. Since non-detects represent left-censored observations where
the  exact concentrations are unknown for the lower tail of the sample distribution, standard normality
tests cannot be run without some estimate or imputation of these unknown values. For a small fraction of
non-detects in a sample (10-15% or less) censored at a single reporting limit, it may be possible to apply
a normality test by simply replacing each non-detect with an imputed value of half the RL. However,
more complicated situations arise when there  is  a  combination of multiple  RLs (detected values
intermingled with different non-detect levels), or the proportion of non-detects  is larger. The Unified
Guidance recommends different strategies in these circumstances.

     Properly ordering  the sample observations  (i.e.,  from least  to  greatest) is critical to any
distributional goodness-of-fit test. Because the concentration of a non-detect measurement is only known
to be in the range from zero to the  RL, it is generally impossible to construct a  full  ordering of the
                                              10-2                                    March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

sample.1 There  are  methods, however,  to  construct partial orderings of the data that  allow the
assignment of relative rankings to each of the detected measurements and which account for the
presence of censored values. In turn, a partial ordering enables construction of an approximate normality
test. This subject is covered in Chapter 15.

10.2 TRANSFORMATIONS TO NORMALITY

     Guidance users will often encounter data sets indicating significant evidence of non-normality.
Due to the presumption of most parametric  tests that the underlying population is normal, a common
statistical strategy for apparently non-normal observations is to search for a normalizing mathematical
transformation. Because of the complexities associated with interpreting statistical results from data that
have been transformed to another scale, some  care must be taken in applying statistical procedures to
transformed measurements. In questionable or  disputable  circumstances,  it may be wise to analyze the
same  data with an equivalent non-parametric version of the same test (if it exists) to see if the same
general  conclusion is reached. If not,  the data transformation and its interpretation may need further
scrutiny.

     Particularly with prediction limits, control charts, and some of the confidence intervals described in
Chapters 18, 20, and 21, the parametric versions of these procedures are especially advantageous. Here,
a transformation may be warranted to  approximately normalize the statistical sample. Transformations
are also often useful when combining or pooling intrawell background from several wells in  order to
increase the degrees  of freedom available for intrawell  testing (Chapter 13).  Slight differences in the
distributional  pattern from well to well  can skew  the  resulting pooled  dataset,  necessitating  a
transformation to bring about approximate normality and to equalize the variances.

     The  interpretation  of transformed data is straightforward  in  the  case  of prediction  limits for
individual observations or when building a confidence interval around an upper percentile. An interval
with limits constructed from the transformed data and then re-transformed (or back-transformed) to the
original measurement domain will retain its original probabilistic interpretation. For instance, if the data
are approximately normal under a square  root transformation  and a 95% confidence prediction limit is
constructed on the square roots of the original measurements, squaring the resulting prediction limit
allows for a 95% confidence level when applied to the original  data.

     The  same ease of interpretation does not apply to prediction limits for a future arithmetic mean
(Chapter  18) or to confidence intervals around  an arithmetic mean  compared  to  a  fixed  GWPS
(Chapter 21). A back-transformed confidence  interval constructed around the mean of log-transformed
data (i.e..,  the  log-mean) corresponds to a confidence interval around the geometric mean of the raw
(untransformed) data. For the lognormal distribution, the geometric mean is equal  to the median, but it is
not the same as the arithmetic mean. Using this back-transformation to bracket the location of the true
arithmetic population mean will result in an incorrect interval.

     For these particular applications, a similar  problem  of scale  bias occurs with other potential
normality transformations. Care is needed when applying  and interpreting transformations to a data set
1  Even when all the non-detects represent the lowest values in the sample, there is still no way to determine how this subset is
  internally ordered.

                                              10-3                                    March 2009

-------
Chapter 10.  Fitting Distributions _ Unified Guidance

for which either a confidence interval around the mean or a prediction limit for a future mean is desired.
The interpretation depends on which statistical parameter is being estimated or predicted. The geometric
mean or median in some situations may be a satisfactory alternative as a central tendency parameter,
although that decision must be weighed carefully when making comparisons against a GWPS.

     Common normalizing transformations include the natural logarithm, the square root, the cube root,
the square, the cube, and the reciprocal functions, as well as a few others.  More generally, one might
consider the "ladder of powers" (Helsel and Hirsch, 2002) technically known  as the set of Box-Cox
transformations (Box and Cox,  1964). The heart of these transformations is a power transformation of
the original data, expressed by the equations:
                                                    for 1^0
                                                                                         [10.1]
                                           logx     for/l = 0

     The goal of a Box-Cox analysis is to find the value X that best transforms the data to approximate
normality, using a procedure such as maximum likelihood. Such algorithms are beyond the scope of this
guidance, although  an excellent discussion  can be found in Helsel and Hirsch (2002).  In  practice,
slightly different equation formulations can be used:


                                     *=P   f°rl*°                                [10.2]
                                          [log*  for/l = 0

where the parameter X can generally be limited to the choices 0, -1, 1/4, 1/3, 1/2,  1, 2, 3, and 4, except
for unusual cases of more extreme powers.

     As  noted in  Section  10.1, checking  normality  with  transformed data does not  require any
additional tools. Standard normality tests can be applied using the transformed  scale measurements.
Only the interpretation of the test changes. A goodness-of-fit test can assess the  normality of the raw
measurements. Under a transformation, the same test checks for normality on the transformed scale. The
data will still follow the non-normal distribution in the original  concentration domain. So if a cube root
transformation is attempted and the transformed data are found to be approximately normal, the original
data are not normal but rather cube-root normal in distribution. If a log transformation is  successfully
used, the original measurements  are not normal but lognormal  instead. In sum, a  series of non-normal
distributions can be  fitted to data with the goodness-of-fit tests described in this chapter without needing
specific tests for other potential distributions.

     Finding a reasonable transformation in practice amounts to systematically 'climbing' the "ladder of
powers" described above. In other words, different choices of the power parameter X would be attempted
— beginning with X = 0 and working upward from -1  toward more extreme power transformations —
until a specific X normalizes the data or all choices have been attempted.  If no transformation  seems to
work, the user should instead consider a non-parametric test alternative.
                                             10-4                                   March 2009

-------
Chapter 10.  Fitting Distributions                                          Unified Guidance
10.3 USING THE  NORMAL DISTRIBUTION AS A DEFAULT

     Normal and lognormal distributions are frequently applied models in groundwater data because of
their general utility.  One or the other of these models might be chosen as a default distribution when
designing a statistical approach, particularly when relatively little data has been collected at a site. Since
the statistical behavior of these two models is  very different  and can lead to substantially different
conclusions, the choice  is not arbitrary. The type of test involved, the monitoring program, and the
sample  size can all affect the decision. For many data sets  and  situations, however, the normal
distribution can be  assumed as a default unless and until a  better model can be pinpointed through
specific goodness-of-fit testing provided in this chapter.

     Assumptions of normality are most easily made with regard to naturally-occurring and measurable
inorganic parameters, particularly under background conditions. Many ionic and other inorganic water
quality analyte measurements exhibit decent symmetry and low variability within a given well data set,
making these data amenable to assumptions of normality.  Less frequently detected analytes (e.g., certain
colloidal trace elements) may be better fit either by a site-wide lognormal or another distribution that can
be normalized, as well as evaluated with non-parametric methods.

     Where contamination in groundwater is known to exist a priori (whether in background or
compliance wells),  default distributional assumptions become more  problematic.  At a given well,
organic or inorganic contaminants may exhibit high or low variability,  depending on local hydrogeologic
conditions, the pattern of release from the source, the degree of solid phase absorption, degradability of a
given constituent, and the variation in groundwater flow direction and depths. Non-steady state releases
may result in a historical, occasionally non-linear pattern of trend increases or decreases. Such data
might be fit by an apparent lognormal distribution, although removal of the trend may lead to normally-
distributed residuals.

     Sample size is also a consideration. With fewer than 8 samples in a data set, formal goodness-of-fit
tests are often of limited value. Where larger sample sizes are available, goodness-of-fit tests should be
conducted. The Shapiro-Wilk multiple group well test (Section 10.7) — even with small sample sizes —
can sometimes be used to identify individual anomalous  wells which might otherwise be presumed to
meet the criterion of normality. Under compliance/assessment or corrective action monitoring, one might
anticipate only four samples per well in the first year after  instituting such monitoring. Under these
conditions, a default assumption of normality for  testing of the mean against a fixed standard is probably
necessary. Aggregation of multi-year data when conducting compliance tests (see Chapter 7) may allow
large enough sample sizes to warrant formal goodness-of-fit testing. With 8 (or more) samples,  it may be
possible to determine that a lognormal distribution is an appropriate fit for the  data. Even in this latter
approach, caution may be needed in applying Land's confidence interval  for a lognormal mean (Chapter
21) if the sample  variability is  large and especially if the upper  confidence  limit is used in the
comparison (i.e.., in corrective action monitoring).

     The normal distribution may also serve as a reasonable default when it is not critical to ensure that
sample data closely follow a specific distribution. For example, statistical tests on the mean are  generally
considered more robust with respect to departures from normality than procedures which involve upper
or lower limits of an assumed distribution. Even  if the data are not quite normal, tests on the mean such
                                             10-5                                   March 2009

-------
Chapter 10.  Fitting Distributions _ Unified Guidance

as a Student's Mest will often still provide a valid result.  However,  one might  need to consider
transformations of the  data for other reasons. Analysis of variance [ANOVA]  can be run with small
individual well samples (e.g., n = 4),  and as a comparison of means, it is fairly robust to departures from
normality. A logarithmic or other transformation may be needed to stabilize or equalize the well-to-well
variability (i.e., achieve homoscedasticity), a separate and more critical assumption of the test.

     Given their importance in statistical testing and the risks that sometimes occur in trying to interpret
tests on other data transformation  possibilities, it is  useful to briefly  consider the logarithmic
transformation in more detail. As noted in Section 10.1, groundwater data can frequently be normalized
using a logarithmic distribution model.  Despite this, objections  are  sometimes  raised that  the  log
transformation is merely used to "make large numbers look smaller."

     To better understand the log  transformation, it  should  be recognized that logarithms are, in fact,
exponents to some unit base. Given a concentration-scale variable x, re-expressed as  x = ICPor x = ey ,
the logarithm y is the exponent of that base (10 or the natural base e). It is the behavior of the resultant^
values that is assessed when data are log-transformed. When data relationships are multiplicative in the
original arithmetic domain (xl xx2\  the relationships between exponents (i.e., logarithms) are additive
(yl + y2).  Since the logarithmic  distribution by mathematical definition is normal in  a log-transformed
domain, working with  the logarithms instead of the  original concentration measurements may offer a
sample distribution much closer to normal.

     Similar to  a  unit scale transformation (ppm to ppb or  Fahrenheit to Centigrade), the  relative
ordering of log-transformed measurements does not change.  When  non-parametric tests based on ranks
(e.g., the Wilcoxon rank-sum test)  are applied to  data transformed  either to a different unit scale or by
logarithms, the outcomes are identical. However, other relationships among the log-transformed data do
change, so that the log-scale numerical 'spacing' between lower values is more similar to the log-scale
spacing between higher values. While parametric tests like prediction limits, ^-tests, etc., are not affected
by unit scale  transformations,  these tests may have different outcomes depending on whether raw
concentrations or log-transformed measurements are used. The justification for utilizing log-transformed
data is that the transformation helps to normalize the data so that these tests can be properly applied.

     There is  also a plausible physical explanation as to why pollutant concentrations  often follow a
logarithmic pattern  (Ott, 1990).  In Ott's  model, pollutant sources  are randomly  dispersed through the
subsurface or  atmosphere in a  multiplicative fashion through repeated  dilutions when mixing with
volumes of (uncontaminated) water  or air,  depending on  the  medium.  Such random and repeated
dilutions can mathematically lead to a lognormal distribution. In particular, if a final concentration (c0)
is the product of several random dilutions  (c. ) as suggested by the following equation:
the logarithm of this concentration is equivalent to the sum of the logarithms of the individual dilutions:
                                              10-6                                    March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

     The Central Limit Theorem (Chapter 3) can be applied to conclude that the logged concentration
in equation [10.4] should be approximately normal, implying that the original concentration (c0) should
be approximately lognormal in distribution. Contaminant fate-and-transport models more or less follow
this  same approach, using  successive multiplicative  dilutions (while accounting for absorption  and
degradation effects) across grids in time and space.

     Despite the mathematical elegance of the Ott model, experience with groundwater monitoring data
has shown that the lognormal model alone is not adequate to account for observed distribution patterns.
While contaminant modeling might predict a lognormal contaminant distribution in space (and often in
time at a fixed point during transient phases), individual well location points fixed in space  and at rough
contaminant equilibrium are more likely to be  subject to a variety of local hydrologic and other factors,
and the observed distributions can be almost limitless in form. Since most of the tests within the Unified
Guidance presume a stationary population over time at a given well location (subject to identification
and removal of trends),  the resultant distributions may  be other than lognormal in character. Individual
constituents may also exhibit varying aquifer-related distributional characteristics.

     A practical issue  in  selecting a default transformation is ease of  use. Distributions like the
lognormal usually entail more  complicated  statistical adjustments or calculations than the normal
distribution.  A confidence interval  around the  arithmetic  mean of a lognormal distribution utilizes
Land's //-factor, which  is a function of both log sample data variability and sample size, and is only
readily available for specific confidence levels.  By contrast, a normal confidence interval around the
sample mean based on the ^-statistic can easily be defined for virtually any  confidence level. As noted
earlier, correct use  of these confidence intervals depends on  selecting the appropriate  parameter and
statistical measure (arithmetic mean versus the geometric mean).

     While a transformation does not always necessitate using a different statistical  formula to ensure
unbiased results, use of a transformation does assume that the  underlying population is  non-normal.
Since the true population will almost never be  known with certainty, it may not be advantageous to
simply default to a lognormal assumption for a variety of reasons. Under detection monitoring, the
presumption is made that a statistically significant increase above background concentrations will trigger
a monitoring  exceedance.  But the larger the prediction limit computed  from background, the  less
statistical power the test will have for detecting true increases.  An important question to answer is what
the consequences are  when incorrectly applying statistical techniques based  on one distributional
assumption  (normal or lognormal), when  the underlying distribution is in fact the  other.  More
specifically,  what is the impact  on statistical  power and accuracy  of assuming  the wrong underlying
distribution? The general effects of violating underlying test assumptions can be measured in terms of
false positive and negative  error rates (and therefore power). These questions are particularly pertinent
for prediction limit and control  chart tests in  detection monitoring. Similar questions could be  raised
regarding the application of confidence  interval tests on  the  mean when compared against fixed
standards.

     To  answer these questions, a series of Monte Carlo  simulations was generated for the Unified
Guidance to evaluate the impacts on prediction limit false positive error rates and statistical power of
using normal and lognormal   distributions   (correctly and  incorrectly  applied to the underlying
distributions). Detailed results of this study are provided in Appendix C, Section C.I.
                                              10-7                                    March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

     The conclusions of the Monte Carlo study are summarized as follows:

    »«»  If  an underlying population  is truly normal,  treating the  sample data  as lognormal in
       constructing a  prediction limit  can have  significant  consequences. With  no retesting, the
       lognormal prediction limits were in every case considerably larger and thus less powerful than
       the normal prediction limits.  Further, the lognormal limits consistently exhibited less than the
       expected (nominal) false positive rate, while the normal prediction limits tended to have slightly
       higher than nominal error rates.
    »«»  When retesting was added to the procedure,  both types of prediction limits improved. While
       power uniformly improved compared to no retest, the normal limits were still on average about
       13% shorter than the lognormal limits, leading again to  a measurable loss of statistical power in
       the lognormal case.
    »«»  On  balance, misapplication  of logarithmic prediction  limits  to  normally-distributed  data
       consistently resulted in (often  considerably) lower power and false positive rates that were lower
       than expected. The results argue against presuming the underlying data to be lognormal without
       specific goodness-of-fit testing.
    »«»  The  highest penalties from  misapplying lognormal  prediction limits  occurred for smaller
       background sizes. Since goodness-of-fit tests  are least able to  distinguish between normal and
       lognormal data with small samples, small background  samples should not be  presumed to be
       lognormal  as a default  unless other evidence from  the  site suggests  otherwise. For  larger
       samples, goodness-of-fit  tests  have  much  better  discriminatory power,  enabling a  better
       indication of which model to use.
    »«»  If the underlying population is truly lognormal but the  sample data are treated as normal, the
       penalty in overall statistical performance is substantial only if no retesting is conducted. With no
       retesting, the false positive rates of normal-based limits were often substantially higher than the
       expected rate.   Under conditions  of no  retesting, misapplying normal prediction  limits to
       lognormal data would result in an excessive site-wide false positive rate (SWFPR).
    »»»  If at least one retest was added, the  achieved false positive rates for the misapplied normal limits
       tended to be less than the expected rates, especially for  moderate to larger sample sizes. Except
       for highly skewed lognormal  distributions, the power of the normal limits was comparable or
       greater than the power of the lognormal limits.
     Overall, the Monte Carlo study  indicated that adding a retest to the  testing procedure significantly
minimized the penalty of misapplying normal prediction limits to lognormal data, as long as the sample
size was at least 8 and the distribution was not too skewed. Consequently, there is less penalty associated
with making a  default assumption of normality than in making a default assumption  of lognormality
under most situations. With highly skewed  data, goodness-of-fit tests tend to better discriminate between
the normal and lognormal models. The Unified Guidance therefore recommends that  such diagnostic
testing be done explicitly rather than simply assuming the data to be normal or lognormal.

     The most problematic cases in the study occurred for very small background sample sizes, where a
misapplication of prediction limits in either direction often resulted in poorer statistical performance,
even with retesting. In  some situations, compliance testing may need to be conducted on an interim
basis until enough  data has been collected to accurately identify  a distributional model.  The Unified
Guidance does not recommend an automatic default assumption of lognormality.
                                             10-8                                    March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

     In summary, during detection and compliance/assessment monitoring, data sets should be treated
initially as normal in distribution unless a better model can be pinpointed through specific testing. The
normal distribution is a fairly  safe assumption for background distributions,  particularly for naturally
occurring, measurable constituents and when sample sizes are small. Goodness-of-fit tests provided in
this chapter can be used to more closely identify the appropriate distributions for larger sample sizes. If
the initial assumption of normality is not rejected, further statistical analyses should be performed on the
raw observations. If the normal distribution is rejected by a goodness-of-fit test, one should generally test
the normality of the logged data,  in order to check for lognormality of the  original observations. If this
test also fails, one can  either  look for an alternate transformation  to achieve approximate normality
(Section 10.2) or use a non-parametric technique.

     Since tests of normality have low power for rejecting the null hypothesis when the data are really
lognormal but the sample size and degree of skewness are small, it is reassuring that a "wrong" default
assumption of normality will infrequently lead to an incorrect statistical conclusion. In fact, the statistical
power for detecting real concentration increases will generally be better than if the data were assumed to
be lognormal. If the data are truly lognormal, there is a risk of greater-than-expected  site-wide false
positive error rates.

     When the population is more skewed,  normality tests in the Unified Guidance have much greater
power for correctly rejecting the normal model in favor of the lognormal distribution. Consequently, an
initial assumption of normality will not,  in most cases, lead to an incorrect final conclusion, since the
presumed normal model will tend to be rejected before further testing is conducted.

     These recommendations do not apply to corrective action monitoring or other programs where it
either known  or reasonable to presume that  groundwater is already impacted  or has  a non-normal
distribution. In such settings,  a  default  presumption of lognormality could  be  made,  or a series of
normalizing transformations could be attempted until a suitable fit is determined. Furthermore, even in
detection monitoring,  there  are situations that often require the use of alternate transformations, for
instance when pooling intrawell  background across  several wells to increase the degrees  of freedom
available for intrawell testing (Chapter 13).

     Whatever the  circumstance, the Unified  Guidance recommends whenever possible that site-
specific data be used to test the distributional  presumption. If no data are initially available to do this,
"referencing"  may be  employed to justify the use of, say, a  normal or lognormal  assumption in
developing statistical tests at a particular site. Referencing involves the use of historical data  or data
from sites in similar hydrologic  settings to  justify the assumptions  applied to the proposed statistical
regimen. These initial  assumptions should be checked when  data from the  site become available, using
the procedures described in the Unified Guidance. Subsequent changes to the initial assumptions should
be made if goodness-of-fit testing contradicts the initial hypothesis.

10.4 COEFFICIENT OF VARIATION AND COEFFICIENT  OF SKEWNESS

       PURPOSE AND BACKGROUND

     Because the normal  distribution has  a  symmetric 'bell-shape,'  the normal mean and median
coincide and random observations drawn from a normal population are just as likely to occur below the
mean as  above it. More generally, in any symmetric distribution the distributional pattern below the

                                             10-9                                   March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

mean is a mirror-image of the pattern above the mean. By definition, such distributions have no degree
of skewness or asymmetry.

      Since the normal distribution has zero skewness, one way to look for non-normality is to estimate
the  degree  of skewness. Non-zero values of this measure imply that the population is asymmetric and
therefore something different from normal.  Two exploratory screening tools useful for this task are the
coefficient of variation and the coefficient of skewness.

      The coefficient of variation [CV] is  extremely easy to compute, but only indirectly  offers an
estimate of skewness and hence normality/non-normality. A more direct estimate can be determined via
the  coefficient of skewness. Furthermore, better, formal tests can be used instead of either coefficient to
directly assess normality. Nevertheless, the CV provides a  measure of intrinsic variability in positive-
valued data sets. Although approximate, CVs can  indicate the relative  variability of  certain  data,
especially with small sample sizes and in the absence of other formal tests (e.g., see Chapter 22, when
comparing  confidence limits on the mean to  a fixed standard in compliance monitoring).

      The CV is also a valid measure of the multiplicative relationship between the population mean and
the  standard deviation for positively-valued  random variables. Using sample statistics for the mean (x)
and standard deviation (s), the true CV for non-negative normal populations can be reasonably estimated
as:

                                         CV=slx                                       [10.5]

      In lognormal populations, the CV is also used in evaluations of statistical  power.  In this latter
case, the population CV works out to be:
                                                                                          [10.6]

where oy is the population log-standard deviation. Instead of a ratio between the original scale standard
deviation and the mean, the lognormal CV is estimated with the equation:
                                                                                          [10.7]


where s  is the sample log-standard deviation. The estimate in equation [10.7] is usually more accurate
than the simple CV ratio of the arithmetic standard deviation-to-mean, especially when the underlying
population coefficient of variation is high. Similar to using the normal CV as a formal indicator of
normality, the lognormal coefficient of variation estimator in equation [10.7] will have little relevance as
a test of lognormality of the data.   Using it for that  purpose is  not recommended  in the Unified
Guidance.  But it can provide a sense of how variable a data set is and whether a lognormal assumption
might need to be tested.

     While others have  reported  a ratio CV on  logged measurements as  CV=syly   for the
transformation y = log x, the result is essentially meaningless. The actual logarithmic CV in  equations
[10.6] and [10.7] is solely determined by the logarithmic variability of oy or sy. Negative logarithmic
mean values are  always  possible, and the  log ratio  statistic  is  not invariant  under  a unit  scale
                                             10-10                                   March 2009

-------
Chapter 10.  Fitting Distributions _ Unified Guidance

transformation (e.g., ppb to ppm or ppt). Similar problems in interpretation occur when CV estimators
are applied to any variable which can be negatively valued, such as following a z-transformation to a
standard normal distribution.  This log ratio  statistic is not recommended for any  application in the
guidance.

     The  coefficient  of skewness  (YJ) directly indicates to  what degree a  dataset is skewed  or
asymmetric with respect to the mean.  Sample data from a normal distribution will have a  skewness
coefficient  near zero, while data from an  asymmetric distribution  will have a positive or negative
skewness depending on whether the right- or left-hand tail of the distribution is longer and  skinnier than
the opposite tail.

     Since groundwater monitoring concentrations are inherently non-negative, such data  often exhibit
skewness.  A small degree  of skewness is not  likely to affect the results of statistical tests that  assume
normality. However, if the  skewness coefficient is larger than 1 (in absolute value) and the sample size is
small (e.g., n  < 25), past  research has shown that standard normal theory-based tests are much less
powerful than when the absolute skewness is less than 1 (Gayen,  1949).

     Calculating the skewness coefficient is useful  and only slightly more difficult than computing the
CV. It provides a quick indication of whether the skewness is minimal enough to assume that the data
are roughly symmetric and hopefully normal in distribution. If the original data exhibit a high skewness
coefficient, the normal distribution will provide a poor approximation to the dataset. In that case — and
unlike the CV — YI can be computed on the  log-transformed data to test for symmetry of the logged
measurements, or similarly for other transformations.

       PROCEDURE

     The CV  is calculated simply  by taking the ratio  of the sample standard deviation to the sample
mean, CV = s/x or its corresponding logarithmic version CV = Jexp (s2 V- 1 .

     The skewness coefficient may be computed using the following equation:
where the numerator represents the average cubed residual after subtracting the sample mean.

       ^EXAMPLE 10-1

     Using the following data, compute the CVs and the coefficient of skewness to test for approximate
symmetry.
                                             10-11                                   March 2009

-------
Chapter 10. Fitting Distributions                                          Unified Guidance
Month
Jan
Mar
Jun
Aug
Oct
Year 1
58.8
1.0
262
56
8.7
Nickel Concentration (ppb)
Year 2 Year 3
19
81.5
331
14
64.4
39
151
27
21.4
578
Year 4
3.1
942
85.6
10
637
       SOLUTION
Step 1.   Compute the mean, standard deviation (s\ and sum of the cubed  residuals for the nickel
         concentrations:

                            x = — (58.
           = I— [(58.8-169.52)2+(l-169.52)2 + ... + (637- 169.52)2]  =  259.7175 ppb


                   3      r                                  1
               -x)   =   [(58.8-169.52)3 +... + (637 -169.52)3J = 5.97845791xl08 ppb3
Step 2.   Compute  the  arithmetic  normal   coefficient  of variation  following  equation  [10.5]:
         CV = 259.7175/169.52 = 1.53

Step 3 .   Calculate the coefficient of skewness using equation [10.8]:

                     Yl = (20 j/2 (5. 97845791 x 108 ^/foj2 (259.7175)' = 1.84

         Both the CV and the coefficient of skewness are much larger than 1, so the data appear to be
         significantly positively skewed. Do not assume that the underlying population is normal.

Step 4.   Since the  original  data evidence a high degree of skewness, one can instead compute the
         skewness coefficient and corresponding  sample CV with equation [10.7] on the logged nickel
         concentrations.   The  logarithmic CV  equals 4.97,  a  much more variable  data  set than
         suggested  by the arithmetic CV. The skewness coefficient works out to be |Yj|= 0.24 <  1,
         indicating that the logged data values are slightly skewed but not enough to clearly reject an
         assumption of normality in the logged data. In other words, the original nickel values may be
         lognormally distributed. -^
                                            10-12                                  March 2009

-------
Chapter 10. Fitting Distributions	Unified Guidance

10.5 SHAPIRO-WILK AND SHAPIRO-FRANCIA  NORMALITY TESTS

       10.5.1  SHAPIRO-WILK TEST (N <  50)

       PURPOSE AND BACKGROUND

     The Shapiro-Wilk test is based on the premise that if a data set is normally distributed, the ordered
values should be highly correlated with corresponding quantiles (z-scores) taken from  a normal
distribution (Shapiro  and Wilk, 1965).  In particular, the  Shapiro-Wilk test gives substantial weight to
evidence of non-normality in the tails of a distribution, where the robustness of statistical tests based on
the normality assumption is most severely affected.  A variant of  this test, the  Shapiro-Francia test, is
useful for sample sizes greater than 50 (see Section 10.5.2).

     The Shapiro-Wilk test statistic (SW) will tend to be large  when a probability plot of the data
indicates a nearly straight line. Only when the plotted data show significant bends or curves will the test
statistic be  small. The  Shapiro-Wilk test is considered  one  of the best tests of normality available
(Miller, 1986; Madansky, 1988).

       PROCEDURE

Step 1.   Order and rank the dataset from least to greatest, labeling the observations as xt for rank /' =
         1.. .n.  Using the notation XQ, let the /'th rank statistic from a data set represent the rth smallest
         value.
Step 2.   Compute differences  x,_.+ , -x,..  for each i=\...n. Then determine k as the greatest integer

         less than or equal to (n/2).

Step 3.   Use Table 10-2 in Appendix D to determine the Shapiro-Wilk coefficients, an-i+i , for / =
         \...k. Note that while these coefficients depend only on the sample size («), the order of the
         coefficients must be preserved when used in Step 4. The coefficients can be determined for
         any sample size from n = 3 up to n = 50.

Step 4.   Compute the quantity b given by the following equation:

                                   k      k
                                b=Tb =ya  . ,(v     -*„)                           [10.9]
                                   Z_^ i   Z_^  «-z+lv («-z + l)   (i)'                           L    J
                                   z=l    z=l

         Note that the values bt are simply intermediate quantities represented by the terms in the sum
         of the right-hand expression in equation [10.9].

Step 5.   Calculate  the standard  deviation (s) of  the dataset.  Then compute the Shapiro-Wilk  test
         statistic using the equation:
                                      SW =
[10.10]
                                            10-13                                  March 2009

-------
Chapter 10. Fitting Distributions
Unified Guidance
Step 6.   Given the significance level (a) of the test, determine the critical point of the Shapiro-Wilk
         test with n observations using Table 10-3 in Appendix D. To maximize the utility and power
         of the test, choose a = .10 for very small data sets (n < 10), a = .05 for moderately sized data
         sets (10 < n < 20), and a = .01 for larger sized data sets (n > 20). Compare the SW against the
         critical  point (swc). If the test statistic exceeds the  critical point, accept normality as a
         reasonable model for the underlying population. However,  if SW < swc, reject the null
         hypothesis of normality at the a-level and decide that  another distributional model might
         provide a better fit.

       ^EXAMPLE 10-2

     Use the nickel data of Example 10-1 to compute the Shapiro-Wilk test of normality.

     SOLUTION
Step 1.   Order the data from smallest to largest, rank in ascending order and list, as shown in columns
         1 and 2  of the table below. Next list the data in reverse order in a third column.
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
x(f)
1.0
3.1
8.7
10.0
14.0
19.0
21.4
27.0
39.0
56.0
58.8
64.4
81.5
85.6
151.0
262.0
331.0
578.0
637.0
942.0
x(n-i + l)
942.0
637.0
578.0
331.0
262.0
151.0
85.6
81.5
64.4
58.8
56.0
39.0
27.0
21.4
19.0
14.0
10.0
8.7
3.1
1.0
x(n-i+1) - x(i)
941.0
633.9
569.3
321.0
248.0
132.0
64.2
54.5
25.4
2.8
-2.8
-25.4
-54.5
-64.2
-132.0
-248.0
-321.0
-569.3
-633.9
-941.0
an-i + l
.4734
.3211
.2565
.2085
.1686
.1334
.1013
.0711
.0422
.0140










b.
445.47
203.55
146.03
66.93
41.81
17.61
6.50
3.87
1.07
0.04
b = 932.88









Step 2.   Compute the differences  */ .+1-> - */.•>  in column 4 of the table by subtracting column 2 from

         column 3. Since the total sample size is n = 20, the largest integer less than or equal to (w/2) is
         £=10.

Step 3.   Look up the coefficients an_i+i from Table 10-2 in Appendix D and list in column 4.
                                             10-14
        March 2009

-------
Chapter 10. Fitting Distributions
                                     Unified Guidance
Step 4.   Multiply the differences  in column 3 by the coefficients in column 4 and add the first k
         products (b[) to get quantity 6, using equation [10.9].

                     b =[.4734(941.0)+.3211(633.9)+... + .0140(2.8)] = 932.88

Step 5.   Compute the standard deviation of the  sample, s = 259.72. Then use equation [10.10] to
         calculate the SW:
                                  SW =
                                           932.88
                                         259.72
       A/19.
              = 0.679
Step 6.   Use Table 10-3 in Appendix D to determine the 0.01-level critical point for the Shapiro-Wilk
         test when n = 20. This gives swc = 0.868. Then compare the observed value of SW= 0.679 to
         the 1% critical point.  Since SW < 0.868,  the sample  shows significant evidence  of non-
         normality by the Shapiro-Wilk test. The data should be transformed using logarithms or
         another transformation  on the ladder of powers and re-checked using the Shapiro-Wilk test
         before proceeding with further statistical analysis. -^

       10.5.2   SHAPIRO-FRANCIA TEST (N > 50)

     The Shapiro-Wilk test of normality can be used  for sample sizes up to 50. When n is larger than
50, a slight modification of the procedure called the Shapiro-Francia test (Shapiro  and Francia,  1972)
can be used instead. Like the Shapiro-Wilk test, the Shapiro-Francia test statistic (SF) will tend to be
large when a probability plot of the data indicates a nearly straight line. Only when the plotted data show
significant bends or curves will the test statistic be small.

     To calculate the test statistic SF, one can use the following equation:
                               SF =
                                       m.x
                                          (0
                                                [10.11]
where XQ represents the /'th ranked value of the sample and where mi denotes the approximate expected
value of the /'th rank normal quantile (or z-score). The values for m\ are approximately equal to
       /  i  }
m = <£   	
        U+iJ
                                                                                       [10.12]
where  <£> : denotes the inverse of the standard normal distribution with zero mean and unit variance.
These values can be computed by hand using the normal distribution in Table 10-1 of Appendix D or
via simple commands found in many statistical computer packages.

     Normality of the data should be rejected if the Shapiro-Francia statistic is too low when compared
to the critical points provided in Table 10-4 of Appendix D. Otherwise one can assume the data are
approximately normal for purposes of further statistical analysis.
                                            10-15
                                            March 2009

-------
Chapter 10. Fitting Distributions	Unified Guidance

10.6 PROBABILITY PLOT CORRELATION  COEFFICIENT

       BACKGROUND AND PURPOSE

     Another test for normality that is essentially equivalent to the Shapiro-Wilk and Shapiro-Francia
tests is the probability plot correlation coefficient test described by Filliben (1975). This test meshes
perfectly with the use of probability plots, because the  essence of the test is to compute the usual
correlation coefficient for points on a probability plot. Since the correlation coefficient is a measure of
the linearity of the points on a scatterplot,  the probability plot correlation coefficient, like the SW test
statistic, will be high when the plotted points fall along a straight line and low when there are significant
bends  and curves in the probability plot.  Comparison of  the  Shapiro-Wilk and  probability  plot
correlation coefficient tests has indicated  very similar statistical power for detecting  non-normality
(Ryan and Joiner, 1990).

     It should be noted that although some statistical software may not compute Filliben's test directly,
the usual Pearson's correlation coefficient computed on the data pairs used to construct a probability plot
will  provide  a  very close approximation  to the Filliben  statistic.  Some users  may find this latter
correlation easier to compute or more accessible in their software.

       PROCEDURE

Step 1.   List the observations in  order from smallest  to largest, denoting XQ as the rth smallest rank
         statistic  in the  data set. Then let n = sample size and compute the sample mean (x ) and the
         standard deviation (s).

Step 2.   Consider a random sample drawn from a standard normal distribution. The rth rank statistic of
         this sample is fixed once the sample is drawn, but beforehand it can be considered a random
         variable, denoted as XQ.  Likewise, by considering all possible datasets of size n that might be
         drawn from the normal distribution, one can think of the sampling distribution  of the statistic
         X(i). This  sampling distribution has its  own  mean and variance, and, of importance to the
         probability plot correlation coefficient, its own median, which can be denoted M;.

         To compute the median of the rth rank statistic, first compute intermediate probabilities m\ for
         7 = 1.. .n using the equation:

                                              !/„
                                                        for 7 = 1
                            m =
forl<7<»                      [10.13]
         Then compute the medians M\ as the standard normal quantiles or z-scores associated with the
         intermediate probabilities m^. These can be determined from Table 10-1 in Appendix D or
         computed according to the following equation, where $~: represents the inverse of the
         standard normal distribution:

                                                                                       [10.14]

                                            10-16                                  March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

Step 3.   With the rank statistic medians in hand, calculate the arithmetic mean of theM's, denoted M,
         and the intermediate quantity Cn, given by the equation:
                                      C =JyM2-nM2                                [10.15]
                                        n   \i ^—^  i                                        L      j
                                           V 2 = 1

         Note that when the dataset is "complete" (meaning it contains no non-detects, ties, or censored
         values), the mean of the order statistic medians reduces to M = 0. This in turn reduces the
         calculation of Cn to:
                                                                                         [10.16]


     Step 4.   Finally compute Filliben's probability plot correlation coefficient:


                                                  - nxM
                                                                                         [10.17]
         When the dataset is complete, the equation for the probability plot correlation coefficient also
         has a simplified form:
Step 5.   Given the level of significance (a), determine the critical point (rcp) for Filliben's test with
         sample size n  from  Table  10-5  in Appendix D. Compare the probability plot  correlation
         coefficient (r) against the critical point (rcp). If r > rcp, conclude that normality is a reasonable
         model for the underlying population at the  a-level of significance. If, however, r < rcp, reject
         the null hypothesis and conclude that another distributional model would provide a better fit.

       ^EXAMPLE 10-3

     Use the data of Example 10-1 to compute Filliben's probability plot correlation coefficient test at
the a = .01 level of significance.

       SOLUTION
Step 1.   Order and  rank the nickel data from  smallest to largest and list, as in the table below. The
         sample size is n = 20, with sample mean x = 169.52 and the standard deviation s =  259.72.

Step 2.   Compute the intermediate probabilities m\ from equation [10.13] for each /' in  column 3 and
         the rank statistic medians, M;, in  column 4 by applying the inverse normal transformation to
         column 3 using equation [10.14] and Table  10-1 of Appendix D.
                                             10-17                                   March 2009

-------
Chapter 10.  Fitting Distributions
                                                                Unified Guidance
Step3.
Step 4.
Since this sample contains no non-detects or ties, the simplified equations for Cn in equation
[10.16] and for r in equation [10.18] may be used. First compute Cn using the squared order
statistic medians in column 5:
                Ctt = Vl.3.328 + 1.926 + ... + 3.328]  =  4.138

Next compute the  products x^xM.'m  column 6 and  sum to get the  numerator of the

correlation coefficient (equal to 3,836.81 in this case). Then compute the final correlation
coefficient:
                            = 3,836.81/[4.138 x259.72>/19l = 0.819
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
x(f)
1.0
3.1
8.7
10.0
14.0
19.0
21.4
27.0
39.0
56.0
58.8
64.4
81.5
85.6
151.0
262.0
331.0
578.0
637.0
942.0
m,
.03406
.08262
.13172
.18082
.22993
.27903
.32814
.37724
.42634
.47545
.52455
.57366
.62276
.67186
.72097
.77007
.81918
.86828
.91738
.96594
M,
-1.8242
-1.3877
-1.1183
-0.9122
-0.7391
-0.5857
-0.4451
-0.3127
-0.1857
-0.0616
0.0616
0.1857
0.3127
0.4451
0.5857
0.7391
0.9122
1.1183
1.3877
1.8242
(M,)2
3.328
1.926
1.251
0.832
0.546
0.343
0.198
0.098
0.034
0.004
0.004
0.034
0.098
0.198
0.343
0.546
0.832
1.251
1.926
3.328
x(i) x Mj
-1.824
-4.302
-9.729
-9.122
-10.347
-11.129
-9.524
-8.444
-7.242
-3.448
3.621
11.959
25.488
38.097
88.445
193.638
301.953
646.376
883.941
1718.408
Step 5.   Compare Filliben's test statistic of r = 0.819 to the 1% critical point for a sample of size 20 in
         Table 10-5 of Appendix D, namely rcp = 925. Since r < 0.925, the sample shows significant
         evidence of non-normality by the probability  plot correlation coefficient. The data should be
         transformed and  the  correlation coefficient re-calculated  before proceeding  with further
         statistical analysis. A
10.7  SHAPIRO-WILK MULTIPLE GROUP TEST OF NORMALITY

       BACKGROUND AND PURPOSE

     The main purpose for including the multiple group test normality (Wilk and Shapiro, 1968) in the
Unified Guidance is to serve as a check for normality when using a Student's Mest (Chapter 16) or
                                           10-18
                                                                        March 2009

-------
Chapter 10.  Fitting Distributions	Unified Guidance

when assessing the joint normality of multiple intrawell data sets. The multiple group test is an extension
of the Shapiro-Wilk procedure for assessing the joint normality of several independent samples. Each
sample may have a different mean and/or variance, but as long as the underlying distribution of each
group is normal, the multiple group test statistic will tend to be non-significant. Conversely, the multiple
group test is designed to identify when at least one of the groups being tested is definitely non-normal.

     This test extends the  Shapiro-Wilk procedure for a single sample, using individual SW test
statistics computed separately  for each  group  or sample.  Then  the  individual SW  statistics  are
transformed and combined into an overall or "omnibus" statistic (G). Like the single sample procedure
— where non-normality is indicated when the test statistic SW is too low — non-normality in one or
more groups is indicated when G is too low. However, instead of a special table of critical points, G is
constructed to follow a standard normal distribution under the null hypothesis of normality. The value of
G can simply be compared to  an a-level z-score or  normal quantile to decide whether the null  or
alternative hypothesis is better supported.

     Since it  may be unclear which one  or more of the groups  is actually non-normal when the G
statistic is significant, Wilk and Shapiro recommend that a probability plot (Chapter 9) be examined on
the intermediate quantities, G; (at least for the case where several groups are being simultaneously
tested). One of these statistics is computed for each separate sample/group and  is designed to follow a
standard normal distribution undergo- Because of this, the G; statistics for non-normal groups will tend
to look like outliers on a normal probability plot (see Chapter 12).

     The multiple group test can also be  used to check normality when performing Welch's  Mest,  a
two-sample  procedure in  which the underlying data of both groups are assumed to be normal, but no
assumption is  made that the means or variances are the same. This is different from either the pooled
variance t-test or the one-way analysis of variance [ANOVA], both of which assume homoscedasticity
(i.e., equal variances across groups). If the  group variances can be shown to be equal, the  single sample
Shapiro-Wilk test can be run on the combined residuals, where  the residuals of each group are formed by
subtracting off the  group mean from each  of the individual measurements. However,  if the group
variances are possibly different, testing the  residuals as  a single group using the SW statistic may give an
inaccurate or  misleading result.  Consequently,  since  a test  of homoscedasticity is not required for
Welch's t-tesl, it is suggested to first use the multiple group test to check normality.

     Although the Shapiro-Wilk multiple  group method is an attractive procedure for  accommodating
several groups of data at once, the user is cautioned against indiscriminate use.  While many of the
methods described in the Unified  Guidance  assume  underlying  normality,  they  also  assume
homoscedasticity. Other parametric multi-sample methods recommended for detection monitoring —
prediction limits in Chapter 18 and control charts in Chapter  20 — all assume  that each  group has the
same variance. Even if normality of the joint data can be demonstrated using the Shapiro-Wilk multiple
group test, it says nothing about whether the assumption of equal variances is also satisfied. Generally
speaking, except for Welch's Mest,  a separate test of homoscedasticity may also be needed. Such tests
are described in Chapter  11.

       PROCEDURE

Step 1.   Assuming there are K groups to be tested, let the sample size of the /'th group be  denoted n\.
         Then compute the SW\ test statistic for each of the K groups using equation [10.10].

                                             10-19                                    March 2009

-------
Chapter 10.  Fitting Distributions _ Unified Guidance

Step 2.   Transform the SWi statistics to the intermediate quantities (G;). If the sample size («;) of the /'th
         group is at least 7, compute G; with the equation:

                                    G=y+d\n\ — '— -I                               [10.19]
                                      1           l-SW
         where the quantities y, 8, and e can be found in Table 10-6 of Appendix D for 7 < n\ < 50. If
         the sample size («;) is less than 7, determine G; directly from Table 10-7 in Appendix D by
         first computing the intermediate value

                                             ( SW.-e\
                                       «,=la|—^1                                 [10-2°]
         (obtaining £ from the top of Table 10-7), and then using linear interpolation to find the closest
         value G; associated with u\.

Step 3.   Once the G; statistics are derived, compute the Shapiro-Wilk multiple group statistic with the
         equation:
Step 4.   Under the null hypothesis that all K groups are normally-distributed, G will follow a standard
         normal distribution. Given the significance level (a), determine an a-level critical point from
         Table 10-1 of Appendix D as the lower a x 100th normal quantile (za).  Then compare G to
         za. If G < za, there is significant evidence of non-normality  at the a level.  Otherwise, the
         hypothesis of normality cannot be rejected.

       ^EXAMPLE 10-4

     The previous examples in this chapter pooled the data of Example  10-1 into a single group before
testing for normality. This time, treat each well separately and compute the Shapiro-Wilk multiple group
test of normality at the a = .05 level.

       SOLUTION
Step 1.   The nickel data in Example 10-1  come from K = 4 wells with n\ = 5 observations per well.
         Using equation [10.10], the SWt individual well test statistics are calculated as:
             Welll:       SWi = 0.6062

             Well 2:       SW2 = 0.5917

             Well 3:       SW3 = 0.5652

             Well 4:       ^ = 0.6519
                                             10-20                                   March 2009

-------
Chapter 10. Fitting Distributions
Unified Guidance
Step 2.   Since n\ = 5 for each well, use Table 10-7 of Appendix D to find e = .5521. First calculating
         MI with equation [10.20]:

                                     , (.6062-.5521^
                                M, = In I 	I = -1.985
                                  1    I   1-.6062  )

         The performing this step for each well group and using linear interpolation on M in Table 10-7,
         the approximate G; statistics are:

             Welll:       MI =-1.985   GI =-3.238

             Well 2:       M2 =-2.333   G2 =-3.488

             Well 3:       M3 = -3.502   G3 = -4.013 (taking the last and closest entry)

             Well 4:       w4 = -1.249   G4 =-2.755

Step 3.   Compute the multiple group test statistic using equation [10.21]:

                    G = -^=[(-1.985)+ (-2.333)+ (-4.013)+ (-2.755)] = -5.543
                        V4

Step 4.   Since a = 0.05, the lower a  x  100th critical  point from the standard normal distribution in
         Table 10-1 of Appendix D is z.os = -1.645. Clearly, G < z.os  ; in fact G is smaller than just
         about any a-level critical point that might be  selected. Thus, there is significant evidence of
         non-normality in at least one of these wells (and probably all of them). -4

       ^EXAMPLE 10-5

     The data in Example 10-1 showed significant evidence of non-normality. In this example, use the
same nickel data applying the coefficient of skewness,  Shapiro-Wilk and the Probability Plot Correlation
Coefficient tests to determine whether the combined well measurements  better follow a  lognormal
distribution by first log-transforming the measurements.  Computing the  natural logarithms of the data
gives the table below:
Month
1
2
3
4
5
Well 1
4.07
0.00
5.57
4.03
2.16
Logged Nickel Concentrations log(ppb)
Well 2 Well 3
2.94 3.66
4.40 5.02
5.80 3.30
2.64 3.06
4.17 6.36
Well 4
1.13
6.85
4.45
2.30
6.46
                                             10-21
        March 2009

-------
Chapter 10. Fitting Distributions	Unified Guidance

       SOLUTION

             METHOD 1.  COEFFICIENT OF SKEWNESS

Step 1.   Compute the log-mean (]7), log-standard deviation (sy), and sum of the cubed residuals for the
         logged nickel concentrations (y;):

                       J7 = — (4.07 + 0.00 + ... + 6.46) =  3.918 l
      Sy
=   I— [(4.07-3.918)2 +(0.00-3.918)2 + ... + (6.46 - 3.918)2]  =  lWU\og(ppb)


               =  [(4.07-3.918)3 +... + (6.46-3.918)3]=-26.5281og3(^*)
Step 2.   Calculate the coefficient of skewness using equation [10.8] with Step 1 values as:

                         yl = (20 j/2 (-26.528)/(l9]3/2 (1.8014 J = -0.245
         Since the  absolute value of the skewness is less than 1, the data do not show evidence of
         significant skewness.  Applying  a normal  distribution  to  the  log-transformed  data  may
         therefore be appropriate, but this model should be further checked. The logarithmic CV of
         4.97 computed in Example  10-1 was also suggestive of a highly skewed distribution, but can
         be  difficult to  interpret in determining if measurements,  in  fact, follow  a logarithmic
         distribution.

              METHOD 2.  SHAPIRO-WILK TEST

Step 1.   Order and rank the data from smallest to largest and list, as in the table below. List the data in
         reverse order alongside the first column. Denote the rth logged observation by_y; = log(x;).

Step 2.   Compute differences  >Yn_!+1)->YA   m column 4 of the table by subtracting column 2 from

         column 3. Since n = 20, the largest integer less than or equal to (n/2) is k = 10.

Step 3.   Look up the coefficients an_;+i from Table 10-2 of Appendix D and list in column 5.

Step 4.   Multiply the differences in column 4 by the coefficients in column 5 and add the first k
         products (b[) to get quantity b, using equation [10.9].

                      b =[.4734(6.85)+. 321 1(5. 33) +... + .0140(. 04)] = 7.77
                                            10-22                                   March 2009

-------
Chapter 10. Fitting Distributions                                         Unified Guidance

i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Yd)
0.00
1.13
2.16
2.30
2.64
2.94
3.06
3.30
3.66
4.03
4.07
4.17
4.40
4.45
5.02
5.57
5.80
6.36
6.46
6.85
V(n-i + l)
6.85
6.46
6.36
5.80
5.57
5.02
4.45
4.40
4.17
4.07
4.03
3.66
3.30
3.06
2.94
2.64
2.30
2.16
1.13
0.00
Y(n-l + l) - Y(l)
6.85
5.33
4.20
3.50
2.93
2.08
1.39
1.10
0.51
0.04
-0.04
-0.51
-1.10
-1.39
-2.08
-2.93
-3.50
-4.20
-5.33
-6.85
3n-i + l
.4734
.3211
.2565
.2085
.1686
.1334
.1013
.0711
.0422
.0140










b,
3.24
1.71
1.08
0.73
0.49
0.28
0.14
0.08
0.02
0.00
b = 7.77









Step 5.   Compute the log-standard deviation of the sample, sy = 1.8014. Then use [10.10] to calculate
         the SWtest statistic:
                                 SW =
                                           7.77
                                        1.8014
Vl9
      = 0.979
Step 6.   Use Table 10-3 of Appendix D to determine the .01-level critical point for the Shapiro-Wilk
         test when n = 20. This gives swcp = 0.868. Then compare the observed value of SW= 0.979 to
         the 1% critical point. Since SW> 0.868, the  sample shows no significant evidence of non-
         normality by the Shapiro-Wilk test.  Proceed  with further  statistical analysis using the log-
         transformed data or by assuming the underlying population is lognormal.

             METHOD 3. PROBABILITY PLOT CORRELATION COEFFICIENT

Step 1.   Order and rank the logged nickel data from smallest to largest and list, as in the table below.
         Again let the rth logged value be denoted by y\ = log(x;). The sample size is n = 20, the log-
         mean is y = 3.918, and the log-standard deviation is sy = 1.8014.

Step 2.   Compute the intermediate probabilities m\ from  equation [10.13] for each /' in column 3 and
         the rank statistic medians, M\ , in column 4 by applying the inverse normal transformation to
         column 3 using equation [10.14] and Table 10-1  of Appendix D.
                                            10-23                                  March 2009

-------
Chapter 10.  Fitting Distributions
                                                                             Unified Guidance
                       Yd)
                                                                         V(i) x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.00
1.13
2.16
2.30
2.64
2.94
3.06
3.30
3.66
4.03
4.07
4.17
4.40
4.45
5.02
5.57
5.80
6.36
6.46
6.85
.03406
.08262
.13172
.18082
.22993
.27903
.32814
.37724
.42634
.47545
.52455
.57366
.62276
.67186
.72097
.77007
.81918
.86828
.91738
.96594
-1.8242
-1.3877
-1.1183
-0.9122
-0.7391
-0.5857
-0.4451
-0.3127
-0.1857
-0.0616
0.0616
0.1857
0.3127
0.4451
0.5857
0.7391
0.9122
1.1183
1.3877
1.8242
3.328
1.926
1.251
0.832
0.546
0.343
0.198
0.098
0.034
0.004
0.004
0.034
0.098
0.198
0.343
0.546
0.832
1.251
1.926
3.328
0.000
-1.568
-2.416
-2.098
-1.951
-1.722
-1.362
-1.032
-0.680
-0.248
0.251
0.774
1.376
1.981
2.940
4.117
5.291
7.112
8.965
12.496
Step 3.
         Since this sample contains no non-detects or ties, the simplified equations for Cn in [10.16]
         and for r in [10.18] may be used. First compute Cn using the squared order statistic medians in
         column 5:
                          Ctt = Vl.3.328 + 1.926 + ... + 3.328]  =  4.138

Step 4.   Next compute  the products  y^ x M.  in  column  6 and sum to get the  numerator of the
Step 5.
         correlation coefficient (equal to 32.226  in  this case). Then compute the final correlation
         coefficient:

                             r = 32.226/[4.138 x l.SQl4-Jl9~\ = 0.992
         Compare the Filliben's test statistic of r = 0.992 to the 1% critical point for a sample of size 20
         in Table  10-5  in Appendix D, namely rcp = 925. Since r > 0.925, the sample shows no
         significant evidence of non-normality by  the probability  plot correlation coefficient test.
         Therefore, lognormality  of the  original  data can  be assumed  in  subsequent statistical
         procedures.

         Note: the  Shapiro-Wilk and  Filliben's Probability  Plot  Correlation Coefficient  tests for
         normality on a single data set perform quite comparably. Only one of these tests need be run in
         routine applications. -^
                                             10-24
                                                                                     March 2009

-------
Chapter 11. Testing Equality of Variance	Unified Guidance

       CHAPTER 11.  TESTING EQUALITY OF VARIANCE
       11.1   Box PLOTS	11-2
       11.2   LEVENE'STEST	 11-4
       11.3   MEAN-STANDARD DEVIATION SCATTER PLOT	11-8
     Many of the methods described in the Unified Guidance assume that the different groups under
comparison  have the  same variance (i.e., are homoscedastic). This chapter covers  procedures for
assessing homoscedasticity and its counterpart, heteroscedasticity (i.e., unequal variances). Equality of
variance  is  assumed, for instance,  when  using  prediction limits  to  make either upgradient-to-
downgradient or intrawell comparisons. In the former case,  the method  assumes that the upgradient
variance is equal to the variance in each downgradient well. In the latter case, the presumption is that the
well variance is  stable over time (i.e., stationary) when comparing intrawell background versus  more
recent measurements.

     If a prediction limit is constructed on a single new measurement at each downgradient well, it isn't
feasible to test the variance equality assumption prior to each statistical evaluation. Homoscedasticity
can  be tested after several new rounds  of compliance sampling  by pooling collected compliance
measurements within a well. The Unified Guidance recommends periodic testing of the presumption of
equal variances by comparing newer data to historical background (Chapter 6).

     Equality of variance  between  different  groups  (e.g., different wells) is also an important
assumption for an analysis of variance [ANOVA]. If equality of variance does not hold, the power of the
F-test (its ability to detect differences among the group means) is reduced. Mild differences in variance
are generally acceptable. But the  effect  becomes noticeable when the largest and  smallest group
variances differ by a ratio of about 4, and becomes quite severe when the  ratio is  10 or more (Milliken
and Johnson, 1984).

     Three procedures for assessing or testing homogeneity  of variance  are described in the Unified
Guidance, two of which  that are more robust to departures from normality (i.e., less sensitive to non-
normality). These include:

  1.    The box plot (Chapter 9), a graphical method useful not only for checking equality of variance
       but also as an exploratory tool for visualizing the basic statistical characteristics of data sets.  It
       can also provide a rough indication of differences in mean or median concentration levels across
       several wells;
  2.    Levene's test (Section  11.2), a formal ANOVA-type procedure for testing variance inequality;
       and
  3.    The mean-standard deviation scatter plot (Chapter  9  and Section  11.3), a  visual  tool for
       assessing whether the degree of variability in a set of data groups or wells is correlated with the
       mean levels for those groups.  This could potentially indicate whether a variance stabilizing
       transformation might be needed.

                                             11-1                                  March 2009

-------
Chapter 11. Testing Equality of Variance	Unified Guidance

11.1 BOX PLOTS

       PURPOSE AND BACKGROUND

     Box plots are described in Chapter 9. In the context of variance testing, one can construct a box
plot for each well group and compare the boxes to see if the assumption of equal variances is reasonable.
The comparison  is not a formal  test procedure, but is easier to perform and is often sufficient for
checking the group variance assumption.

     Box plots  for each  data group  simultaneously graphed  side-by-side  provide a direct visual
comparison of the dispersion in each group. As a rule of thumb, if the box length for each group is less
than 1.5-2 times the length of the shortest box, the sample variances  may be close enough to assume
equal  group variances. If the box length for any group is greater than 1.5-2 times the length of the box
for another group, the variances may be significantly different. A  formal test such as Levene's might be
needed to more  accurately decide. Sample data  sets with unequal variances may need a variance
stabilizing transformation.,  i.e., one in which the transformed measurements have approximately equal
variances.

     Most statistical software packages will calculate the statistics needed to draw a box plot, and many
will construct side-by-side box plots directly. Usually a box plot will also be shown with two "whiskers"
extending from the edges of the box. These lines indicate either  the positions of extreme minimum or
maximum values in the data set.  In Tukey's original formulation (Tukey, 1977), they indicate the most
extreme lower and upper data points outside  the box but falling within a distance of 1.5 times the
interquartile range (that is, the length of the box) from either edge. The whiskers should generally not be
used to approximate the overall variance under either formulation.

     A convenient tactic when using box plots to screen for heteroscedasticity is to plot the residuals of
each data group  rather than the measurements themselves.  This will  line the boxes up at roughly a
common level (close to zero), so that a visual comparison of box lengths is easier.

       REQUIREMENTS AND ASSUMPTIONS

     The requirements and assumptions for box plots are discussed in Section 9.2.

       PROCEDURE

Step 1.   For each of j wells or data groups, compute the sample mean of that group x.. Then compute
         the residuals  (r;j) for each  group  by subtracting the group mean from  each individual
         measurement:r = x  -x .
                      y   v    i

Step 2.   Use the procedure outlined in Section 9.2  to create side-by-side box plots of the residuals
         formed in Step 1. Then compare the box lengths to check for possibly unequal variances.

       ^EXAMPLE 11-1

     Construct  box plots  on the  residuals  for  each of the  following  well  groups  to check for
homoscedasticity.

                                            11-2                                  March 2009

-------
Chapter 11. Testing Equality of Variance
Unified Guidance
Month
1
2
3
4
Well 1
22.9
3.1
35.7
4.2
Well 2
2.0
1.2
7.8
52
Arsenic Concentration (ppb)
Well 3 Well 4
2.0
109.4
4.5
2.5
7.8
9.3
25.9
2.0
Well 5
24.9
1.3
0.8
27
Well 6
0.3
4.8
2.8
1.2
       SOLUTION
Step 1.   Form the residuals for each well by subtracting the sample well mean from each observation,
         as shown in the table below.
Month
1
2
3
4
Mean
Well 1
6.43
-13.38
19.22
-12.28
16.48
Well 2
-13.75
-14.55
-7.95
36.25
15.75
Arsenic Residuals (ppb)
Well 3 Well 4
-27.6 -3.45
79.8 -1.95
-25.1 14.65
-27.1 -9.25
29.6 11.25
Well 5
11.4
-12.2
-12.7
13.5
13.5
Well 6
-1.98
2.52
0.52
-1.08
2.28
Step 2.   Follow the procedure in Section 9.2 to compute a box plot of the residuals for each well. Line
         these up side by side on the same graph, as in Figure 11-1.

Step 3.   Compare the box lengths. Since the box length for Well 3 is more than three times the box
         lengths of Wells 4 and 6, there is informal evidence that the population group variances may
         be different. These data should be further checked using a formal test and perhaps a variance
         stabilizing transformation attempted. -^
                                            11-3
        March 2009

-------
Chapter 11. Testing Equality of Variance
Unified Guidance
               Figure 11-1. Side-by-Side Box Plots of Arsenic Residuals
                      8 -
                                   Wsll2
                                          WelB
                                                 We!l4
                                                        WollS
                                                               Wane
 11.2 LEVENE'S TEST

       PURPOSE AND  BACKGROUND

     Levene's test is a formal procedure for testing homogeneity of variance that is fairly robust (i.e.,
 not overly sensitive) to non-normality in the data. It is based on computing the new variables:

                                                                                      [11.1]

 where xy represents they'th sample value from the rth group (e.g.., well) and x.. is the rth group sample
 mean.  The symbol (•)  in the notation for the group sample mean represents an averaging over subscript
j.   The values zt] then  represent the absolute values of the residuals.  Levene's test involves running a
 standard one-way ANOVA (Chapter 17) on the variablesz.;. If the F-test is significant, reject the
 hypothesis of equal group variances and perhaps seek a variance stabilizing transformation. Otherwise,
 proceed with analysis of the original xt 's.
     Levene's test is based on a one-way ANOVA and contrasts the means of the groups being tested.
This implies a comparison between averages of the form:
                                            11-4
       March 2009

-------
Chapter 11. Testing Equality of Variance                                  Unified Guidance
                                      z,. =•
Such averages of the z^. 's are very similar to the standard deviations of the original data groups, given
by the formula:
       In both cases, the statistics are akin to an average absolute residual. Therefore, the comparison of
means in Levene's test is closely related to a direct comparison of the group standard deviations, the
underlying aim of any test of variance equality.

       REQUIREMENTS  AND ASSUMPTIONS

     The requirements and assumptions  for  Levene's test are essentially the  same as the one-way
ANOVA in Section 17.1, but applied to the absolute residuals instead of the raw measurements.

       PROCEDURE

Step 1.   Suppose there are p data groups to be compared.  Because there may be different numbers of
         observations per well, denote the sample  size of the /'th group by n\ and the total number of
         data points across  all groups by N.

         Denote the observations in the /'th group by xi}.  for /' = 1 . . .p and/' = 1 . . .n\. The first subscript
         then designates the well, while the second denotes the /th value in the  /'th well. After
         computing the sample mean  (x. ) for each group,  calculate the absolute residuals (zjy. ) using
         equation [11.1].

Step 2.   Utilizing the  absolute residuals  — and not the original data — compute the mean of each
         group along with the overall (grand) mean of the combined data set using the formula:
Step 3.   Compute the sum of squares of differences between the group means and the grand mean,
         denoted SSgrps:


                             SS   =Tn.(z. -z  t =T«J2-7VF2                        [11.5]
                               grps  1—t i\ i»    •• /   Z_^ i i*     ••                        L    J
                                     i=l              i=l

         The  formula on the far right is usually the most convenient for calculation. This sum of
         squares has (p-l) degrees of freedom associated with  it and is a measure of the variability
         between groups. It constitutes the numerator of theF-statistic.
                                             11-5                                   March 2009

-------
Chapter 11. Testing Equality of Variance	Unified Guidance

Step 4.   Compute the corrected total sum of squares, denoted by -SIStotai:
                            ss  , = yyu -z  )=yvz2-7VF2                       [n.e]
                              total   /—I /—I \ ij   •• J   Z_f Z_f  7;     ••                       L    J
                                   z=l ;=1            i=l ;=1

         Again, the formula on the far right is usually the most computationally convenient. This sum
         of squares has (A/-1) associated degrees of freedom.

Step 5.   Compute the sum  of squares  of differences between the absolute residuals  and the group
         means. This is known as the within-groups  component of  the  total  sum  of  squares or,
         equivalently, as the sum of squares due to error. It is easiest to obtain by subtracting SS^pS
         from XS'totai and is denoted SSenor:

                            P "' ,       N2                  P  "i      P
                   ss   =yyfc..-z.) =ss  ,-ss  =yy>2-y>.z2              [n.?]
                      error    ' ' '  ' \ 77   7» /      total     frrss   ' ' ' ' 77   ' ' 7  7»              L    J
                                              total     grps  4-^ 4-^  ij
                                                         7 = 1 j=\     7 = 1

               is associated with (N-p) degrees of freedom and  is a measure of the variability within
         groups. This quantity goes into the denominator of the F-statistic.

Step 6.   Compute  the  mean sum of  squares  for both the  between-groups and  within-groups
         components of the total sum of squares, denoted by MS grps  and MSenor. These quantities are
         obtained by dividing each sum of squares by its corresponding degrees of freedom:

                                    MS   =SS  /(p-l)                              [11.8]
                                       grps     grps I \f   /                              L    J

                                   MS   =SS   /(N-p)                              [11.9]
                                       error     error /  \    r /                              L    J

Step 7.   Compute the F-statistic by forming the ratio between the mean sum of squares for wells and
         the mean sum of squares due to error, as in Figure 11-2 below. This layout is known as the
         one-way parametric ANOVA table and illustrates each sum of squares component of the total
         variability, along with the corresponding degrees of freedom, the mean squares  components,
         and the final F-statistic calculated as F = M$rgrpS/M$'error. Note that the first two rows of the
         one-way table sum to the last row.

Step 8.   Figure 11-2 is a generalized ANOVA table for Levene's test. To test the hypothesis of equal
         variances across all p well groups, compare the F-statistic in Figure 11-2 to the a-level critical
         point found from the F-distribution with (p-l) and (N-p) degrees of freedom in Appendix D
         Table 17-1. When testing variance equality, only severe levels of difference typically impact
         test performance in a substantial way. For this reason, the Unified Guidance recommends
         setting a = .01 when screening multiple wells and/or constituents using Levene's test. In that
         case, the  needed critical point equals the upper 99th  percentage point of the F-distribution. If
         the observed F-statistic exceeds the critical point CF.99,P-i,N-p),  reject the hypothesis of equal
         group population  variances.  Otherwise, conclude that there  is insufficient  evidence  of  a
         significant difference between the variances.
                                            11-6                                   March 2009

-------
Chapter 11.  Testing Equality of Variance
Unified Guidance
                     Figure 11-2. ANOVA Table for Levene's Test
Source of Variation
Between Wells
Error (within wells)
Total
Sums
of Squares
-'-'grps
•^•^error
SStotal
Degrees of
Freedom
p-1
A/-p
A/-1
Mean Squares
MSgrps= SSgrps/(p-l)
MSenor = SSerror/(/V-p)
F-Statistic
F=MS"/MS™
       ^EXAMPLE 11-2

     Use the data from Example 11-1 to conduct Levene's test of equal variances at the a = 0.01 level
of significance.

       SOLUTION
Step 1.    Calculate the group arsenic mean for each well (x.t):

             Well  1 mean = 16.47 ppm          Well 4 mean = 11.26 ppm

             Well  2 mean = 15.76 ppm          Well 5 mean = 13.49 ppm

             Well  3 mean = 29.60 ppm          Well 6 mean = 2.29 ppm

         Then compute the absolute  residuals zy- in each well using equation [11.1]  as in the table
         below.
Month

Well
1
2
3
4
Mean (z..)
Overall Mean (F, )
Well 1
6.43
13.38
19.23
12.29
12.83
15.36
Absolute Arsenic
Well 2 Well 3
13.76
14.51
7.96
36.24
18.12

27.6
79.8
25.1
27.1
39.9

Residuals (;
Well 4
3.42
1.96
14.64
9.26
7.32

Well 5
11.41
12.19
12.74
13.51
12.46

Well 6
1.95
2.49
0.56
1.09
1.52

Step 2.   Compute the mean absolute residual (z..) in each well and then the overall grand mean using
         equation [11.4]. These results are listed above.

Step 3.   Compute the between-groups sum of squares for the absolute residuals using equation [11.5]:
                                           11-7
       March 2009

-------
Chapter 11. Testing Equality of Variance	Unified Guidance


              SSgrps =[4(l2.83)2 +4(18.12)2 + ... + 4(l.52)2J-24-(15.36)2 =3,522.90


Step 4.   Compute the corrected total sum of squares using equation [11.6]:

                 SStotal =[(6.43)2 +(13.38)2 + ... + (l.09)2J- 24- (15.36)2 =6,300.89

Step 5.   Compute the within-groups or error sum of squares using equation [11.7]:

                             SS   =6,300.89-3,522.90 = 2,777.99
                               error   '          ?         '

Step 6.   Given that the number of groups is p = 6 and the total sample size is N = 24, calculate the
         mean squares for the between-groups and error components using formulas [11.8] and [11.9]:

                               MS   =3,522.90/(6-l)= 704.58
                                  grps    '      / \   /

                              MS   =2,777.99/(24-6)=154.33
                                 error    7      /  \      J

Step 7.   Construct an ANOVA table following Figure 11-2 to calculate the F-statistic. The numerator
         degrees of freedom \df\ is computed as (p-1) = 5,  while the denominator dfis equal to (N-p) =
         18.
Source of Variation
Between Well Grps
Error (within grps)
Total
Sums of Squares
3,522.90
2,777.99
6,300.89
Degrees of
Freedom
5
18
23
Mean Squares
704.58
154.33
F-Statistic
4.56
Step 8.   Determine the .01-level critical point for the F-test with 5 and 18 degrees of freedom from
         Table 17-1. This gives F,99^^ = 4.25. Since the F-statistic of 4.56 exceeds the critical point,
         the assumption of equal variances should be rejected. Since the original concentration data are
         used in this example, a transformation such as the natural logarithm might be tried and the
         transformed data retested.  -4
11.3 MEAN-STANDARD DEVIATION  SCATTER PLOT

       BACKGROUND AND PURPOSE

     The mean-standard deviation scatter plot is described in Chapter 9. It is useful as an exploratory
tool for multiple groups of data (e.g., wells) to aid in identifying relationships between mean levels and
variability.  It is also helpful in providing  a visual assessment of variance homogeneity across data
groups. Like side-by-side box plots, the mean-standard  deviation scatter plot graphs a measure  of
variability for each well. In the latter, however,  the  standard deviation is plotted  rather  than the
interquartile range, so a more  direct assessment of variance equality can be made.  Since standard

                                            11-8                                  March 2009

-------
Chapter 11. Testing Equality of Variance	Unified Guidance

deviations (and consequently variances) are often  positively correlated with sample mean levels in
skewed populations, the observed pattern on the mean-standard deviation scatter plot can offer valuable
clues as to what sort of variance stabilizing transformation if any might work.

      REQUIREMENTS AND ASSUMPTIONS

     The requirements for the mean-standard deviation scatter plot are listed in Section 9.4.

      PROCEDURE

     See Section 9.4.

      ^EXAMPLE 11-3

     Use the data from Example 11-1 to construct a mean-standard deviation scatter plot.

      SOLUTION
Step 1.   First compute the sample mean (x ) and standard deviation (s) of each well, as listed below.
Well
1
2
3
4
5
6
Mean
16.468
15.762
29.600
11.260
13.488
2.292
Std Dev
15.718
24.335
53.211
10.257
14.418
1.958
Step 2.   Plot the well means versus the standard deviations as in Figure 11-3 below. Note the roughly
         linear relationship between the magnitude of the standard deviations and their corresponding
         means. The data suggest unequal variances among the wells, as indicated by the large range in
         the standard deviations.
                                            11-9                                  March 2009

-------
Chapter 11.  Testing Equality of Variance
                         Unified Guidance
                               Figure 11-3, Arsenic Mean-Standard Deviation Plot
                      <
                      «
                                        10
                                               —i—
                                                IS
—I—
 20
—i—
 25
                                                                       30
                                            Mean Afsenic (ppb)
Step 3.   Because lognormal  data groups will tend to show a linear association between the sample
         means  and  standard  deviations,  apply  a log  transformation  to the  original  arsenic
         measurements  and  reconstruct the mean-standard deviation  scatter plot  on the log scale.
         Computing the log-means and log-standard deviations and then re-plotting gives Figure 11-4.
         Now the apparent trend between the means and standard deviations is gone. Further, on the
         log scale, the standard deviations are much more similar in magnitude, all with values between
         1 and 2. The log transformation thus appears to roughly stabilize the arsenic variances. -4
                        KJ
                        cj
                      Q,  O
                      &  «M
                      O>
                      3
                      u
                     
-------
Chapter 12. Identifying Outliers	Unified Guidance

              CHAPTER 12.   IDENTIFYING  OUTLIERS
       12.1   SCREENING WITH PROBABILITY PLOTS	12-1
       12.2   SCREENING WITH Box PLOTS	12-5
       12.3   DKON'STEST	12-8
       12.4   ROSNER'STEST	 12-10
     This chapter discusses  screening tools and formal tests for identifying statistical  outliers.  Two
screening tools are first presented: probability plots (Section 12.1) and box plots (Section 12.2). These
are followed by two formal outlier tests:

   »»»  Dixon's test (Section 12.3) for a single outlier in smaller data sets, and
   *»*  Rosner's test (Section 12.4) for up to five separate outliers in larger data sets.
     A statistical  determination  of one or  more  statistical  outliers  does   not  indicate  why the
measurements are discrepant from the rest of the data set. The Unified Guidance does not recommend
that  outliers  be  removed solely on a statistical  basis.   The  outlier  tests can provide supportive
information, but  generally a reasonable rationale needs to be identified  for removal of suspect outlier
values  (usually limited to background data).  At the same time there must be some level of confidence
that the data are representative of ground water quality. A number  of factors and considerations in
removing outliers from potential background data are discussed in Section 5.2.3.

12.1  SCREENING WITH PROBABILITY  PLOTS

       BACKGROUND AND PURPOSE

     Probability plots (Chapter 9) are helpful in identifying outliers in at least two ways. First, since the
straightness of the plot indicates how closely the data fit the pattern of a normal distribution, values that
appear "out of line" with the remaining data can be visually identified as possible outliers. Secondly, the
two formal outlier tests presented in the Unified Guidance assume that the underlying population minus
the suspected outlier(s) is  normal. Probability plots can provide visual evidence  for this assumption.
Data that appear non-normal after the suspected outliers have been removed from the probability plot
may need to be transformed (e.g.,  via the natural logarithm) and re-examined on the transformed scale to
see if potential outliers are still apparent.

     As an aid to the interpretation of a given probability plot, the Unified Guidance recommends
computation of the probability plot correlation coefficient,  using either Filliben's  procedure (Chapter
10) or  the simple (Pearson) correlation (Chapter 3) between the numerical pairs plotted on the graph.
The higher the correlation, the more linear the pattern is on the probability plot and therefore a better fit
to normality. Note that while the Filliben correlation coefficient can be compared to  critical points
derived for that test of normality (Chapter 10), a low correlation may be related  to other causes of non-
normality besides the presence of outliers. The correlation coefficient is not a  substitute for a formal
outlier test, but can be useful as a screening tool.
                                             12-1                                   March 2009

-------
Chapter 12.  Identifying Outliers	Unified Guidance

       REQUIREMENTS AND ASSUMPTIONS

     Probability plots are primarily a tool to assess normality, and not to identify outliers per se. It is
critical that the remaining data without potential outliers is  either normal in distribution  or  can be
normalized via a transformation. Otherwise, the probability plot may appear non-linear and non-normal
for reasons unrelated to the presence of outliers. Right-skewed lognormal distributions can  appear to
have one or more outliers on a probability plot unless the original data are first log-transformed. As a
general rule, probability plots should be constructed on the original (or raw) measurements and one or
more transformed data sets (e.g., log or square root), in order to avoid mistaking inherent data skewness
for outliers.

     If the raw and transformed-data probability plots both indicate one or more values inconsistent
with the pattern of the remaining values, continue with a second level of screening by temporarily
removing the  suspected  outlier(s) and  re-constructing the probability  plots.  If the raw-scale  plot is
reasonably linear, consider running  a formal  outlier test on the original measurements. On the  other
hand, if the raw-scale  plot is skewed but the transformed-scale plot is linear, consider conducting a
formal outlier test on the transformed measurements.

     A related difficulty occurs when sample data includes  censored or non-detect values.  If simple
substitution is used to estimate a value for each non-detect prior to plotting, the resulting probability plot
may appear non-linear simply because the censored observations were not properly handled. In this case,
a censored probability plot (Chapter 15) should be constructed instead of an uncensored,  complete
sample plot (Chapter 9). The same caveats apply to normalizing the sample data, perhaps by attempting
at least  one transformation.  The  only  difference is that  each  probability plot  constructed  must
appropriately account for the observed censoring in the sample.

     PROCEDURE

Step 1.   After identifying one or more possible outliers (e.g., values much higher in concentration than
         the remaining measurements), construct  a probability plot on the  entire  sample  using  the
         procedure described in Section 9.5.  Construct a censored probability plot from Section 15.3
         if the sample contains  non-detects. If the data including the suspected outlier(s) follow a
         reasonably linear pattern, a formal outlier test is probably unnecessary. However, if one or
         more values are out of line compared to the pattern of the remaining data, construct a similar
         probability plot after applying one or more transformations. If one or more suspected outliers
         is still inconsistent, proceed to Step 2.

Step 2.   Compute  a probability plot correlation coefficient for each plot constructed in Step 1. Use
         these correlations as an aid to interpreting the degree of linearity in each probability plot.

Step 3.   Reconstruct the  probability plots  from  Step  1  after removing  the  suspected   outlier(s).
         Recompute the correlation coefficients from Step 2 on this reduced sample.

Step 4.   If the 'outlier-deleted' probability plot on the raw concentration scale indicates a linear pattern
         with  high correlation, consider running a formal outlier test on the original measurements.
         When the pattern  is distinctly  non-linear but the corresponding  probability  plot  on  the
         transformed-scale is fairly linear (and  higher in correlation),  conduct the outlier test on  the
         transformed values.
                                              IF2                                    March 2009

-------
Chapter 12. Identifying Outliers
                                                                   Unified Guidance
       ^EXAMPLE 12-1

     The table below contains data from five background wells measured over a four month period. The
value 7,066 is found in the second month at Well 3. Use probability plots on the combined sample to
determine whether or not a formal outlier test is warranted.
                 Well 1
                  Carbon Tetrachloride Concentrations (ppb)
                      Well 2        Well 3         Well 4
         Well 5
1.7
3.2
7.3
12.1
302
35.1
15.6
13.7
16.2
7066
350
70.1
199
41.6
75.4
57.9
275
6.5
59.7
68.4
       SOLUTION
Step 1.   Examine the probability plots of the entire sample first using the raw measurements and then
         log-transformed values (Figures 12-1 and 12-2). Both these plots indicate that the suspected
         outlier does not follow the pattern of the remaining observations, but seems 'out of line.' The
         Pearson correlation coefficients for these probability  plots are, respectively, r = 0.513 and
         0.975,  indicating that  the fit  to normality overall is much  closer using log-transformed
         measurements.

            Figure 12-1.  Probability Plot on Raw Concentrations (r =  .513)
Step 2.
                   2.50
                   1.25
               OJ

               8   o.oo
               CO
               r\i
                  -1.25
                  -2.50
                                        2000
                                          4000
6000
8000
                                     Carbon tetrachloride (ppb)
Next remove the suspected outlier and reconstruct the probability plots on both the original
and logged observations (Figures 12-3 and 12-4).  The plot on the original scale indicates
heavy positive (or right-) skewness and a non-linear pattern, while the plot on the log-scale
exhibits a fairly linear pattern. The respective correlation coefficients now become r = 0.854
and 0.985, again favoring the  log-transformed sample. On the  basis  of these plots, the
                                             12-3
                                                                          March 2009

-------
Chapter 12. Identifying Outliers
    Unified Guidance
        underlying data should be modeled as lognormal and the observations logged prior to running
        a formal outlier test. -^


          Figure 12-2. Probability Plot on Logged Observations (r = .975)


                   2
                  -2
                     0.0         2.5         5.0          7.5


                              Log(carbon tetrachloride) log(ppb)
 10.0
      Figure 12-3. Outlier-Deleted Probability Plot on Original Scale (r = .854)

                     2
                 CD


                 8   o
                 OT
                 I
                 M
                    -1
                    -2
                            0        100       200       300


                                  Carbon Tetrachloride (ppb)
400
                                          12-4
           March 2009

-------
Chapter 12. Identifying Outliers
                                           Unified Guidance
    Figure 12-4. Outlier-Deleted Probability Plot on Logarithmic Scale (r =  .985)

                    2
                CD
                O
                (J
                (f)
                I
                N
                   -1
                   -2
                     0.00
1.25
2.50
3.75
5.00
6.25
                               Log(carbon tetrachloride) log(ppb)
12.2 SCREENING WITH BOX PLOTS

       BACKGROUND AND PURPOSE

     Probability plots as described  in  Section  12.1  require the  remaining  observations following
removal of one or  more suspected  outliers to be  either approximately normal  or  normalized  via
transformation.  Box plots (Chapter  9) provide an alternate method to perform outlier screening, one
not dependent on normality of the underlying measurement population. Instead of looking for points
inconsistent with a linear pattern on a probability plot, the box plot flags as possible outliers values that
are located in either or both of the extreme tails of the sample.

     To define the extreme tails, Tukey (1977) proposed the concept of 'hinges' that would 'swing' off
either end of a box plot, defining the  range of concentrations consistent with the bulk of the data. Data
points outside this concentration range could then be identified as potential outliers. Tukey defined the
hinges, i.e., the lower and upper edges of the box plot, essentially as the lower and upper quartiles of the
data set. Then multiples of the interquartile range [IQR]  (i.e., the range represented by the middle half of
the sample)  were added  to  or subtracted  from these hinges as  potential outlier boundaries. Any
observation from 1.5 x IQR to 3 x IQR below the lower edge of the box plot was labeled a 'mild' low
outlier; any value more than 3 x IQR below the lower edge of the box plot was labeled an 'extreme' low
outlier. Similarly, values greater than the upper edge of the box plot in the range of 1.5 to 3 times the
IQR were labeled 'mild' higher outliers, and  'extreme' high outliers if more than 3  times the IQR
beyond the upper box plot edge.

       REQUIREMENTS AND ASSUMPTIONS

     By using hinges and multiples of the interquartile range, Tukey's box plot method utilizes statistics
(i.e., the lower and upper quartiles) that are generally not or minimally affected by one or a few outliers
                                            12-5
                                                  March 2009

-------
Chapter 12. Identifying Outliers _ Unified Guidance

in the sample. Consequently, it isn't necessary to first delete possible outliers before constructing the
box plot.

     Screening for outliers with box plots is a very simple technique. Since no assumption of normality
is needed, Tukey's procedure can be considered quasi-non-parametric. But note that rough symmetry of
the underlying distribution  is  implicitly  assumed. Legitimate observations from  highly  skewed
distributions could be flagged as potential outliers on a box plot if no transformation of the data is first
attempted. It may be necessary to first conduct multiple  data  transformations in order  to  achieve
approximate symmetry before applying and evaluating potential outliers with box plots.

PROCEDURE

Step 1 .   Construct a box plot on the sample using the method given in Section 9.2. Using the IQR from
         that calculation, along with the lower and upper quartiles (  x25 and x75), compute the first pair
         of lower and upper boundaries as:

                                     LB1 = x25- \.5xIQR                                (12.1)


                                     UB1 = x75 + \5xIQR                                (12.2)

Step 2.   Construct the second pair of lower and upper boundaries as:

                                     LB2  = x25-3xIQR                                (12.3)


                                     UB2  = x_75 + 3 x IQR                                (12.4)
Step 3.   Label any sample measurement lower than the first lower boundary (£#1) but no less than the
         second lower boundary (LB^) as a mild low outlier. Label any measurement greater than the
         first upper boundary (UB\) but no greater than the second upper boundary (UB^) as a mild high
         outlier.

Step 4.   Label any sample measurement lower than the second lower boundary (LB^)  as an extreme
         low outlier. Label any value higher than the second upper boundary (UB2) as an extreme high
         outlier.

       ^EXAMPLE 12-2

     Use the carbon tetrachloride data from Example 12-1 to screen for possible outliers using Tukey's
box plot.

       SOLUTION
Step 1.   Using the procedure described in  Section 9.2, the upper and  lower quartiles of carbon
         tetrachloride sample are found to be  x25  = 12.9 and x75 = 137.2, leading to an  IQR = 124.3.

Step 2.   Compute the two pairs  of lower and upper boundaries using equations (12.1), (12.2), (12.3),
         and (12.4):
                                            IF6                                  March 2009

-------
Chapter 12. Identifying Outliers
             Unified Guidance
                               LBl = 12.9 -1.5 x 124.3 = -173.55
                               UBl = 137.2 +1.5 x 124.3 = 323.65
                               LB2 = 12.9- 3x124.3 = -360
                               UB2 = 137.2 + 3x124.3 = 510.1

Step 3.   Scan the list of carbon tetrachloride  measurements and compare against the boundaries of
         Step 2. It can be seen that the value of 350 from Well 3 is greater than UB\ but lower than
         UB2, thus qualifying as a mild high outlier. Also, the measurement 7,066 from the same well
         is higher than UBi and so qualifies as an extreme high outlier.

Step 4.   Because the box plot outlier screening method assumes roughly symmetric data, recompute
         the  box plot on the log-transformed  measurements (as shown in Figure 12-5 alongside a
         similar box plot of the raw concentrations). Transforming the sample to the log-scale does
         result in much greater symmetry compared to the original  measurement scale.   This can be
         seen in the close similarity between the mean and median  on the log-scale box plot. With a
         more symmetric data set, the mild high outlier from Step 3 disappears, but the  extreme high
         value is still classified as an outlier. -4

    Figure 12-5. Comparative Carbon Tetrachloride Box  Plots Indicating Outliers

                 I
                                #
#
                                            12-7
                     March 2009

-------
Chapter 12.  Identifying Outliers	Unified Guidance



12.3 DIXON'STEST

       BACKGROUND AND PURPOSE

     Dixon's test is helpful for documenting statistical outliers in smaller data sets (i.e., n < 25). The
test is particularly designed for cases where there is only a single high or low outlier, although it can also
be adapted to test for multiple outliers. The test falls in the general class of tests for discordancy (Barnett
and  Lewis,  1994).  The test statistic for such procedures is  generally a ratio: the numerator is  the
difference between the suspected outlier and  some  summary  statistic  of the data set, while  the
denominator is always a measure of spread within the data. In this version of Dixon's test, the  summary
statistic in the numerator is an order statistic nearby to the potential outlier (e.g., the second or third most
extreme value). The measure of spread is essentially the observed sample range.

     If there is more than  one  outlier in the data set, Dixon's test can be vulnerable to masking, at least
for very small samples.  Masking in the statistical literature refers to the problem  of an extreme outlier
being missed because one  or more additional extreme outliers  are also present. For instance, if the data
consist of the values (2, 4, 10,  12,  15, 18, 19, 22, 200, 202}, identification of the maximum value (202)
as an outlier might fail since the maximum by itself is not extreme with respect to the next highest value
(200). However, both of these values are clearly much higher than the rest of the data set and might
jointly be considered outliers.

     If more than one outlier is suspected, the user is encouraged to consider Rosner's test (Section
12.4) as an alternative to Dixon's test, at least if the sample size is 20 or more. If the data set is smaller,
Dixon's test should be modified so that the least extreme of the suspected outliers is tested first. This
will  help avoid the risk  of masking. The same equations given below can be used, but the data set and
sample size  should be temporarily reduced to exclude any suspected outliers that are more extreme than
the one being tested. If a less extreme value is found to be an outlier, then that observation and any more
extreme values can also be regarded as outliers. Otherwise, add back the next most extreme value and
test it in the  same way.

       REQUIREMENTS AND  ASSUMPTIONS

     Dixon's test is only recommended for sample  sizes n < 25.  It assumes that the data set (minus the
suspected outlier) is normally-distributed. This assumption should be checked prior to running Dixon's
test using a goodness-of-fit technique such as  the probability plots described in Section 12.2.

       PROCEDURE

Step 1.   Order the data set and label the ordered values, XQ.

Step 2.   If a  "low" outlier is suspected (i.e., X(i)), compute the test statistic C using the appropriate
         equation [12.5] depending on the sample size (n):
                                             12-8                                    March 2009

-------
Chapter 12. Identifying Outliers
                                                                    Unified Guidance
                          C =
                                                                                [12.5]
Step 3.   If a "high" outlier is suspected (i.e., X(n)), and again depending on the sample size (n), compute
         the test statistic C using the appropriate equation [12.6] as:
                          C =
                                                                                [12.6]
Step 4.
In either case, given the significance level (a), determine a critical point for Dixon's test with
n observations from Table 12-1 in Appendix D. If C exceeds this critical point, the suspected
value should be  declared a  statistical outlier  and investigated further (see discussion  in
Chapter 6)
       ^EXAMPLE 12-3

     Use the data from Example 12-1 in Dixon's test to determine if the anomalous high value is a
statistical outlier at an a = 0.05 level of significance.

       SOLUTION
Step 1.   In Example 12-1, probability plots of the carbon tetrachloride data indicated that the highest
         value might be an  outlier, but that the distribution of the measurements was more nearly
         lognormal than normal. Since the sample size n = 20, Dixon's test can be used on the logged
         observations. Logging the values and ordering them leads to the following table:
                                             12-9
                                                                            March 2009

-------
Chapter 12. Identifying Outliers                                          Unified Guidance
Order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Concentration
(ppb)
1.7
3.2
6.5
7.3
12.1
13.7
15.6
16.2
35.1
41.6
57.9
59.7
68.4
70.1
75.4
199.0
275.0
302.0
350.0
7066.0
Logged
Concentration
0.531
1.163
1.872
1.988
2.493
2.617
2.747
2.785
3.558
3.728
4.059
4.089
4.225
4.250
4.323
5.293
5.617
5.710
5.878
8.863
Step 2.   Because a high  outlier is suspected  and n = 20, use the  last option of equation [12.6] to
         calculate the test statistic C:

                                       8.863-5.710
                                       8.863-1.872

Step 3.   With n = 20 and a = .05, the critical point from Table 12-1 in Appendix D is equal to 0.450.
         Since the test statistic C exceeds this critical point, the extreme high value can be declared a
         statistical outlier.  Before excluding this value from  further  analysis, however,  a valid
         explanation for this unusually high value should be sought. Otherwise, the outlier may need to
         be treated as an extreme but valid concentration measurement. -^
12.4 ROSNER'STEST

       BACKGROUND AND PURPOSE

     Rosner's test (Rosner,  1975) is a useful method for identifying multiple outliers in moderate to
large-sized data sets. The approach developed in Rosner's method is known as a block-style test. Instead
of testing for outliers one-by-one in a consecutive manner from most extreme to least extreme (i.e., most
to least suspicious), the data are examined first to identify the total number of possible outliers, k. Once k
is determined, the set of possible outliers is tested together as a block. If the test is significant, all k
measurements are regarded as statistical outliers. If not, the set of possible outliers is reduced by one and
the test repeated on the smaller block. This procedure is iterated until either a set of outliers is identified
                                             12-10                                  March 2009

-------
Chapter 12.  Identifying Outliers _ Unified Guidance

or none of the observations are labeled an outlier. By testing outliers in blocks instead of one-by-one,
Rosner's test largely avoids the problem of masking of one outlier by another (as discussed in Section
12.3 regarding Dixon's test).

     Although Rosner's test avoids the problem of masking when multiple outliers are present in the
same data set, it is not immune to the related problem of swamping.   A good discussion is found in
Barnett and Lewis, 1994, Outliers in Statistical Data (3rd Edition), p. 236.  Swamping refers to a block
of measurements all being labeled as  outliers even though only some  of the observations are actually
outliers. This can occur with Rosner's test especially if all the outliers tend to be at one end of the data
set (e.g., as upper extremes). The difficulty is in properly identifying the total number of possible outliers
(&), which can be low outliers, high outliers, or some combination of the two extremes. If k is made too
large, swamping may occur.  Again, the  user  is reminded to always  do a preliminary screening for
outliers via box plots (Section 12.2) and probability plots (Section 12.1).

       REQUIREMENTS AND ASSUMPTIONS

     Rosner's test is recommended when the sample size (n) is 20 or larger. The critical points provided
in Table 12-2 in Appendix D can  be used  to identify from 2 to 5 outliers in a given data set. Like
Dixon's test,  Rosner's  method assumes the  underlying data set (minus any outliers)  is  normally
distributed. If a probability plot of the  data exhibits significant bends or curves,  the data should first be
transformed (e.g.,  via a logarithm)  and then re-plotted. The formal  test for outliers should only be
performed on (outlier-deleted) data sets that have been approximately normalized.

     A potential drawback of Rosner's test is that the user must first identify the maximum number of
potential outliers (k) prior to running the test. Therefore, this requirement makes the test ill-advised as an
automatic outlier screening tool, and somewhat reliant on the user to identify candidate outliers.

       PROCEDURE

Step 1 .   Order the data set and denote the ordered values XQ. Then by simple  inspection, identify the
         maximum number of possible outliers, r$.

Step 2.   Compute the sample mean and standard deviation of all the data; denote these values by x^ '
         and 5^. Then  determine the measurement furthest from  x^' and denote ity^°\ Note thaty0-*
         could be either a potentially low or a high outlier.

Step 3.   Delete y^ from the data set and compute the sample mean and standard deviation from the
         remaining  observations. Label these new  values x^'  and s^\ Again find the value in  this
         reduced data set furthest from x^' and label it_y(1).

Step 4.   Delete _y(1), recompute the mean and standard deviation, and continue  this process until all r0
         potential outliers have been removed.  At this point, the following set of  statistics will be
         available:
                                                                                         [12.7]

Step 5.   Now test for r outliers (where r < r0) by iteratively computing the test statistic:
                                             12-11                                   March 2009

-------
Chapter 12.  Identifying Outliers	                                    Unified Guidance


                                    ^-1 =
|/sH)                               [12.8]
         First test for r0 outliers. If the test statistic Rr_l  in equation [12.8] exceeds the first critical

         point from Table 12-2 in Appendix D based on sample size  («) and the Type I error (a),
         conclude there are TO outliers. If not, test for ro-l outliers in the same fashion using the next
         critical  point, continuing  until a certain number of outliers have either been identified or
         Rosner's test finds no outliers at all.

       ^EXAMPLE  12-4

     Consider the following series  of 25 background napthalene measurements (in ppb). Use Rosner's
test to determine whether any of the  values should be deemed statistical outliers.
Qtr
1
2
3
4
5
BW-1
3
5
5
6
5
34
39
74
88
85
Naphthalene
BW-2
5.59
5.96
1.47
2.57
5.39
Concentrations (ppb)
BW-3 BW-4
1.91
1.74
23.23
1.82
2.02
6
6
5
4
1
12
05
18
43
00
BW-5
8
5
5
4
35
64
34
53
42
45
       SOLUTION
Step 1 .   Screening with probability plots of the combined data indicates a less than linear fit with both
         the raw measurements  and log-transformed  data (see Figures 12-6  and 12-7);  two points
         appear rather discrepant from the rest. Correlation coefficients for these plots are 0.740 on the
         concentration  scale and 0.951 on the log-scale. Re-plotting after removing the two possible
         outliers gives a substantially improved correlation on the concentration  scale of 0.958 but
         reduces the log-scale correlation to 0.929. Normality appears to be a slightly better default
         distribution for the outlier-deleted data set. Run Rosner's test on the original data with k = 2
         possible outliers.

Step 2.   Compute the  mean and  standard deviation  of the  complete data  set.  Then identify the
         observation farthest from the mean. These results are listed, along with the ordered data, in the
         table below. After removing the farthest value (35.45), recompute the mean and standard
         deviation on the remaining values and again identify the most discrepant observation (23.23).
         Repeat  this process one more time so that both suspected outliers have been removed (see
         table below).
Step 3.   Now test for 2 joint outliers by computing Rosner's statistic on subset SS^-i  = SS\ using
         equation [12.8]:

                                         23 23 - 5 23
                                     R=            =4.16
                                      ^     4.326
                                              12-12                                    March 2009

-------
Chapter 12.  Identifying Outliers
                                   Unified Guidance
                      Figure 12-6. Napthalene Probability Plot
               o
               o
               N
                  0
                 -1
                 -2
                     0
10         20          30


     Napthalene (ppb)
40
                    Figure 12-7.  Log Napthalene Probability Plot
              OJ



              S  o
              CO
                 -1
                 -2
                                 1            2           3



                                 Log Napthalene log(ppb)
                                       12-13
                                          March 2009

-------
Chapter 12.  Identifying Outliers                                            Unified Guidance
Successive
SS0
1.00
1.47
1.74
1.82
1.91
2.02
2.57
3.34
4.42
4.43
5.18
5.34
5.39
5.39
5.53
5.59
5.74
5.85
5.96
6.05
6.12
6.88
8.64
23.23
35.45
XQ = 6.44
S0 = 7.379
Yo = 35.45
Naphthalene Subsets
SSx
1.00
1.47
1.74
1.82
1.91
2.02
2.57
3.34
4.42
4.43
5.18
5.34
5.39
5.39
5.53
5.59
5.74
5.85
5.96
6.05
6.12
6.88
8.64
23.23

JCj = 5.23
Si = 4.326
Xi = 23.23
(SSj)
SS2
1.00
1.47
1.74
1.82
1.91
2.02
2.57
3.34
4.42
4.43
5.18
5.34
5.39
5.39
5.53
5.59
5.74
5.85
5.96
6.05
6.12
6.88
8.64


X2 = 4.45
S2 = 2.050
y2 = 8.64
Step 4.   Given a = 0.05, a sample size of n = 25, and k = 2, the first critical point in Table 12-2 in
         Appendix D equals 2.83 for n = 20 and 3.05 for n = 30. The value R\ in Step 3 is larger than
         either of these critical points,  so both suspected values may be declared statistical outliers by
         Rosner's test at the 5% significance level. Before excluding these values from further analysis,
         however,  a valid  explanation for them should be  found. Otherwise, treat the outliers as
         extreme but valid concentration measurements.

         Note: had R\ been less than these values, a test could still be run for a single outlier using the
         second critical point for each sample size (or a critical point interpolated between them). -^

       The  guidance  considers  Dixon's  and Rosner's  outlier evaluation methods  preferable for
groundwater monitoring data situations, when  assumptions of normality  are reasonable and data are
quantified.   We did  not include the older method found in the 1989 guidance  based on ASTM paper
El78-75, which can  still be used as an  alternative.  Where data do not appear to be fit by a normal or
transformably  normal distribution, other robust outlier evaluation methods can be considered from the
wider statistical literature.  The literature will also need to be consulted when data contains non-detect
values along with potential outliers.
                                              12-14                                    March 2009

-------
Chapter 13. Spatial Variability	Unified Guidance

                CHAPTER  13.  SPATIAL VARIABILITY
        13.1   INTRODUCTION TO SPATIAL VARIATION	13-1
        13.2   IDENTIFYING SPATIAL VARIABILITY	13-2
          13.2.1  Side-by-Side Box Plots	13-2
          13.2.2  One-Way Analysis of Variance for Spatial Variability	13-5
        13.3   USING ANOVA TO IMPROVE PARAMETRIC INTRAWELL TESTS	13-8
     This chapter discusses a type of statistical dependence in groundwater monitoring data known as
spatial  variability.  Spatial  variability  exists when  the distribution  or pattern of  concentration
measurements changes from well location to well  location (most typically in the form of differing mean
levels).  Such variation may be natural or synthetic, depending on whether it is caused by natural or
anthropogenic factors. Methods for identifying spatial variation  are detailed  via the use of box  plots
(Section 13.2.1) and analysis of variance [ANOVA] (Section 13.2.2). Once identified, ANOVA can
sometimes be employed to construct more powerful intrawell background limits. This topic is addressed
in Section 13.3.

13.1 INTRODUCTION  TO  SPATIAL  VARIATION

     Spatial dependence,  spatial variation or variability, and spatial correlation are closely related
concepts. All refer to the notion of measurement levels that vary in a structured way as a function of the
location  of sampling.  Although  spatial  variation  can apply to any statistical  characteristic of the
underlying population  (including the  population variance or upper percentiles),  the usual sense in
groundwater monitoring is that mean levels of a given constituent vary from one well to the next.

     Standard geostatistical models posit that an area exhibits positive spatial  correlation  if any two
sampling locations share a greater similarity in concentration level the closer the distance between them,
and more dissimilarity the further apart they are.  Such models have been applied to both groundwater
and soil  sampling problems,  but are  not applicable  in  all  geological configurations. It may be, for
instance, that mean concentration levels differ across wells but vary in a seemingly random way with no
apparent connection to the distance between the  sampling points. In that  case, the concentrations
between  pairs of wells are not correlated with distance, yet the measurements  within  each well are
strongly  associated  with the mean level  at that  particular location,  whether due to  a change in soil
composition, hydrological characteristics  or some other  factor. In other words,  spatial variation may
exist even when spatial correlation does not.

     Spatial variation is  important in the guidance  context since substantial  differences in mean
concentration  levels between  different wells can invalidate  interwell,  upgradient-to-downgradient
comparisons and point instead toward intrawell tests (Chapter 6).  Not all spatial variability is natural.
Average concentration levels can vary from well to well for a variety of reasons.

     In this guidance, a  distinction is occasionally made between natural versus  synthetic spatial
variation. Natural  spatial  variability  refers to a pattern of changing mean levels  in groundwater
associated with normal geochemical behavior unaffected by human activities. Natural spatial variability

                                              iTl                                   March 2009

-------
Chapter 13.  Spatial Variability	Unified Guidance

is not an indication of groundwater contamination, even if concentrations at one or more compliance
wells exceed (upgradient) background.  In contrast, synthetic spatial variability is  related  to human
activity. Sources can include recent releases affecting compliance wells, migration of contaminants from
off-site sources,  or historic contamination  at certain wells due to past industrial activity or pre-RCRA
waste disposal.  Whether natural or synthetic, techniques  and  test methods for dealing with  spatial
variation will still be identical from a purely statistical standpoint. It is interpreting the testing outcomes
which will necessitate a consideration of why the spatial variation occurs.

     The goal of groundwater analysis is not simply to identify significant concentration differences
among monitoring wells  at compliance point locations.  It is also to determine why those differences
exist. Especially with prior groundwater contamination, regulatory decisions outside the scope  of this
guidance need to address the problem. In  some cases, compliance/assessment monitoring or remedial
action may be warranted.  In other cases, chronic contamination from offsite sources may simply have to
be considered as the current background condition at a given location. At least the ability to attribute
certain mean differences to natural spatial variation allows the range of potential  concerns to be
somewhat narrowed. Of course, deciding that an observed pattern of spatial variation is natural and not
synthetic may not be easy. Ultimately, expert judgment and knowledge concerning  site  hydrology,
geology and geochemistry are important in providing more definitive answers.

     One statistical approach to use when a site has multiple,  non-impacted background wells is to
conduct a one-way ANOVA for inorganic constituents on those wells. Substantial differences among the
mean levels  at a set of uncontaminated sampling locations are suggestive of natural spatial variability. At
a true 'greenfield' site, ANOVA can be run on all the wells — both background and compliance — after
a few preliminary sampling rounds have been collected.

     The Unified Guidance offers two basic tools to explore and test for spatial correlation. The first,
side-by-side box plots (Section 13.2.1), provides a quick screen for possible spatial variation. When
multiple well data are plotted on the same concentration axis, noticeably staggered boxes are often an
indication of significantly different mean levels.

     A more formal test of spatial variation is the one-way ANOVA (Section 13.3.2).  When significant
spatial variation  exists and an intrawell test strategy is pursued,  one-way ANOVA can  also be used to
adjust the standard deviation estimate used in forming intrawell prediction and control chart limits, and
to increase the effective sample size of the  test, via the degrees of freedom. This is discussed in Section
13.3

13.2 IDENTIFYING  SPATIAL VARIABILITY

13.2.1       SIDE-BY-SIDE  BOX  PLOTS

       BACKGROUND AND PURPOSE

     Box plots  for graphing side-by-side  statistical summaries  of multiple wells were introduced in
Chapter 9.  They are also discussed in Chapter 11 as an initial screen for differences in population
variances and as a tool to check the assumption of equal variances in ANOVA. They can  further be
employed to screen for possible spatial variation in mean levels. While variability in a sample from a
given well is roughly indicated by the length of the box, the average concentration level is indicated by
the  position of the box relative to the concentration axis. Many standard box plot  software routines
                                              13-2                                   March 2009

-------
Chapter 13. Spatial Variability	Unified Guidance

display both  the sample median value and  the sample mean  on each box, so these values may be
compared from well to well. A high degree of staggering in the box positions is then indicative of
potentially significant spatial variation.

     Since side-by-side box plots provide a picture of the variability at each well, the extent to which
apparent  differences in mean levels seem  to be real rather than chance fluctuations  can be examined. If
the boxes are staggered but there is substantial overlap between them, the degree of spatial variability
may not be significant. A more formal ANOVA might still be warranted as a follow-up test, but side-by-
side box plots will offer a initial sense of how spatially variable the groundwater data appear.

       REQUIREMENTS,  ASSUMPTIONS AND PROCEDURE

     Requirements, assumptions and the procedure for box plots are outlined in Chapter 9, Section 9.2.

       ^EXAMPLE 13-1

     Quarterly dissolved iron concentrations measured at each of six upgradient wells are listed below.
Construct side-by-side box plots to initially screen for the presence of spatial variability.
Date
Jan 1997
Apr 1997
Jul 1997
Oct 1997
Mean
Median
Well 1
57.97
54.05
29.96
46.06
47.01
55.06
Well 2
46.06
76.71
32.14
68.03
55.71
57.04
Iron Concentrations (ppm)
Well 3 Well 4
100.48
170.72
39.25
52.98
90.86
76.73
34.12
93.69
70.81
83.10
70.43
76.96
Well 5
60.95
72.97
244.69
202.35
145.24
137.66
Well 6
83.10
183.09
198.34
160.77
156.32
171.93
       SOLUTION
Step 1.   Determine the median, mean, lower and upper quartiles of each well. Then plot these against a
         concentration  axis to form side-by-by side box plots  (Figure  13-1) using the procedure in
         Section 9.2

Step 2.   From this plot, the means and  medians at the last two  wells (Wells 5 and 6) appear elevated
         above the rest. This is a possible indication of spatial variation. However, the variances as
         represented by the box lengths also appear to differ, with the highest means associated with
         the largest boxes.  A transformation of the data should be  attempted and the  data re-plotted.
         Spatial variability is only a significant problem if it is apparent on the scale of the data actually
         used for statistical analysis.

Step 3.   Take  the logarithm of each measurement as in the table below. Recompute the mean, median,
         lower and upper quartiles, and then re-construct the box plot as in Figure 13-2.
                                             13-3                                   March 2009

-------
Chapter 13. Spatial Variability
Unified Guidance
Date
Jan 1997
Apr 1997
Jul 1997
Oct 1997
Mean
Median
Well 1
4.06
3.99
3.40
3.83
3.82
3.70
Log
Well 2
3.83
4.34
3.47
4.22
3.96
4.02
Iron Concentrations log(ppm)
Well 3 Well 4 Well 5
4.61
5.14
3.67
3.97
4.35
4.29
3.53
4.54
4.26
4.42
4.19
4.34
4.11
4.29
5.50
5.31
4.80
4.80
Well 6
4.42
5.21
5.29
5.08
5.00
5.14
                      Figure 13-1. Side-by-Side Iron Box Plots
                    •
              200,00 -
              150,00 -
          E
          a.
          a.
          o   100,00 •
               50.00 -
                 .00'
                                      o
                                                    O
                            II       I       I       I

                      Weil 1  Well 2 Well 3  Weil 4      5      6
                                         13-4
       March 2009

-------
Chapter 13. Spatial Variability
                                                          Unified Guidance
                     Figure 13-2. Side-by-Side  Log(Iron) Box Plots
                    8,00-
                    5,50-
£ 5.00'
a
a
at
o
- 4,50 H
o
                  5 4.00-
                    3.50-
                    3.00-
          T
                                         i
                                          1
                                                        J_
                                 I      I       I       I       I
                          Well 1 Weil 2 Well 3 Well 4  Well 5  Well 6
Step 4.   While more nearly similar on the log-scale, the means and medians are still elevated in Wells
         5 and  6.  Since  the  differences in  box lengths  are much  less on  the  log-scale, the log
         transformation has worked to  somewhat stabilize  the variances. These data should be tested
         formally for significant spatial  variation using an ANOVA, probably on the log-scale. -^

13.2.2      ONE-WAY ANALYSIS OF  VARIANCE FOR SPATIAL VARIABILITY

       PURPOSE AND BACKGROUND

     Chapter 17 presents Analysis of Variance [ANOVA] in greater  detail.  When using ANOVA to
check for spatial variability,  the observations from each well  are taken as a single group. Significant
differences between data groups represent monitoring wells with different mean concentration levels.
The lack of significant well mean  differences may afford  an opportunity to pool the data  for larger
background sizes and conduct interwell detection monitoring tests.

     ANOVA used for this purpose should be performed either on a set of  multiple non-impacted
upgradient  wells,  or  on historically uncontaminated  compliance and upgradient background wells. If
significant mean differences exist among naturally occurring  constituent data at upgradient wells, natural
spatial variability is the likely  reason.  Synthetic consitituents in upgradient  wells might also exhibit
spatial differences if affected by an  offsite- plume. Presumably, if the flow gradient has been correctly
                                             13-5
                                                                 March 2009

-------
Chapter 13. Spatial Variability	Unified Guidance

assessed and no migration of contaminants from off-site has occurred, differences in mean levels across
upgradient wells ought to signal the influence of factors not attributable to  a  monitored release. A
similar, but  potentially  weaker,  argument  can be  made  if  spatial  differences  exist  between
uncontaminated  historical  data  at compliance  wells.  The  lack of  spatial  differences  between
uncontaminated compliance  and  upgradient background well  data, may again allow for even larger
background sample sizes.

       REQUIREMENTS AND ASSUMPTIONS

     The basic assumptions and data requirements for one-way ANOVA are presented in Section 17.1.
If the assumption that the observations are statistically independent over time is not met, both identifying
spatial  variability using ANOVA as well as improving intrawell prediction limits and control charts can
be impacted.  It is usually difficult to verify that the measurements are temporally independent with only
a limited number of observations per well.  This potential problem can be somewhat minimized by
collecting  samples far  enough apart in time to guard  against autocorrelation.  Another option is to
construct a parallel time series plot (Chapter 14) to  look for time-related effects or dependencies
occurring simultaneously across the  set of wells.

     If a  significant temporal dependence or autocorrelation  exists, the one-way ANOVA can still
identify well-to-well mean level differences. But the power of the test to do so is lessened. If a parallel
time series plot indicates a potentially strong time-related effect, a two-way ANOVA including temporal
effects  can be performed to test and correct for a significant  temporal  factor. This slightly  more
complicated procedure is discussed in Davis (1994).

     Another key  assumption of  parametric ANOVA is  that the residuals are normal or  can be
normalized. If a normalizing transformation cannot be found, a test for spatial variability can be made
using the Kruskal-Wallis non-parametric ANOVA (Chapter 17).  As long as the measurements can be
ranked, average ranks that differ significantly across wells provide evidence of spatial variation.

       PROCEDURE

Step 1.   Assuming there are p distinct wells to test, designate the measurements from each well as a
         separate group for purposes of computing the ANOVA. Then follow Steps 1 through 7 of the
         procedure in Section 17.1.1  to  compute the overall F-statistic and  the  quantities of the
         ANOVA table in Figure 13-3 below.
                    Figure 13-3.  One-Way Parametric ANOVA Table

 Source of Variation   Sums of   Degrees of      Mean Squares          F-Statistic
                       Squares    Freedom
Between Wells
Error (within wells)
Total
SSWells
•^•^error
SStotal
P-1
n-p
n-l
/V7 C ,, — QQ 1 1 / f n 	 1 ^ F — /V7 Q • • //V7 Q
* * Dwells — *J*J wslls/ vA^ •'• / * — * * *-^ wslls/ * * *"^srTor
MSerror = SSerror/(A7-p)

                                             13-6                                   March 2009

-------
Chapter 13.  Spatial Variability	Unified Guidance

Step 2.   To test the hypothesis of equal means for all p wells, compare the F-statistic from Step 1 to the
         a-level critical point found from the F-distribution with (p-\) and (n-p) degrees of freedom in
         Table 17-1 of Appendix D. Usually a is taken to be 5%, so that the needed comparison value
         equals the upper 95th percentage point of the F-distribution. If the observed F-statistic exceeds
         the critical  point  (F.95,P-i,n-p),  reject the  hypothesis  of equal  well  population means  and
         conclude there is  significant spatial variability. Otherwise, the evidence is insufficient to
         conclude there are  significant differences between the means at the/? wells.

       ^EXAMPLE  13-2

     The iron concentrations in Example 13-1 show evidence of spatial variability in side-by-side box
plots. Tested for equal variances and normality, these same data are best fit by a lognormal distribution.
The  statistics for natural logarithms of the iron measurements are shown below;  individual log data are
provided in the Example 13-1 second table. Compute  a one-way parametric ANOVA to  determine
whether there is significant spatial variation at the a = .05 significance level.
Date
N
Mean
SD

Well 1

3
0

4
820
296

Log
Well 2

3
0

4
965
395

Iron Concentration
Well 3
4
4.348
0.658
Grand Mean
Statistics
Well 4
4
4.188
0.453
= 4.354
log(ppm)
Well 5
4
4.802
0.704

Well 6

5
0

4
000
396

       SOLUTION
Step 1.   With 6 wells and 4 observations per well, n; = 4 for all the wells. The total sample size is n =
         24 and/? = 6. Compute the (overall) grand mean and the sample mean concentrations in each
         of the well groups using equations [17.1] and [17.2]. These values are listed (along with each
         group's standard deviation) in the above table.

Step 2.   Compute the sum of squares due to well-to-well differences using equation [17.3]:

            ss*eih  =  [4(3.820)2 +4(3.965)2 +  ...  +4(5.000)2]-24(4.354)2  =4.294

         This quantity has (6 - 1) = 5 degrees of freedom.

Step 3.   Compute the corrected total sum of squares using equation [17.4] with («-!) = 23 df:

                    SS^  =   [(4.06)2+   ...  +(5.08)2]-24(4.354)2  =8.934

Step 4.   Obtain the within-well or error sum of squares by subtraction using equation [17.5]:

                                  SS    =8.934-4.294 = 4.640
                                    error

         This quantity has (n -p) = 24-6 = 18 degrees of freedom.
                                              IF?                                   March 2009

-------
Chapter 13. Spatial Variability	Unified Guidance

Step 5.   Compute the well and error mean sum of squares using equations [17.6] and [17.7]:

                                   MS  „ =4.294/5 = 0.859
                                      wells       '

                                  MS    =4.640/18 = 0.258
                                     error       '

Step 6.   Construct the F-statistic and the one-way ANOVA table, using Figure 13-3 as a guide:
Source of Variation
Between Wells
Error (within wells)
Total
Sums of Squares
4.294
4.640
8.934
Degrees of
Freedom
5
18
23
Mean Squares
0.859
0.258
F-Statistic
F= 0.859/0.258=3.33
Step 7.   Compare the observed F-statistic of 3.33 against the critical point taken as the upper 95th
         percentage point from the F-distribution with 5 and  18 degrees of freedom. Using Table 17-1
         of Appendix D, this gives a value of Fgs^jg = 2.77. Since the F-statistic exceeds the critical
         point, the null hypothesis of equal  well means can be rejected, suggesting the presence of
         significant spatial variation. -^

13.3 USING ANOVA TO IMPROVE PARAMETRIC  INTRAWELL TESTS

     BACKGROUND AND PURPOSE

     Constituents that exhibit  significant spatial variability usually should be formally tested with
intrawell procedures such as a prediction limit or control chart. Historical data from each compliance
well are used  as background for these tests  instead of from upgradient wells. At an early stage of
intrawell testing,  there may only be  a few  measurements  per well which can be designated as
background. Depending on the number of  statistical tests that need to be performed across the
monitoring network, available intrawell background at individual compliance wells may not provide
sufficient statistical power or meet the false positive rate criteria (Chapter 19).

     One remedy first suggested by Davis  (1998) can increase the degrees  of freedom of the test by
using one-way ANOVA results (Section 13.2) from a number of wells to  provide an alternate estimate
of the  average intrawell variance.  In constructing a parametric intrawell  prediction limit for a single
compliance well, the intrawell background of sample size n is used to compute a well-specific sample
mean (x). The intrawell standard deviation (s)  is replaced by the root mean squared error [RMSE]
component from an ANOVA of the intrawell background associated with a series of compliance and/or
background wells.1 This raises the degrees of freedom from (n-\) to (N-p\ where TV is the total sample
size across the group of wells input to the ANOVA and/? is the number of distinct wells.
1  RMSE is another name for the square root of the mean error sum of squares (MS^^) in the ANOVA table of Figure 13-3.
                                                                                   March 2009

-------
Chapter 13.  Spatial Variability	Unified Guidance

     As an example of the difference this adjustment can make, consider a site with 6 upgradient wells
and 15 compliance wells. Assuming n =  6 observations per well that have been collected over the last
year, a total of 36 potential background measurements are available to construct an interwell test. If there
is significant natural spatial variation in the mean levels from well to well, an interwell test is probably
not appropriate.  Switching to an  intrawell  method is  the next  best solution, but  with only  six
observations  per compliance  well,  either  the  power of an intrawell test  to identify contaminated
groundwater is likely to be quite low (even with retesting) or the site-wide false positive rate [SWFPR]
will exceed the recommended target.

     If the six upgradient wells were tested for spatial variability using a one-way ANOVA (presuming
that the  equal variance assumption is met),  the degrees of freedom [df\ associated with the mean error
sum of  squares term is (6 wells x 5 df per well) = 30 df (see Section 13.2). Thus by substituting  the
RMSE in place of each compliance well's intrawell standard deviation (s), the degrees of freedom  for
the modified intrawell prediction or control chart limit is 30 instead of 5.

     ANOVA can  be usefully employed in this  manner since the RMSE is very close to being a
weighted average of the individual well  sample standard deviations. As  such, it  can be  considered a
measure of average within-well variability across the wells input to the ANOVA. Substituting the RMSE
for s at  an individual well consequently provides a better estimate of the typical within-well variation,
since the RMSE is based on levels of fluctuation averaged across several wells. In addition, the number
of observations used to construct the RMSE is much greater than  the n values used to  compute  the
intrawell sample  standard deviation (s).  Since  both statistical measures  are estimates of within-well
variation, the  RMSE with its larger degrees of freedom is generally a superior estimate if certain
assumptions are met.

     REQUIREMENTS AND ASSUMPTIONS

     Using ANOVA to bolster parametric intrawell prediction or control chart limits will not work at
every site or for every constituent. Replacement of the well-specific, intrawell sample standard deviation
(s) by the RMSE from ANOVA assumes  that the true within-well variability is approximately the same
at  all the  wells for which  an intrawell  background limit (i.e., prediction  or control chart) will  be
constructed, and not just those wells tested in the ANOVA procedure. This last assumption can be
difficult to verify if the ANOVA includes only background or upgradient wells. But to the extent that
uncontaminated intrawell background measurements from compliance point wells can be  included,  the
ANOVA should be run on all  or a substantial fraction of the site's wells (excluding those which might
already  be  contaminated). Whatever mix of upgradient  and downgradient wells are  included  in  the
ANOVA, the  purpose of the  procedure  is not to identify groundwater contamination, but rather to
compute a better and more powerful estimate of the  average intrawell standard deviation.

     For the ANOVA to be valid and the  RMSE to be  a reasonable estimate of average within-well
variability,  a formal check of the equal variance assumption should be conducted using Chapter 11
methods. A spatially variable constituent  will often  exhibit well-specific  standard deviations that
increase with the well-specific mean concentration.  Equalizing the variances in these cases will require a
data transformation, with an ANOVA conducted on the transformed data. Ultimately, any transformation
applied  to the wells in the ANOVA also  need to be applied to intrawell background before computing
intrawell prediction or control  chart limits. The same transformation has to be appropriate for both sets

                                              13^9                                    March 2009

-------
Chapter 13. Spatial Variability	Unified Guidance

of data (i.e.,  wells included in ANOVA and intrawell background at wells for which background limits
are desired).

     Even when the  ANOVA procedure described in this section is utilized, the resulting intrawell
limits  should also  be designed  to incorporate retesting. When intrawell background  is employed to
estimate both  a well-specific background  mean  (x)  and well-specific  standard deviation (s), the
Appendix D tables associated with Chapters 19 and 20 can be used to look up the intrawell sample size
(n) and number of wells (w) in the network in order to find a prediction or control  chart multiplier that
meets the targeted SWFPR and has acceptable statistical power. However, these tables implicitly assume
that the degrees of freedom [df\  associated with the test  is equal to (w-1). The ANOVA method of this
section results in a much larger df, and more importantly, in a df that does not 'match' the intrawell
sample size (n).

     Consequently, the parametric multipliers in the Appendix D tables cannot be directly used when
constructing  prediction or control chart limits with retesting. Instead, a multiplier must be computed for
the specific  combination of n  and df computed  as a  result of the ANOVA.  Tabulating  all  such
possibilities would  be prohibitive.  For prediction limits, the Unified Guidance recommends the free-of-
charge, open source  R statistical  computing  environment.   A pre-scripted program  is included in
Appendix C that can  be run in R to calculate appropriate prediction limit multipliers, once the user has
supplied an intrawell sample size («), network size (w), and type of retesting scheme.

     If guidance users are unable to utilize the R-script approach, the following approximation for the
well-specific prediction limit K-factors is suggested based on EPA Region 8 Monte Carlo evaluations.
Given  a per- test confidence level of 1- a , r total tests ofw -c well-constituents, an individual well size
«,, a pooled variance sample size of njf =  df+  1, and  Kndfj-a obtained from annual intrawell Unified
Guidance tables, the individual well Knij.a factor can be estimated using the following equation:
                                    ~    Kndf,\-a
                                                                       m*=-
                                                                           A.^
     where // =1  for future l:m  observations  or ju is the size  of a future mean.   The value of m* is
specific to each of the nine parametric prediction limit tests and is a function of the three coefficients A,
b and c, individual well sample size n\ and r tests. For a 1:1 test of future means or observations,  the
equation is exact;  for higher order \:m tests, the results are approximate.2 The equation is also useful in
        For each of the nine prediction limit tests, the following coefficients (A, b & c) are recommended:  a 1:2 future
values test (1.01, .0524 & .0158);  a 1:3 test (1.63, .108 & .0407); a 1:4 test (2.41, .157 & .0668); the modified California
plan (1.36, .103 & .0182); a 1:1 mean size 2 test (.5, 0 & 0); a 1:2 mean size 2 test (.898, .0856 &  .0172); a 1:3 mean size 2
test (1.27, .168 & .0363); a 1:1 mean size 3 test (.5, 0 & 0); and a 1:2 mean size 3 test (.817, .108 & .0158). %.  The
coefficients were obtained from regression analysis; approximation values were compared with R-script values for K-factors.
In 1260 comparisons of the seven tests using repeat values (m > 1), 86% of the approximations lay within or equal to + 1% of
the true value and 96% within or equal to + 2%. The 1:4 test had the greatest variability, but all values lay within +_4%.  81%
of the values lay within or equal to + .01  K-units and 93% less than or equal to + .02 units.
                                               13-10                                     March 2009

-------
Chapter 13. Spatial Variability
Unified Guidance
gauging R-script method results.  Another virtue of this equation is that it can be readily applied to
different individual well sample sizes based on the common  Kndf,i-a for pooled variance data.

     A less elegant solution is available for intrawell control charts. Currently, an appropriate multiplier
needs to be  simulated via Monte Carlo  methods.  The approach is to simulate  separate normally-
distributed  data sets for the background mean based on n measurements, and the background standard
deviation based on df+ 1 measurements. Statistical independence of the sample mean (x ) and standard
deviation (s)  for  normal populations allows this to work. With the background mean and standard
deviation available, a series of possible multipliers (h) can be investigated in simulations of control chart
performance.  The multiplier which meets the targeted SWFPR and provides acceptable power should be
selected. Further detail is presented in Chapter 20.  R can also be used to conduct these simulations.

       ^EXAMPLE 13-3

     The  logged iron concentrations from  Example  13-2  showed  significant evidence of spatial
variability.  Use the results of the one-way ANOVA to compute adjusted intrawell prediction limits
(without retesting) for each of the wells in that example and compare them to the unadjusted prediction
limits.

     SOLUTION
Step 1.   Summary statistics by well for the logged iron measurements are listed in the table below.
         With n = 4 measurements per well, use equation [13.1] and ^-a,n-i = ^.99,3 = 4.541 from Table
         16-1 in Appendix D to compute at each well an unadjusted 99% intrawell  prediction limit for
         the  next single measurement, based on lognormal data:
                                                                                        [13.1]

Log-mean
Log-SD
n
t.99,3
99% PL
Well 1
3.820
0.296
4
4.541
204.9
Unadjusted
Well 2
3.965
0.395
4
4.541
391.6
99% Prediction
Well 3
4.348
0.658
4
4.541
2183.0
Limits for Iron
Well 4
4.188
0.453
4
4.541
657.0
(ppm)
Well 5
4.802
0.704
4
4.541
4341.5
Well 6
5.000
0.396
4
4.541
1108.1
Step 2.   Use the RMSE (i.e., square root of the mean error sum of squares [Mirror] component) of the
         ANOVA in Example 13-2 as an estimate of the adjusted, pooled standard deviation, giving
         •^MSerror = V.258 = .5079 . The degrees of freedom (df) associated with this pooled standard
         deviation isp(n-1)= 6^3j= 18, the same as listed in the ANOVA table of Example 13-2.
                                             13-11
        March 2009

-------
Chapter 13.  Spatial Variability
                                              Unified Guidance
Step 3.   Use equation [13.2], along with the adjusted pooled standard deviation and its associated df, to
         compute an adjusted 99% prediction limit for each well, as given in the table below. Note that
         the adjusted t-value based on the larger dfis ^i-a,df = ^.99,18 = 2.552.
                                                                                          [13.2]
                  Well 1
 Adjusted 99% Prediction Limits for Iron (ppm)
Well 2         Well 3         Well 4         Well 5
Well 6
Log-mean
RMSE
df
t.99,18
99% PL
3.820
0.5079
18
2.552
194.3
3.965
0.5079
18
2.552
224.6
4.348
0.5079
18
2.552
329.4
4.188
0.5079
18
2.552
280.7
4.802
0.5079
18
2.552
518.7
5.000
0.5079
18
2.552
632.3
Step 4.   Compare the adjusted and unadjusted lognormal prediction limits. By estimating the average
         intrawell standard deviation using ANOVA, the adjusted prediction limits  are significantly
         lower and thus more powerful than the unadjusted limits, especially at Wells 3, 5, and 6.

         In this example, use of the R-script approach was unnecessary, since the corresponding K-
         multiple used in 1-of-l prediction limit tests can be directly derived analytically. -4
                                             13-12
                                                      March 2009

-------
Chapter 14.  Temporal Variability	Unified Guidance


               CHAPTER 14.   TEMPORAL VARIABILITY

        14.1   TEMPORAL DEPENDENCE	14-1
        14.2   IDENTIFYING TEMPORAL EFFECTS AND CORRELATION	14-3
          14.2.1   Parallel Time Series Plots	14-3
          14.2.2   One-Way Analysis of Variance for Temporal Effects	14-6
          14.2.3   Sample Autocorrelation Function	14-12
          14.2.4   Rank von Neumann Ratio Test	14-16
        14.3   CORRECTING FOR TEMPORAL EFFECTS AND CORRELATION	14-19
          14.3.1   Adjusting the Sampling Frequency and/or Test Method.	14-19
          14.3.2   Choosing a Sampling Interval Via Darcy's Equation	14-20
          14.3.3   Creating Adjusted, Stationary Measurements	14-28
          14.3.4   Identifying Linear Trends Amidst Seasonality: Seasonal Mann-Kendall Test	14-37
     This chapter discusses the importance of statistical independence in groundwater monitoring data
with respect to temporal variability. Temporal variability exists when the distribution of measurements
varies with the times at which sampling or analytical measurement occurs. This variation can be caused
by seasonal fluctuations in  the groundwater itself,  changes in the analytical  method used, the re-
calibration of instruments, anomalies in sampling method, etc.

     Methods to identify temporal variability are discussed for both groups of wells (parallel time series
plots; one-way analysis of variance  [ANOVA] for temporal effects) and single  data series  (sample
autocorrelation function; rank von Neumann ratio).  Procedures are also presented for  correcting or
accommodating temporal effects. These include guidance on adjusting the sampling frequency to avoid
temporal correlation, choosing a sampling interval using the Darcy  equation,  removing seasonality or
other temporal dependence, and finally testing for trends with seasonal data.

14.1 TEMPORAL DEPENDENCE

     A key assumption underlying most statistical tests is that the sample data are independent  and
identically distributed [i.i.d.]  (Chapter 3). In part, this means that measurements collected over a period
of time should not exhibit a clear time dependence  or  significant autocorrelation.  Time dependence
refers to the presence of trends or cyclical patterns when the observations are graphed on a time series
plot. The  closely related concept of autocorrelation  is essentially the degree  to which measurements
collected later in a series can be predicted from previous measurements. Strongly  autocorrelated data are
highly predictable from one value to the next.  Statistically independent values vary in a random,
unpredictable fashion.

     While temporal independence is a complex topic, there are several common  types of temporal
dependence. Some of these include: 1) correlation across wells over time in the concentration pattern of
a single constituent (i.e., concentrations tending to jointly rise or fall at each of the  wells on common
sampling events); 2) correlation across multiple constituents over time in their  concentration patterns
(i.e.., a parallel rise or fall  in concentration across  several parameters on common sampling events); 3)
seasonal cycles; 4) trends, linear or otherwise; and 5) serial  dependence or autocorrelation (i.e., greater
correlation between sampling events more closely spaced in time).
                                              14-1                                    March 2009

-------
Chapter 14.  Temporal Variability	Unified Guidance

     Any of these patterns can invalidate or weaken the results of statistical testing. In some cases, a
statistical  method  can be chosen that specifically accounts for temporal dependence (e.g.,  seasonal
Mann-Kendall trend test). In other instances, the sample data need to be adjusted for the dependence.
Future data might also need to be collected in a manner that avoids temporal correlation. The goal of this
chapter is to present straightforward tools that can be used to first identify temporal dependence and then
to adjust for this correlation.

     To better understand  why  most statistical  tests  depend  on  the  assumption of  statistical
independence, consider a hypothetical series of groundwater measurements exhibiting an obvious pattern
of seasonal fluctuation (Figure 14-1).  These data demonstrate regular and repeated cycles of higher and
lower values. Even though fluctuating predictably and highly dependent, the characteristics of the entire
groundwater population will be observed over a long period of monitoring. This provides an estimate of
the full range of concentrations and an accurate  gauge of total variability.

     The same is not true for data collected from the same population over a much shorter span, say in
five to six months. A much narrower range of sample concentrations would be  observed due to  the
cyclical pattern. Depending on when the sampling was conducted, the average concentration level would
either be much higher or much lower than the overall  average; no single sampling period is likely to
accurately estimate either the true population mean or its variance.

     From this example, an important lesson can be drawn about temporally dependent data. Variance
estimates in a sample of dependent, positively autocorrelated data are likely to be biased low. This is
important  because the guidance methods require and assume that an accurate and unbiased estimate of
the sample standard deviation be available. A case in point was the practice of using aliquot replicates of
a single physical sample for comparison with other combined replicate aliquot samples from a number of
individual physical water quality samples  (e.g., in a Student-^ test).  Aliquot replicate values  are much
more  similar to each  other than to measurements made on physically discrete groundwater  samples.
Consequently, the estimate of variance was too low and the Mest frequently registered false positives.

     Using physically discrete  samples is not always sufficient. If the sampling interval ensures that
discrete volumes of groundwater are being sampled on  consecutive sampling events, the observations
can be described as physically independent. However, they are not necessarily statistically independent.
Statistical  independence is based not on the physical characteristics of the sample data, but rather on the
statistical pattern of measurements.

     Temporally dependent and autocorrelated data generally contain both  a truly random  and non-
random component.  The relative strength of the  latter effect is a measured by one or more correlation
techniques.  The degree of correlation among dependent sample measurements lies on a continuum.
Sample pairs  can  be  mildly correlated or strongly correlated. Only strong  correlations are likely to
substantially impact the results of further statistical testing.
                                              14-2                                    March 2009

-------
Chapter 14.  Temporal Variability
                                                            Unified Guidance
                           Figure 14-1. Seasonal Fluctuations
              §

              I
              w
              o
              §
              o
                                            DATE
14.2 IDENTIFYING TEMPORAL EFFECTS AND CORRELATION
14.2.1
PARALLEL TIME SERIES PLOTS
       BACKGROUND AND  PURPOSE

     Time series plots were introduced in Chapter 9. A time series plot such as Figure 14-1 is a simple
graph of concentration versus time of sample collection. Such plots are useful for identifying a variety of
temporal patterns. These include identifying a trend over time, one or more sampling events that may
signal contaminant releases, measurement outliers resulting in anomalous 'spikes' due to field handling
or analytical problems, cyclical and seasonal fluctuations, as well as the presence of other time-related
dependencies.

     Time  series plots can be used in two basic ways to identify temporal dependence. By graphing
single constituent data from multiple wells together on a time series plot, potentially significant temporal
components of variability can be  identified.  For example, seasonal fluctuations can cause the mean
concentration levels at a number of wells to vary with the time of sampling events. This dependency will
show up in  the time series plot as a pattern of parallel traces., in which the individual wells will tend to
rise and fall together across the sequence of sampling dates. The parallel pattern may be the result of the
measurement process  such as mid-stream changes in field handling or sample collection procedures,
periodic re-calibration of analytical instrumentation, and changes in laboratory or analytical methods. It
could also be the result from significant autocorrelation present in the groundwater  population itself.
Hydrologic  factors such as drought, recharge patterns or regular (e.g., seasonal) water table fluctuations
may be responsible.  In these cases, it may be useful to test for the presence of a significant temporal
                                            14-3
                                                                    March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

effect by first constructing a parallel time series plot and then running a formal one-way ANOVA for
temporal effects (Section 14.2.2).

     The second way time series plots can be helpful is by plotting multiple constituents over time for
the same well, or averaging values for each constituent across wells on each sampling event and then
plotting the averages over time. In either case, the plot can signify whether the general concentration
pattern over time is simultaneously observed for different constituents. If so, it may indicate that a group
of constituents is highly correlated in groundwater or that the same  artifacts of sampling and/or lab
analysis impacted the results of several monitoring parameters.

       REQUIREMENTS AND ASSUMPTIONS

     The requirements for time  series plots were discussed in  Chapter  9.  Two  very  useful
recommendations follow from that discussion.  First, a different plot symbol should be used to display
any non-detect measurements (e.g.,  solid symbols for detected values, hollow symbols for non-detects).
This can help prevent mistaking a change over time in reporting limits as a trend, since detected and
non-detected data are clearly distinguished on the plot. It also allows one to determine whether  non-
detects are more prevalent during certain portions of the sample record  and less prevalent at other times.
Secondly, when multiple constituents are plotted on the same graph, it  may be necessary to standardize
each constituent prior to plotting to avoid trying to simultaneously visualize high-valued and low-valued
traces on the same j/-axis (i.e., concentration axis).  The goal of such a plot  is to identify parallel
concentration patterns over time. This can be done most readily by subtracting each constituent's sample
mean (x ) from the measurements for that constituent and dividing by the standard deviation (s), so that
every constituent is plotted on roughly the same scale.

       PROCEDURE  FOR MULTIPLE WELLS, ONE CONSTITUENT

Step 1.   For each well to be  plotted, form data pairs by matching each concentration value with its
         sampling date.

Step 2.   Graph the data pairs for each well on the same set of axes, the horizontal axis representing
         time and the  vertical axis representing concentration. Connect the points for each  individual
         well to form a 'trace' for that well.

Step 3.   Look for parallel movement in the traces across the wells. Even if all the well concentrations
         tend  to rise on a given sampling event, but not to the same magnitude  or degree,  this is
         evidence of a possible temporal effect.

       PROCEDURE  FOR MULTIPLE CONSTITUENTS, ONE OR MANY WELLS

Step 1.   For each constituent to be plotted, compute the  constituent-specific sample mean (x) and
         standard  deviation (s). Form  standardized  measurements (z;) by subtracting the mean  from
         each concentration (x;) and dividing by the standard deviation, using the equation:

                                              x - x
                                          z!=^—                                     [14.1]


         Form data pairs by matching each standardized concentration with its sampling event.

                                             14-4                                  March  2009

-------
Chapter 14. Temporal Variability
                                        Unified Guidance
Step 2.   If correlation is suspected in a group of wells, average the standardized concentrations for each
         given constituent across wells for each specific sampling event. Otherwise, form a multi-
         constituent time series plot separately for each well.

Step 3.   Graph  the data  pairs for each constituent on  the  same  set of axes,  the  horizontal axis
         representing time and the vertical  axis representing standardized concentrations. Connect the
         points for each constituent to form  a trace for that parameter.

Step 4.   Look for parallel movement  in  the  traces across  the constituents.  A  strong  degree  of
         parallelism indicates a high degree  of correlation among the monitoring parameters.

       ^EXAMPLE 14-1

     The  following well  sets  of manganese measurements were collected over a two-year period.
Construct a time series plot of these data to check for possible temporal effects.
Qtr
1
2
3
4
5
6
7
8
Manganese Concentrations
BW-1 BW-2 BW-3
28.14
29.33
30.45
32.42
34.37
33.25
31.02
28.50
31.41
30.27
32.57
32.77
33.03
32.18
28.85
32.88
27.15
30.24
29.14
30.59
34.88
30.53
30.33
30.42
(ppm)
BW-4
30.46
30.60
30.96
30.70
32.71
31.76
31.85
29.58
       SOLUTION
Step 1.   Graph each well's concentrations versus sampling event on the same set of axes to construct
         the following time series plot (Figure 14-2).
               40
            CD
            W
            CD
               35  H
               30  H
               25
                            \
                           2
\
4
\
6
10
                                  Sampling Event
                   Figure 14-2. Manganese Parallel Time Series Plot
                                             14-5
                                                March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

Step 2.   Examining the traces on the plot, there is some degree of parallelism in the pattern over time.
         Particularly for the fifth quarter, there is an across-the-board increase in the manganese level,
         followed by a general decline the next two quarterly events. Note, however, that there is little
         evidence of differences in mean levels by well location. ~4
14.2.2      ONE-WAY ANALYSIS OF VARIANCE FOR TEMPORAL EFFECTS

       PURPOSE AND BACKGROUND

     Parametric ANOVA is a comparison of means among a set of populations. The one-way ANOVA
for temporal effects is no exception. A one-way ANOVA for spatial variation (Chapter 13) uses well
data  sets to represent locations as the statistical factor of interest.  In contrast, a one-way ANOVA for
temporal effects considers multiple well  data sets  for individual sampling events or seasons  as  the
relevant statistical factor. A significant temporal factor implies that the average concentration depends to
some degree on when sampling takes place.

     Three common examples of temporal factors include: 1) an irregular, but consistent shift of
average concentrations  over  time  perhaps  due to changes in laboratories  or analytical method
interferences; 2) cyclical seasonal patterns; or 3) parallel upward or downward trends.  These can occur
in both upgradient and downgradient well data.

     If event-specific analytical differences or seasonality appear to be an important temporal factor, the
one-way ANOVA for temporal effects can be  used  to formally identify seasonality, parallel trends, or
changes in lab performance that affect other temporal effects. Results of the ANOVA can also be used to
create temporally stationary residuals, where the temporal effect has been 'subtracted from' the original
measurements.  These stationary residuals may be  used to replace  the original  data in  subsequent
statistical testing.

     The one-way ANOVA for a temporal factor described below can be used for an additional purpose
when interwell testing is appropriate. For this situation, there can be no significant spatial variability. If
a group of upgradient or other background wells indicates a significant temporal effect, an interwell
prediction limit can be designed  which  properly accounts for this  temporal  dependence.  A more
powerful interwell test  of upgradient-to-downgradient differences can be developed than otherwise
would  be possible.  This can occur because the ANOVA  separates  or 'decomposes'  the overall  data
variation into two sources: a) temporal effects and b) random variation or statistical error. It  also
estimates how  the background mean is  changing  from one sampling  event to the next. The final
prediction limit is formed by computing the background mean, using the separate structural and random
variation components of the ANOVA to better estimate the standard  deviation, and then adjusting the
effective sample size (via the degrees of freedom) to account for these factors.

       REQUIREMENTS AND ASSUMPTIONS

     Like the one-way ANOVA for spatial variation (Chapter 13), the one-way ANOVA for temporal
effects assumes that the data groups are normally-distributed with constant variance. This requirement
means  that the  group residuals should be tested for normality (Chapter 10) and  also for  equality of

                                             14-6                                    March 2009

-------
Chapter 14.  Temporal Variability	Unified Guidance

variance (Chapter 11). It is also assumed that for each of a series of background wells, measurements
are collected at each well on sampling events or dates common to all the wells.

     Two variations in the basic procedure are described below. For cases of temporal effects excluding
seasonally, each sampling event is treated as a separate population. The ANOVA residuals are grouped
and tested by sampling event to test for equality of variance. In cases of apparent seasonality,  each
season is treated as a distinct population. The difference is that seasons contain multiple sampling events
across a span of multiple years, with sampling events collected at the same time of year assigned to one
of the seasons (e.g., all January or first quarter measurements). Here, the ANOVA residuals are grouped
by season to test for homoscedasticity.

     If the assumption of equal variances or normal residuals is violated, a data transformation should
be considered. This should be followed by testing of the assumptions on the transformed scale. The one-
way  ANOVA for a non-seasonal effect should include a minimum  of four wells  and at least  4
observations (i.e., distinct sampling dates) per well. In the seasonal case, there should be a minimum of
3-4 sampling events per distinct season,  with the events thus spanning at least three years (i.e., one per
year per season). Larger numbers of both wells and observations are preferable. Sampling dates should
also be approximately the same for each well if a temporal effect is to be tested.

     If the data cannot be normalized, a similar test for a temporal or seasonal effect can be performed
using the Kruskal-Wallis test (Chapter 17). The only difference from the procedure outlined in Section
17.1.2 is that the roles of wells/groups and sampling events have to be reversed. That is, each sampling
event should  be treated as a separate 'well,' while each well is treated as a separate 'sampling event.'
Then the same equations  can be  applied to the reversed data set  to test for a significant temporal
dependence. If testing  for a seasonal  effect, the wells in the notation  of  Section 17.1.2 become the
groups of common sampling events from different years, while the sampling events are again the distinct
wells.

     Even when a temporal effect exists and is apparent on a time series plot, the variation between well
locations (i.e., spatial variability) may overshadow the temporal variability.   This could result in a non-
significant one-way ANOVA finding for the temporal factor. In these cases, a two-way ANOVA can be
considered where both well location and sampling event/season are treated as statistical factors.  This
procedure is described  in Davis (1994).  Evidence for a temporal effect can be documented using this
last technique, although the  two-way  ANOVA isn't  necessary if the goal is simply to  construct
temporally  stationary  residuals.  That  can  be  accomplished with  a  one-way  ANOVA even  when
significant spatial variability exists.

       PROCEDURE

Step 1.   Given a set of Wwells and measurements from each of T sampling events at each well on each
         of K years, label the observations as xp, for /' = 1 to W,j = 1 to  T, and k = 1  to K. Then xp
         represents the measurement from the rth well on theyth sampling event during the Mi year.

Step 2.   When testing for a non-seasonal temporal effect, form the set of event means (x   ) and the
         grand mean (x...) using equations [14.2] and [14.3] respectively:
                                              14-7                                    March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

                          *.*  =  if]***  fory=ltorand£=lto^                    [14.2]
                                      i=\
                                          W  T  K      I
                                    v   — ^~* ^~* ^~* v   /HTTTf                               F1 A 71
                                    x   — 2_i2_i2__i  i- /PYj.j\.                               [i^f.jj
                                         t=i j=\ k=\   I
Step 2a.  If testing for a seasonal effect common to all wells, form the seasonal means (xt ..) instead of
         the event means of Step 2, using the equation:

                                                 * for/ =1 toT                          [14.4]
Step 3.   Compute the set of residuals for each sampling event or season using either equation [14.5] or
         equation [14.6] respectively:

                                 r,]k  =   xi]k-X.]kfori=ltoW                           [14.5]

                          ri-k  =  xk ~ x..  for J=ltoWandk=ltoK                    [14.6]
Step 4.   Test the residuals for normality (Chapter 10). If significant non-normality is evident, consider
         transforming the data and re-doing the computations in Steps 1 through 4 on the transformed
         scale.

Step 5.   Test the sets of residuals grouped  either  by sampling event or season for equal variance
         (Chapter 11). If the variances are significantly  different, consider transforming the data and
         re-doing the computations in Steps 1 through 5 on the transformed data.

Step 6.   If testing for a non-seasonal temporal effect, compute the mean error sum of squares term
         (M$E) using equation:

                                         W  T K    I

                                        i=\  j=l k=l   I

         This term is associated with TK(W-\) degrees  of freedom. Also compute the mean sum of
         squares for the temporal effect (MSV) with degrees of freedom (7X-1), using equation:

                                       T K           \? /
                                       j=\ k=l           I

Step 6a.  If testing for a seasonal effect, compute the mean error sum of squares (M$E) using equation:
                                             14-8                                   March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

         This term is associated with T(WK-\) degrees of freedom. Also compute the mean sum of
         squares for the seasonal effect (MSr) with degrees of freedom (T-l), using equation:

                                                                                        [14.10]


Step 7.   Test for a significant event-to-event or seasonal effect by computing the ratio of the mean sum
         of squares for time and the mean error sum of squares:

                                        FT=MST/MSE                                  [14.11]

Step 8.   If testing for a non-seasonal temporal effect, the test statistic FT under the null hypothesis (i.e.,
         of no  significant  time-related  variability among the  sampling events) will follow an  F-
         distribution with (7X-1) and TK(W-\)  degrees of freedom. Therefore, using a significance
         level of a = 0.05, compare FT against  the critical point FOS, TK-I,TK(W-I) taken from the  F-
         distribution in Table 17-1 in Appendix D. If the critical point is  exceeded, conclude there is a
         significant temporal effect.

Step 8a.  If testing  for a  seasonal effect, the test statistic FT under the null hypothesis (i.e., of  no
         seasonal pattern) will follow an F-distribution with (T-l) and T(WK-V) degrees of freedom.
         Therefore, using a significance level of a = 0.05, compare FT against the critical point FQS, T-
         I,T(WK-I) taken from the F-distribution in Table 17-1  of Appendix D.  If the critical point is
         exceeded, conclude there is a significant  seasonal pattern.

Step 9.   If there is  no  spatial  variability but a significant temporal effect exists among a set  of
         background wells,  compute  an appropriate interwell  prediction or control chart limit  as
         follows.  First replace the background sample standard deviation (s) with the following
         estimate built from the one-way ANOVA table:
                                                                                        [14.12]

         Then calculate the effective sample size for the prediction limit as:
                                                                                        [14.13]

       ^EXAMPLE 14-2

     Some parallelism was found in the time series plot of Example 14-1. Test those same manganese
data for a significant, non-seasonal temporal effect using a one-way ANOVA at the 5% significance
level.
                                              14-9                                    March 2009

-------
Chapter 14. Temporal Variability
Unified Guidance
Qtr
1
2
3
4
5
6
7
8

Event
Mean
29.290
30.110
30.780
31.620
33.747
31.930
30.513
30.345

Manganese Concentrations
BW-1 BW-2
28.14
29.33
30.45
32.42
34.37
33.25
31.02
28.50
Grand
31.41
30.27
32.57
32.77
33.03
32.18
28.85
32.88
mean = 31.042
(ppm)
BW-3
27.15
30.24
29.14
30.59
34.88
30.53
30.33
30.42

BW-4
30.46
30.60
30.96
30.70
32.71
31.76
31.85
29.58

       SOLUTION
Step 1.   First compute the means for each sampling event and the grand mean of all the data. These
         values are listed in the table above. With four wells and eight quarterly events per well, W=4,
         T=4,andK = 2.

Step 2.   Determine the residuals for each sampling event by subtracting off the event mean. These
         values are listed in the table below.
Qtr
1
2
3
4
5
6
7
8
Manganese Event
BW-1 BW-2
-1.150
-0.780
-0.330
0.800
0.622
1.320
0.508
-1.845
2.120
0.160
1.790
1.150
-0.718
0.250
-1.662
2.535
Residuals (ppm)
BW-3 BW-4
-2.140 1.170
0.130 0.490
-1.640 0.180
-1.030 -0.920
1.132 -1.038
-1.400 -0.170
-0.182 1.338
0.075 -0.765
Step 3.   Test the residuals for normality. A probability plot of these residuals is given in Figure 14-3.
         An adequate fit to normality is suggested by Filliben's probability plot correlation coefficient
         test.
                                             14-10
        March 2009

-------
Chapter 14. Temporal Variability
                                        Unified Guidance
        Figure 14-3.  Probability Plot of Manganese Sampling Event Residuals

                    2
                0)
                8   o
                IM
                   -1
                   -2
     z
                      -4
-2
0
                                    Event Mean Residuals (ppm)
Step 4.   Next, test the groups of residuals for equal variance across sampling events.  Levene's test
         (Chapter 11) gives an F-statistic of 1.30, well below the 5% critical point with 7 and 24
         degrees of freedom of F,9sj^4 = 2.42. Therefore, the group variances test out as adequately
         homogeneous.

Step 5.   Compute the mean error sum of squares term using equation [14.7]:

           MSE  =   [(-1.150)2+(-.780)2+... + (!.338)2+(-.765)2J/(4-2)(7)   =   1.87

Step 6.   Compute the mean sum of squares term for the time effect using equation  [14.8]:

      MST  = 4[(29.290-31.042)2+(30.11-31.042)2+... + (30.345-31.042)2J/7  =   7.55

Step 7.   Test for a significant temporal effect, computing the F-statistic in equation [14.11]:

                                     FT  =7.55/1.87 = 4.04

         The degrees of freedom associated with the numerator and denominator respectively are (T-l)
         = 7 and T(W-\) = 24. Just as with Levene's test run earlier, the 5% level  critical point for the
         test is F.95;7;24 = 2.42. Since FT exceeds this value, there is evidence of a  significant temporal
         effect in the manganese background data.

Step 8.   Assuming a lack of spatial variation, the presence of a temporal effect can be used to compute
         a standard  deviation  estimate and effective background  sample size  appropriate  for an
                                            14-11
                                               March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

         interwell prediction limit test, using equations [14.12] and [14.13] respectively. The adjusted
         standard deviation becomes:
                                     a- ^[7.55+7(1.87)]
         while the effective sample size is:

                    n * = 1+ J8 • 7 • (4.04 + 4-1)2 ]/[s • (4.04)2 +7 • 3J}= 19.31 = 19

         If the background data had simply been pooled together and the sample standard deviation
         computed, s = 1.776 ppm with a sample size of n = 32.  So the adjustments based on the
         temporal effect alter the final  prediction limit by enlarging it to account for the additional
         component of variation. -4


14.2.3      SAMPLE AUTOCORRELATION FUNCTION

       BACKGROUND AND  PURPOSE

     The  sample autocorrelation function enables a test of temporal autocorrelation in a single data
series (e.g., from a single  well over time). When  a time-related  dependency affects  several wells
simultaneously, parallel  time series plots (Section 14.2.1) and one-way ANOVA  for temporal effects
(Section 14.2.2) should be considered. But when a longer data series is to be used for an intrawell test
such as a  prediction  limit or control chart,  the autocorrelation function does an excellent job  of
identifying temporal dependence.

     Given a  sequence of consecutively-collected  measurements, x\,  X2,..., xn, form  the set  of
overlapping pairs (*;, x;+i) for / =  1,..., n-l. The approximate  first-order  sample autocorrelation
coefficient is then computed from these pairs as (Chatfield, 2004):
                                    r=^-	                              [14.14]
Equation [14.14]  estimates the first-order autocorrelation, that is, the correlation between pairs of
sample measurements collected one event apart (i.e., consecutive events). The number  of sampling
events separating each pair is called  the lag, representing the temporal distance between the pair
measurements.

     Autocorrelation can also be computed at other lags. The general approximate equation for the Mi
lag is given by:
                                            14-12                                   March 2009

-------
Chapter 14.  Temporal Variability _ Unified Guidance
which estimates the Mi-order autocorrelation for pairs of measurements separated in time by k sampling
events. Note that the number of pairs used to compute r^ decreases with increasing k due to the fact that
fewer and fewer sample pairs can be formed which are separated by that many lags.

     By computing the first  few sample autocorrelation  coefficients and defining TO = 1,  the sample
autocorrelation function can be formed by plotting r^ against k. Since the autocorrelation coefficients are
approximately normal in distribution with zero mean and variance equal to l/n,  a test of significant
autocorrelation at the 95% significance level can be  made by  examining the sample autocorrelation
function to see if any coefficients exceed 2/\n in absolute value (±2/\n represent approximate upper
and lower confidence limits).

     The sample autocorrelation function is a valuable visual  tool for assessing different types of
autocorrelation (Chatfield, 2004). For instance, a stationary (i.e.,  stable, non-trending) but non-random
series of measurements will  often exhibit a large value  of r\ followed by perhaps one or two other
significantly non-zero coefficients. The remaining coefficients will be progressively smaller and smaller,
tending towards zero. A series with a clear seasonal pattern will exhibit a seasonal (i.e., approximately
sinusoidal) autocorrelation function.  If the series  tends to alternate between  high  and low  values, the
autocorrelation function  will  also  alternate,  with  r\  being negative  to  reflect that  consecutive
measurements tend to be on 'opposite sides'  of the sample mean. Finally, if the series contains a trend,
the sample autocorrelation function will not drop to zero as  the lag k increases.  Rather, there will a
persistent autocorrelation even at very large lags.

       REQUIREMENTS AND ASSUMPTIONS

     The approximate distribution of the sample autocorrelation coefficients is predicated on the sample
measurements following  a normal distribution. A test for significant autocorrelation may therefore be
inaccurate unless the sample measurements are roughly normal. Non-normal data series should be tested
for temporal autocorrelation using the non-parametric rank Von Neumann ratio (Section 14.2.4).

     Outliers  can drastically affect the sample  autocorrelation function (Chatfield, 2004). Before
assessing  autocorrelation, check the sample for possible outliers, removing those that are identified. A
series of  at least  10-12  measurements is minimally recommended to construct the autocorrelation
function.  Otherwise, the number of lagged data pairs  will be too small  to reliably estimate the
correlation, especially for larger lags. Sampling events should be regularly spaced so that pairs lagged by
the same number of events (k) represent the same approximate time interval.

       PROCEDURE

Step 1.    Given a series of n  measurements,  x\,..., xn,  form sets of lagged data pairs (x;, xj+k), i = 1,...,
          n-k, for k < [n/3], where the notation [c] represents the largest integer no greater than c. For
          longer series, computing lags to a maximum of k = 15 is generally sufficient.

                                             14-13                                   March 2009

-------
Chapter 14.  Temporal Variability
                                                                 Unified Guidance
Step 2.   For each set of lagged pairs from Step 1, compute the sample autocorrelation coefficient, r^,
         using equation [14.15]. Also define TO = 1.

Step 3.   Graph the sample autocorrelation function by plotting r^ versus k for k = 0,..., [«/3], generally
         up to a maximum lag of 15. Also plot horizontal lines at levels equal to: ±2/V« .

Step 4.   Examine the sample autocorrelation function. If any coefficient r^ exceeds 2/vn in absolute
         value, conclude that the sample has significant autocorrelation.

       ^EXAMPLE 14-3

     The following series of monthly total alkalinity  measurements were collected from  leachate at a
solid waste landfill during a four and a half year period. Use the sample autocorrelation function to test
for significant temporal dependence in this  series.
Date
01/26/96
02/20/96
03/19/96
04/22/96
05/22/96
06/24/96
07/15/96
08/21/96
09/15/96
10/15/96
11/11/96
12/10/96
01/22/97
02/11/97
03/04/97
04/07/97
05/01/97
06/09/97
Total
Alkalinity
(mg/L)
1400
1700
1900
1800
1300
2000
2300
2500
1700
1600
1400
1600
1800
1000
720
1400
1600
990
Date
07/01/97
08/15/97
09/15/97
10/15/97
11/15/97
12/15/97
01/15/98
02/15/98
03/15/98
04/15/98
05/08/98
06/15/98
07/15/98
08/15/98
09/02/98
10/06/98
11/03/98
12/15/98
Total
Alkalinity
(mg/L)
2400
3500
3100
3300
2100
2100
1500
710
1100
1900
2100
2000
2500
2700
2400
3000
2700
2680
Date
01/15/99
02/02/99
03/02/99
04/15/99
05/04/99
06/02/99
07/07/99
08/03/99
09/02/99
10/07/99
11/02/99
12/07/99
01/06/00
02/02/00
03/02/00
04/04/00
05/02/00
06/06/00
Total
Alkalinity
(mg/L)
1350
1560
1220
1390
1940
2160
1990
2540
2250
1630
1710
1210
1170
1330
1540
1670
1520
2080
       SOLUTION
Step 1.   Create a time series plot of the n = 54 alkalinity measurements, as in Figure 14-4. The series
         indicates an apparent seasonal fluctuation.

Step 2.   Form lagged data pairs from the alkalinity series for each lag k = 1,..., [n/3] = 18. The first
         two pairs for k = 1 (i.e., first order lag) are (1400,  1700) and (1700, 1900). For k = 2, the first
         two pairs are (1400, 1900) and (1700, 1800), etc.

Step 3.   At  each lag (&),  compute the sample autocorrelation coefficient using equation [14.15]. Note
         that the denominator of this equation equals (n-l)s2. For the alkalinity data, the sample mean
         and variance are x = 1865.93 and s = 392349.1 respectively. The lag-1 autocorrelation is thus:
     r,   =
(1400 - 1865.93)- (1700 -1865.93) + ... + (1520 - 1865.93)- (2080 - 1865.93)
                           (54-l)-392349.1
                                             14-14
                                                                                     =  .64
                                                                        March 2009

-------
Chapter 14. Temporal Variability
Unified Guidance
         Other lags are computed similarly.

Step 4.   Plot the  sample autocorrelation function as in Figure 14-5.  Overlay the  plot with  95%
         confidence limits (dotted lines) shown at ±2/v« = ±2/V54 = 0.27 .

Step 5.   The autocorrelation function indicates coefficients  at several lags that lie outside the  95%
         confidence limits,  confirming the presence of temporal dependence.  Further, the shape of
         autocorrelation function is sinusoidal, suggesting a strong seasonal fluctuation in the alkalinity
         levels. -4

                Figure 14-4.  Time Series Plot of  Total Alkalinity (mg/L)
   I
   I
   «2
         1996
                          1997
                                           19S8

                                            Sampling Date
                                                           1399
                                                                            2000
                                            14-15
        March 2009

-------
Chapter 14. Temporal Variability
                                                              Unified Guidance
           Figure 14-5. Sample Autocorrelation Function for Total Alkalinity
 C
 0
 1
    p
    o
                                                              ~TT
                                                       I
                                                      10
                                                               15
                                               Lag
14.2.4
RANK VON  NEUMANN RATIO TEST
       BACKGROUND AND PURPOSE

     The rank von Neumann ratio is a non-parametric test of first-order temporal autocorrelation in a
single data series (e.g., from  a single well over time). It can be used as an alternative to the sample
autocorrelation function (Section 14.2.3) for non-normal data, and is both easily computed and effective.

     The rank von Neumann ratio is based on the idea that a truly independent series of data will vary in
an unpredictable fashion as the list is examined sequentially. The first order or lag-1 autocorrelation will
be approximately zero. By  contrast, the  first-order autocorrelation in dependent data will tend to be
positive (or negative), implying that lag-1 data pairs in the series will tend to be more similar (or
dissimilar) in magnitude than would expected by chance.

     Not only will the concentrations of lag-1 data pairs tend to be similar (or dissimilar) when the
series is  autocorrelated,  but the ranks of lag-1  data pairs will  share that  similarity or  dissimilarity.
Although the test is non-parametric and rank-based, the ranks of non-independent data  still follow a
discernible pattern. Therefore, the rank von Neumann ratio is constructed from the sum of differences
between the ranks of lag-1 data pairs. When these differences are small, the ranks of consecutive data
measurements  need to  be  fairly similar,  implying that the pattern  of observations  is  somewhat
predictable.  Given the relative position and magnitude of one observation, the approximate relative
position and magnitude of the next sample  measurement can be predicted. Low values of the rank von
Neumann ratio are therefore indicative of temporally dependent data  series.
                                            14-16
                                                                      March 2009

-------
Chapter 14. Temporal Variability _ Unified Guidance

     Compared to other tests of statistical independence, the rank von Neumann ratio has been shown to
be more powerful than non-parametric methods such as the Runs up-and-down test (Madansky,  1988). It
is also a reasonable test when the data follow a normal distribution. In that case, the efficiency of the test
is always close to 90 percent when compared to the von Neumann ratio computed on concentrations
instead of the ranks. Thus, very little effectiveness is lost by using the ranks in place of the original
measurements. The rank von Neumann ratio will correctly detect dependent data and do so over a variety
of underlying data distributions.  The rank von Neumann ratio is also fairly robust to departures from
normality, such as when the data derive from a skewed distribution like the lognormal.

       REQUIREMENTS AND ASSUMPTIONS

     An unresolved problem with the  rank von Neumann ratio test is  the presence of a substantial
fraction of tied  observations.  Like the Wilcoxon  rank-sum test (Chapter 16), Bartels (1982)
recommends replacing each tied value by its mid-rank (i.e., the average of all the ranks that would have
been  assigned to that  set of ties).  However,  no explicit adjustment  of the ratio for ties has been
developed.  The rank von Neumann critical points may not be appropriate (or at  best very approximate)
when a large portion of the data consists of non-detects or other tied values. Especially in the case of
frequent non-detects, too much information is lost regarding the pattern of variability to use the rank von
Neumann ratio as an accurate indication of autocorrelation. In fact, no test  is likely to provide a good
estimate of temporal correlation, whether non-parametric or parametric.

     While the rank von Neumann ratio test is recommended in the Unified Guidance for its ease of use
and robustness when applied to either normal or non-normal  distributions, the literature on time series
analysis and temporal correlation is  extensive with respect to other potential tests. Many other tests of
autocorrelation are available, especially when either the original measurements  or the residuals  of the
data are normally distributed after a trend has been removed. Chatfield (2004) and (Madansky, 1988) are
two good references for some of these alternate tests.

       PROCEDURE

Step 1 .   Order the sample from least to greatest and assign a unique rank to each measurement. If some
         data values are tied, replace tied values with their mid-ranks as in the Wilcoxon rank-sum test
         (Chapter 16). Then list the observations and their corresponding ranks in the order that they
         were collected (i.e., by sampling event or time order).

Step 2.   Using the list  of ranks, R\, for the sampling events / = !...«, compute  the rank von Neumann
         ratio with the equation:
Step 3.   Given sample size (ri) and desired significance level (a), find the lower critical point of the
         rank von Neumann ratio in Table 14-1 of Appendix D. In most cases, a choice of a = .01
         should be sufficient, since only substantial non-independence is likely to affect subsequent
         statistical testing.  If the computed ratio, v, is smaller than this critical point, conclude that the
         data series is strongly autocorr elated. If not, there is insufficient evidence to reject the
                                             14-17                                   March 2009

-------
Chapter 14. Temporal Variability
                                                                    Unified Guidance
         hypothesis of independence; treat the data as temporally independent in subsequent statistical
         testing.

       ^EXAMPLE 14-4

     Use the rank von Neumann ratio test on the following series of 16 quarterly measurements  of
arsenic (ppb) to determine whether or not the data set should be treated as temporally independent  in
subsequent tests. Compute the test at the a = .01 level of significance.
Sample Date
Jan 1990
Apr 1990
Jul 1990
Oct 1990
Jan 1991
Apr 1991
Jul 1991
Oct 1991
Jan 1992
Apr 1992
Jul 1992
Oct 1992
Jan 1993
Apr 1993
Jul 1993
Oct 1993
Arsenic (ppb)
4.0
7.2
3.1
3.5
4.4
5.1
2.2
6.3
6.5
7.5
5.8
5.9
5.7
4.1
3.8
4.3
Rank (/?,)
5
15
2
3
8
9
1
13
14
16
11
12
10
6
4
7
       SOLUTION
Step 1.   Assign ranks to the data values as in the table above. Then list the data in chronological order
         so that each rank value occurs in the order sampled.
Step 2.
Step3.
Compute the von Neumann ratio using the set of ranks in column 3 using equation [14.16],
being sure to take squared differences of successive, overlapping pairs of rank values:
                           ]/ =
                                                     = 1.67
Look up the lower critical point (vcp) for the rank von Neumann ratio in Table 14-1 of
Appendix D. For n = 16 and a = .01, the lower critical point is equal to 0.93. Since the test
statistic v is larger than vcp, there is insufficient evidence of autocorrelation at the a = .01 level
of significance. Therefore, treat these data as statistically independent in subsequent testing. A
                                             14-18
                                                                           March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

14.3 CORRECTING FOR TEMPORAL EFFECTS AND CORRELATION

14.3.1      ADJUSTING THE SAMPLING FREQUENCY AND/OR TEST METHOD

     If a data series is temporally correlated, a simple remedy (if allowable under program rules) is to
change the sampling frequency and/or statistical method used to analyze the data.  In  some cases,
increasing the sampling  interval will effectively eliminate the statistical dependence exhibited by the
series. This may happen  because the longer time between  sampling events allows more groundwater to
flow  through  the  well  screen,  further differentiating  measurements of  consecutive  volumes  of
groundwater and lessening the impact of seasonal fluctuations or other time-dependent patterns in the
underlying concentration distribution.

     Many authors including Gibbons (1994a) and ASTM (1994) have recommended that sampling be
conducted no more  often than  quarterly to avoid temporal dependence. If the sampling  frequency is
reduced, there are obviously fewer measurements available for statistical  analysis  during any  given
evaluation period. A t-test or ANOVA cannot realistically be run with fewer than four measurements per
well. A prediction limit for a future mean requires at least two new observations, and a prediction limit
for a future median requires at least three measurements, not counting any resamples. Depending on the
length of the evaluation period (i.e., quarterly, semi-annual, annual), a change of statistical  method may
also be necessary when groundwater measurements are autocorrelated.

     When sufficient background data have been collected over a longer period of time,  a prediction
limit test for future values can be run with as few as one or two new measurements per compliance well.
The same is true for control  charts. Therefore, if a low groundwater flow velocity and/or evidence of
statistical dependence suggest a reduction in sampling frequency, certain prediction  limits and control
charts should be strongly considered as alternate statistical procedures.

       RUNNING A PILOT STUDY

     An optional approach to adjusting the sampling frequency is to run a site-specific pilot study of
autocorrelation.  Such a study can be conducted in several ways, but perhaps the easiest is to pick two or
three wells from the network (perhaps one background well and one or two compliance wells) and then
conduct weekly sampling at these wells over a one year period. For each well in the study, construct the
sample autocorrelation function (Section 14.2.3) for a variety of constituents, and determine from these
graphs the  smallest lagged  interval at which the autocorrelation  coefficient becomes insignificantly
different from zero for most of the study constituents.

     Since an autocorrelation of zero  is equivalent to temporal independence for practical purposes,
finding the smallest lag between sampling events with no correlation indicates the minimum sampling
frequency  needed to approximately ensure statistical independence.  If  the sample autocorrelation
function does not drop down to zero with increasing lag (&), there may be a strong seasonal component
or a trend involved.  In these circumstances, lengthening the sampling frequency may do little to lessen
the temporal dependence. A seasonal pattern may need to be estimated instead and regularly removed
from the data prior to  statistical  testing.  Likewise, any  apparent trends  should be investigated to
determine if there is evidence of increasing concentration levels indicative of a possible release.
                                            14-19                                  March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

14.3.2      CHOOSING A SAMPLING INTERVAL VIA DARCY'S EQUATION

     Another strategy for determining an appropriate sampling interval is to use Darcy's equation. The
goal of this approach is to  calculate groundwater flow velocity and the time needed to ensure that
physically independent or distinct volumes of groundwater are collected on each sampling trip. As noted
in Chapter 6, physical independence does not guarantee statistical independence. However, statistical
independence may be more likely if the same general volume of groundwater is not re-sampled on
multiple occasions.

     This  section discusses the  important hydrological  parameters  to  consider  when choosing a
sampling interval. The Darcy equation is used to determine the horizontal component of the average
linear velocity  of ground water  for  confined,  semi-confined,  and unconfined aquifers. This value
provides a good estimate of travel  time for most soluble constituents in groundwater, and can be used to
determine a minimal sampling interval. Example calculations are provided to further assist the reader.
Alternative methods should be employed to determine a sampling interval in groundwater environments
where Darcy's law is invalid. Karst, cavernous basalt, fractured rocks, and other 'pseudo-karst' terranes
usually require specialized monitoring approaches.

     Section 264.97(g) of 40 CFR Part 264 Subpart F allows the owner or operator of a RCRA facility
to choose a sampling procedure that will reflect site-specific concerns. It specifies that the owner or
operator shall obtain a sequence of at least four samples from each well  collected at least semi-annually.
The interval  is  determined after evaluating the  uppermost aquifer's  effective  porosity,  hydraulic
conductivity, and hydraulic gradient, and the fate and transport characteristics of potential contaminants.
The intent of this  provision is to set a sampling frequency that allows sufficient time between sampling
events to ensure, to the greatest extent technically feasible, that independent groundwater observations
are taken from each well.

     The  sampling frequency required  in Part 264  Subpart  F  can be based on estimates using the
average linear  velocity  of  ground water.  Two  forms of the Darcy equation stated  below relate
groundwater velocity (V) to effective porosity (Ne), hydraulic gradient (/'), and hydraulic conductivity
(K):
                                                     e                                  [14.17]

                                        Vv=(Kv-i)/Ne                                  [14.18]

where  Vh and Vv are the  horizontal and  vertical  components  of the average  linear velocity  of
groundwater,  respectively; Kh  and  Kv are  the  horizontal and  vertical components  of hydraulic
conductivity, respectively;  /' is the head gradient; and Ne is the effective porosity.

     In applying these equations to ground-water monitoring, the horizontal component of the average
linear  velocity (Vh)  can  be used to determine  an appropriate  sampling interval. Usually, field
investigations  will yield  bulk values for  hydraulic conductivity.  In most  cases, the bulk hydraulic
conductivity determined by a pump test, tracer test, or a slug test will be sufficient for these calculations.
The vertical component (Fv), however, should be considered in  estimating flow velocities in areas with
significant components of vertical velocity such as recharge and discharge zones.

                                             14-20                                   March 2009

-------
Chapter 14.  Temporal Variability
                                                              Unified Guidance
     To apply the Darcy equation to groundwater monitoring, the parameters K, /', and Ne need to be
determined. The hydraulic conductivity, K, is the volume of water at the existing kinematic viscosity that
will move in unit time under a unit hydraulic gradient through a unit area measured at right angles to the
direction of flow. "[E]xisting kinematic viscosity" refers to the fact that hydraulic conductivity is not
only determined by the media  (aquifer),  but  also by  fluid properties (groundwater or potential
contaminants). Thus, it is possible to have several hydraulic conductivity values for different chemical
substances present in the same aquifer. The lowest velocity value calculated using  the Darcy equation
should be used to determine sampling intervals, ensuring physical independence  of  consecutive sample
measurements.

                 Figure 14-6.  Hydraulic Conductivity of  Selected  Rocks
                   IGNEOUS AND  METAMORPHIC   ROCKS
               Unf roct u red
                                 Fractured
                                           BASALT
               Unf ractured
                             Fractured
                      SANDSTONE
                                                                    Lava flow
                                Fractured      Semiconsolidoted
                        SHALE
               Unfractured      Fractured
                                                  CARBONATE  ROCKS
                                      Fractured
                           CLAY              SILT,  LOESS
                                                    Cavernous
                                                    SILTY  SAND
                                                       CLEAN  SAND
                                                          Fine     Coarse
                               GLACIAL  TILL
                l _ i _ I _ i _ l _ l _ I _ I _ L
               I0~8  IO"7  IO"6   IO"5   10""  I0~3   IO"2  10"'     I
                                                     GRAVEL
                                              _l	I	I	I
                                               10   10 2  IO3   IO4
                  I _ I
IO"7  IO"6  I0~5   I0~4  IO"3   IO"2  10"'
                             ft d"1
                                                             10   10 2   10 3   10 *  10  5
            IO"7  IO"6  IO"5   IO"4  IO"3   IO"2  10"'    I     10    10 2   10 3  10 *   10 5
                                            gal d-1 ft-2
            Source: Heath, B.C. 1987. Basic Ground-Water Hydrology. U.S. Geological Survey Water Supply Paper, 2220, 13 pp.
                                              14-21
                                                                      March 2009

-------
Chapter 14.  Temporal Variability
Unified Guidance
     A range of hydraulic conductivities (the transmitted fluid is water) for various aquifer materials is
given in Figures  14-6 and  14-7. The conductivities are given in several  units.  Figure  14-8 lists
conversion factors to change between various permeability and hydraulic conductivity units.

     The hydraulic gradient, /', is the change in hydraulic head per unit of distance in a given direction. It
can be determined by dividing the difference in head between two points on a potentiometric surface
map by the orthogonal distance between those two points (see calculation in Example 14-5). Water level
measurements are normally used to determine the natural hydraulic gradient at a facility. However, the
effects of mounding in the event  of a release  may produce a steeper local hydraulic gradient in the
vicinity of the monitoring well. These local changes in hydraulic gradient should be accounted for in the
velocity calculations.
       Figure  14-7.  Range of Values of Hydraulic Conductivity and Permeability
Unconsolidated k k K K K
h!OCKS deposits ^ (darcy) (cma) (cm/s) (m/s) (gal/day/ft





j H_
5 ^
— 1^

-  o
Q_ OJ U
c'£
CT Q--O
-~ i- c
T3 O O















T




plO5 r10"3 r102


o
(.
5

5


o
C

1"


o

« £ 0)  In
3 a c — a) .•!
Itl
"B QJ
•og^ 11
^f 2 
-------
Chapter 14. Temporal Variability
Unified Guidance
 Figure 14-8. Conversion Factors for Permeability and Hydraulic Conductivity Units

cm2
ft2
darcy
m/s
ft/s
gal/day/ft2
cm2
1
9.29xl02
9.87xl(T9
1.02xl(T3
3.11xlO-4
5.42xl(T10
Permeability, k*
ft2
l.OSxlCT3
1
i.oexicr11
i.ioxicr5
3.35xl(T7
5.83xlQ-13
darcy
l.OlxlO8
9.42xl010
1
1.04xl05
3.15xl04
5.49xlQ-2
Hydraulic conductivity, K
m/s ft/s gal/day/ft2
9.80xl02
9.11xl05
9.66xlQ-5
1
s.osxicr1
4.72xlQ-7
3.22xl03
2.99xl05
3.17xlQ-5
3.28
1
i.ssxicr5
1.85xl09
1.71xl012
1.82X101
2.12xl05
6.46xl05
1
                                             -3
    *To obtain k in ft2, multiply k in cm2 by 1.08x10

    Source: Freeze, R.A., and J.A. Cherry (1979). Ground Water. Prentice Hall, Inc., Englewood Cliffs,
    New Jersey, p. 29.
     The effective porosity, Ne, is the ratio, usually expressed as a percentage, of the total volume of
voids available for fluid transmission  to the total volume of the porous medium de-watered.  It can be
estimated during a pump test by dividing the volume of water removed from an aquifer by the total
volume of aquifer dewatered (see calculation  in Example 14-5). Figure 14-9 presents approximate
effective porosity values for a  variety of aquifer materials. In cases where the effective porosity is
unknown, specific yield may be substituted into the equation. Specific yields of selected rock units are
given in Figure  14-10. In the absence  of measured  values,  drainable porosity  is often  used to
approximate effective porosity. Figure 14-11 illustrates representative values of drainable porosity and
total porosity as a function of aquifer particle size. If available, field measurements of effective porosity
are preferred.
                                             14-23
        March 2009

-------
Chapter 14.  Temporal Variability
              Unified Guidance
   Figure 14-9.  Default Values of Effective Porosity (A/e)  For Travel Time Analyses
                         Soil textural classes
   Effective porosity of
        saturation3
       Unified soil classification system

            GS, GP,  GM, GC,  SW, SP, SM,  SC
            ML, MH
            CL, OL, CH, OH, PT

       USDA soil textural classes

            Clays, silty clays, sandy clays
            Silts, silt loams, silty clay loams
            All others

       Rock units (all)

            Porous media (non-fractured rocks such as sandstone
            and some carbonates)
      	Fractured rocks (most carbonates, shales, granites, etc.)
 0.20 (20%)
 0.15 (15%)
 0.01 (l%)b
 0.01 (l%)b
 0.10 (10%)
 0.20 (20%)
 0.15 (15%)

 0.0001 (0.01%)
       Source: Barari, A., and L. S. Hedges (1985). Movement of Water in Glacial Till. Proceedings of
       the 17th International Congress of the International Association of Hydrogeologists, pp.  129-
       134.

       aThese values are estimates and there may be differences between similar units. For example,
       recent studies indicate that weathered and unweathered glacial till may have markedly
       different effective porosities (Barari and Hedges, 1985;  Bradbury et al., 1985).

       bAssumes de minimus secondary porosity. If fractures or soil structure are present, effective
       porosity should be 0.001 (0.1%).
              Figure 14-10. Specific Yield Values for Selected Rock Types
                   Rock Type
Specific Yield (%)
       Clay
       Sand
       Gravel
       Limestone
       Sandstone (semi-consolidated}
       Granite
       Basalt (young}	
        2
        22
        19
        18
        6
       0.09
        8
Source: Heath, R.C. (1983). Basic Ground-Water Hydrology. U.S. Geological Survey, Water Supply Paper
2220, 84 pp.
                                              14-24
                      March 2009

-------
Chapter 14.  Temporal Variability
                                                                 Unified Guidance
     Once the values for K, /', and Ne are determined, the horizontal component of average linear
groundwater velocity  can be calculated. Using the Darcy equation [14.17],  the  time required for
groundwater to pass through the complete monitoring well diameter can be determined by dividing the
well diameter by the horizontal component of the average linear groundwater velocity.  If considerable
exchange of water occurs during well purging, the diameter of the filter pack may be used rather than the
well diameter. This value represents  the minimum time  interval  required  between sampling  events
yielding  a physically  independent (i.e., distinct) ground-water sample.  Note that three-dimensional
mixing of groundwater in the vicinity of the monitoring well is likely to occur when the well is purged
before sampling. Partly for that reason, this method can only provide an estimated travel time.

  Figure 14-11. Total  Porosity and Drainable Porosity for Typical Geologic Materials
c

-------
Chapter 14. Temporal Variability _ Unified Guidance

       ^EXAMPLE 14-5

     Compute the effective porosity, Ne, expressed as a percent (%), using results obtained during a
pump test.

       SOLUTION
Step 1 .    Compute the effective porosity using the following equation:

                Ne = 100% x volume of water removed/ volume of aquifer dewatered         [14.19]

Step 2.    Based on a pumping rate of 50 gal/min and a pumping  duration  of 30  min, compute the
         volume of water removed as:

                   volume of water removed = 50 gal/min x 30 min = 1,500 gal

Step 3 .    To calculate the volume of aquifer de-watered, use the equation:
                                                                                     [14.20]


         where r is the radius (in ft) of the area affected by pumping and h (ft) is the drop in the water
         level. If, for example, h = 3 ft and r = 18 ft, then:
                                F = -(3.14x3xl82)=l,018ft3
                                    3

         Next, converting cubic feet of water to gallons of water,

                             V = 1,018 ft3 x 7.48  gal/ft3 = 7,615 gal

Step 4.   Finally, substitute the two volumes from Step 3 into equation [14.19] to obtain the effective
         porosity:

                         Ne = 100% x (l,500 gal/7,615 gal)= 19.7%  ^

       ^EXAMPLE 14-6

     Determine the hydraulic gradient, /', from a potentiometric surface map.

       SOLUTION
Step 1.   Consider the potentiometric  surface map in Figure 14-12. The hydraulic gradient  can be
         constructed as /' = Ah/1, where Ah is the difference measured in the gradient at piezometers Pzj
         and Pz2,and I is the orthogonal distance between the two piezometers.
                                            14-26                                  March 2009

-------
Chapter 14. Temporal Variability
Unified Guidance
  Figure 14-12.  Potentiometric Surface Map for Computation of Hydraulic Gradient
                                                                           29.2'
     N
Step 2.   Using the values given in Figure 14-12, the hydraulic gradient is computed as:

                         / = A/2// = (29.2 ft - 29.1 ft)/100 ft = 0.001  ft/ft

Step 3.   Note that this method provides only a very general estimate of the natural hydraulic gradient
         existing in the vicinity of the two piezometers. Chemical gradients are known to exist and may
         override the effects of the hydraulic gradient. A  detailed study of the effects of multiple
         chemical contaminants may be necessary to determine the actual average linear groundwater
         velocity (horizontal component) in the vicinity of the monitoring wells. ~4

       ^EXAMPLE 14-7

     Determine the horizontal component of the average linear groundwater velocity (Vh) at  a land
disposal facility which has monitoring wells screened in an unconfmed silty sand aquifer.

       SOLUTION
Step 1.   Slug tests, pump tests, and tracer tests conducted during a hydrologic site investigation have
         revealed that the  aquifer has a horizontal hydraulic conductivity (Kh) of 15  ft/day and an
         effective porosity (Ne) of 15%. Using a potentiometric map (as in Example 14-6), the regional
         hydraulic gradient (/') has been determined to be 0.003 ft/ft.

Step 2.   To estimate the minimum time interval between sampling events enabling the collection of
         physically independent  samples of ground water, calculate the horizontal component of the
         average linear groundwater velocity (Vh) using Darcy's equation [14.17]. With^/, = 15 ft/day,
         Ne = 15%, and / = 0.003 ft/ft, the velocity becomes:

                   Vh = (l5 ft/day x0.003 ft/ft)/15% = 0.3 ft/day  or 3.6 in/day

Step 3.   Based on  these calculations,  the horizontal component of the average linear groundwater
         velocity, Vh, is  equal to 3.6 in/day.  Since monitoring well  diameters at this particular facility
         are  4 inches, the minimum time interval between sampling events enabling a physically
                                            14-27
        March 2009

-------
Chapter 14. Temporal Variability
                                                             Unified Guidance
         independent groundwater sample can be computed by dividing the horizontal component into
         the monitoring well diameter:

                      Minimum time interval = (4 iny(3.6 in/day)= 1.1 days

         As a result, the facility could theoretically  sample  every other day. However, this may be
         unwise because velocity can seasonally vary with recharge rates. It is also emphasized that
         physical independence does not guarantee statistical independence. Figure 14-13 gives results
         for common  situations. The  overriding point  is that it may not  be necessary to set the
         minimum sampling frequency to quarterly at every site.  Some hydrologic environments may
         allow for more frequent sampling, some less. ~4

  Figure 14-13. Typical  Darcy Equation Results  in Determining a Sampling Interval
Unit
Gravel
Sand
Silty Sand
Till
Silty Sand (semi-consolidated)
Basalt
Kh (ft/day)
104
102
10
io-3
1
io-1
Ne (%)
19
22
14
2
6
8
Vh (in/mo)
9.6xl04
8.3xl02
l.SxlO2
9.1xlO-2
30
2.28
Sampling Interval
Daily
Daily
Weekly
Monthly
Weekly
Monthly
14.3.3
CREATING ADJUSTED, STATIONARY MEASUREMENTS
     When an  existing  data set exhibits temporal  correlation or other variability, it is sometimes
possible to remove the temporal pattern and thereby create a set of adjusted data which are uncorrelated
and stationary over time in mean level. As  long  as the same temporal pattern seems to  affect both
background and the compliance point data to be tested, the effect (e.g., regular seasonal fluctuation) can
be estimated  and removed from  both data sets prior to statistical testing. Testing the adjusted data
instead of the  raw measurements in this way results in a more powerful and accurate test. An extraneous
source of variation not related to identifying a contaminant  release has been removed from the sample
data.

     The general topic of stationary, adjusted data is  complex, contained within the extensive literature
on time series. The Unified Guidance discusses two  simple cases below: removing a seasonal pattern
from a single well and creating adjusted data from a one-way ANOVA for temporal effects across
several wells.  More complicated situations may need professional consultation.

14.3.3.1    CORRECTING FOR SEASONAL  PATTERN IN A SINGLE WELL
       BACKGROUND AND PURPOSE

     Sometimes an obvious cyclical seasonal pattern can be seen in a time series plot. Such data are not
statistically independent.  They do not fluctuate randomly but rather in a predictable way from one
sampling event to the next. Data from such patterns can be adjusted to correct for or remove the seasonal
fluctuation, but only if a longer series of data is available. This is also  known  as deseasonalizing the
data. Seasonal correction should be done both to minimize the chance of mistaking a seasonal effect for
                                           14-28
                                                                     March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

evidence of contaminated groundwater, and also to build  more powerful background-to-compliance
point tests.

     Problems can arise, for instance, from measurement variations associated with changing recharge
rates during different seasons. Compliance point concentrations can exceed a groundwater protection
standard [GWPS] for a portion of the year, but on average lie below.  If the long-term  average is of
regulatory concern, the data should first be de-seasonalized before comparing it against a GWPS.

     If point-in-time,  interwell  comparisons  are  being  made  between  simultaneously collected
background and downgradient data, a correction may not be necessary even when seasonal fluctuations
exist. A temporal cycle may  cover a period of  several  years  so that  both the background and
downgradient values are observed on essentially the same parts of the overall cycle.  In  this case, the
short-term  averages  in both  data  sets  will be  fairly stable and the seasonal  or  cyclical  effect  may
equivalently impact both sets of data.

     For intrawell tests, the data need to be collected sequentially at each well, with background formed
from the earliest measurements in the series. The  point-in-time  argument would not apply and the
presence of seasonality should be checked and accounted for.

     Even with interwell testing, it is sometimes difficult to verify whether or not a seasonal pattern is
impacting upgradient and compliance point wells similarly.  If the groundwater velocity is low, the lag
between the time groundwater passes through a background well screen and then travels downgradient
may create a noticeable shift  as to when corresponding portions of the  seasonal cycle are observed in
compliance point locations. It also may be the case that differences in geochemistry from well to well
may cause the  same seasonal pattern  to  differentially impact  concentration levels at distinct  wells
(Figure 14-14)
                                             14-29                                   March 2009

-------
Chapter 14. Temporal Variability
      Unified Guidance
    Figure 14-14.  Differential  Seasonal Effects:  Background vs.  Compliance Wells
                  H
                  H
                  O
                  O
                  o
                             10
                                  20
                                       30
                                            40    50    60
                                                WEEK
                                                            70    80    90    100
                                E5aekgroimd
Compliance
     If the timing of the cycle and the relative magnitude of the concentration swings are essentially the
same in upgradient and downgradient wells, both data sets should be deseasonalized prior to statistical
analysis. If the seasonal effects are ignored, real differences in mean levels between upgradient and
downgradient well data may not be observed, simply because the short-term seasonal fluctuations add
variability that can mask the difference being tested. In this case, the  non-independent nature of the
seasonal pattern adds unwanted noise to the observations, obscuring statistical evidence of groundwater
contamination.

       REQUIREMENTS AND ASSUMPTIONS

     Seasonal correction is only appropriate for wells where a cyclical pattern is clearly present and very
regular  over time. Many approaches to deseasonalizing data exist.  If the  seasonal pattern is highly
regular, it may be modeled with a sine or cosine function.  Often, moving averages and/or lag-based
differences (of order 12 for monthly data, for example) are used. General time series models may include
these and other more complicated methods for deseasonalizing the data.

     The  simple method  described in the Unified Guidance has  the advantage  of being  easy  to
understand and apply, and of providing natural estimates of the monthly or quarterly seasonal effects via
the monthly or quarterly  means.  The method can be applied to any seasonal or recurring cycle— perhaps
an annual cycle  for monthly  or  quarterly  data or a longer cycle for certain kinds  of geologic
environments.  In some cases, recharge rates are linked to  drought cycles that may be  on the order of
                                            14-30
              March 2009

-------
Chapter 14.  Temporal Variability	Unified Guidance

several years long. For these situations, the 'seasonal' cycle may not correspond to typical fluctuations
over the course of a single year.

     Corrections for  seasonality  should  be used cautiously,  as they  represent extrapolation into the
future. There  should  be a good  physical  explanation for the seasonal  fluctuation  as  well  as  good
empirical evidence for seasonality before corrections are made. Higher than average rainfall for two or
three Augusts in a row does not justify the belief that there will never  be a drought in August, and this
idea extends directly  to groundwater quality.  At least three  complete cycles  of the seasonal pattern
should be observed on a time series plot  before  attempting the adjustment below. If seasonality is
suspected but the pattern is complicated, the user should seek the help of a professional statistician.

       PROCEDURE

Step 1.   If a time series plot clearly shows at least 3 full cycles of the seasonal pattern, determine the
         length of time to complete one full cycle.  Since the correction  presumes a regular sampling
         schedule, determine the number of observations (K) in each full  cycle (this number should be
         the same for each cycle). Then, assuming that N complete cycles of data are available, let xy-
         denote the raw, unadjusted measurement for the /'th sampling event during they'th complete
         cycle. Note  that this could represent monthly data over an annual cycle, quarterly data over a
         biennial cycle, semi-annual data over a 10-year cycle, etc.

Step 2.   Compute the mean concentration for sampling event /' over the TV-cycle period:

                                x-   =   (xn+x!2  +... + XiN)/N                          [14.21]

         This is the average of all observations taken in different cycles, but during the same sampling
         event.  For instance, with monthly data over an annual cycle, one would calculate the mean
         concentrations for all Januarys, the  mean for all Februarys, and so on for each of the 12
         months.

Step 3.   Calculate the grand mean, x , of all Nxk observations:
                                               Ixk   ~* k

Step 4.   Compute seasonally-corrected, adjusted concentrations using the equation:

                                         z.. = x.. - x. + x                                   [14.23]

         Computing  x. -x.  removes the average seasonal effect of sampling event /' from the data
         series. Adding back the overall mean,  x , gives the adjusted zy values the same mean as the
         raw, unadjusted data. Thus, the overall mean  of the corrected values, z , equals the overall
         mean value, x .
                                             14-31                                   March 2009

-------
Chapter 14. Temporal Variability
Unified Guidance
       ^EXAMPLE 14-8

     Consider the monthly unadjusted  concentrations of an analyte over a 3-year period graphed in
Figure 14-15 and listed in the table below. Given the clear and regular seasonal pattern, use the above
method to produce a deseasonalized data set.
Unadjusted Concentrations
1983 1984 1985
January
February
March
April
May
June
July
August
September
October
November
December
Overall
1.99
2.10
2.12
2.12
2.11
2.15
2.19
2.18
2.16
2.08
2.05
2.08
3-year
2.01
2.10
2.17
2.13
2.13
2.18
2.25
2.24
2.22
2.13
2.08
2.16
average = 2.17
2.15
2.17
2.27
2.23
2.24
2.26
2.31
2.32
2.28
2.22
2.19
2.22

Monthly
Average
2.05
2.12
2.19
2.16
2.16
2.20
2.25
2.25
2.22
2.14
2.11
2.16

Adjusted Concentrations
1983 1984 1985
2.11
2.14
2.10
2.13
2.12
2.12
2.11
2.10
2.11
2.10
2.11
2.09

2.13
2.14
2.15
2.14
2.14
2.15
2.17
2.16
2.17
2.15
2.14
2.17

2.27
2.21
2.25
2.24
2.25
2.23
2.23
2.24
2.23
2.24
2.25
2.23

       SOLUTION
Step 1.   From Figure 14-15, there are N = 3 full cycles represented, each lasting approximately a year.
         With monthly data, the number of sampling events per cycle is k = 12.

Step 2.   Compute the monthly averages across  the 3 years for each of the 12 months using equation
         [14.21]. These values are shown in the fifth column of the table above.

Step 3.   Calculate the grand mean over the 3-year period using equation [14.22]:

                  x  =  —!—(1.99 + 2.01 + 2.15 + 2.10   +... + 2.22)  =  2.17
                         3-12

Step 4.   Within each month and year,  subtract the average monthly concentration for that month and
         add-in the grand mean, using equation [14.23]. As an example, for January 1983, the adjusted
         concentration becomes:

                                 z,, =1.99-2.05 + 2.17 = 2.11
                                            14-32
        March 2009

-------
Chapter 14. Temporal Variability
                                        Unified Guidance
            Figure 14-15. Seasonal Time Series Over a Three-Year Period
             Jan-83    May-83   Sep-83
                       Unadjusted
 Jan-84   May-84    Sep-84    Jan-85   May-85   Sep-85

      TIME (Month)

—     Adjusted 	+	       3-Year Mean 	
         The  adjusted concentrations are shown in the last three columns  of the table  above. The
         average of all 36  adjusted concentrations equals  2.17, the same  as the mean unadjusted
         concentration. Figure 14-15 shows the adjusted data superimposed on the unadjusted data.
         The raw data exhibit seasonality, as well as an upward trend. The adjusted data, on the other
         hand, no longer exhibit a seasonal pattern, although the upward trend still remains. From a
         statistical standpoint, the trend is much more easily identified by a trend test on the adjusted
         data than with the raw data. -4
14.3.3.2    CORRECTING FOR A TEMPORAL EFFECT ACROSS SEVERAL WELLS
       BACKGROUND AND PURPOSE

     When a significant temporal dependence or correlation is identified across a group of wells using
one-way ANOVA for temporal effects (Section 14.2.2), results of the ANOVA can be used to create
stationary adjusted data similar to the seasonal correction described in Section 14.3.3.1. The difference
is that the adjustment is not applied to a data series at a single well, but rather simultaneously to several
well sets.
                                            14-33
                                                March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

     The adjustment works in the same way as a correction for seasonality. First, the mean for each
sampling event or season (averaged across wells) is computed along with the grand mean. Then each
individual measurement is adjusted by subtracting off the event/seasonal mean and adding the overall or
grand mean. In practice, this process is identical to  adding the one-way ANOVA residual to the grand
mean, so the already-computed results of the ANOVA can be used.  By  removing or correcting for a
significant temporal effect, the adjusted  data will have a temporally stationary mean and less  overall
variation. This allows for more powerful  and accurate detection monitoring tests.

     Temporal  dependence (e.g., seasonality) is sometimes observed as parallel traces on a time series
plot across multiple wells (Section 14.2.1), although the one-way ANOVA for temporal effects  is non-
significant.  This can occur due to the simultaneous presence of strong spatial variability (Chapter 13).
Differences in mean levels from well to well can be large enough to 'swamp' the added variation due to
the temporal dependence. The one-way ANOVA for temporal effects will not identify the dependence
because the mean error sum  of squares will then include the spatial variation component and not just
random error.

     Two remedies are possible when the ANOVA for temporal effects  is non-significant. First, if a
strong parallelism is evident on time series plots, the residuals from the ANOVA can still be used to
create a set of adjusted, temporally-stationary measurements.  The adjustment will not eliminate  or
remove any  existing spatial variation, but it may not matter. Intrawell  tests are needed anyway when
such spatial variability is evident,  and those tests assume temporal independence of the measurements
collected at each well.

     A second remedy is to perform a two-way  ANOVA, testing for both spatial variation and temporal
effects.  This procedure is discussed in Davis (1994). Not only will a two-way ANOVA more  readily
identify a significant temporal effect even when there is simultaneous spatial variability,  but  the F-
statistic used to test for the temporal  dependence can be utilized to further adjust the appropriate degrees
of freedom in intrawell background limits, such as prediction limits and control charts.

       REQUIREMENTS AND ASSUMPTIONS

     The key requirement to correct for a temporal effect using ANOVA is that the same effect must be
present  in all wells to which the adjustment is applied. Otherwise, the adjustment will tend to skew or
bias measurements at wells with no observable temporal dependence. Parallel time series plots (Section
14.2.1)  should be examined  to determine whether all the wells under  consideration exhibit a  similar
temporal pattern.

     The parametric one-way ANOVA assumes the data are normal or can be normalized. If the data
cannot be normalized,  a Kruskal-Wallis non-parametric ANOVA can be conducted to  test for  the
presence of a temporal dependence. In this case, no residuals can be computed since the Kruskal-Wallis
test employs ranks of the data rather than the measurements themselves.  So the adjustment presented
below is only applicable for data sets that can be normalized.

     PROCEDURE

Step 1.   Given a set of Wwells and measurements from each of T sampling events at each well on each
         of K years, label the observations as xp,  for /' =  1 to W, j = 1 to T, and k = 1 to K. Then xp
         represents the measurement from the rth well on they'th sampling event during the Mi year.
                                             14-34                                   March 2009

-------
Chapter 14. Temporal Variability
Unified Guidance
Step 2.   Using the one-way ANOVA for temporal effects (Section 14.2.2), compute the sampling
         event or seasonal means (whichever is appropriate), along with the grand (overall) mean. Also
         construct the ANOVA residuals using either equation [14.5] or [14.6].

Step 3.   Add each residual to the  grand mean to  form  adjusted values z..k=xttt+r..k. Use these
         adjusted values in subsequent statistical testing instead of the original measurements.

       ^EXAMPLE 14-9

     The manganese data of Examples  14-1 and  14-2 were found to have a significant temporal
dependence using ANOVA for temporal effects. Adjust these data to remove the temporal pattern.
Qtr
1
2
3
4
5
6
7
8

Event
Mean
29.290
30.110
30.780
31.620
33.747
31.930
30.513
30.345

Manganese Residuals (ppm)
BW-1 BW-2 BW-3
-1.15
-0.78
-0.33
0.80
0.6225
1.32
0.5075
-1.845
Grand
2.12
0.16
1.79
1.15
-0.7175
0.25
-1.6625
2.535
mean = 31
-2.14
0.13
-1.64
-1.03
1.1325
-1.40
-0.1825
0.075
042
BW-4
1.17
0.49
0.18
-0.92
-1.0375
-0.17
1.3375
-0.765

       SOLUTION
Step 1.   The mean of each sampling event taken across the four background wells was computed in
         Example 14-2, along with the grand mean. These results are listed in the table above, along
         with the individual residuals which were also computed in that example.

Step 2.   Add the grand mean to each residual to form the adjusted manganese concentrations, as in the
         table below.
Qtr
1
2
3
4
5
6
7
8

Event
Mean
29.290
30.110
30.780
31.620
33.747
31.930
30.513
30.345

Adjusted
BW-1
29.89
30.26
30.71
31.84
31.66
32.36
31.55
29.20
Grand
Manganese
BW-2
33.16
31.20
32.83
32.19
30.32
31.29
29.38
33.58
(ppm)
BW-3
28.90
31.17
29.40
30.01
32.17
29.64
30.86
31.12
BW-4
32.21
31.53
31.22
30.12
30.00
30.87
32.38
30.28
mean = 31.042
                                            14-35
        March 2009

-------
Chapter 14. Temporal Variability
Unified Guidance
Step 3.   Plot a time  series of the adjusted manganese values, as in  Figure 14-16. The 'hump-like'
         temporal pattern evident in Figure 14-2 is no longer apparent. Instead, the overall mean is
         stationary across the 8 quarters. -4
   Figure 14-16. Parallel Time Series Plot of Adjusted Manganese Concentrations
          f °
          I
                                              Quarter
14.3.3.3    CORRECTING FOR LINEAR TRENDS

     If a data series exhibits a linear trend, the sample will exhibit temporal dependence when tested via
the sample autocorrelation function (Section  14.2.3), the rank von Neumann ratio (Section 14.2.4), or
similar procedure.  These data can be de-trended, much like the  data in the previous  example were
deseasonalized.  Probably the easiest way to de-trend observations with a linear trend is to compute a
linear regression on the data (Section 17.3.1) and then use the regression residuals instead of the original
measurements in subsequent statistical analysis.

     But no matter how tempting it may be to  automatically  de-trend data of this sort, the user is
strongly cautioned to consider what a linear trend may represent. Often, an upward trend is indicative of
changing groundwater conditions at a site, perhaps due to the increasing presence of contaminants
during a gradual waste release. The trend in this case may itself be statistically significant evidence of
groundwater contamination,   particularly  if  it occurs  at  compliance  wells but  not  at  upgradient
background wells. The  trend tests of Chapter 17 are useful for such determinations. Trends in
background may  signal  other important factors, including migration of contaminants from off-site
sources, changes in the regional aquifer, or possible groundwater mounding.
                                            14-36
        March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance

     The overriding point is that data should be deseasonalized when a cyclical pattern might obscure
the random deviations around an otherwise stable average concentration level, or to more clearly identify
an existing trend. However, a linear trend is inherently indicative of a changing mean level. Such data
should not be de-trended before it is determined what the trend likely represents, and whether or not it is
itself prima facie evidence of possible groundwater contamination.

     A similar trend both in direction and slope may be exhibited by background wells and compliance
wells, perhaps suggestive of sitewide changes in natural groundwater conditions. Residuals from a one-
way  ANOVA for temporal effects (Section 14.2.2) can be used to simultaneously create adjusted values
across the well network (Section 14.3.3.2). Linear trends are just as easily identified and adjusted in this
way  as are parallel seasonal fluctuations or other temporal effects.
14.3.4      IDENTIFYING  LINEAR  TRENDS  AMIDST  SEASONALITY:  SEASONAL
       MANN-KENDALL TEST

       BACKGROUND AND PURPOSE

     Corrections for seasonality or other cyclical patterns over time in a single well  are discussed in
Section 14.3.3.1. These adjustments work best when the long-term mean at the well is stationary. In
cases where a test for trend is desired and there are also seasonal fluctuations, Chapter 17 tests may not
be sensitive enough to detect a real trend due to the added seasonal variation.

     One possible remedy is to use the seasonal  correction in Section  14.3.3.1 and illustrated in
Example 14-8.  The seasonal component of the trend is removed prior to conducting a formal trend test.
A second option is the seasonal Mann-Kendall test (Gilbert, 1987).

     The seasonal Mann-Kendall is a simple modification to the  Mann-Kendall test for trend (Section
17.3.2) that accounts for apparent seasonal fluctuations. The basic idea is to divide a longer multi-year
data series  into subsets, each  subset representing the measurements collected on a common sampling
event (e.g., all January events or all fourth quarter events). These subsets then represent different points
along the regular seasonal cycle, some associated with peaks and others with troughs. The usual Mann-
Kendall test is performed on each subset separately and a Mann-Kendall test statistic S\  formed for each.
Then the separate S\ statistics are summed to get an overall Mann-Kendall statistic S.

     Assuming that the same basic trend impacts each subset,  the combined statistic S will be powerful
enough to identify a trend despite the seasonal fluctuations.

       REQUIREMENTS AND ASSUMPTIONS

     The basic requirements of the Mann-Kendall trend test are discussed in Section  17.3.2. The only
differences with the seasonal Mann-Kendall test are that  1) the  sample should be a multi-year series with
an observable seasonal  pattern each year; 2) each 'season' or subset of the overall series should include
at least three measurements in order to  compute the  Mann-Kendall  statistic;  and 3) a  normal
approximation to the overall Mann-Kendall test statistic  must be tenable. This will generally be the case
if the series has at least  10-12 measurements.

                                            14-37                                  March 2009

-------
Chapter 14.  Temporal Variability	Unified Guidance

       PROCEDURE

Step 1.   Given a series of measurements from each of T sampling events on each of K years, label the
         observations as .% for /' = 1 to T, andy = 1 to K. Then xy represents the measurement from the
         rth sampling event during they'th year.

Step 2.   For each distinct sampling event (/'), form a seasonal subset by grouping together observations
         xn, Xi2,...., x;K. This results in T separate seasons.

Step 3.   For each seasonal  subset, use the procedure in Section 17.3.2 to compute the Mann-Kendall
         statistic Si and its standard deviation SD[Si\. Form the overall seasonal Mann-Kendall statistic
         (S) and its standard deviation with the equations:


                                           S = i>,                                     [14.24]
Step 4.   Compute the normal approximation to the seasonal Mann-Kendall statistic using the equation:

                                                                                        [14.26]
Step 5.   Given significance level,  a,  determine the  critical point zcp  from the  standard  normal
         distribution in Table  10-1  of Appendix D. Compare Z against this critical point. If Z > zcp,
         conclude there is statistically significant evidence at the a-level of an increasing trend. If Z < -
         zcp, conclude  there is statistically  significant  evidence  of a decreasing trend.  If neither,
         conclude that the sample evidence is insufficient to identify a trend.

       ^EXAMPLE 14-10

     The data set in Example  14-8  replicated below indicated both clear seasonality and an apparent
increasing trend. Use the seasonal Mann-Kendall procedure to test for a significant trend with a = 0.01
significance.
                                             14-38                                   March 2009

-------
Chapter 14. Temporal Variability
                          Unified Guidance
                              Analyte Concentrations
                            1983       1984      1985
           Si
                                                           S= 35
                              SD[S|]
January
February
March
April
May
June
July
August
September
October
November
December
1.99
2.10
2.12
2.12
2.11
2.15
2.19
2.18
2.16
2.08
2.05
2.08
2.01
2.10
2.17
2.13
2.13
2.18
2.25
2.24
2.22
2.13
2.08
2.16
2.15
2.17
2.27
2.23
2.24
2.26
2.31
2.32
2.28
2.22
2.19
2.22
3
2
3
3
3
3
3
3
3
3
3
3
1.915
1.633
1.915
1.915
1.915
1.915
1.915
1.915
1.915
1.915
1.915
1.915
                   SD[S]= 6.558
       SOLUTION
Step 1.   Form a seasonal subset for each month by grouping all the January measurements, all the
         February measurements, and so on, across the 3 years of sampling. This gives 12 seasonal
         subsets with n  = 3 measurements  per season. Note there  are no  tied values  in any of the
         seasons except for February.

Step 2.   Use  equations [17.30] and [17.31]  in  Section 17.3.2 to compute the Mann-Kendall statistic
         (Si) for each subset. These values are listed in the table above. Also compute their sum to form
         the overall seasonal Mann-Kendall statistic, giving S= 35.

Step 3.   Use equation [17.28] from Section 17.3.2 for all months but February to compute the standard
         deviation of Si. Since n = 3 for each of these subsets, this gives
                                                   =— 3-2-11 = 1.915
         For the month of February, one pair of tied values exists. Use equation [17.27] to compute the
         standard deviation for this subset:
(t  -i
\,   A
t
                                                                                = 1.633
         List  all the subset  standard deviations  in  the table above. Then  use  equation  [14.25] to
         compute the overall  standard deviation:
                                                   . 915 )  + (1.633 ) =6.558
                                             14-39
                                  March 2009

-------
Chapter 14. Temporal Variability	Unified Guidance



Step 4.   Compute a normal approximation to S with equation [ 17.29]:

                                   Z = (35-l)/6.558 = 5.18

Step 5.   Compare Z against the 1% critical point from the standard normal distribution in Table 10-1
         of Appendix D, z.oi = 2.33. Since Z is clearly larger than z.oi, the increasing trend evidence in
         Figure 14-15 is highly significant. -4
                                            14-40                                  March 2009

-------
Chapter 15. Managing Non-Detect Data	Unified Guidance

     CHAPTER  15.  MANAGING  NON-DETECT DATA
       15.1   GENERAL CONSIDERATIONS FOR NON-DETECT DATA	 15-1
       15.2   IMPUTING NON-DETECT VALUES BY SIMPLE SUBSTITUTION	15-3
       15.3   ESTIMATION BY KAPLAN-MEIER	15-7
       15.4   ROBUST REGRESSION ON ORDER STATISTICS	15-13
       15.5   OTHER METHODS FOR A SINGLE CENSORING LIMIT	 15-21
             15.5.1 COHEN'S METHOD	15-21
             15.5.2 PARAMETRIC REGRESSION ON ORDER STATISTICS	15-23
       15.6   USEOFTHE15%/50%NON-DETECTSRULE	15.24
     This chapter considers strategies  for accommodating non-detect measurements in groundwater
data analysis. Five particular  methods are described  for incorporating non-detects into  parametric
statistical procedures. These include:

    »»»  Simple substitution (Section 15.2);
    »»»  Kaplan-Meier (Section 15.3);
    »»»  Robust Regression on Order Statistics (Section 15.4);
    *  Cohen's Method (Section 15.5.1); and
    »»»  Parametric Regression on Order Statistics (Section 15.5.2).
15.1 GENERAL CONSIDERATIONS FOR NON-DETECT DATA

     Non-detects  commonly reported  in groundwater monitoring are statistically known  as  "left-
censored" measurements, because the concentration of any non-detect either cannot be estimated or is
not reported directly. Rather, it is known or assumed only to fall within a certain range of concentration
values (e.g., between zero and the quantitation limit [QL]).  The direct estimate has been censored by
the limitations of the measurement process or analytical technique, and is deemed too uncertain to be
considered  reliable.  Groundwater non-detect data  are censored on the low or left end of a sample
concentration range. Other kinds of threshold data,  particularly survival rates in the medical literature,
are often reported as right-censored values.

     Historically, there has been inconsistent treatment of non-detects in groundwater analysis.  Often,
easily applied techniques have been favored over more sophisticated methods of handling non-detects.
This may primarily be  due to the lack of familiarity and difficulties with software that can incorporate
such methods. Even at present, most statistical  packages include  analysis routines for right-censored
values but not left-censored ones (Helsel, 2005).  Left-censored data needs to be converted to right-
censored data  for analysis and then back  again.   Despite these  limitations, the more sophisticated
methods are almost always superior to the methods of simple substitution.

     The past twenty  years has seen considerable research on statistical aspects  of  non-detect data
analysis.   Helsel  (2005)  provides  a detailed summary  of available methods for non-detects, and

                                            lifl                                  March 2009

-------
Chapter 15.  Managing Non-Detect Data	Unified Guidance

concludes that simple substitution usually leads to greater statistical bias and inaccuracy than with better
technical methods.  Gibbons (1994b) and Gibbons & Coleman (2001) offer a broad review of some of
the same research, not all of it directly relating to groundwater data.  Both Gibbons and McNichols &
Davis (1988) note that most of the existing studies focus  on an estimation of parameters such as the
mean and variance of an underlying population  from which the censored and detected data originate.
For these tasks, simple substitution methods tend  to perform poorly, especially when the non-detect
percentages are high (Gilliom & Helsel, 1986).

     Much less attention has been given to how left-censored data impact the results of statistical tests,
the actual data-based conclusions that are drawn when using detection, compliance, or corrective action
monitoring tests.  Closely  estimating  the  true  mean  and  variance  of the underlying background
population may be important, but does not directly answer how well a given test performs (in achieving
the nominal false positive error rate and correctly identifying true significant differences). McNichols &
Davis (1988) performed a limited study to address these concerns. They found that simple substitution
methods were among the best performers in simulated prediction limit tests even with fairly high rates of
censoring, so long as the prediction limit procedure incorporated a verification resample.

     Gibbons (1994b; also Gibbons and Coleman,  2001) conducted a similar limited simulation of
prediction limit testing performance incorporating a verification resample. They, too, found that a type
of simple substitution was one of the best performers when either an average of 20% or 50% of the data
was non-detect. The Gibbons study concluded that substituting zero for each non-detect worked better to
keep the false positive rate low than by substituting half the method detection limit [MDL].

     Both studies  primarily focused on the achievable false positive rate when censored  data are
present, rather than the statistical power of these tests to identify contaminated groundwater. In addition,
both only considered parametric prediction limits.  For data sets with fairly low  detection frequencies
(e.g., <50%), parametric prediction limits may not accurately accommodate left-censored measurements,
with or without retesting. The McNichols & Davis study in particular found that none of the simpler
methods for handling non-detects did well when the  underlying data came from a skewed distribution
and the non-detect percentage was over 50%.

     On balance, there are four general  strategies for handling non-detects:  1)  employing a  test
specifically designed to accommodate non-detects,  such as the Tarone-Ware two-sample alternative to
the t-test (Section 16.3);  2) using a rank-based, non-parametric test such  as the Kruskal-Wallis
alternative to analysis of variance [ANOVA] (Section 17.1.2) when the non-detects and detects can be
jointly sorted and ordered  (except for tied values); 3)  estimating the  mean and standard deviation of
samples  containing non-detects  by means of a censored estimation technique; and  4)  imputing an
estimated value for each non-detect prior to further statistical manipulation.

     The first two strategies mentioned  above are discussed in Chapters 16 and 17 of the Unified
Guidance as alternative testing procedures for evaluating left-censored data when parametric distribution
assumptions cannot be made. Tests that can accommodate non-detects are typically non-parametric and
thus carry  both the advantages  and disadvantages of  non-parametric methods. The third and fourth
strategies — presented in  this chapter —  are often  employed as an intermediate step in parametric
analyses. Estimates of the background mean and standard deviation are needed to construct parametric
prediction and control chart limits, as well  as confidence intervals. Imputed values for individual non-
detects can be used as an alternate way to construct mean and standard deviation estimates, which are


                                              15^2                                   March 2009

-------
Chapter 15. Managing Non-Detect Data	Unified Guidance

needed to update the cumulative sum [CUSUM] portion of control charts or to compute the means of
orderp that get compared against prediction limits.

     The guidance generally favors the use of the more sophisticated Kaplan-Meier or Robust ROS
methods  which  can address the problem of multiple detection limits.  Two older techniques— Cohen's
method and parametric ROS-- are also included as somewhat easier methods which can work in some
circumstances.  Applying any of the four estimation techniques as well as simple substitution does rel