Guidance for Data Quality Assessment Practical Methods for Data Analysis EPA QA/G-9, QA00 Update Version


        United States       Office of Environmental    EPA/600/R-96/084
        Environmental Protection   Information        July, 2000
        Agency         Washington, DC 20460
<&EPA  Guidance for
        Data Quality Assessment

        Practical Methods for
        Data Analysis

        EPA QA/G-9

        QAOO UPDATE

-------

-------
                                     FOREWORD

       This document is the 2000 (QAOO) version of the Guidance for Data Quality Assessment
which provides general guidance to organizations on assessing data quality criteria and
performance specifications for decision making. The Environmental Protection Agency (EPA) has
developed a process for performing Data Quality Assessment (DQA) Process for project
managers and planners to determine whether the type, quantity, and quality of data needed to
support Agency decisions has been achieved. This guidance is the culmination of experiences in
the design and statistical analyses of environmental data in different Program Offices at the EPA.
Many elements of prior guidance, statistics, and scientific planning have been incorporated into
this document.

       This document is distinctly different from other guidance documents; it is not intended to
be read in a linear or continuous fashion.  The intent of the document is for it to be used as a
"tool-box" of useful techniques in assessing the quality of data. The  overall structure  of the
document will  enable the analyst to investigate many different problems using a systematic
methodology.

       This document is one of a series of quality management guidance documents that the EPA
Quality Staff has prepared to assist users in implementing the Agency-wide Quality System. Other
related documents include:

       EPA QA/G-4        Guidance for the Data Quality Objectives Process

       EPA QA/G-4D       DEFT Software for the Data Quality Objectives Process

       EPA QA/G-4HW     Guidance for the Data Quality Objectives Process for Hazardous
                           Waste Site Investigations

       EPA QA/G-9D       Data Quality Evaluation Statistical Toolbox (DataQUEST)

       This document is intended to be a  "living document" that will be updated periodically to
incorporate new topics and revisions or refinements to existing procedures. Comments received
on this 2000 version will be considered for inclusion in subsequent versions. Please send your
written comments on Guidance for Data Quality Assessment to:

              Quality Staff (2811R)
              Office of Environmental Information
              U.S. Environmental Protection Agency
              1200 Pennsylvania Avenue, NW
              Washington, DC 20460
              Phone:  (202) 564-6830
              Fax:  (202)565-2441
              E-mail:  quality@epa.gov

EPA QA/G-9                                                                         Final
QAOO Version                                i                                    July 2000

-------
                                 This page is intentionally blank.
EPA QA/G-9                                                                                 Final
QAOO Version                                   ii                                       July 2000

-------
                            TABLE OF CONTENTS

                                                                          Page

INTRODUCTION	0-1
      0.1    PURPOSE AND OVERVIEW	0-1
      0.2    DQA AND THE DATA LIFE CYCLE  	0-2
      0.3    THE 5 STEPS OF DQA	0-2
      0.4    INTENDED AUDIENCE  	0-4
      0.5    ORGANIZATION	0-4
      0.6    SUPPLEMENTAL SOURCES  	0-4

STEP 1: REVIEW DQOs AND THE SAMPLING DESIGN	1-1
      1.1    OVERVIEW AND ACTIVITIES  	 1-3
            1.1.1  Review Study Objectives	 1-4
            1.1.2  Translate Objectives into Statistical Hypotheses 	 1-4
            1.1.3  Develop Limits on Decision Errors	 1-5
            1.1.4  Review Sampling Design	 1-7
      1.2    DEVELOPING THE STATEMENT OF HYPOTHESES 	 1-9
      1.3    DESIGNS FOR SAMPLING ENVIRONMENTAL MEDIA	 1-11
            1.3.1  Authoritative Sampling	 1-11
            1.3.2  Probability Sampling	 1-13
                  1.3.2.1   Simple Random Sampling	 1-13
                  1.3.2.2  Sequential Random Sampling	 1-13
                  1.3.2.3   Systematic Samples  	 1-14
                  1.3.2.4  Stratified Samples	 1-14
                  1.3.2.5   Compositing Physical Samples 	 1-15
                  1.3.2.6  Other Sampling Designs	 1-15

STEP 2: CONDUCT A PRELIMINARY DATA REVIEW	2-1
      2.1    OVERVIEW AND ACTIVITIES	2-3
            2.1.1  Review Quality Assurance Reports	2-3
            2.1.2  Calculate Basic Statistical Quantities	2-4
            2.1.3  Graph the Data	2-4
      2.2    STATISTICAL QUANTITIES  	2-5
            2.2.1  Measures of Relative Standing  	2-5
            2.2.2  Measures of Central Tendency  	2-6
            2.2.3  Measures of Dispersion	2-8
            2.2.4  Measures of Association	2-8
                  2.2.4.1   Pearson's Correlation Coefficient	2-8
                  2.2.4.2  Spearman's Rank Correlation Coefficient	2-11
                  2.2.4.3   Serial Correlation Coefficient	2-11
EPA QA/G-9                                                                 Final
QAOO Version                            iii                               July 2000

-------
                                                                               Page

      2.3    GRAPHICAL REPRESENTATIONS  	2-13
             2.3.1  Histogram/Frequency Plots	2-13
             2.3.2  Stem-and-Leaf Plot	2-15
             2.3.3  Box and Whisker Plot	2-17
             2.3.4  Ranked Data Plot	2-17
             2.3.5  Quantile Plot	2-21
             2.3.6  Normal Probability Plot (Quantile-Quantile Plot)	2-22
             2.3.7  Plots for Two or More Variables	2-26
                   2.3.7.1 Plots for Individual Data Points 	2-26
                   2.3.7.2  Scatter Plot  	2-27
                   2.3.7.3  Extensions of the Scatter Plot	2-27
                   2.3.7.4  Empirical Quantile-Quantile Plot	2-30
             2.3.8  Plots for Temporal Data	2-30
                   2.3.8.1  Time Plot	2-32
                   2.3.8.2  Plot of the Autocorrelation Function (Correlogram)	2-33
                   2.3.8.3  Multiple Observations Per Time Period	2-35
             2.3.9  Plots for Spatial Data  	2-36
                   2.3.9.1  Posting Plots  	2-37
                   2.3.9.2  Symbol Plots  	2-37
                   2.3.9.3  Other Spatial Graphical Representations	2-39
      2.4    Probability Distributions	2-39
             2.4.1  The Normal Distribution	2-39
             2.4.2  The t-Distribution	2-40
             2.4.3  The Lognormal Distribution 	2-40
             2.4.4  Central Limit Theorem  	2-41

STEP 3:  SELECT THE STATISTICAL TEST 	3-1
      3.1    OVERVIEW AND ACTIVITIES	3-3
             3.1.1  Select Statistical Hypothesis Test  	3-3
             3.1.2  Identify Assumptions Underlying the Statistical Test	3-3
      3.2    TESTS OF HYPOTHESES ABOUT A SINGLE POPULATION  	3-4
             3.2.1  Tests for a Mean	3-4
                   3.2.1.1  The One-Sample t-Test	3-5
                   3.2.1.2  The Wilcoxon Signed Rank (One-Sample) Test 	3-11
                   3.2.1.3 The Chen Test	3-15
             3.2.2  Tests for a Proportion or Percentile	3-16
                   3.2.2.1  The One-Sample Proportion Test	3-18
             3.2.3  Tests for a Median  	3-18
             3.2.4  Confidence Intervals	3-20
      3.3    TESTS FOR COMPARING TWO POPULATIONS  	3-21
EPA QA/G-9                                                                     Final
QAOO Version                               iv                                  July 2000

-------
                                                                              Page

             3.3.1  Comparing Two Means	3-22
                   3.3.1.1  Student's Two-Sample t-Test (Equal Variances)	3-23
                   3.3.1.2  Satterthwaite's Two-Sample t-Test (Unequal Variances) . . 3-23
             3.3.2  Comparing Two Proportions or Percentiles	3-27
                   3.3.2.1  Two-Sample Test for Proportions	3-28
             3.3.3  Nonparametric Comparisons of Two Population	3-31
                   3.3.3.1  The Wilcoxon Rank Sum Test  	3-31
                   3.3.3.2  The Quantile Test	3-35
             3.3.4  Comparing Two Medians  	3-36
      3.4    Tests for Comparing Several Populations	3-37
             3.4.1  Tests for Comparing Several Means 	3-37
                   3.4.1.1 Dunnett's Test	3-38

STEP 4:  VERIFY THE ASSUMPTIONS OF THE STATISTICAL TEST  	4-1
      4.1    OVERVIEW AND ACTIVITIES	4-3
             4.1.1  Determine Approach for Verifying Assumptions	4-3
             4.1.2  Perform Tests of Assumptions  	4-4
             4.1.3  Determine Corrective Actions  	4-5
      4.2    TESTS FOR DISTRIBUTIONAL ASSUMPTIONS  	4-5
             4.2.1  Graphical Methods	4-7
             4.2.2  Shapiro-Wilk Test for Normality (the W test)	4-8
             4.2.3  Extensions of the Shapiro-Wilk Test (Filliben's Statistic)  	4-8
             4.2.4  Coefficient of Variation	4-8
             4.2.5  Coefficient of Skewness/Coefficient of Kurtosis Tests	4-9
             4.2.6  Range Tests  	4-10
             4.2.7  Goodness-of-Fit Tests	4-12
             4.2.8  Recommendations	4-13
      4.3    TESTS FOR TRENDS	4-13
             4.3.1  Introduction  	4-13
             4.3.2  Regression-Based Methods for Estimating and Testing for Trends  .4-14
                   4.3.2.1  Estimating a Trend Using the Slope of the Regression Line 4-14
                   4.3.2.2  Testing for Trends Using Regression Methods  	4-15
             4.3.3  General Trend Estimation Methods	4-16
                   4.3.3.1  Sen's Slope Estimator	4-16
                   4.3.3.2  Seasonal Kendall Slope Estimator	4-16
             4.3.4  Hypothesis Tests for Detecting Trends  	4-16
                   4.3.4.1  One Observation per Time Period for
                          One Sampling Location	4-16
                   4.3.4.2  Multiple Observations per Time Period
                          for One Sampling Location 	4-19
EPA QA/G-9                                                                     Final
QAOO Version                               v                                 July 2000

-------
                                                                              Page

                   4.3.4.3  Multiple Sampling Locations with Multiple Observations .4-20
                   4.3.4.4  One Observation for One Station with Multiple Seasons  .4-22
             4.3.5  A Discussion on Tests for Trends	4-23
             4.3.6  Testing for Trends in Sequences of Data	4-24
      4.4    OUTLIERS  	4-24
             4.4.1  Background  	4-24
             4.4.2  Selection of a Statistical Test 	4-27
             4.4.3  Extreme Value Test (Dixon's Test)	4-27
             4.4.4  Discordance Test 	4-29
             4.4.5  Rosner's Test  	4-30
             4.4.6  Walsh's Test	4-32
             4.4.7  Multivariate Outliers	4-32
      4.5    TESTS FOR DISPERSIONS	4-33
             4.5.1  Confidence Intervals for a Single Variance 	4-33
             4.5.2  The F-Test for the Equality of Two Variances	4-33
             4.5.3  Bartlett's Test for the Equality of Two or More Variances	4-33
             4.5.4  Levene's Test for the Equality of Two or More Variances	4-35
      4.6    TRANSFORMATIONS	4-39
             4.6.1  Types of Data Transformations	4-39
             4.6.2  Reasons for Transforming Data	4-41
      4.7    VALUES BELOW DETECTION LIMITS  	4-42
             4.7.1  Less than 15% Nondetects - Substitution Methods 	4-43
             4.7.2  Between 15-50% Nondetects 	4-43
                   4.7.2.1  Cohen's Method	4-43
                   4.7.2.2  Trimmed Mean	4-45
                   4.7.2.3  Winsorized Mean and Standard Deviation	4-45
                   4.7.2.4  Atchison's Method	4-46
                   4.7.2.5 Selecting Between Atchison's Method or Cohen's Method .4-49
             4.7.3  Greater than 5-% Nondetects - Test of Proportions	4-50
             4.7.4  Recommendations	4-50
      4.8    INDEPENDENCE	4-51

STEP 5:  DRAW CONCLUSIONS FROM THE DATA	5-1
      5.1    OVERVIEW AND ACTIVITIES	5-3
             5.1.1  Perform the Statistical Hypothesis Test	5-3
             5.1.2  Draw Study Conclusions	5-3
             5.1.3  Evaluate Performance of the Sampling Design	5-5
      5.2    INTERPRETING AND COMMUNICATING THE TEST RESULTS	5-6
             5.2.1  Interpretation of p-Values	5-7
             5.2.2  "Accepting" vs. "Failing to Reject" the Null Hypothesis	5-7
EPA QA/G-9                                                                     Final
QAOO Version                               vi                                  July 2000

-------
             5.2.3   Statistical Significance vs. Practical Significance
             5.2.4   Impact of Bias on Test Results  	
             5.2.5   Quantity vs. Quality of Data	
             5.2.6   "Proof of Safety" vs. "Proof of Hazard" 	
APPENDIX A: STATISTICAL TABLES	A - 1

APPENDIX B: REFERENCES	B - 2
EPA QA/G-9                                                                       Final
QAOO Version                              vii                                   July 2000

-------
                                 This page is intentionally blank.
EPA QA/G-9                                                                                 Final
QAOO Version                                   viii                                       July 2000

-------
                                    INTRODUCTION

0.1    PURPOSE AND OVERVIEW

       Data Quality Assessment (DQA) is the scientific and statistical evaluation of data to
determine if data obtained from environmental data operations are of the right type, quality, and
quantity to support their intended use.  This guidance demonstrates how to use DQA in
evaluating environmental data sets and illustrates how to apply some graphical and statistical tools
for performing DQA.  The guidance focuses primarily on using DQA in environmental decision
making; however, the tools presented for preliminary data review and verifying statistical
assumptions are useful whenever environmental data are used, regardless of whether the data are
used for decision making.

       DQA is built on a fundamental premise: data quality, as a concept, is meaningful only
when it relates to the intended use of the data. Data quality does not exist in a vacuum; one must
know in what context a data set is to be used in order to establish a relevant yardstick for judging
whether or not the data set is adequate. By using the DQA, one can answer two fundamental
questions:

1.      Can the decision (or estimate) be made with the desired confidence, given the quality of
       the data set?

2.      How well can the sampling design be expected to perform over a wide range of possible
       outcomes?  If the same sampling design strategy is used again for a similar study, would
       the data be expected to support the same intended use with the desired level of
       confidence, particularly if the measurement results turned out to be higher or lower than
       those observed in the current study?

       The first question addresses the data user's immediate needs. For example, if the data
provide evidence strongly in favor of one course of action over another, then the decision maker
can proceed knowing that the decision will be supported by unambiguous data.  If, however, the
data do not show sufficiently strong evidence to favor one alternative, then the data analysis alerts
the decision maker to this uncertainly.  The decision maker now is in a position to make an
informed choice about how to proceed (such as collect more or different data before making the
decision, or proceed with the decision despite the relatively high, but acceptable, probability of
drawing an erroneous conclusion).

       The second question addresses the data user's potential future needs. For example, if
investigators decide to use a certain sampling design at a different location from where the design
was first used, they should determine how well the design can be expected to perform given that
the outcomes and environmental conditions of this  sampling event will be different from those of
the original event. Because environmental conditions will vary from one location or time to
another, the adequacy of the sampling design approach should be evaluated over a broad range of
possible outcomes and conditions.

EPA QA/G-9                                                                          Final
QAOO Version                                0 - 1                                   July 2000

-------
0.2    DQA AND THE DATA LIFE CYCLE

       The data life cycle (depicted in Figure 0-1) comprises three steps: planning,
implementation, and assessment. During the planning phase, the Data Quality Objectives (DQO)
Process (or some other systematic planning procedure) is used to define quantitative and
qualitative criteria for determining when, where, and how many samples (measurements) to
collect and a desired level of confidence.  This information, along with the sampling methods,
analytical procedures, and appropriate quality assurance (QA) and quality control (QC)
procedures, are documented in the QA Project Plan. Data are then collected following the QA
Project Plan specifications. DQA completes the data life cycle by providing the assessment
needed to determine if the planning objectives were achieved. During the assessment phase, the
data are validated and verified to ensure that the sampling and analysis protocols specified in the
QA Project Plan were followed, and that the measurement systems performed in accordance with
the criteria specified in the QA Project Plan. DQA then proceeds using the validated data set to
determine if the quality of the data is satisfactory.




















PLANNING
Data Quality Objectives Process
Quality Assurance Project Plan Development


,
IMPLEMENTATION
Field Data Collection and Associated
Quality Assurance / Quality Control Activities
1
1
ASSESSMENT
Data Validation/Verification
Data Quality Assessment





/
1


1

I

I










/




t





t



QUALITY ASSURANCE ASSESSMENT

/_,.„, / /QC/Performance /
' Routine Data //,_,.. /
/ / Evaluation Data /
• INPUTS •

DATA VALIDATION/VERIFICATION
• Verify measurement performance
• Verify measurement procedures and
reporting requirements
^ OUTPUT
/ VALIDATED/VERIFIED DATA /
| INPUT
DATA QUALITY ASSESSMENT
• Review DQOs and design
• Select statistical test
• Verify assumptions
• Draw conclusions
• OUTPUT

1 CONCLUSIONS DRAWN FROM DATA /





/






























                   Figure 0-1.  DQA in the Context of the Data Life Cycle

0.3    THE 5 STEPS OF THE DQA

       The DQA involves five steps that begin with a review of the planning documentation and
end with an answer to the question posed during the planning phase of the study. These steps
roughly parallel the actions of an environmental statistician when analyzing a set of data. The five
EPA QA/G-9
QAOO Version
0-2
   Final
July 2000

-------
steps, which are described in detail in the remaining chapters of this guidance, are briefly
summarized as follows:

1. Review the Data Quality Objectives (DQOs) and Sampling Design: Review the DQO
outputs to assure that they are still applicable. If DQOs have not been developed, specify
DQOs before evaluating the data (e.g., for environmental decisions, define the statistical
hypothesis and specify tolerable limits on decision errors; for estimation problems, define
an acceptable confidence or probability interval width). Review the sampling design and
data collection documentation for consistency with the DQOs.

2. Conduct a Preliminary Data Review: Review QA reports, calculate basic statistics, and
generate graphs of the data. Use this information to learn about the structure of the data
and identify patterns, relationships, or potential anomalies.

3. Select the Statistical Test: Select the most appropriate procedure for summarizing and
analyzing the data, based on the review of the DQOs, the sampling design, and the
preliminary data review. Identify the key underlying assumptions that must hold for the
statistical procedures to be valid.

4. Verify the Assumptions of the Statistical Test: Evaluate whether the underlying
assumptions hold, or whether departures are acceptable, given the actual data and other
information about the study.

5. Draw Conclusions from the Data: Perform the calculations required for the statistical
test and document the inferences drawn as a result of these calculations. If the design is to
be used again, evaluate the performance of the sampling design.

These five steps are presented in a linear sequence, but the DQA is by its very nature iterative.
For example, if the preliminary data review reveals patterns or anomalies in the data set that are
inconsistent with the DQOs, then some aspects of the study planning may have to be reconsidered
in Step 1. Likewise, if the underlying assumptions of the statistical test are not supported by the
data, then previous steps of the DQA may have to be revisited. The strength of the DQA is that it
is designed to promote an understanding of how well the data satisfy their intended use by
progressing in a logical and efficient manner.

Nevertheless, it should be emphasized that the DQA cannot absolutely prove that one has
or has not achieved the DQOs set forth during the planning phase of a study. This situation
occurs because a decision maker can never know the true value of the item of interest. Data
collection only provides the investigators with an estimate of this, not its true value. Further,
because analytical methods are not perfect, they too can only provide an estimate of the true value
of an environmental sample. Because investigators make a decision based on estimated and not
true values, they run the risk of making a wrong decision (decision error) about the item of
interest.
EPA QA/G-9 Final
QAOO Version 0 - 3 July 2000

-------
0.4    INTENDED AUDIENCE

       This guidance is written for a broad audience of potential data users, data analysts, and
data generators. Data users (such as project managers, risk assessors, or principal investigators
who are responsible for making decisions or producing estimates regarding environmental
characteristics based on environmental data) should find this guidance useful for understanding
and directing the technical work of others who produce and analyze data. Data analysts (such as
quality assurance specialists, or any technical professional who is responsible for evaluating the
quality of environmental  data) should find this guidance to be a convenient compendium of basic
assessment tools. Data generators (such as analytical chemists, field sampling specialists, or
technical support staff responsible for collecting and analyzing environmental samples and
reporting the resulting data values) should find this guidance useful for understanding how their
work will be used and for providing a foundation for improving the efficiency and effectiveness of
the data generation process.

0.5    ORGANIZATION

       This guidance presents background information and statistical tools for performing DQA.
Each chapter corresponds to a step in the DQA and begins with an overview of the activities to be
performed for that step. Following the overviews in Chapters 1, 2, 3, and 4, specific graphical or
statistical tools are described and step-by-step procedures are provided along with examples.

0.6    SUPPLEMENTAL SOURCES

       Many of the graphical and statistical tools presented in this guidance are also implemented
in a user-friendly, personal computer software program called Data Quality Evaluation Statistical
Tools (DataQUEST) (G-9D) (EPA, 1996).  DataQUEST simplifies the implementation of DQA
by automating many of the recommended statistical tools. DataQUEST runs on most IBM-
compatible personal computers using the DOS operating system; see the DataQUEST User's
Guide for complete information on the minimum computer requirements.
EPA QA/G-9                                                                          Final
QAOO Version                               0 - 4                                   July 2000

-------
CHAPTER 1
STEP 1: REVIEW DQOs AND THE SAMPLING DESIGN
THE DATA QUALITY ASSESSMENT PROCESS
Review DQOs and Sampling Design
Conduct Preliminary Data Review
Select the Statistical Test
Verify the Assumptions
Draw Conclusions From the Data
REVIEW DQOs AND SAMPLING DESIGN

Purpose

Review the DQO outputs, the sampling design, and
any data collection documentation for consistency. If
DQOs have not been developed, define the statistical
hypothesis and specify tolerable limits on decision errors.

Activities

• Review Study Objectives
• Translate Objectives into Statistical Hypothesis
• Develop Limits on Decision Errors
• Review Sampling Design

Tools

• Statements of hypotheses
• Sampling design concepts
Step 1: Review DQOs and Sampling Design

Review the objectives of the study.
P If DQOs have not been developed, review Section 1.1.1 and define these objectives.
P If DQOs were developed, review the outputs from the DQO Process.

Translate the data user's objectives into a statement of the primary statistical hypothesis.
P If DQOs have not been developed, review Sections 1.1.2 and 1.2, and Box 1-1,
then develop a statement of the hypothesis based on the data user's objectives.
P If DQOs were developed, translate them into a statement of the primary hypothesis.

Translate the data user's objectives into limits on Type I or Type II decision errors.
P If DQOs have not been developed, review Section 1.1.3 and document the data
user's tolerable limits on decision errors.
P If DQOs were developed, confirm the limits on decision errors.

Review the sampling design and note any special features or potential problems.
P Review the sampling design for any deviations (Sections 1.1.4 and 1.3).
EPA QA/G-9
QAOO Version
1 - 1
Final
July 2000

-------
                                     List of Boxes
                                                                                 Page
Box 1-1:  Example Applying the DQO Process Retrospectively	 1-8
                                    List of Tables
                                                                                 Page
Table 1-1. Choosing a Parameter of Interest	 1-6
Table 1-2. Commonly Used Statements of Statistical Hypotheses	 1-12
                                    List of Figures
                                                                                 Page
Figure 1-1.  The Data Quality Objectives Process   	 1-3
EPA QA/G-9                                                                       Final
QAOO Version                              1 - 2                                  July 2000

-------
CHAPTER 1
STEP 1: REVIEW DQOs AND THE SAMPLING DESIGN

1.1 OVERVIEW AND ACTIVITIES

DQA begins by reviewing the key outputs from the planning phase of the data life cycle:
the Data Quality Objectives (DQOs), the Quality Assurance (QA) Project Plan, and any
associated documents. The DQOs provide the context for understanding the purpose of the data
collection effort and establish the qualitative and quantitative criteria for assessing the quality of
the data set for the intended use. The sampling design (documented in the QA Project Plan)
provides important information about how to interpret the data. By studying the sampling design,
the analyst can gain an understanding of the assumptions under which the design was developed,
as well as the relationship between these assumptions and the DQOs. By reviewing the methods
by which the samples were collected, measured, and reported, the analyst prepares for the
preliminary data review and subsequent steps of DQA.
Careful planning improves the
representativeness and overall quality of a sampling
design, the effectiveness and efficiency with which
the sampling and analysis plan is implemented, and
the usefulness of subsequent DQA efforts. Given
the benefits of planning, the Agency has developed
the DQO Process which is a logical, systematic
planning procedure based on the scientific method.
The DQO Process emphasizes the planning and
development of a sampling design to collect the
right type, quality, and quantity of data needed to
support the decision. Using both the DQO Process
and the DQA will help to ensure that the decisions
are supported by data of adequate quality; the
DQO Process does so prospectively and the DQA
does so retrospectively.

When DQOs have not been developed
during the planning phase of the study, it is
necessary to develop statements of the data user's
objectives prior to conducting DQA. The primary
purpose of stating the data user's objectives prior
to analyzing the data is to establish appropriate
criteria for evaluating the quality of the data with
respect to their intended use. Analysts who are not
familiar with the DQO Process should refer to the
Guidance for the Data Quality Objectives Process
(QA/G-4) (1994), a book on statistical decision
Step 1. State the Problem
Define the problem; identify the planning team;
examine budget, schedule.
Step 2. Identify the Decision
State decision; identify study question; define
alternative actions.
Step 3. Identify the Inputs to the Decision
Identify information needed for the decision (information
sources, basis for Action Level, sampling/analysis method).
Step 4. Define the Boundaries of the Study
Specify sample characteristics; define
spatial/temporal limits, units of decision making.
Step 5. Develop a Decision Rule
Define statistical parameter (mean, median); specify Action
Level; develop logic for action.
Step 6. Specify Tolerable Limits on Decision Errors
Set acceptable limits for decision errors relative to
consequences (health effects, costs).
Step 7. Optimize the Design for Obtaining Data
Select resource-effective sampling and analysis plan that
meets the performance criteria.
Figure 1-1. The Data Quality Objectives
Process
EPA QA/G-9
QAOO Version
1 -3
Final
July 2000

-------
making using tests of hypothesis, or consult a statistician. The seven steps of the DQO Process
are illustrated in Figure 1.1.

The remainder of this chapter addresses recommended activities for performing this step
of DQA and technical considerations that support these activities. The remainder of this section
describes the recommended activities, the first three of which will differ depending on whether
DQOs have already been developed for the study. Section 1.2 describes how to select the null
and alternative hypothesis and Section 1.3 presents a brief overview of different types of sampling
designs.

1.1.1 Review Study Objectives

In this activity, the objectives of the study are reviewed to provide context for analyzing
the data. If a planning process has been implemented before the data are collected, then this step
reduces to reviewing the documentation on the study objectives. If no planning process was used,
the data user should:

• Develop a concise definition of the problem (DQO Process Step 1) and the decision (DQO
Process Step 2) for which the data were collected. This should provide the fundamental
reason for collecting the environmental data and identify all potential actions that could
result from the data analysis.

Identify if any essential information is missing (DQO Process Step 3). If so, either collect
the missing information before proceeding, or select a different approach to resolving the
decision.

Specify the scale of decision making (any subpopulations of interest) and any boundaries
on the study (DQO Process Step 4) based on the sampling design. The scale of decision
making is the smallest area or time period to which the decision will apply. The sampling
design and implementation may restrict how small or how large this scale of decision
making can be.

1.1.2 Translate Objectives into Statistical Hypotheses

In this activity, the data user's objectives are used to develop a precise statement of the
primary1 hypotheses to be tested using environmental data. A statement of the primary statistical
hypotheses includes a null hypothesis, which is a "baseline condition" that is presumed to be true
in the absence of strong evidence to the contrary, and an alternative hypothesis, which bears the
burden of proof. In other words, the baseline condition will be retained unless the alternative
Throughout this document, the term "primary hypotheses" refers to the statistical hypotheses that correspond to the
data user's decision. Other statistical hypotheses can be formulated to formally test the assumptions that underlie the specific
calculations used to test the primary hypotheses. See Chapter 3 for examples of assumptions underlying primary hypotheses
and Chapter 4 for examples of how to test these underlying assumptions.
EPA QA/G-9 Final
QAOO Version 1 - 4 July 2000

-------
condition (the alternative hypothesis) is thought to be true due to the preponderance of evidence.
In general, such hypotheses consist of the following elements:

       a population parameter of interest, which describes the feature of the environment that the
       data user is investigating;

•      a numerical value to which the parameter will be compared, such as a regulatory  or risk-
       based threshold or a similar parameter from another place (e.g., comparison to a  reference
       site) or time (e.g., comparison to a prior time); and

•      the relation (such as "is equal to" or "is greater than") that specifies precisely how the
       parameter will be compared to the numerical value.

To help the analyst decide what parameter value should be investigated, Table 1-1 compares the
merits of the mean, upper proportion (percentile), and mean.  If DQOs were developed,  the
statement of hypotheses already should be documented in the outputs of Step 6 of the DQO
Process.  If DQOs have not been developed, then the analyst should consult with the data user to
develop hypotheses that address the data user's concerns. Section 1.2 describes in detail how to
develop the statement of hypotheses and includes a list of common encountered hypotheses for
environmental decisions.

1.1.3  Develop Limits on Decision Errors

       The goal of this activity is to develop numerical probability limits that express the data
user's tolerance for committing false rejection (Type I) or false acceptance (Type II) decision
errors as a result of uncertainty in the data.  A false rejection  error occurs when the null
hypothesis is rejected when it is true. A false acceptance decision error occurs when the null
hypothesis is not rejected when it is false. These are the statistical definitions of false rejection
and false acceptance decision errors.  Other commonly used phrases include "level of significance"
which is equal to the Type I Error (false rejection) and "complement of power" equal to  the Type
II Error (false  acceptance). If tolerable decision error rates were not established prior to data
collection, then the data user should:

       Specify the gray region where the consequences of a false acceptance decision error are
       relatively minor (DQO Process Step 6).  The gray region is bounded on one side  by the
       threshold value and on the other side by that parameter value where the consequences of
       making a false acceptance decision error begin to be significant.  Establish this boundary
       by evaluating the consequences of not rejecting the null hypothesis when it is false and
       then place the edge of the gray region where these consequences are severe enough to set
       a limit on the magnitude of this false acceptance decision error. The gray region  is the
       area between this parameter value and the threshold value.

       The width of the gray region represents one important aspect of the decision maker's
       concern for decision errors.  A more narrow gray region implies a desire to detect

EPA QA/G-9                                                                            Final
QAOO Version                                 1 - 5                                    July 2000

-------
Table 1-1. Choosing a Parameter of Interest
Parameter
Points to Consider
Mean
1. Easy to calculate and estimate a confidence interval.
2. Useful when the standard has been based on consideration of health effects or long-term average exposure.
3. Useful when the data have little variation from sample to sample or season to season.
4. If the data have a large coefficient of variation (greater than about 1.5) testing the mean can require more samples than for testing
an upper percentile in order to provide the same protection to human health and the environment.
5. Can have high false rejection rates with small sample sizes and highly skewed data, i.e., when the contamination levels are
generally low with only occasional short periods of high contamination.
6. Not as powerful for testing attainment when there is a large proportion of less-than-detection-limit values.
7. Is adversely affected by outliers or errors in a few data values.
Upper Proportion
(Percentile)
1. Requiring that an upper percentile be less than a standard can limit the occurrence of samples with high concentrations, depending
on the selected percentile.
2. Unaffected by less-than-detection-limit values, as long as the detection limit is less than the cleanup standard.
3. If the health effects of the contaminant are acute, extreme concentrations are of concern and are best tested by ensuring that a
large portion of the measurements are below a standard.
4. The proportion of the samples that must be below the standard must be chosen.
5. For highly variable or skewed data, can provide similar protection of human health and the environment with a smaller size than
when testing the mean.
6. Is relatively unaffected by a small number of outliers.
Median
1. Has benefits over the mean because it is not as heavily influenced by outliers and highly variable data, and can be used with a
large number of less-than-detection-limit values.
2. Has many of the positive features of the mean, in particular its usefulness of evaluating standards based on health effects and
long-term average exposure.
3. For positively skewed data, the median is lower than the mean and therefore testing the median provides less protection for human
health and the environment than testing the mean.
4. Retains some negative features of the mean in that testing the median will not limit the occurrence of extreme values.
EPA QA/G-9
QAOO Version
1 -6
Final
May 2000

-------
conclusively the condition when the true parameter value is close to the threshold value
("close" relative to the variability in the data).

Specify tolerable limits on the probability of committing false rejection and false
acceptance decision errors (DQO Process Step 6) that reflect the decision maker's
tolerable limits for making an incorrect decision. Select a possible value of the parameter;
then, choose a probability limit based on an evaluation of the seriousness of the potential
consequences of making the decision error if the true parameter value is located at that
point. At a minimum, the decision maker should specify a false rejection decision error
limit at the threshold value (a), and a false acceptance decision error limit at the other
edge of the gray region (P).

An example of the gray region and limits on the probability of committing both false rejection and
false acceptance decision errors are contained in Box 1-1.

If DQOs were developed for the study, the tolerable limits on decision errors will already
have been developed. These values can be transferred directly as outputs for this activity. In this
case, the action level is the threshold value; the false rejection error rate at the action level is the
Type I error rate or a; and the false acceptance error rate at the other bound of the gray region is
the Type II error rate or p.

1.1.4 Review Sampling Design

The goal of this activity is to familiarize the analyst with the main features of the sampling
design that was used to generate the environmental data. The overall type of sampling design and
the manner in which samples were collected or measurements were taken will place conditions
and constraints on how the data must be used and interpreted. Section 1.3 provides additional
information about several different types of sampling designs that are commonly used in
environmental studies.

Review the sampling design documentation with the data user's objectives in mind. Look
for design features that support or contradict those objectives. For example, if the data user is
interested in making a decision about the mean level of contamination in an effluent stream over
time, then composite samples may be an appropriate sampling approach. On the other hand, if the
data user is looking for hot spots of contamination at a hazardous waste site, compositing should
only be used with caution, to avoid "averaging away" hot spots. Also, look for potential
problems in the implementation of the sampling design. For example, verify that each point in
space (or time) had an equal probability of being selected for a simple random sampling design.
Small deviations from a sampling plan may have minimal effect on the conclusions drawn from the
data set. Significant or substantial deviations should be flagged and their potential effect carefully
considered throughout the entire DQA.
EPA QA/G-9 Final
QAOO Version 1 - 7 July 2000

-------
Box 1-1: Example Applying the DQO Process Retrospectively
A waste incineration company was concerned that waste fly ash could contain hazardous levels of
cadmium and should be disposed of in a RCRA landfill. As a result, eight composite samples each
consisting of eight grab samples were taken from each load of waste. The TCLP leachate from these
samples were then analyzed using a method specified in 40 CFR, Pt. 261, App. II. DQOs were not
developed for this problem; therefore, study objectives (Sections 1.1.1 through 1.1.3) should be
developed before the data are analyzed.

1.1.1 Review Study Objectives

P Develop a concise definition of the problem - The problem is defined above.

P Identify if any essential information is missing - It does not appear than any essential information is
missing.

P Specify the scale of decision making - Each waste load is sampled separately and decisions need to
be made for each load. Therefore, the scale of decision making is an individual load.

1.1.2 Translate Objectives into Statistical Hypotheses

Since composite samples were taken, the parameter of interest is the mean cadmium concentration. The
RCRA regulatory standard for cadmium in TCLP leachate is 1.0 mg/L. Therefore, the two hypotheses
are "mean cadmium > 1.0 mg/L" and "mean cadmium < 1.0 mg/L."

There are two possible decision errors 1) to decide the waste is hazardous ("mean > 1.0") when it truly is
not ("mean < 1.0"), and 2) to decide the waste is not hazardous ("mean < 1.0") when it truly is ("mean >
1.0"). The risk of deciding the fly ash is not hazardous when it truly is hazardous is more severe since
potential consequences of this decision error include risk to human health and the environment.
Therefore, this error will be labeled the false rejection error and the other error will be the false
acceptance error. As a result of this decision, the null hypothesis will be that the waste is hazardous
("mean cadmium > 1.0 mg/L") and the alternative hypothesis will be that the waste is not hazardous
("mean cadmium < 1.0 mg/L"). (See Section 1.2 for more information on developing the null and
alternative hypotheses.)

1.1.3 Develop Limits on Decision Errors
1
_ 0.9
J= (D
S -i 0.8
01 £=
"= 0
? < °-7
Q
-------
1.2 DEVELOPING THE STATEMENT OF HYPOTHESES

The full statement of the statistical hypotheses has two major parts: the null hypothesis
(H0) and the alternative hypothesis (HA). In both parts, a population parameter is compared to
either a fixed value (for a one-sample test) or another population parameter (for a two-sample
test). The population parameter is a quantitative characteristic of the population that the data
user wants to estimate using the data. In other words, the parameter describes that feature of the
population that the data user will evaluate when making the decision. Table 1-1 describes several
common statistical parameters.

If the data user is interested in drawing inferences about only one population, then the null
and alternative hypotheses will be stated in terms that relate the true value of the parameter to
some fixed threshold value. A common example of this one-sample problem in environmental
studies is when pollutant levels in an effluent stream are compared to a regulatory limit. If the
data user is interested in comparing two populations, then the null and alternative hypotheses will
be stated in terms that compare the true value of one population parameter to the corresponding
true parameter value of the other population. A common example of this two-sample problem in
environmental studies is when a potentially contaminated waste site is being compared to a
reference area using samples collected from the respective areas. In this situation, the hypotheses
often will be stated in terms of the difference between the two parameters.

The decision on what should constitute the null hypothesis and what should be the
alternative is sometimes difficult to ascertain. In many cases, this problem does not arise because
the null and alternative hypotheses are determined by specific regulation. However, when the null
hypothesis is not specified by regulation, it is necessary to make this determination. The test of
hypothesis procedure prescribes that the null hypothesis is only rejected in favor of the alternative,
provided there is overwhelming evidence from the data that the null hypothesis is false. In other
words, the null hypothesis is considered to be true unless the data show conclusively that this is
not so. Therefore, it is sometimes useful to choose the null and alternative hypotheses in light of
the consequences of possibly making an incorrect decision between the null and alternative
hypotheses. The true condition that occurs with the more severe decision error (not what would
be decided in error based on the data) should be defined as the null hypothesis. For example,
consider the two decision errors: "decide a company does not comply with environmental
regulations when it truly does" and "decide a company does comply with environmental
regulations when it truly does not." If the first decision error is considered the more severe
decision error, then the true condition of this error, "the company does comply with the
regulations" should be defined as the null hypothesis. If the second decision error is considered
the more severe decision error, then the true condition of this error, "the company does not
comply with the regulations" should be defined as the null hypothesis.

An alternative method for defining the null hypothesis is based on historical information.
If a large amount of information exists suggesting that one hypothesis is extremely likely, then this
hypothesis should be defined as the alternative hypothesis. In this case, a large amount of data
may not be necessary to provide overwhelming evidence that the other (null) hypothesis is false.

EPA QA/G-9 Final
QAOO Version 1 - 9 July 2000
-------
For example, if the waste from an incinerator was previously hazardous and the waste process has
not changed, it may be more cost-effective to define the alternative hypothesis as "the waste is
hazardous" and the null hypothesis as "the waste is not hazardous."

Consider a data user who wants to know whether the true mean concentration (|i) of
atrazine in ground water at a hazardous waste site is greater than a fixed threshold value C. If the
data user presumes from prior information that the true mean concentration is at least C due
possibly to some contamination incident, then the data must provide compelling evidence to reject
that presumption, and the hypotheses can be stated as follows:
Narrative Statement of Hypotheses
Statement of Hypotheses Using Standard
Notation
Null Hypothesis (Baseline Condition):
The true mean concentration of atrazine in
ground
water is greater than or equal to the threshold
value C; versus
H
versus
Alternative Hypothesis:
The true mean concentration of atrazine in
ground water is less than the threshold value
C.
H
A
On the other hand, if the data user presumes from prior information that the true mean
concentration is less than C due possibly to the fact that the ground water has not been
contaminated in the past, then the data must provide compelling evidence to reject that
presumption, and the hypotheses can be stated as follows:
Narrative Statement of Hypotheses
Statement of Hypotheses Using Standard
Notation
Null Hypothesis (Baseline Condition):
The true mean concentration of atrazine in
ground
water is less than or equal to the threshold
value C; versus
H0: |i < C;
versus
Alternative Hypothesis:
The true mean concentration of atrazine in
ground water is greater than the threshold
value C.
HA:
EPA QA/G-9
QAOO Version
1 - 10
Final
July 2000
-------
In stating the primary hypotheses, it is convenient to use standard statistical notation, as
shown throughout this document. However, the logic underlying the hypothesis always
corresponds to the decision of interest to the data user.

Table 1-2 summarizes common environmental decisions and the corresponding
hypotheses. In Table 1-2, the parameter is denoted using the symbol "@," and the difference
between two parameters is denoted using "Oj - @2" where @j represents the parameter of the first
population and @2 represents the parameter of the second population. The use of "@" is to avoid
using the terms "population mean" or "population median" repeatedly because the structure of the
hypothesis test remains the same regardless of the population parameter. The fixed threshold
value is denoted "C," and the difference between two parameters is denoted "50" (often the null
hypothesis is defined such that 50= 0).

For the first problem in Table 1-2, only estimates of @ that exceed C can cast doubt on the
null hypothesis. This is called a one-tailed hypothesis test, because only parameter estimates on
one side of the threshold value can lead to rejection of the null hypothesis. The second, fourth,
and fifth rows of Table 1-2 are also examples of one-tailed hypothesis tests. The third and sixth
rows of Table 1-2 are examples of two-tailed tests, because estimates falling both below and
above the null-hypothesis parameter value can lead to rejection of the null hypothesis. Most
hypotheses connected with environmental monitoring are one-tailed because high pollutant levels
can harm humans or ecosystems.

1.3 DESIGNS FOR SAMPLING ENVIRONMENTAL MEDIA

Sampling designs provide the basis for how a set of samples may be analyzed. Different
sampling designs require different analysis techniques and different assessment procedures. There
are two primary types of sampling designs: authoritative (judgment) sampling and probability
sampling. This section describes some of the most common sampling designs.

1.3.1 Authoritative Sampling

With authoritative (judgment) sampling, an expert having knowledge of the site (or
process) designates where and when samples are to be taken. This type of sampling should only
be considered when the objectives of the investigation are not of a statistical nature, for example,
when the objective of a study is to identify specific locations of leaks, or when the study is
focused solely on the sampling locations themselves. Generally, conclusions drawn from
authoritative samples apply only to the individual samples and aggregation may result in severe
bias and lead to highly erroneous conclusions. Judgmental sampling also precludes the use of the
sample for any purpose other than the original one. Thus if the data may be used in further
studies (e.g., for an estimate of variability in a later study), a probabilistic design should be used.

When the study objectives involve estimation or decision making, some form of
probability sampling should be selected. As described below, this does not preclude use of the
expert's knowledge of the site or process in designing a probability-based sampling plan; however,

EPA QA/G-9 Final
QAOO Version 1-11 July 2000
-------
Table 1-2. Commonly Used Statements of Statistical Hypotheses
Type of Decision
Null Hypothesis
Alternative
Hypothesis
Compare environmental conditions to a fixed
threshold value, such as a regulatory standard or
acceptable risk level; presume that the true
condition is less than the threshold value.
H0: 0 < C
HA: 0>C
Compare environmental conditions to a fixed
threshold value; presume that the true condition is
greater than the threshold value.
H0: 0 > C
HA: 0 60

(HA: 0j - 02 > 0)
Compare environmental conditions associated
with two different populations to a fixed threshold
value (60) such as a regulatory standard or
acceptable risk level; presume that the true
condition is greater than the threshold value. If it
is presumed that conditions associated with the
two populations are the same, the threshold value
isO.
H0: 0: - 02 > 60

(Ho: 0: - 02 > 0)
HA: 0j - 02 < 60
(HA: 01-02<0)
Compare environmental conditions associated
with two different populations to a fixed threshold
value (80) such as a regulatory standard or
acceptable risk level; presume that the true
condition is equal to the threshold value. If it is
presumed that conditions associated with the two
populations are the same, the threshold value is 0.
H0: 0: - 02 = 80

(Ho: 0j - 02 = 0)
HA: e^e^do
(HA: 0j - 02 * 0)
EPA QA/G-9
QAOO Version
1 - 12
Final
July 2000
-------
valid statistical inferences require that the plan incorporate some form of randomization in
choosing the sampling locations or sampling times. For example, to determine maximum SO2
emission from a boiler, the sampling plan would reasonably focus, or put most of the weight on,
periods of maximum or near-maximum boiler operation. Similarly, if a residential lot is being
evaluated for contamination, then the sampling plan can take into consideration prior knowledge
of contaminated areas, by weighting such areas more heavily in the sample selection and data
analysis.

1.3.2 Probability Sampling

Probability samples are samples in which every member of the target population (i.e.,
every potential sampling unit) has a known probability of being included in the sample.
Probability samples can be of various types, but in some way, they all make use of randomization,
which allows valid probability statements to be made about the quality of estimates or hypothesis
tests that are derived from the resultant data.

One common misconception of probability sampling procedures is that these procedures
preclude the use of important prior information. Indeed, just the opposite is true. An efficient
sampling design is one that uses all available prior information to stratify the region and set
appropriate probabilities of selection. Another common misconception is that using a probability
sampling design means allowing the possibility that the sample points will not be distributed
appropriately across the region. However, if there is no prior information regarding the areas
most likely to be contaminated, a grid sampling scheme (a type of stratified design) is usually
recommended to ensure that the sampling points are dispersed across the region.

1.3.2.1 Simple Random Sampling

The simplest type of probability sample is the simple random sample where every possible
sampling unit in the target population has an equal chance of being selected. Simple random
samples, like the other samples, can be either samples in time and/or space and are often
appropriate at an early stage of an investigation in which little is known about systematic variation
within the site or process. All of the sampling units should have equal volume or mass, and
ideally be of the same shape if applicable. With a simple random sample, the term "random"
should not be interpreted to mean haphazard; rather, it has the explicit meaning of equiprobable
selection. Simple random samples are generally developed through use of a random number table
or through computer generation of pseudo-random numbers.

1.3.2.2 Sequential Random Sampling

Usually, simple random samples have a fixed sample size, but some alternative approaches
are available, such as sequential random sampling, where the sample sizes are not fixed a priori.
Rather, a statistical test is performed after each specimen's analysis (or after some minimum
number have been analyzed). This strategy could be applicable when sampling and/or analysis is
quite expensive, when information concerning sampling and/or measurement variability is lacking,

EPA QA/G-9 Final
QAOO Version 1-13 July 2000
-------
when the characteristics of interest are stable over the time frame of the sampling effort, or when
the objective of the sampling effort is to test a single specific hypothesis.

1.3.2.3 Systematic Samples

In the case of spatial sampling, systematic sampling involves establishing a two-
dimensional (or in some cases a three-dimensional) spatial grid and selecting a random starting
location within one of the cells. Sampling points in the other cells are located in a deterministic
way relative to that starting point. In addition, the orientation of the grid is sometimes chosen
randomly and various types of systematic samples are possible. For example, points may be
arranged in a pattern of squares (rectangular grid sampling) or a pattern of equilateral triangles
(triangular grid sampling). The result of either approach is a simple pattern of equally spaced
points at which sampling is to be performed.

Systematic sampling designs have several advantages over random sampling and some of
the other types of probability sampling. They are generally easier to implement, for example.
They are also preferred when one of the objectives is to locate "hot spots" within a site or
otherwise map the pattern of concentrations over a site. On the other hand, they should be used
with caution whenever there is a possibility of some type of cyclical pattern in the waste site or
process. Such a situation, combined with the uniform pattern of sampling points, could very
readily lead to biased results.

1.3.2.4 Stratified Samples

Another type of probability sample is the stratified random sample, in which the site or
process is divided into two or more non-overlapping strata, sampling units are defined for each
stratum, and separate simple random samples are employed to select the units in each stratum. (If
a systematic sample were employed within each stratum, then the design would be referred to as a
stratified systematic sample.) Strata should be defined so that physical samples within a stratum
are more similar to each other than to samples from other strata. If so, a stratified random sample
should result in more precise estimates of the overall population parameter than those that would
be obtained from a simple random sample with the same number of sampling units.

Stratification is a way to incorporate prior knowledge and professional judgment into a
probabilistic sampling design. Generally, units that are "alike" or anticipated to be "alike" are
placed together in the same stratum. Units that are contiguous in space (e.g., similar depths) or
time are often grouped together into the same stratum, but characteristics other than spatial or
temporal proximity can also be employed. Media, terrain characteristics, concentration levels,
previous cleanup attempts, and confounding contaminants can be used to create strata.

Advantages of stratified samples over random samples include their ability to ensure more
uniform coverage of the entire target population and, as noted above, their potential for achieving
greater precision in certain estimation problems. Even when imperfect information is used to
form strata, the stratified random sample will generally be more cost-effective than a simple

EPA QA/G-9 Final
QAOO Version 1-14 July 2000
-------
random sample. A stratified design can also be useful when there is interest in estimating or
testing characteristics for subsets of the target population. Because different sampling rates can
be used in different strata, one can oversample in strata containing those subareas of particular
interest to ensure that they are represented in the sample. In general, statistical calculations for
data generated via stratified samples are more complex than for random samples, and certain
types of tests, for example, cannot be performed when stratified samples are employed.
Therefore, a statistician should be consulted when stratified sampling is used.

1.3.2.5 Compositing Physical Samples

When analysis costs are large relative to sampling costs, cost-effective plans can
sometimes be achieved by compositing physical samples or specimens prior to analysis, assuming
that there are no safety hazards or potential biases (for example, the loss of volatile organic
compounds from a matrix) associated with such compositing. For the same total cost,
compositing in this situation would allow a larger number of sampling units to be selected than
would be the case if compositing were not used. Composite samples reflect a physical rather than
a mathematical mechanism for averaging. Therefore, compositing should generally be avoided if
population parameters other than a mean are of interest (e.g., percentiles or standard deviations).

Composite sampling is also useful when the analyses of composited samples are to be used
in a two-staged approach in which the composite-sample analyses are used solely as a screening
mechanism to identify if additional, separate analyses need to be performed. This situation might
occur during an early stage of a study that seeks to locate those areas that deserve increased
attention due to potentially high levels of one or more contaminants.

1.3.2.6 Other Sampling Designs

Adaptive sampling involves taking a sample and using the resulting information to design
the next stage of sampling. The process may continue through several additional rounds of
sampling and analysis. A common application of adaptive sampling to environmental problems
involves subdividing the region of interest into smaller units, taking a probability sample of these
units, then sampling all units that border on any unit with a concentration level greater than some
specified level C. This process is continued until all newly sampled units are below C. The field
of adaptive sampling is currently undergoing active development and can be expected to have a
significant impact on environmental sampling.

Ranked set sampling (RSS) uses the availability of an inexpensive surrogate measurement
when it is correlated with the more expensive measurement of interest. The method exploits this
correlation to obtain a sample which is more representative of the population that would be
obtained by random sampling, thereby leading to more precise estimates of population parameters
than what would be obtained by random sampling. RSS consists of creating n groups, each of
size n (for a total of n2 initial samples), then ranking the surrogate from largest to smallest within
each group. One sample from each group is then selected according to a specified procedure and
these n samples are analyzed for the more expensive measurement of interest.

EPA QA/G-9 Final
QAOO Version 1-15 July 2000
-------
This page is intentionally blank
EPA QA/G-9 Final
QAOO Version 1-16 July 2000
-------
CHAPTER 2
STEP 2: CONDUCT A PRELIMINARY DATA REVIEW
THE DATA QUALITY ASSESSMENT PROCESS
Review DQOs and Sampling Design
Conduct Preliminary Data Review
Select the Statistical Test
Verify the Assumptions
Draw Conclusions From the Data
CONDUCT PRELIMINARY DATA REVIEW

Purpose

Generate statistical quantities and graphical
representations that describe the data. Use this
information to learn about the structure of the data
and identify any patterns or relationships.

Activities

• Review Quality Assurance Reports
• Calculate Basic Statistical Quantities
•Graph the Data

Tools

• Statistical quantities
•Graphical representations
Step 2: Conduct a Preliminary Data Review

Review quality assurance reports.
P Look for problems or anomalies in the implementation of the sample collection and
analysis procedures.
P Examine QC data for information to verify assumptions underlying the Data Quality
Objectives, the Sampling and Analysis Plan, and the QA Project Plans.

Calculate the statistical quantities.
P Consider calculating appropriate percentiles (Section 2.2.1)
P Select measures of central tendency (Section 2.2.2) and dispersion (Section 2.2.3).
P If the data involve two variables, calculate the correlation coefficient (Section 2.2.4).

Display the data using graphical representations.
P Select graphical representations (Section 2.4) that illuminate the structure of the data
set and highlight assumptions underlying the Data Quality Objectives, the Sampling
and Analysis Plan, and the QA Project Plans.
P Use a variety of graphical representations that examine different features of the set.
EPA QA/G-9
QAOO Version
2- 1
Final
July 2000
-------
Boxes
Box 2-1: Directions for Calculating the Measure of Relative Standing (Percentiles) 2-6
Box 2-2: Directions for Calculating the Measures of Central Tendency 2-7
Box 2-3: Example Calculations of the Measures of Central Tendency 2-7
Box 2-4: Directions for Calculating the Measures of Dispersion 2-9
Box 2-5: Example Calculations of the Measures of Dispersion 2-9
Box 2-6: Directions for Calculating Pearson's Correlation Coefficient 2-10
Box 2-7: Directions for Calculating Spearman's Correlation 2-12
Box 2-8: Directions for Estimating the Serial Correlation Coefficient with a Example .... 2-13
Box 2-9: Directions for Generating a Histogram and a Frequency Plot 2-14
Box 2-10: Example of Generating a Histogram and a Frequency Plot 2-15
Box 2-11: Directions for Generating a Stem and Leaf Plot 2-16
Box 2-12: Example of Generating a Stem and Leaf Plot 2-16
Box 2-13: Directions for Generating a Box and Whiskers Plot 2-18
Box 2-14: Example of a Box and Whiskers Plot 2-18
Box 2-15: Directions for Generating a Ranked Data Plot 2-19
Box 2-16: Example of Generating a Ranked Data Plot 2-19
Box 2-17: Directions for Generating a Quantile Plot 2-22
Box 2-18: Example of Generating a Quantile Plot 2-22
Box 2-19: Directions for Constructing a Normal Probability Plot 2-23
Box 2-20: Example of Normal Probability Plot 2-24
Box 2-21: Directions for Generating a Scatter Plot and an Example 2-28
Box 2-22: Directions for Constructing an Empirical Q-Q Plot with an Example 2-31

Figures
Figure 2-1. Example of a Frequency Plot 2-13
Figure 2-2. Example of a Histogram 2-14
Figure 2-3. Example of a Box and Whisker Plot 2-17
Figure 2-4. Example of a Ranked Data Plot 2-20
Figure 2-5. Example of a Quantile Plot of Skewed Data 2-21
Figure 2-6. Normal Probability Paper 2-25
Figure 2-7. Example of Graphical Representations of Multiple Variables 2-26
Figure 2-8. Example of a Scatter Plot 2-27
Figure 2-9. Example of a Coded Scatter Plot 2-28
Figure 2-10. Example of a Parallel Coordinates Plot 2-29
Figure 2-11. Example of a Matrix Scatter Plot 2-29
Figure 2-12. Example of a Time Plot Showing a Slight Downward Trend 2-33
Figure 2-13. Example of a Correlogram 2-34
Figure 2-14. Example of a Posting Plot 2-37
Figure 2-15. Example of a Symbol Plot 2-38
Figure 2-16. The Normal Distribution 2-40
Figure 2-17. The Standard Normal Curve (Z-Curve) 2-40
EPA QA/G-9 Final
QAOO Version 2 - 2 July 2000
-------
CHAPTER 2
STEP 2: CONDUCT A PRELIMINARY DATA REVIEW

2.1 OVERVIEW AND ACTIVITIES

In this step of DQA, the analyst conducts a preliminary evaluation of the data set,
calculates some basic statistical quantities, and examines the data using graphical representations.
A preliminary data review should be performed whenever data are used, regardless of whether
they are used to support a decision, estimate a population parameter, or answer exploratory
research questions. By reviewing the data both numerically and graphically, one can learn the
"structure" of the data and thereby identify appropriate approaches and limitations for using the
data. The DQA software Data Quality Evaluation Statistical Tools (DataQUEST) (G-9D) (EPA,
1996) will perform all of these functions as well as more sophisticated statistical tests.

There are two main elements of preliminary data review: (1) basic statistical quantities
(summary statistics); and (2) graphical representations of the data. Statistical quantities are
functions of the data that numerically describe the data set. Examples include a mean, median,
percentile, range, and standard deviation. They can be used to provide a mental picture of the
data and are useful for making inferences concerning the population from which the data were
drawn. Graphical representations are used to identify patterns and relationships within the data,
confirm or disprove hypotheses, and identify potential problems. For example, a normal
probability plot may allow an analyst to quickly discard an assumption of normality and may
identify potential outliers.

The preliminary data review step is designed to make the analyst familiar with the data.
The review should identify anomalies that could indicate unexpected events that may influence the
analysis of the data. The analyst may know what to look for based on the anticipated use of the
data documented in the DQO Process, the QA Project Plan, and any associated documents. The
results of the review are then used to select a procedure for testing a statistical hypotheses to
support the data user's decision.

2.1.1 Review Quality Assurance Reports

The first activity in conducting a preliminary data review is to review any relevant QA
reports that describe the data collection and reporting process as it actually was implemented.
These QA reports provide valuable information about potential problems or anomalies in the data
set. Specific items that may be helpful include:

• Data validation reports that document the sample collection, handling, analysis,
data reduction, and reporting procedures used;

Quality control reports from laboratories or field stations that document
measurement system performance, including data from check samples, split
samples, spiked samples, or any other internal QC measures; and

EPA QA/G-9 Final
QAOO Version 2 - 3 July 2000
-------
Technical systems reviews, performance evaluation audits, and audits of data
quality, including data from performance evaluation samples.

When reviewing QA reports, particular attention should be paid to information that can be
used to check assumptions made in the DQO Process. Of great importance are apparent
anomalies in recorded data, missing values, deviations from standard operating procedures, and
the use of nonstandard data collection methodologies.

2.1.2 Calculate Basic Statistical Quantities

The goal of this activity is to summarize some basic quantitative characteristics of the data
set using common statistical quantities. Some statistical quantities that are useful to the analyst
include: number of observations; measures of central tendency, such as a mean, median, or mode;
measures of dispersion, such as range, variance, standard deviation, coefficient of variation, or
interquartile range; measures of relative standing, such as percentiles; measures of distribution
symmetry or shape; and measures of association between two or more variables, such as
correlation. These measures can then be used for description, communication, and to test
hypothesis regarding the population from which the data were drawn. Section 2.2 provides
detailed descriptions and examples of these statistical quantities.

The sample design may influence how the statistical quantities are computed. The
formulas given in this chapter are for simple random sampling, simple random sampling with
composite samples, and randomized systematic sampling. If a more complex design is used, such
as a stratified design, then the formulas may need to be adjusted.

2.1.3 Graph the Data

The goal of this step is to identify patterns and trends in the data that might go unnoticed
using purely numerical methods. Graphs can be used to identify these patterns and trends, to
quickly confirm or disprove hypotheses, to discover new phenomena, to identify potential
problems, and to suggest corrective measures. In addition, some graphical representations can be
used to record and store data compactly or to convey information to others. Graphical
representations include displays of individual data points, statistical quantities, temporal data,
spatial data, and two or more variables. Since no single graphical representation will provide a
complete picture of the data set, the analyst should choose different graphical techniques to
illuminate different features of the data. Section 2.3 provides descriptions and examples of
common graphical representations.

At a minimum, the analyst should choose a graphical representation of the individual data
points and a graphical representation of the statistical quantities. If the data set has a spatial or
temporal component, select graphical representations specific to temporal or spatial data in
addition to those that do not. If the data set consists of more than one variable, treat each
variable individually before developing graphical representations for the multiple variables. If the

EPA QA/G-9 Final
QAOO Version 2 - 4 July 2000
-------
sampling plan or suggested analysis methods rely on any critical assumptions, consider whether a
particular type of graph might shed light on the validity of that assumption. For example, if a
small-sample study is strongly dependent on the assumption of normality, then a normal
probability plot would be useful (Section 2.3.6).

The sampling design may influence what data may be included in each representation.
Usually, the graphical representations should be applied to each complete unit of randomization
separately or each unit of randomization should be represented with a different symbol. For
example, the analyst could generate box plots for each stratum instead of generating one box plot
with all the strata.

2.2 STATISTICAL QUANTITIES

2.2.1 Measures of Relative Standing

Sometimes the analyst is interested in knowing the relative position of one of several
observations in relation to all of the observations. Percentiles are one such measure of relative
standing that may also be useful for summarizing data. A percentile is the data value that is
greater than or equal to a given percentage of the data values. Stated in mathematical terms, the
pth percentile is the data value that is greater than or equal to p% of the data values and is less
than or equal to (l-p)% of the data values. Therefore, if V is the pth percentile, then p% of the
values in the data set are less than or equal to x, and (100-p)% of the values are greater than or
equal to x. A sample percentile may fall between a pair of observations. For example, the 75th
percentile of a data set of 10 observations is not uniquely defined. Therefore, there are several
methods for computing sample percentiles, the most common of which is described in Box 2-1.

Important percentiles usually reviewed are the quartiles of the data, the 25th, 50th, and 75th
percentiles. The 50th percentile is also called the sample median (Section 2.2.2), and the 25th and
75th percentile are used to estimate the dispersion of a data set (Section 2.2.3). Also important for
environmental data are the 90th, 95th, and 99th percentile where a decision maker would like to be
sure that 90%, 95%, or 99% of the contamination levels are below a fixed risk level.

A quantile is similar in concept to a percentile; however, a percentile represents a
percentage whereas a quantile represents a fraction. If'x' is the pth percentile, then at least p% of
the values in the data set lie at or below x, and at least (100-p)% of the values lie at or above x,
whereas if x is the p/100 quantile of the data, then the fraction p/100 of the data values lie at or
below x and the fraction (l-p)/100 of the data values lie at or above x. For example, the .95
quantile has the property that .95 of the observations lie at or below x and .05 of the data lie at or
above x. For the example in Box 2-1, 9 ppm would be the .95 quantile and 10 ppm would be the
.99 quantile of the data.
EPA QA/G-9 Final
QAOO Version 2 - 5 July 2000
-------
Box 2-1 : Directions for Calculating the Measure of Relative Standing (Percentiles)
with an Example

Let X.,, X2, ..., Xn represent the n data points. To compute the pth percentile, y(p), first list the data from
smallest to largest and label these points X(1), X(2), . . ., X(n) (so that X(1) is the smallest, X(2) is the
second smallest, and X, n } is the largest). Let t = p/1 00, and multiply the sample size n by t. Divide the
result into the integer part and the fractional part, i.e., let nt = j + g where j is the integer part and g is the
fraction part. Then the pth percentile, y(p), is calculated by:

If 9 = 0, y(p) = (X(j) + X(j + 1))/2

otherwise,
Example: The 90th and 95th percentile will be computed for the following 10 data points (ordered from
smallest to largest) : 4, 4, 4, 5, 5, 6, 7, 7, 8, and 10 ppb.

For the 95th percentile, t = p/1 00 = 95/100= .95 and nt = (10)(.95) = 9.5 = 9 + .5. Therefore,] = 9 and
g = .5. Because g = .5 ? 0, y(95) = X(j + 1) = X(9 + 1) = X(10) = 10 ppm. Therefore, 10 ppm is the 95th
percentile of the above data.

For the 90th percentile, t= p/1 00 = 90/100 = .9 and nt = (10)(.9) = 9. Therefore] = 9 and g = 0. Since g =
0, y(90) = (X(9) + X(10))/2 = (8 + 10) / 2 = 9 ppm.
2.2.2 Measures of Central Tendency

Measures of central tendency characterize the center of a sample of data points. The three
most common estimates are the mean, median, and the mode. Directions for calculating these
quantities are contained in Box 2-2; examples are provided in Box 2-3.

The most commonly used measure of the center of a sample is the sample mean, denoted
by x. This estimate of the center of a sample can be thought of as the "center of gravity" of the
sample. The sample mean is an arithmetic average for simple sampling designs; however, for
complex sampling designs, such as stratification, the sample mean is a weighted arithmetic
average. The sample mean is influenced by extreme values (large or small) and nondetects (see
Section 4.7).

The sample median (x) is the second most popular measure of the center of the data. This
value falls directly in the middle of the data when the measurements are ranked in order from
smallest to largest. This means that l/2 of the data are smaller than the sample median and l/2 of
the data are larger than the sample median. The median is another name for the 50th percentile
(Section 2.2.1). The median is not influenced by extreme values and can easily be used in the case
of censored data (nondetects).

The third method of measuring the center of the data is the mode. The sample mode is the
value of the sample that occurs with the greatest frequency. Since this value may not always
exist, or if it does it may not be unique, this value is the least commonly used. However, the
mode is useful for qualitative data.
EPA QA/G-9 Final
QAOO Version 2 - 6 July 2000
-------
Box 2-2: Directions for Calculating the Measures of Central Tendency

LetX^ X2, ..., Xn represent the n data points.

Sample Mean: The sample mean x is the sum of all the data points divided by the total number of data
points (n):

X = -
n

Sample Median: The sample median (X) is the center of the data when the measurements are ranked in
order from smallest to largest. To compute the sample median, list the data from smallest to largest and
label these points X(1), X(2), . . ., X(n) (so that X(1) is the smallest, X(2) is the second smallest, and X(n) is
the largest).

If the number of data points is odd, then X= -^C-rn+nm

, ..., , , --^*- / i
If the number of data points is even, then X =
Sample Mode: The mode is the value of the sample that occurs with the greatest frequency. The mode
may not exist, or if it does, it may not be unique. To find the mode, count the number of times each
value occurs. The sample mode is the value that occurs most frequently.
Box 2-3: Example Calculations of the Measures of Central Tendency

Using the directions in Box 2-2 and the following 10 data points (in ppm): 4, 5, 6, 7, 4, 10, 4, 5, 7, and 8,
the following is an example of computing the sample mean, median, and mode.

Sample mean:

- 4 + 5 + 6 + 7 + 4+10 + 4 + 5 + 7 + 8 60 ,
X = = — = 6 ppm
10 10

Therefore, the sample mean is 6 ppm.

Sample median: The ordered data are: 4, 4, 4, 5, 5, 6, 7, 7, 8, and 10. Since n=10 is even, the sample
median is

X = X(io/2) + X([io/2]+i) = X(5) + X(6) = 5 + 6 = 5 5
2 2 2

Thus, the sample median is 5.5 ppm.

Sample mode: Computing the number of times each value occurs yields:

4 appears 3 times; 5 appears 2 times; 6 appears 1 time; 7 appears 2 times; 8 appears 1 time; and 10
appears 1 time.

Because the value of 4 ppm appears the most times, it is the mode of this data set.
EPA QA/G-9 Final
QAOO Version 2 - 7 July 2000
-------
2.2.3 Measures of Dispersion

Measures of central tendency are more meaningful if accompanied by information on how
the data spread out from the center. Measures of dispersion in a data set include the range,
variance, sample standard deviation, coefficient of variation, and the interquartile range.
Directions for computing these measures are given in Box 2-4; examples are given in Box 2-5.

The easiest measure of dispersion to compute is the sample range. For small samples, the
range is easy to interpret and may adequately represent the dispersion of the data. For large
samples, the range is not very informative because it only considers (and therefore is greatly
influenced) by extreme values.

The sample variance measures the dispersion from the mean of a data set. A large sample
variance implies that there is a large spread among the data so that the data are not clustered
around the mean. A small sample variance implies that there is little spread among the data so
that most of the data are near the mean. The sample variance is affected by extreme values and by
a large number of nondetects. The sample standard deviation is the square root of the sample
variance and has the same unit of measure as the data.

The coefficient of variation (CV) is a unitless measure that allows the comparison of
dispersion across several sets of data. The CV is often used in environmental applications
because variability (expressed as a standard deviation) is often proportional to the mean.

When extreme values are present, the interquartile range may be more representative of
the dispersion of the data than the standard deviation. This statistical quantity does not depend on
extreme values and is therefore useful when the data include a large number of nondetects.

2.2.4 Measures of Association

Data often include measurements of several characteristics (variables) for each sample
point and there may be interest in knowing the relationship or level of association between two or
more of these variables. One of the most common measures of association is the correlation
coefficient. The correlation coefficient measures the relationship between two variables, such as a
linear relationship between two sets of measurements. However, the correlation coefficient does
not imply cause and effect. The analyst may say that the correlation between two variables is high
and the relationship is strong, but may not say that one variable causes the other variable to
increase or decrease without further evidence and strong statistical controls.

2.2.4.1 Pearson's Correlation Coefficient

The Pearson correlation coefficient measures a linear relationship between two variables.
A linear association implies that as one variable increases so does the other linearly, or as one
EPA QA/G-9 Final
QAOO Version 2 - 8 July 2000
-------
Box 2-4: Directions for Calculating the Measures of Dispersion

LetX.,, X2, ..., Xn represent the n data points.

Sample Range: The sample range (R) is the difference between the largest value and the smallest value
of the sample, i.e., R = maximum - minimum.

Sample Variance: To compute the sample variance (s2), compute:
n ,-=
n-1

Sample Standard Deviation: The sample standard deviation (s) is the square root of the sample
variance, i.e.,

s = \l s2

Coefficient of Variation: The coefficient of variation (CV) is the standard deviation divided by the sample
mean (Section 2.2.2), i.e., CV = s/x. The CV is often expressed as a percentage.

Interquartile Range: Use the directions in Section 2.2.1 to compute the 25th and 75th percentiles of the
data (y(25) and y(75) respectively). The interquartile range (IQR) is the difference between these values,

IQR = y(75) - y(25).
Box 2-5: Example Calculations of the Measures of Dispersion

In this box, the directions in Box 2-4 and the following 10 data points (in ppm): 4, 5, 6, 7, 4, 10, 4, 5, 7,
and 8, are used to calculate the measures of dispersion. From Box 2-2, x = 6 ppm.

Sample Range: R = maximum - minimum =10-4 = 6 ppm

Sample Variance:

[42 + 52+ + ?2+g2] _ (4 +
2 10 10 A 2
s = - = - = 4 ppm
10-1 9
Sample Standard Deviation: s = \Js2 =
-------
variable decreases the other increases linearly. Values of the correlation coefficient close to +1
(positive correlation) imply that as one variable increases so does the other, the reverse holds for
values close to -1. A value of+1 implies a perfect positive linear correlation, i.e., all the data
pairs lie on a straight line with a positive slope. A value of-1 implies perfect negative linear
correlation. Values close to 0 imply little correlation between the variables. Directions and an
example for calculating Pearson's correlation coefficient are contained in Box 2-6.
Box 2-6: Directions for Calculating Pearson's Correlation Coefficient with an Example

LetX.,, X2, ..., Xn represent one variable of the n data points and let Y.,, Y2, ..., Yn represent a second
variable of the n data points. The Pearson correlation coefficient, r, between X and Y is computed by:
n
i= 1
n n
( \ ^V ^2 / \ ^v \2
/-i ' , ryy2 ,--i ,
J L ZJ i J
« ,= i n
1/2
Example: Consider the following data set (in ppb): Sample 1 — arsenic (X) = 8.0, lead (Y) = 8.0;
Sample 2 - arsenic = 6.0, lead = 7.0; Sample 3 - arsenic = 2.0, lead = 7.0; and Sample 4 - arsenic = 1.0,
lead = 6.0.

n n n n n
\ "V- _ i n \ V - ">9 \ "V"^— 1 fl^ \ V2_ 1 QO \ "V V - ^SvJ?\ _i_ _i_ ^1v/C\ - 1 T^
/_A - 1U, Jl .- Zo, 7-ji ~ ^•"•'5 / /, - lyo, / .A 1 - (oXo) + . . . + \iXO) — 1ZO.
;'= 1 ;'= 1 ;'= 1 ;'= 1 ;'= 1
126_ 02X28)
and r =
[105 - [198 -
4 4
1/2
= 0.865
Since r is close to 1, there is a strong linear relationship between these two contaminants.
The correlation coefficient does not detect nonlinear relationships so it should be used
only in conjunction with a scatter plot (Section 2.3.7.2). A scatter plot can be used to determine
if the correlation coefficient is meaningful or if some measure of nonlinear relationships should be
used. The correlation coefficient can be significantly changed by extreme values so a scatter plot
should be used first to identify such values.

An important property of the correlation coefficient is that it is unaffected by changes in
location of the data (adding or subtracting a constant from all of the X measurements and/or the
Y measurements) and by changes in scale of the data and/or Y values by a positive constant).
Thus linear transformations on the Xs and Ys do not affect the correlation of the measurements.
This is reasonable since the correlation reflects the degree to which linearity between X and Y
measurements occur and the degree of linearity is unaffected by changes in location or scale. For

EPA QA/G-9 Final
QAOO Version 2-10 July 2000
-------
example, if a variable was temperature in Celsius, then the correlation should not change if Celsius
was converted to Fahrenheit.

On the other hand, if nonlinear transformations of the X and/or Y measurements are made,
then the Pearson correlation between the transformed values will differ from the correlation of the
original measurements. For example, if X and Y, respectively, represent PCB and dioxin
concentrations in soil, and x=log (X) and y=log(Y), then the Pearson correlations between X and
Y, X and y x and Y, and x and y, will all be different, in general, since the logarithmic
transformation is a nonlinear transformation.

Pearson's correlation may be sensitive to the presence of one or two extreme values,
especially when sample sizes are small. Such values may result in a high correlation, suggesting a
strong linear trend, when only moderate trend is present. This may happen, for instance, if a
single (X,Y) pair has very high values for both X and Y while the remaining data values are
uncorrelated. Extreme value may also lead to low correlations between X and Y, thus tending to
mask a strong linear trend. This may happen if all the (X, Y) pairs except one (or two) tend to
cluster tightly about a straight line, and the exceptional point has a very large X value paired with
a moderate or small Y value (or vice versa). Because of the influences of extreme values, it is
wise to use a scatter plot (Section 2.3.7.2) in conjunction with a correlation coefficient.

2.2.4.2 Spearman's Rank Correlation Coefficient

An alternative to the Pearson correlation is Spearman's rank correlation coefficient. It is
calculated by first replacing each X value by its rank (i.e., 1 for the smallest X value, 2 for the
second smallest, etc.) and each Y value by its rank. These pairs of ranks are then treated as the
(X,Y) data and Spearman's rank correlation is calculated using the same formulae as for
Pearson's correlation (Box 2-6). Directions and an example for calculating a correlation
coefficient are contained in Box 2-7.

Since meaningful (i.e., monotonic increasing) transformations of the data will not later the
ranks of the respective variables (e.g, the ranks for log (X) will be the same for the ranks for X),
Spearman's correlation will not be altered by nonlinear increasing transformations of the Xs or the
Ys. For instance, the Spearman correlation between PCB and dioxin concentrations (X and Y) in
soil will be the same as the correlations between their logarithms (x and y). This desirable
property and the fact that Spearman's correlation is less sensitive to extreme values than
Pearson's correlation make Spearman's correlation an attractive alternative or complement of
Pearson's correlation coefficient. There are some important theoretical differences between
Pearson's and Spearman's correlation but a full discussion is beyond this guidance. In general,
Pearson's correlation has a higher statistical power than Spearman's, although the latter has some
more varied applications.

2.2.4.3 Serial Correlation Coefficient

When data are truly independent, the correlation between data points is zero. For a
sequence of data points taken serially in time, or one-by-one in a row, the serial correlation
EPA QA/G-9 Final
QAOO Version 2-11 July 2000
-------
Box 2-7: Directions for Calculating Spearman's Correlation Coefficient with an Example

LetX.,, X2, ..., Xn represent a set of ranks of the n data points of one data set and let Y.,, Y2, ..., Yn
represent a set of ranks of a second variable of the n data points. The Spearman correlation coefficient,
r, between X and Y is computed by:
v
~
i= 1 !'= 1

;=!
;=1
1/2
(15)
Example: Consider the following data set (in ppb): Sample 1 — arsenic (X) = 8.0, lead (Y) = 8.0;
Sample 2 - arsenic = 6.0, lead = 7.0; Sample 3 - arsenic = 2.0, lead = 7.0; and Sample 4 - arsenic = 1.0,
lead = 6.0.
Using Arsenic rank the data smallest to largest:
Sample No.
Arsenic
Lead
4
1.0
6.0
3
2.0
7.0
2
6.0
7.0
1
8.0
8.0
Convert the raw data to ranks, any ties being made an average of what ranks should have been
assigned.
Sample No.
Arsenic Rank
Lead Rank
3
2
2.5
2
3
2.5
4
4
4
(X)
(Y)
Note how 7.0 (two lead observations) was converted to the average rank (i.e., ranks 2 and 3, therefore
2.5 each).
;= 1
and r =
,.= 10,
;= 1 ;= 1
...+ (4x4) = 29.5.
;= 1
;= 1
[30 -
[29 5 _
1/2
= 0.948
4 4

Since r is close to 1, there is a strong linear relationship between these two contaminants.
coefficient can be calculated by replacing the sequencing variable by the numbers 1 through n and
calculating Pearon's correlation coefficient with z being the actual data values, and y being the
numbers 1 through n. For example, for a sequence of data collected at a waste site along a
straight transit line, the distances on the transit line of the data points are replaced by the numbers
1 through n, e.g., first 10-foot sample point = 1, the 20-foot sample point = 2, the 30-foot sample
EPA QA/G-9
QAOO Version
2- 12
Final
July 2000
-------
point = 3, etc., for samples taken at 10-foot intervals. Directions for the Serial correlation
coefficient, along with an example, are given in Box 2-8.
Box 2-8: Directions for Estimating the
Serial Correlation Coefficient with a Example
Directions:
LetX.,, X2, . . . , Xn represent the data values collected in sequence over equally spared periods of
time. Label the periods of time 1, 2, ..., n to match the data values. Use the directions in Box 2-6 to calculate
the Pearson's Correlation Coefficient between the data X and the time periods Y.

Example: The following are hourly readings from a discharge monitor. Notice how the actual 24-hour times are
replaced by the numbers 1 through 13.
Time
Reading
Time
Periods
12:00
6.5
1
13:00
6.6
2
14:00
6.7
3
15:00
6.4
4
16:00
6.3
5
17:00
6.4
6
18:00
6.2
7
19:00
6.2
8
20:00
6.3
9
21:00
6.6
10
22:00
6.8
11
23:00
6.9
12
24:00
7.0
13
Using Box 2-6, with the readings being the X values and the Time Periods being the Y values gives a serial
correlation coefficient of 0.432.
2.3 GRAPHICAL REPRESENTATIONS
2.3.1 Histogram/Frequency Plots
10
Two of the oldest methods for summarizing data distributions are the frequency plot
(Figure 2-1) and the histogram (Figure 2-2).
Both the histogram and the frequency plot use
the same basic principles to display the data:
dividing the data range into units, counting the
number of points within the units, and
displaying the data as the height or area within
a bar graph. There are slight differences
between the histogram and the frequency plot.
In the frequency plot, the relative height of the
bars represents the relative density of the data.
In a histogram, the area within the bar
represents the relative density of the data. The
difference between the two plots becomes
more distinct when unequal box sizes are used.
01 6
M
.Q
O
10
15 20 25
Concentration (ppm)
30
35
40
Figure 2-1. Example of a Frequency Plot
EPA QA/G-9
QAOO Version
2- 13
Final
July 2000
-------
The histogram and frequency plot
provide a means of assessing the symmetry and
variability of the data. If the data are symmetric,
then the structure of these plots will be
symmetric around a central point such as a
mean. The histogram and frequency plots will
generally indicate if the data are skewed and the
direction of the skewness.

Directions for generating a histogram
and a frequency plot are contained in Box 2-9
and an example is contained in Box 2-10. When
plotting a histogram for a continuous variable
(e.g., concentration), it is necessary to decide on
an endpoint convention; that is, what to do with cases that fall on the boundary of a box. With
discrete variables, (e.g., family size) the intervals can be centered in between the variables. For
the family size data, the intervals can span between 1.5 and 2.5, 2.5 and 3.5, and so on, so that the
whole numbers that relate to the family size can be centered within the box. The visual
impression conveyed by a histogram or a frequency plot can be quite sensitive to the choice of
interval width. The choice of the number of intervals determines whether the histogram shows
more detail for small sections of the data or whether the data will be displayed more simply as a
smooth overview of the distribution.
1 Percentage of Observations (per ppm)
O M ^ O) 00 |
-

) 5 10 15 20 25 30 35 40
Concentration (ppm)
Figure 2-2. Example of a Histogram
Box 2-9: Directions for Generating a Histogram and a Frequency Plot

LetX.,, X2, ..., Xn represent the n data points. To develop a histogram or a frequency plot:

STEP 1: Select intervals that cover the range of observations. If possible, these intervals should have
equal widths. A rule of thumb is to have between 7 to 11 intervals. If necessary, specify an
endpoint convention, i.e., what to do with cases that fall on interval endpoints.

STEP 2: Compute the number of observations within each interval. For a frequency plot with equal
interval sizes, the number of observations represents the height of the boxes on the frequency
plot.

STEP 3: Determine the horizontal axis based on the range of the data. The vertical axis for a frequency
plot is the number of observations. The vertical axis of the histogram is based on percentages.

STEP 4: For a histogram, compute the percentage of observations within each interval by dividing the
number of observations within each interval (Step 3) by the total number of observations.

STEP 5: For a histogram, select a common unit that corresponds to the x-axis. Compute the number of
common units in each interval and divide the percentage of observations within each interval
(Step 4) by this number. This step is only necessary when the intervals (Step 1) are not of
equal widths.

STEP 6: Using boxes, plot the intervals against the results of Step 5 for a histogram or the intervals
against the number of observations in an interval (Step 2) for a frequency plot.
EPA QA/G-9
QAOO Version
2- 14
Final
July 2000
-------

Consider
17.2 19.1
STEP 1:
STEP 2:
STEP 3:
STEP 4:

STEP 5:
STEP 6:

Box 2-10: Example of Generating a Histogram and a Frequency Plot
the following 22 samples of a contaminant concentration (in ppm): 17.7, 17.4, 22.8, 35.5, 28.6,
<4 72 <4 152 147 149 109 124 124 116 147 102 52 165 and 8 9
This data spans 0-40 ppm. Equally sized intervals of 5 ppm will be used: 0-5 ppm; 5-10
ppm; etc. The endpoint convention will be that values are placed in the highest interval
containing the value. For example, a value of 5 ppm will be placed in the interval 5-10 ppm
instead of 0 - 5 ppm.
The table below shows the number of observations within each interval defined in Step 1 .
The horizontal axis for the data is from 0 to 40 ppm. The vertical axis for the frequency plot is
from 0-10 and the vertical axis for the histogram is from 0% - 10%.
There are 22 observations total, so the number observations shown in the table below will be
divided by 22. The results are shown in column 3 of the table below.
A common unit for this data is 1 ppm. In each interval there are 5 common units so the
percentage of observations (column 3 of the table below) should be divided by 5 (column 4).
The frequency plot is shown in Figure 2-1 and the histogram is shown in Figure 2-2.
# of Obs % of Obs % of Obs
Interval in Interval in Interval per ppm
0 - 5 ppm 2 9.10 1.8
5 -10 ppm 3 13.60 2.7
10 -15 ppm 8 36.36 7.3
15 -20 ppm 6 27.27 5.5
20 -25 ppm 1 4.55 0.9
25 -30 ppm 1 4.55 0.9
30 - 35 ppm 0 0.00 0.0
35 -40 ppm 1 4.55 0.9
2.3.2 Stem-and-LeafPlot

The stem-and-leaf plot is used to show both the numerical values themselves and
information about the distribution of the data. It is a useful method for storing data in a compact
form while, at the same time, sorting the data from smallest to largest. A stem-and-leaf plot can
be more useful in analyzing data than a histogram because it not only allows a visualization of the
data distribution, but enables the data to be reconstructed and lists the observations in the order of
magnitude. However, the stem-and-leaf plot is one of the more subjective visualization
techniques because it requires the analyst to make some arbitrary choices regarding a partitioning
of the data. Therefore, this technique may require some practice or trial and error before a useful
plot can be created. As a result, the stem-and-leaf plot should only be used to develop a picture
of the data and its characteristics. Directions for constructing a stem-and-leaf plot are given in
Box 2-11 and an example is contained in Box 2-12.

Each observation in the stem-and-leaf plot consist of two parts: the stem of the
observation and the leaf. The stem is generally made up of the leading digit of the numerical
values while the leaf is made up of trailing digits in the order that corresponds to the order of
magnitude from left to right. The stem is displayed on the vertical axis and the data points make

EPA QA/G-9 Final
QAOO Version 2-15 July 2000
-------
Box 2-11: Directions for Generating a Stem and Leaf Plot

Let X.,, X2, ..., Xn represent the n data points. To develop a stem-and-leaf plot, complete the following steps:

STEP 1: Arrange the observations in ascending order. The ordered data is usually labeled (from smallest
to largest) X(1),X, 2,, ..., X(n).

STEP 2: Choose either one or more of the leading digits to be the stem values. As an example, for the
value 16, 1 could be used as the stem as it is the leading digit.

STEP 3: List the stem values from smallest to largest at the left (along a vertical axis). Enter the leaf (the
remaining digits) values in order from lowest to highest to the right of the stem. Using the value
16 as an example, if the 1 is the stem then the 6 will be the leaf.
Box 2-12: Example of Generating a Stem and Leaf Plot

Consider the following 22 samples of trifluorine (in ppm): 17.7, 17.4, 22.8, 35.5, 28.6, 17.2 19.1, <4, 7.2,
<4, 15.2, 14.7, 14.9, 10.9, 12.4, 12.4, 11.6, 14.7, 10.2, 5.2, 16.5, and 8.9.

STEP 1: Arrange the observations in ascending order: <4, <4, 5.2, 7.7, 8.9, 10.2, 10.9, 11.6, 12.4, 12.4,
14.7, 14.7, 14.9, 15.2, 16.5, 17.2, 17.4, 17.7, 19.1, 22.8, 28.6, 35.5.

STEP 2: Choose either one or more of the leading digits to be the stem values. For the above data, using
the first digit as the stem does not provide enough detail for analysis. Therefore, the first digit
will be used as a stem; however, each stem will have two rows, one for the leaves 0-4, the other
for the leaves 5-9.

STEP 3: List the stem values at the left (along a vertical axis) from smallest to largest. Enter the leaf (the
remaining digits) values in order from lowest to highest to the right of the stem. The first digit of
the data was used as the stem values; however, each stem value has two leaf rows.

0(0,1,2,3,4) | <4 <4
0(5, 6, 7, 8, 9) | 5.2 7.7 8.9
1 (0, 1, 2, 3, 4) | 0.2 0.9 1.6 2.4 2.4 4.7 4.7 4.9
1 (5, 6, 7, 8, 9) | 5.2 6.5 7.2 7.4 7.7 9.1
2(0,1,2,3,4) | 2.8
2 (5, 6, 7, 8, 9) | 8.6
3(0,1,2,3,4) |
3(5,6,7,8,9) | 5.5

Note: If nondetects are present, place them first in the ordered list, using a symbol such as
-------
up the magnitude from left to right. The stem is displayed on the vertical axis and the data points
make up the leaves. Changing the stem can be accomplished by increasing or decreasing the
digits that are used, dividing the groupings of one stem (i.e., all numbers which start with the
numeral 6 can be divided into smaller groupings), or multiplying the data by a constant factor (i.e.,
multiply the data by 10 or 100). Nondetects can be placed in a single stem.

A stem-and-leaf plot roughly displays the distribution of the data. For example, the stem-
and-leaf plot of normally distributed data is approximately bell shaped. Since the stem-and-leaf
roughly displays the distribution of the data, the plot may be used to evaluate whether the data are
skewed or symmetric. The top half of the stem-and-leaf plot will be a mirror image of the bottom
half of the stem-and-leaf plot for symmetric data. Data that are skewed to the left will have the
bulk of data in the top of the plot and less data spread out over the bottom of the plot.

2.3.3 Box and Whisker Plot
A box and whisker plot or box plot (Figure 2-3) is a schematic diagram useful for
visualizing important statistical quantities of the data. Box plots are useful in situations where it is
not necessary or feasible to portray all the details of a distribution. Directions for generating a
box and whiskers plot are contained in Box 2-13, and an example is contained in Box 2-14.
A box and whiskers plot is composed of a central box divided by a
line and two lines extending out from the box called whiskers. The length of
the central box indicates the spread of the bulk of the data (the central 50%)
while the length of the whiskers show how stretched the tails of the
distribution are. The width of the box has no particular meaning; the plot
can be made quite narrow without affecting its visual impact. The sample
median is displayed as a line through the box and the sample mean is
displayed using a '+' sign. Any unusually small or large data points are
displayed by a '*' on the plot. A box and whiskers plot can be used to
assess the symmetry of the data. If the distribution is symmetrical, then the
box is divided in two equal halves by the median, the whiskers will be the
same length and the number of extreme data points will be distributed
equally on either end of the plot.
2.3.4 Ranked Data Plot
Figure 2-3.
Example of a Box
and Whisker Plot
A ranked data plot is a useful graphical representation that is easy to
construct, easy to interpret, and makes no assumptions about a model for the data. The analyst
does not have to make any arbitrary choices regarding the data to construct a ranked data plot
(such as cell sizes for a histogram). In addition, a ranked data plot displays every data point;
therefore, it is a graphical representation of the data instead of a summary of the data. Directions
for developing a ranked data plot are given in Box 2-15 and an example is given in Box 2-16.
EPA QA/G-9
QAOO Version
2- 17
Final
July 2000
-------
Box 2-13: Directions for Generating a Box and Whiskers Plot
STEP 1 :
STEP 2:
Set the vertical scale of the plot based on the maximum and minimum values of the data set.
Select a width for the box plot keeping in mind that the width is only a visualization tool. Label the
width w; the horizontal scale then ranges from -1/4Wto 1/
Compute the upper quartile (Q(.75), the 75th percentile) and the lower quartile (Q(.25), the 25th
percentile) using Box 2-1. Compute the sample mean and median using Box 2-2. Then, compute
the interquartile range (IQR) where IQR = Q(.75) - Q(.25).

STEP 3: Draw a box through points ( -1/2W, Q(.75) ), ( -1/2W, Q (.25) ), ( 1/2W, Q(.25) ) and ( 1/2W, Q(.75) ).
Draw a line from (1/4W, Q(.5)) to (-1/4W, Q(.5)) and mark point (0, x) with (+).

STEP 4'. Compute the upper end of the top whisker by finding the largest data value X less than
Q(.75) + 1.5( Q(.75) - Q(.25) ). Draw a line from (0, Q(.75)) to (0, X).

Compute the lower end of the bottom whisker by finding the smallest data value Y greater than
Q(.25) - 1.5( Q(.75) - Q(.25) ). Draw a line from (0, Q(.25)) to (0, Y).

STEP 5: For all points X* > X, place an asterisk (*) at the point (0, X*).

For all points Y* < Y, place an asterisk (*) at the point (0, Y*).
Box 2-14: Example of a Box and Whiskers Plot

Consider the following 22 samples of trifluorine (in ppm) listed in order from smallest to largest: 4.0, 6.1, 9.8,
10.7, 10.8, 11.5, 11.6, 12.4, 12.4, 14.6, 14.7, 14.7, 16.5, 17, 17.5, 20.6, 20.8, 25.7, 25.9, 26.5, 32.0, and 35.5.

STEP 1: The data ranges from 4.0 to 35.5 ppm. This is the range of the vertical axis. Arbitrarily, a width of
4 will be used for the horizontal axis.

STEP 2: Using the formulas in Box 2-2, the sample mean = 16.87 and the median = 14.70. Using Box 2-1,
Q(.75) = 20.8 and Q(.25) = 11.5. Therefore, IQR = 20.8 -11.5 =
9.3.

STEP 3: In the figure, a box has been drawn through points (-2, 20.8),
(-2, 11.5), ( 2, 11.5 ), (2, 20.8). A line has been drawn from (-2 ,
14.7 ) to ( 2, 14.7 ), and the point (0, 16.87) has been marked with
a '+' sign.

STEP 4: Q(.75) + 1.5(9.3) = 34.75. The closest data value to this number,
but less than it, is 32.0. Therefore, a line has been drawn in the
figure from ( 0, 20.8) to (0, 32.0).

Q(.25) -1.5( 9.3 ) = -2.45. The closest data value to this number,
but greater than it, is 4.0. Therefore, a line has been drawn in the
figure from ( 0, 4) to ( 0, 11.5).

STEP 5: There is only 1 data value greater than 32.0 which is 35.5.
Therefore, the point ( 0, 35.5) has been marked with an asterisk.
There are no data values less than 4.0.
EPA QA/G-9
QAOO Version
2- 18
Final
July 2000
-------
Box 2-15: Directions for Generating a Ranked Data Plot

LetX.,, X2, ..., Xn represent the n data points. LetX(i), for i=1 to
n, be the data listed in order from smallest to largest so that X,.,}
(i = 1) is the smallest, X(2) (i = 2) is the second smallest, and X(n
) (i = n) is the largest. To generate a ranked data plot, plot the
ordered X values at equally spaced intervals along the horizontal
axis.
Box 2-16: Example of Generating a Ranked Data Plot
Consider the following 22 samples of triflourine (in ppm): 17.7, 17.4, 22.8, 35.5, 28.6, 17.2
191 49 72 40 152 147 149 109 124 124 116 147 102 52 165 and 8 9 The
data listed in order from smallest to largest X(i) along with the ordered number of the
observation (i) are:
1~
2
3
4
5
6
7
8
9
10
11
A ranked data

4.0" 12 14.7"
4.9 13 14.9
5.2 14 15.2
7.7 15 16.5
8.9 16 17.2
10.2 17 17.4
10.9 18 17.7
11.6 19 19.1
12.4 20 22.8
12.4 21 28.6
14.7 22 35.5
plot of this data is a plot of the pairs ( i, X(i)). This plot is shown below:
40
35
30
Q.
5=20
ro
° 15
10
5
0
-
-
"
. • • *
•
. *
I I I I I I I I I I I I I I I I I I I I I I I

Smallest ^ Largest
EPA QA/G-9
QAOO Version
2- 19
Final
July 2000
-------
A ranked data plot is a plot of the data from smallest to largest at evenly spaced intervals
(Figure 2-4). This graphical representation is very similar to the quantile plot described in Section
2.3.5. A ranked data plot is marginally easier to generate than a quantile plot; however, a ranked
data plot does not contain as much information as a quantile plot. Both plots can be used to
determine the density of the data points and the skewness of the data; however, a quantile plot
contains information on the quartiles of the data whereas a ranked data plot does not.
Smallest
Largest
Figure 2-4. Example of a Ranked Data Plot

A ranked data plot can be used to determine the density of the data values, i.e., if all the
data values are close to the center of the data with relatively few values in the tails or if there is a
large amount of values in one tail with the rest evenly distributed. The density of the data is
displayed through the slope of the graph. A large amount of data values has a flat slope, i.e., the
graph rises slowly. A small amount of data values has a large slope, i.e., the graph rises quickly.
Thus the analyst can determine where the data lie, either evenly distributed or in large clusters of
points. In Figure 2-4, the data rises slowly up to a point where the slope increases and the graph
rises relatively quickly. This means that there is a large amount of small data values and relatively
few large data values.

A ranked data plot can be used to determine if the data are skewed or if they are
symmetric. A ranked data plot of data that are skewed to the right extends more sharply at the
top giving the graph a convex shape. A ranked data plot of data that are skewed to the left
increases sharply near the bottom giving the graph a concave shape. If the data are symmetric,
then the top portion of the graph will stretch to upper right corner in the same way the bottom
portion of the graph stretches to lower left, creating a s-shape. Figure 2-4 shows a ranked data
plot of data that are skewed to the right.
EPA QA/G-9
QAOO Version
2-20
Final
July 2000
-------
2.3.5 Quantile Plot

A quantile plot (Figure 2-5) is a graphical representation of the data that is easy to
construct, easy to interpret, and makes no assumptions about a model for the data. The analyst
does not have to make any arbitrary choices regarding the data to construct a quantile plot (such
as cell sizes for a histogram). In addition, a quantile plot displays every data point; therefore, it is
a graphical representation of the data instead of a summary of the data.
0.2 0.4 0.6 0.8
Fraction of Data (f-values)
Figure 2-5. Example of a Quantile Plot of Skewed Data

A quantile plot is a graph of the quantiles (Section 2.2.1) of the data. The basic quantile
plot is visually identical to a ranked data plot except its horizontal axis varies from 0.0 to 1.0, with
each point plotted according to the fraction of the points it exceeds. This allows the addition of
vertical lines indicating the quartiles or, any other quantiles of interest. Directions for developing
a quantile plot are given in Box 2-17 and an example is given in Box 2-18.

A quantile plot can be used to read the quantile information such as the median, quartiles,
and the interquartile range. In addition, the plot can be used to determine the density of the data
points, e.g., are all the data values close to the center with relatively few values in the tails or are
there a large amount of values in one tail with the rest evenly distributed? The density of the data
is displayed through the slope of the graph. A large amount of data values has a flat slope, i.e.,
the graph rises slowly. A small amount of data values has a large slope, i.e., the graph rises
quickly. A quantile plot can be used to determine if the data are skewed or if they are symmetric.
A quantile plot of data that are skewed to the right is steeper at the top right than the bottom left,
as in Figure 2-5. A quantile plot of data that are skewed to the left increases sharply near the
bottom left of the graph. If the data are symmetric then the top portion of the graph will stretch
to the upper right corner in the same way the bottom portion of the graph stretches to the lower
left, creating an s-shape.
EPA QA/G-9
QAOO Version
2-21
Final
July 2000
-------
Box 2-17: Directions for Generating a Quantile Plot

LetX.,, X2, ..., Xn represent the n data points. To obtain a quantile plot, letX(i),
for
i = 1 to n, be the data listed in order from smallest to largest so that X,.,} (i = 1)
is the smallest, X(2) (i = 2) is the second smallest, and X(n) (i = n) is the
largest. For each i, compute the fraction f,= (i - 0.5)/n. The quantile plot is a
plot of the pairs (fh X(i)), with straight lines connecting consecutive points.
Box 2-18: Example of Generating a Quantile Plot

Consider the following 10 data points: 4 ppm, 5 ppm, 6 ppm, 7 ppm, 4 ppm, 10 ppm, 4 ppm, 5 ppm, 7
ppm, and 8 ppm. The data ordered from smallest to largest, X(i), are shown in the first column of the
table below and the ordered number for each observation, i, is shown in the second column. The third
column displays the values f, for each i where f,= (i - 0.5)/n.
4
4
4
5
5
1
2
3
4
5
0.05
0.15
0.25
0.35
0.45
6
7
7
8
10
I
6
7
8
9
10
0.55
0.65
0.75
0.85
0.95
The pairs (fh X(i)) are then plotted to yield the following quantile plot:
10
a.
a.
ro
•%
Q 6
0 0.2 0.4 0.6 0.8 1
Fraction of Data (f-values)

Note that the graph curves upward; therefore, the data appear to be skewed to the right.
2.3.6 Normal Probability Plot (Quantile-Quantile Plot)

There are two types of quantile-quantile plots or q-q plots. The first type, an empirical
quantile-quantile plot (Section 2.3.7.4), involves plotting the quantiles of two data variables
against each other. The second type of a quantile-quantile plot, a theoretical quantile-quantile
plot, involves graphing the quantiles of a set of data against the quantiles of a specific distribution.
The following discussion will focus on the most common of these plots for environmental data,
EPA QA/G-9
QAOO Version
2-22
Final
July 2000
-------
the normal probability plot (the normal q-q plot); however, the discussion holds for other q-q
plots. The normal probability plot is used to roughly determine how well the data set is modeled
by a normal distribution. Formal tests are contained in Chapter 4, Section 2. Directions for
developing a normal probability plot are given in Box 2-19 and an example is given in Box 2-20.
A discussion of the normal distribution is contained in Section 2.4.

A normal probability plot is the graph of the quantiles of a data set against the quantiles of
the normal distribution using normal probability graph paper (Figure 2-6). If the graph is linear,
the data may be normally distributed. If the graph is not linear, the departures from linearity give
important information about how the data distribution deviates from a normal distribution.

If the graph of the normal probability plot is not linear, the graph may be used to
determine the degree of symmetry (or asymmetry) displayed by the data. If the data in the upper
tail fall above and the data in the lower tail fall below the quartile line, the data are too slender to
be well modeled by a normal distribution, i.e., there are fewer values in the tails of the data set
than what is expected from a normal distribution. If the data in the upper tail fall below and the
data in the lower tail fall above the quartile line, then the tails of the data are too heavy to be well
modeled using a normal distribution, i.e., there are more values in the tails of the data than what is
expected from a normal distribution. A normal probability plot can be used to identify potential
outliers. A data value (or a few data values) much larger or much smaller than the rest will cause
the other data values to be compressed into the middle of the graph, ruining the resolution.
Box 2-19: Directions for Constructing a Normal Probability Plot

LetX.,, X2, ..., Xn represent the n data points.

STEP 1: For each data value, compute the absolute frequency, AF,. The absolute frequency is the
number of times each value occurs. For distinct values, the absolute frequency is 1 . For
non-distinct observations, count the number of times an observation occurs. For example,
consider the data 1, 2, 3, 3. The absolute frequency of value 1 is 1 and the absolute
frequency of value 2 is 1. The absolute frequency of value 3 is 2 since 3 appears 2 times
in the data set.

STEP 2: Compute the cumulative frequencies, CF,. The cumulative frequency is the number of data
;'
points that are less than or equal to Xh i.e., CFt = /AF .. Using the data given in step
y=i
2, the cumulative frequency for value 1 is 1, the cumulative frequency for value 2 is 2
(1 + 1), and the cumulative frequency for value 3 is 4 (1 + 1+2).
CFi
STEP 3: Compute Y = 100 x - and plot the pairs (Yh X,) using normal probability paper
(Figure 2-6). If the graph of these pairs approximately forms a straight line, then the data
are probably normally distributed. Otherwise, the data may not be normally distributed.
EPA QA/G-9 Final
QAOO Version 2 - 23 July 2000
-------
Box 2-20: Example of Normal Probability Plot

Considerthe following 15 data points: 5, 5, 6, 6, 8, 8, 9, 10, 10, 10, 10, 10, 12, 14, and 15.

STEP 1: Because the value 5 appears 2 times, its absolute frequency is 2. Similarly, the absolute frequency
of 6 is 2, of 8 is 2, of 9 is 1, of 10 is 5, etc. These values are shown in the second column of the
table below.

STEP 2: The cumulative frequency of the data value 8 is 6 because there are 2 values of 5, 2 values of 6, and
2 values of 8. The cumulative frequencies are shown in the 3rd column of the table.

CF,
STEP 3: The values Y. = 100 x ( )for each data point are shown in column 4 of the table below. A
n+ 1
plot of these pairs (Yh X,) using normal probability paper is also shown below.
i
1
2
3
4
5
6
7
8
Individual
x,
5
6
8
9
10
12
14
15
Absolute
Frequency AFi
2
2
2
1
5
1
1
1
Cumulative
Frequency CFi
2
4
6
7
12
13
14
15
Y,
12.50
25.00
37.50
43.75
75.00
81.25
87.50
93.75
X
18
14
12
10
20 30 40 50 60 70 80

Y
EPA QA/G-9
QAOO Version
2-24
Final
July 2000
-------
0
EPA
QA(

01 0.05
QA/G-9
)0 Version

0.1 0.2 0.5

2 5 10

20 30 40 50 60 70 80 90 95 98
Figure 2-6. Normal Probability Paper
2-25

99.8 99.9

99.99
Final
May 2000
-------
2.3.7 Plots for Two or More Variables

Data often consist of measurements of several characteristics (variables) for each sample
point in the data set. For example, a data set may consist of measurements of weight, sex, and
age for each animal in a sample or may consist of daily temperature readings for several cities. In
this case, graphs may be used to compare and contrast different variables. For example, the
analyst may wish to compare and contrast the temperature readings for different cities, or
different sample points (each containing several variables) such the height, weight, and sex across
individuals in a study.

To compare and contrast individual data points, some special plots have been developed
to display multiple variables. These plots are discussed in Section 2.3.7.1. To compare and
contrast several variables, collections of the single variable displays described in previous sections
are useful. For example, the analyst may generate box and whisker plots or histograms for each
variable using the same axis for all of the variables. Separate plots for each variable may be
overlaid on one graph, such as overlaying quantile plots for each variable on one graph. Another
useful technique for comparing two variables is to place the stem and leaf plots back to back. In
addition, some special plots have been developed to display two or more variables. These plots
are described in Sections 2.3.7.2 through 2.3.7.4.

2.3.7.1 Plots for Individual Data Points

Since it is difficult to visualize data in more than 2 or 3 dimensions, most of the plots
developed to display multiple variables for individual data points involve representing each
variable as a distinct piece of a two-dimensional figure. Some such plots include Profiles, Glyphs,
and Stars (Figure 2-7). These graphical representations start with a specific symbol to represent
each data point, then modify the various features of the symbol in proportion to the magnitude of
each variable. The proportion of the magnitude is determined by letting the minimum value for
each variable be of length 0, the maximum be of length 1. The remaining values of each variable
are then proportioned based on the magnitude of each value in relation to the maximum and
minimum.
Profile Plot
Glyph Plot
Star Plot
Figure 2-7. Example of Graphical Representations of
Multiple Variables
EPA QA/G-9
QAOO Version
2-26
Final
July 2000
-------
A profile plot starts with a line segment of a fixed length. Then lines spaced an equal
distance apart and extended perpendicular to the line segment represent each variable. A glyph
plot uses a circle of fixed radius. From the perimeter, parallel rays whose sizes are proportional to
the magnitude of the variable extend from the top half of the circle. A star plot starts with a point
where rays spaced evenly around the circle represent each variable and a polygon is then drawn
around the outside edge of the rays.
2.3.7.2
Scatter Plot
For data sets consisting of paired observations where two or more continuous variables
are measured for each sampling point, a scatter plot is one of the most powerful tools for
analyzing the relationship between two or more variables. Scatter plots are easy to construct for
two variables (Figure 2-8) and many computer graphics packages can construct 3-dimensional
scatter plots. Directions for constructing a scatter plot for two variables are given in Box 2-21
along with an example.
40
30
Q.
Q.
^20
"•
10
°c
>K ^
-
*
^ ^ >K
* i*r"% " **
2468
Chromium VI (ppb)
Figure 2-8. Example of a Scatter Plot

A scatter plot clearly shows the relationship between two variables. Both potential
outliers from a single variable and potential outliers from the paired variables may be identified on
this plot. A scatter plot also displays the correlation between the two variables. Scatter plots of
highly linearly correlated variables cluster compactly around a straight line. In addition, nonlinear
patterns may be obvious on a scatter plot. For example, consider two variables where one
variable is approximately equal to the square of the other. A scatter plot of this data would
display a u-shaped (parabolic) curve. Another important feature that can be detected using a
scatter plot is any clustering effect among the data.
2.3.7.3
Extensions of the Scatter Plot
It is easy to construct a 2-dimensional scatter plot by hand and many software packages
can construct a useful 3-dimensional scatter plot. However, with more than 3 variables, it is
difficult to construct and interpret a scatter plot. Therefore, several graphical representations
have been developed that extend the idea of a scatter plot for data consisting of 2 or more
variables.
EPA QA/G-9
QAOO Version
2-27
Final
July 2000
-------
Box 2-21: Directions for Generating a Scatter Plot and an Example
LetX.,, X2, ..., Xn represent one variable of the n data points and let Y.,, Y2, ..., Yn represent a second variable of
the n data points. The paired data can be written as (Xh Y,) for i = 1 , ..., n. To construct a scatter plot, plot the
first variable along the horizontal axis and the second variable along the vertical axis. It does not matter which
variable is placed on which axis.
Example: /
and Chrom

\ scatter plot will be developed for the data below. PCE values are displayed on the vertical axis
um VI values are displayed on the horizontal axis of Figure 2-8.
PCE
(PPb)
14.49
37.21
10.78
18.62
7.44
37.84
13.59
4.31

Chromium
VI (ppb)
3.76
6.92
1.05
6.30
1.43
6.38
5.07
3.56

PCE
(PPb)
2.23
3.51
6.42
2.98
3.04
12.60
3.56
7.72

Chromium
VI (ppb)
0.77
1.24
3.48
1.02
1.15
5.44
2.49
3.01

PCE
(PPb)
4.14
3.26
5.22
4.02
6.30
8.22
1.32
7.73
5.88
Chromium
VI (ppb)
2.36
0.68
0.65
0.68
1.93
3.48
2.73
1.61
1.42

30
20
10
Chromium vs. PCE
Atrazine vs. PCE
Atrazine vs. Chromium IV
The simplest of these graphical representations is a coded scatter plot. In this case, all
possible pairs of data are given a code and plotted on one scatter plot. For example, consider a
data set of 3 variables: variable A, variable B, and variable C. Using the first variable to
designate the horizontal axis, the
analyst may choose to display the
pairs (A, B) using an X, the pairs (A,
C) using a Y, and the pairs (B, C)
using a Z on one scatter plot. All of
the information described above for
a scatter plot is also available on a
coded scatter plot. However, this
method assumes that the ranges of
the three variables are comparable
and does not provide information on
three-way or higher interactions
between the variables. An example
of a coded scatter plot is given in
Figure 2-9.
+ +
+ +
10
20
Figure 2-9. Example of a Coded Scatter Plot
A parallel coordinate plot also extends the idea of a scatter plot to higher dimensions. The
parallel coordinates method employs a scheme where coordinate axes are drawn in parallel
(instead of perpendicular). Consider a sample point X consisting of values Xt for variable 1, X2
for variable 2, and so on up to Xp for variable p. A parallel coordinate plot is constructed by
placing an axis for each of the p variables parallel to each other and plotting Xt on axis 1, X2 on
axis 2, and so on through Xp on axis p and joining these points with a broken line. This method
EPA QA/G-9
QAOO Version
2-28
Final
July 2000
-------
contains all of the information
available on a scatter plot in
addition to information on 3-way
and higher interactions (e.g.,
clustering among three variables).
However, for p variables one
must construct (p+l)/2 parallel
coordinate plots in order to
display all possible pairs of
variables. An example of a
parallel coordinate plot is given in
Figure 2-10.

A scatter plot matrix is
another useful method of
extending scatter plots to higher
dimensions. In this case, a scatter plot is developed for all possible pairs of the variables which
are then displayed in a matrix format. This method is easy to implement and provides a concise
method of displaying the individual scatter plots. However, this method does not contain
information on 3-way or higher interactions between variables. An example of a scatter plot
matrix is contained in Figure 2-11.
Figure 2-10. Example of a Parallel Coordinates Plot
40

Q.
S
| 20
o
5 10
0
40

£•30
Q.
Q.
5
1 20
Chromii
o o
14
12
s-10
Q.
£ 8
HI
I C
^ 1 6
j+ 4
/^ ( ( ( *
r
- +
+

" ++ +
"f-^+^ +
H4-
. -I-,
10 20 30 40 0 10 20 30 40
Chromium IV (ppb) 14
12
-10
Q.
3 8
HI
_i_ =
+ g 6
\ + 5 4
Chromium IV (ppb)
+
++
^+
:+++
2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Atrazine (ppb) Atrazine (ppb)
Figure 2-11. Example of a Matrix Scatter Plot
EPA QA/G-9
QAOO Version
2-29
Final
July 2000
-------
2.3.7.4 Empirical Quantile-Quantile Plot

An empirical quantile-quantile (q-q) plot involves plotting the quantiles (Section 2.2.1) of
two data variables against each other. This plot is used to compare distributions of two or more
variables; for example, the analyst may wish to compare the distribution of lead and iron samples
from a drinking water well. This plot is similar in concept to the theoretical quantile-quantile plot
and yields similar information in regard to the distribution of two variables instead of the
distribution of one variable in relation to a fixed distribution. Directions for constructing an
empirical q-q plot with an example are given in Box 2-22.

An empirical q-q plot is the graph of the quantiles of one variable of a data set against the
quantiles of another variable of the data set. This plot is used to determine how well the
distribution of the two variables match. If the distributions are roughly the same, the graph is
linear or close to linear. If the distributions are not the same, than the graph is not linear. Even if
the graph is not linear, the departures from linearity give important information about how the
two data distributions differ. For example, a q-q plot can be used to compare the tails of the two
data distributions in the same manner a normal probability plot was used to compare the tails of
the data to the tails of a normal distribution. In addition, potential outliers (from the paired data)
may be identified on this graph.

2.3.8 Plots for Temporal Data

Data collected over specific time intervals (e.g., monthly, biweekly, or hourly) have a
temporal component. For example, air monitoring measurements of a pollutant may be collected
once a minute or once a day; water quality monitoring measurements of a contaminant level may
be collected weekly or monthly. An analyst examining temporal data may be interested in the
trends over time, correlation among time periods, and cyclical patterns. Some graphical
representations specific to temporal data are the time plot, correlogram, and variogram.

Data collected at regular time intervals are called time series. Time series data may be
analyzed using Box-Jenkins modeling and spectral analysis. Both of these methods require a large
amount of data collected at regular intervals and are beyond the scope of this guidance. It is
recommended that the interested reader consult a statistician.

The graphical representations presented in this section are recommended for all data that
have a temporal component regardless of whether formal statistical time series analysis will be
used to analyze the data. If the analyst uses a time series methodology, the graphical
representations presented below will play an important role in this analysis. If the analyst decides
not to use time series methodologies, the graphical representations described below will help
identify temporal patterns that need to be accounted for in the analysis of the data
EPA QA/G-9 Final
QAOO Version 2-30 July 2000
-------
Box 2-22: Directions for Constructing an Empirical Q-Q Plot with an Example

LetX.,, X2, ..., Xn represent n data points of one variable and let Y.,, Y2, ..., Ym represent a second variable
of m data points. Let X,,}, for i = 1 to n, be the first variable listed in order from smallest to largest so that
X,!) (i = 1) is the smallest, X( 2} (i = 2) is the second smallest, and X, n} (i = n) is the largest. Let Y,,}, for i
= 1 to n, be the second variable listed in order from smallest to largest so that Y(.,} (i = 1) is the smallest,
Y(2) (i = 2) is the second smallest, and Y(m) (i = m) is the largest.

If m = n: If the two variables have the same number of observations, then an empirical q-q plot of the
two variables is simply a plot of the ordered values of the variables. Since n=m, replace m by n. A plot
of the pairs (X(1), Y(1)), (X(2), Y,
x(2). '(2)!
(X(n), Y(n)) is an empirical quantile-quantile plot.
If n > m: If the two variables have a different number of observations, then the empirical quantile-quantile
plot will consist of m (the smaller number) pairs. The empirical q-q plot will then be a plot of the ordered
Y values against the interpolated X values. For i = 1, i = 2, ..., i = m, let v = (n/m)(i - 0.5) + 0.5 and
separate the result into the integer part and the fractional part, i.e., let v = j + g where j is the integer part
and g is the fraction part. If g = 0, plot the pair (Y(i), X(i)). Otherwise, plot the pair (Y(i), (1-g)X(j) + gX(j +
!)). A plot of these pairs is an empirical quantile-quantile plot.

Example: Consider two sets of contaminant readings from two separate drinking water wells at the same
site. The data from well 1 are: 1.32, 3.26, 3.56, 4.02, 4.14, 5.22, 6.30, 7.72, 7.73, and 8.22. The data
from well 2 are: 0.65, 0.68, 0.68, 1.42, 1.61, 1.93, 2.36, 2.49, 2.73, 3.01, 3.48, and 5.44. An empirical
q-q plot will be used to compare the distributions of these two wells. Since there are 10 observations in
well 1, and 12 observations in well, the case for n ? m will be used. Therefore, for i = 1, 2, ..., 10,
compute:

i = 1: v = —(l-.5)+.5 = 1.1 soj=1 andg=.1. Since g*0, plot (1.32,(.9).65+(.1).68)=(1.32,
0.653)
12,
1 = 2: v = —(2-.5)+.5 = 2.3 so j=2 and g=3. Since g#0, plot (3.26,(.7).68+(.3).68)=(3.26, 0.68)

1 = 3: v = —(3-.5)+.5 = 3.5 so j=3 and g=5. Since g#0, plot (3.56,(.5).68+(.5)1.42)=(3.56,1.05)

Continue this process for i =4, 5, 6, 7, 8, 9, and 10 to yield the following 10 data pairs (1.32, 0.653),
(3.26, 0.68), (3.56, 1.05), (4.02, 1.553), (4.14, 1.898), (5.22, 2.373), (6.30, 2.562), (7.72, 2.87), (7.73,
3.339), and (8.22, 5.244). These pairs are plotted below, along with the best fitting regression line.
10
tM
~s
I6
in
o
'c ^
ro
O
4 6
Quantiles of Well 1
10
EPA QA/G-9
QAOO Version
2-31
Final
July 2000
-------
The analyst examining temporal environmental data may be interested in seasonal trends,
directional trends, serial correlation, and stationarity. Seasonal trends are patterns in the data that
repeat over time, i.e., the data rise and fall regularly over one or more time periods. Seasonal
trends may be large scale, such as a yearly trend where the data show the same pattern of rising
and falling over each year, or the trends may be small scale, such as a daily trend where the data
show the same pattern for each day. Directional trends are downward or upward trends in the
data which is of importance to environmental applications where contaminant levels may be
increasing or decreasing. Serial correlation is a measure of the extent to which successive
observations are related. If successive observations are related, statistical quantities calculated
without accounting for serial correlation may be biased. Finally, another item of interest for
temporal data is stationarity (cyclical patterns). Stationary data look the same over all time
periods. Directional trends and increasing (or decreasing) variability among the data imply that
the data are not stationary.

Temporal data are sometimes used in environmental applications in conjunction with a
statistical hypothesis test to determine if contaminant levels have changed. If the hypothesis test
does not account for temporal trends or seasonal variations, the data must achieve a "steady state"
before the hypothesis test may be performed. Therefore, the data must be essentially the same for
comparable periods of time both before and after the hypothesized time of change.

Sometimes multiple observations are taken in each time period. For example, the
sampling design may specify selecting 5 samples every Monday for 3 months. If this is the case,
the time plot described in Section 2.3.8.1 may be used to display the data, display the mean
weekly level, display a confidence interval for each mean, or display a confidence interval for each
mean with the individual data values. A time plot of all the data can be used to determine if the
variability for the different time periods changes. A time plot of the means can be used to
determine if the means are possibly changing between time periods. In addition, each time period
may be treated as a distinct variable and the methods of Section 2.3.7 may be applied.

2.3.8.1 Time Plot

One of the simplest plots to generate that provides a large amount of information is a time
plot. A time plot is a plot of the data over time. This plot makes it easy to identify large-scale
and small-scale trends over time. Small-scale trends show up on a time plot as fluctuations in
smaller time periods. For example, ozone levels over the course of one day typically rise until the
afternoon, then decrease, and this process is repeated every day. Larger scale trends, such as
seasonal fluctuations, appear as regular rises and drops in the graph. For example, ozone levels
tend to be higher in the summer than in the winter so ozone data tend to show both a daily trend
and a seasonal trend. A time plot can also show directional trends and increased variability over
time. Possible outliers may also be easily identified using a time plot.

A time plot (Figure 2-12) is constructed by numbering the observations in order by time.
The time ordering is plotted on the horizontal axis and the corresponding observation is plotted
on the vertical axis. The points plotted on a time plot may be joined by lines; however, it is

EPA QA/G-9 Final
QAOO Version 2 - 32 July 2000
-------
recommended that the plotted points not be connected to avoid creating a false sense of
continuity. The scaling of the vertical axis of a time plot is of some importance. A wider scale
tends to emphasize large-scale trends, whereas a smaller scale tends to emphasize small-scale
trends. Using the ozone example above, a wide scale would emphasize the seasonal component
of the data, whereas a smaller scale would tend to emphasize the daily fluctuations. Directions for
constructing a time plot are contained in Box 2-23 along with an example.
20
15
in
> 10
re
re
Q
5
0

* K
yf_ 'X W. *
->K * ** >K* '*- * >K >K
****** \ ^^ ^^^

) 5 10 15 20 25 30 35 40 45 50
Time
Figure 2-12. Example of a Time Plot Showing a Slight Downward Trend
Box 2-23: Directions for Generating a Time Plot and an Example

LetX.,, X2, ..., Xn represent n data points listed in order by time, i.e., the subscript represents the ordered time
interval. A plot of the pairs (i, X,) is a time plot of this data.

Example: Consider the following 50 daily observations (listed in order by day): 10.05, 11.22, 15.9, 11.15,
10.53, 13.33, 11.81, 14.78, 10.93, 10.31, 7.95, 10.11, 10.27, 14.25, 8.6, 9.18, 12.2, 9.52, 7.59, 10.33, 12.13,
11.31, 10.13, 7.11, 6.72, 8.97, 10.11, 7.72, 9.57, 6.23, 7.25, 8.89, 9.14, 12.34, 9.99, 11.26, 5.57, 9.55, 8.91,
7.11, 6.04, 8.67, 5.62, 5.99, 5.78, 8.66, 5.8, 6.9, 7.7, 8.87. By labeling day 1 as 1, day 2 as 2, and soon, a
time plot is constructed by plotting the pairs (i, X,) where i represents the number of the day and X, represents
the concentration level. A time plot of this data is shown in Figure 2-12.
2.3.8.2
Plot of the Autocorrelation Function (Correlogram)
Serial correlation is a measure of the extent to which successive observations are related.
If successive observations are related, either the data must be transformed or this relationship
must be accounted for in the analysis of the data. The correlogram is a plot that is used to display
serial correlation when the data are collected at equally spaced time intervals. The autocorrelation
function is a summary of the serial correlations of data. The 1st autocorrelation coefficient (r^ is
the correlation between points that are 1 time unit (kj) apart; the 2nd autocorrelation coefficient
(r2) is the correlation between points that are 2 time units (k2) apart; etc. A correlogram (Figure
2-13) is a plot of the sample autocorrelation coefficients in which the values of k versus the values
of rk are displayed. Directions for constructing a correlogram are contained in Box 2-24; example
calculations are contained in Box 2-25. For large sample sizes, a correlogram is tedious to
EPA QA/G-9
QAOO Version
2-33
Final
July 2000
-------
construct by hand; therefore, software like Data Quality Evaluation Statistical Tools
(DataQUEST) (G-9D) should be used.

1
0.75
2- °-5
0.25
0
-0.25
-0.5
-*
-

'* *

0 5 10 15 20 25 30
k
Figure 2-13. Example of a Correlogram
Box 2-24: Directions for Constructing a Correlogram

Let X.,, X2, ..., Xn represent the data points ordered by time for equally spaced time points, i.e., X., was collected
at time 1, X2 was collected at time 2, and so on. To construct a Correlogram, first compute the sample
autocorrelation coefficients. So for k = 0, 1, ..., compute rk where
= - and 8k = T,(Xt-X)(Xt.k-X).
t=k+
Once the rk have been computed, a Correlogram is the graph (k, rk) for k = 0, 1, . . . , and so on. As a
approximation, compute up to approximately k = n/6. Also, note that r0 = 1. Finally, place horizontal lines at
±2//n.
The correlogram is used for modeling time series data and may be used to determine if
serial correlation is large enough to create problems in the analysis of temporal data using other
methodologies besides formal time series methodologies. A quick method for determining if serial
correlation is large is to place horizontal lines at ±2//n on the correlogram (shown as dashed lines
on Figure 2-13). Autocorrelation coefficients that exceed this value require further investigation.
In application, the correlogram is only useful for data at equally spaced intervals. To relax
this restriction, a variogram may be used instead. The variogram displays the same information as
a correlogram except that the data may be based on unequally spaced time intervals. For more
information on the construction and uses of the variogram, consult a statistician.
EPA QA/G-9
QAOO Version
2-34
Final
July 2000
-------
2.3.8.3 Multiple Observations Per Time Period

Sometimes in environmental data collection, multiple observations are taken for each time
period. For example, the data collection design may specify collecting and analyzing 5 samples
from a drinking well every Wednesday for three months. If this is the case, the time plot
described in Section 2.3.8.1. may be used to display the data, display the mean weekly level,
display a confidence interval for each mean, or display a confidence interval for each mean with
the individual data values. A time plot of all the data will allow the analyst to determine if the
variability for the different collection periods varies. A time plot of the means will allow the
analyst to determine if the means may possibly be changing between the collection periods. In
addition, each collection period may be treated as a distinct variable and the methods described in
Section 2.3.7 may be applied.
Box 2-25: Example Calculations for Generating a Correlogram

A correlogram will be constructed using the following four hourly data points: hour 1: 4.5, hour 2: 3.5, hour 3:
2.5, and hour 4: 1.5. Only four data points are used so that all computations may be shown. Therefore, the
idea that no more than n/6 autocorrelation coefficients should be computed will be broken for illustrative
purposes. The first step to constructing a correlogram is to compute the sample mean (Box 2-2) which is 3 for
the 4 points. Then,
_ \ A, _ V\(M - -\!\ = = v ' v ' v ' v ' = 1 9S
g°" gy<-y>(y<-°-y> 4 4
(v3-3)(y2-3) + (v4-3)(y3-3)
= (3.5-3)(4.5-3) +(2.5-3)(3.5-3) +(1.5-3X2.5-3) = 1.25 =
0>4-3)(y2-3)
(2.5-3)(4.5-3) + (1.5-3)(3.5-3) _ -1.5
= ^-^- = (1.5-3)(4.5-3) = -2.25 =
EPA QA/G-9 Final
QAOO Version 2-35 July 2000
-------
Box 2-25: Example Calculations for Generating a Correlogram - Continued
So r, = —- =
0.3125
1.25
= 0.25 , r? = ±1 =
-.375
1.25
= -0.3 , and
-0.5625
1.25
= -0.45.
Remember r0 = 1. Thus, the correlogram of these data is a plot of (0, 1) (1, 0.25), (2, -0.3) and (3, -0.45) with
two horizontal lines at±2//4 (±1). This graph is shown below.
In this case, it appears that the observations are not serially correlated because all of the correlogram points
are within the bounds of ±2//4 (±1.0). In Figure 2-13, if k represents months, then the correlogram shows a
yearly correlation between data points since the points at k=12 and k=24 are out of the bounds of ±2//n. This
correlation will need to be accounted for when the data are analyzed.
1
0.8
0.6
0.4
0.2
o
-0.2
-0.4
-0.6
-0.8
-1
+ 1.0
-1.0
1 2
k
(Hours)
2.3.9 Plots for Spatial Data

The graphical representations of the preceding sections may be useful for exploring spatial
data. However, an analyst examining spatial data may be interested in the location of extreme
values, overall spatial trends, and the degree of continuity among neighboring locations.
Graphical representations for spatial data include postings, symbol plots, correlograms, h-scatter
plots, and contour plots.
EPA QA/G-9
QAOO Version
2-36
Final
July 2000
-------
The graphical representations presented in this section are recommended for all spatial
data regardless of whether or not geostatistical methods will be used to analyze the data. The
graphical representations described below will help identify spatial patterns that need to be
accounted for in the analysis of the data. If the analyst uses geostatistical methods such as kriging
to analyze the data, the graphical representations presented below will play an important role in
geostatistical analysis.
2.3.9.1
Posting Plots
A posting plot (Figure 2-14) is a map of data locations along with corresponding data
values. Data posting may reveal obvious errors in data location and identify data values that may
be in error. The graph of the sampling locations gives the analyst an idea of how the data were
collected (i.e., the sampling design), areas that may have been inaccessible, and areas of special
interest to the decision maker which may have been heavily sampled. It is often useful to mark
the highest and lowest values of the data to see if there are any obvious trends. If all of the
highest concentrations fall in one region of the plot, the analyst may consider some method such
as post-stratifying the data (stratification after the data are collected and analyzed) to account for
this fact in the analysis. Directions for generating a posting of the data (a posting plot) are
contained in Box 2-26.
4.0 11.6 14.9 17.4 17.7 12.4
28.6 7.7 15.2 35.5 14.7 16.5
14.7 10.9 12.4 22.8 19.1
10.2 5.2 4.9 17.2
2.3.9.2
Figure 2-14. Example of a Posting Plot

Symbol Plots
For large amounts of data, a posting plot may not be feasible and a symbol plot (Figure 2-
15) may be used. A symbol plot is basically the same as a posting plot of the data, except that
instead of posting individual data values, symbols are posted for ranges of the data values. For
example, the symbol '0' could represent all concentration levels less than 100 ppm, the symbol T
could represent all concentration levels between 100 ppm and 200 ppm, etc. Directions for
generating a symbol plot are contained in Box 2-26.
EPA QA/G-9
QAOO Version
2-37
Final
July 2000
-------
022
5 1 3
3 3

7 2

2 4

1 0
Figure 2-15. Example of a Symbol Plot
Box 2-26: Directions for Generating a Posting Plot and a Symbol Plot
with an Example
On a map of the site, plot the location of each sample. At each location, either indicate the value of the
data point (a posting plot) or indicate by an appropriate symbol (a symbol plot) the data range within
which the value of the data point falls for that location, using one unique symbol per data range.
Example: The spatial data displayed in the table below contains both a location (Northing and Easting)
and a concentration level ([c]). The data range from 4.0 to 35.5 so units of 5 were chosen to group the
data:
Range Symbol Range Symbol
0.0- 4.9 0 20.0-24.9 4
5.0- 9.9 1 25.0-29.9 5
10.0-14.9 2 30.0-34.9 6
15.0-19.9 3 35.0-39.9 7
The data values with corresponding symbols then become:
Northing Easting fcl Symbol Northing Easting fcl
Symbol
25.0
25.0
25.0
25.0
20.0
20.0
20.0
20.0
15.0
15.0
15.0
The posting plot of this data
0.0
5.0
10.0
15.0
0.0
5.0
10.0
15.0
0.0
5.0
10.0
4.0
11.6
14.9
17.4
17.7
12.4
28.6
7.7
15.2
35.5
14.7
is displayed in
0
2
2
3
3
2
5
1
3
7
2
Figure 2-14
15.0
15.0
10.0
10.0
10.0
5.0
5.0
5.0
5.0
0.0
0.0
and the symbol
15.0
0.0
5.0
10.0
15.0
0.0
5.0
10.0
15.0
5.0
15.0
plot is
16.5
8.9
14.7
10.9
12.4
22.8
19.1
10.2
5.2
4.9
17.2
displayed
3
1
2
2
2
4
3
2
1
0
3
in Figure 2-15.
EPA QA/G-9
QAOO Version
2-38
Final
July 2000
-------
2.3.9.3 Other Spatial Graphical Representations

The two plots described in Sections 2.3.9.1 and 2.3.9.2 provide information on the
location of extreme values and spatial trends. The graphs below provide another item of interest
to the data analyst, continuity of the spatial data. The graphical representations are not described
in detail because they are used more for preliminary geostatistical analysis. These graphical
representations can be difficult to develop and interpret. For more information on these
representations, consult a statistician.

An h-scatterplot is a plot of all possible pairs of data whose locations are separated by a
fixed distance in a fixed direction (indexed by h). For example, a h-scatter plot could be based on
all the pairs whose locations are 1 meter apart in a southerly direction. A h-scatter plot is similar
in appearance to a scatter plot (Section 2.3.7.2). The shape of the spread of the data in a h-
scatter plot indicates the degree of continuity among data values a certain distance apart in
particular direction. If all the plotted values fall close to a fixed line, then the data values at
locations separated by a fixed distance in a fixed location are very similar. As data values become
less and less similar, the spread of the data around the fixed line increases outward. The data
analyst may construct several h-scatter plots with different distances to evaluate the change in
continuity in a fixed direction.
A correlogram is a plot of the correlations of the h-scatter plots. Because the h-scatter
plot only displays the correlation between the pairs of data whose locations are separated by a
fixed distance in a fixed direction, it is useful to have a graphical representation of how these
correlations change for different separation distances in a fixed direction. The correlogram is such
a plot which allows the analyst to evaluate the change in continuity in a fixed direction as a
function of the distance between two points. A spatial correlogram is similar in appearance to a
temporal correlogram (Section 2.3.8.2). The correlogram spans opposite directions so that the
correlogram with a fixed distance of due north is identical to the correlogram with a fixed distance
of due south.

Contour plots are used to reveal overall spatial trends in the data by interpolating data
values between sample locations. Most contour procedures depend on the density of the grid
covering the sampling area (higher density grids usually provide more information than lower
densities). A contour plot gives one of the best overall pictures of the important spatial features.
However, contouring often requires that the actual fluctuations in the data values are smoothed so
that many spatial features of the data may not be visible. The contour map should be used with
other graphical representations of the data and requires expert judgement to adequately interpret
the findings.

2.4 Probability Distributions

2.4.1 The Normal Distribution

Data, especially measurements, occur in natural patterns that can be considered to be a
distribution of values. In most instances the data values will be grouped around some measure of

EPA QA/G-9 Final
QAOO Version 2-39 July 2000
-------
control tendency such as the mean or median. The spread of the data (as determined by the sum
of the squared distances from data point to the mean) is called the variance (the square root of
this is called the standard deviation). A distribution with a large variance will be more spread out
than one with a small variance (Figure 2-16). When the data values fall in a systematic pattern
around the mean and then taper off rapidly to the tails, it is often a normal distribution or bell-
shaped curve.
Figure 2-16. The Normal
Distribution
Figure 2-17. The Standard
Normal Curve (Z-Curve)
The characteristics of a normal distribution are well known mathematically and when
referred to, usually written as "data are distributed N (//, a2)" where the first characteristic is the
mean (//) and the second, the variance (a2). It may be shown that any normal distribution can be
transformed to a standard normal distribution, N(0,l), and this standard normal referred to as
simply Z (Figure 2-17). It is frequently necessary to refer to the percentiles of a standard normal
and in this guidance document, the subscript to a quoted Z-value will denote the percentile (or
area under the curve, cumulative from the left), see Figure 2-17.

2.4.2 The t-Distribution
The standard normal curve is used when exact information on the mean and variance are
available, but when only estimates from a sample are available, a different distribution applies.
When only information from a random sample on sample mean and sample variance is known for
decision making purposes, a Student's t distribution is appropriate. It resembles a standard
normal but is lower in the center and fatter in the tails. The degree of fatness in the tails is a
function of the degrees of freedom available, which in turn is related to sample size.
2.4.3 The Lognormal Distribution

A commonly met distribution in environmental work is
the lognormal distribution which has a more skewed (lopsided)
shape than a normal, see Figure 2-18. The lognormal is
bounded by zero and has a fatter tail than the normal. It is
related to the normal by the simple relationship: if X is
distributed lognormally, then Y = In (X) is distributed normally.
It is common practice to transform data (and any standard
being tested against) to achieve approximate normality prior to
conducting statistical tests.
Figure 2-18. Three Different
Lognormal Distributions
EPA QA/G-9
QAOO Version
2-40
Final
July 2000
-------
2.4.4 Central Limit Theorem

In nearly all estimation situations in environmental work, the focus of the investigation
centers on the mean of a random sample of observations or measurements. It is rare that true
normality of the observations can be assumed and therefore a question as whether to use
statistical tests based on normality may be considered.

In many cases, the normally-based statistical tests are not overly affected by the lack of
normality as tests are very robust (sturdy) and perform tolerably well unless gross non-normality
is present. In addition, the tests become increasingly tolerant of deviations from normality as the
number of individual samples constituting the sample mean increases. In simple terms, as the size
of the sample increases, the mean of that sample acts increasingly as if it came from a normal
distribution regardless of the true distribution of the individual values. The taking of large
samples "stabilizes" the mean, so it then acts as if normality was present and the statistical test
remain valid.
EPA QA/G-9 Final
QAOO Version 2-41 July 2000
-------
This page is intentionally blank.
EPA QA/G-9 Final
QAOO Version 2 - 42 July 2000
-------
CHAPTER 3
STEP 3: SELECT THE STATISTICAL TEST
THE DATA QUALITY ASSESSMENT PROCESS
Review DQOs and Sampling Design
Conduct Preliminary Data Review
Select the Statistical Test
Verify the Assumptions
Draw Conclusions From the Data
SELECT THE STATISTICAL TEST

Purpose

Select an appropriate procedure for analyzing
data based on the preliminary data review.

Activities

• Select Statistical Hypothesis Test
• Identify Assumptions Underlying Test

Tools

• Hypothesis tests for a single population
• Hypothesis tests for comparing two populations
Step 3: Select the Statistical Test

Select the statistical hypothesis test based on the data user's objectives and the results of the
preliminary data review.
P If the problem involves comparing study results to a fixed threshold, such as a regulatory
standard, consider the hypothesis tests in Section 3.2.
P If the problem involves comparing two populations, such as comparing data from two
different locations or processes, then consider the hypothesis tests in Section 3.3.

Identify the assumptions underlying the statistical test.
P List the key underlying assumptions of the statistical hypothesis test, such as distributional
form, dispersion, independence, or others as applicable.
P Note any sensitive assumptions where relatively small deviations could jeopardize the
validity of the test results.
EPA QA/G-9
QAOO Version
3- 1
Final
July 2000
-------
List of Boxes
Page
Box 3-1: Directions for a One-Sample t-Test 3-7
Box 3-2: An Example of a One-Sample t-Test 3-8
Box 3-3: Directions for a One-Sample t-Test for a Stratified Random Sample 3-9
Box 3-4: An Example of a One-Sample t-Test for a Stratified Random Sample 3-10
Box 3-5: Directions for the Wilcoxon Signed Rank Test 3-12
Box 3-6: An Example of the Wilcoxon Signed Rank Test 3-13
Box 3-7: Directions for the Large Sample Approximation to the Wilcoxon
Signed Rank Test 3-14
Box 3-8: Directions for the Chen Test 3-16
Box 3-9: Example of the Chen Test 3-17
Box 3-10: Directions for the One-Sample Test for Proportions 3-19
Box 3-11: An Example of the One-Sample Test for Proportions 3-20
Box 3-12: Directions for a Confidence Interval for a Mean 3-21
Box 3-13: An Example of a Confidence Interval for a Mean 3-21
Box 3-14: Directions for the Student's Two-Sample t-Test (Equal Variances) 3-24
Box 3-15: An Example of a Student's Two-Sample t-Test (Equal Variances 3-25
Box 3-16: Directions for Satterthwaite's t-Test (Unequal Variances 3-26
Box 3-17: An Example of Satterthwaite's t-Test (Unequal Variances) 3-27
Box 3-18: Directions for a Two-Sample Test for Proportions 3-29
Box 3-19: An Example of a Two-Sample Test for Proportions 3-30
Box 3-20: Directions for the Wilcoxon Rank Sum Test 3-32
Box 3-21: An Example of the Wilcoxon Rank Sum Test 3-33
Box 3-22: Directions for the Large Sample Approximation to the
Wilcoxon Rank Sum Test 3-34
Box 3-23: Directions for a Modified Quantile Test 3-36
Box 3-24: A Example of a Modified Quantile Test 3-37
Box 3-25: Directions for Dunnett's Test for Simple Random and Systematic Samples .... 3-39
Box 3-26: An Example of Dunnett's Test for Simple Random and Systematic Samples ... 3-40
EPA QA/G-9 Final
QAOO Version 3 - 2 July 2000
-------
CHAPTER 3
STEP 3: SELECT THE STATISTICAL TEST

3.1 OVERVIEW AND ACTIVITIES

This chapter provides information that the analyst can use in selecting an appropriate
statistical hypothesis test that will be used to draw conclusions from the data. A brief review of
hypothesis testing is contained in Chapter 1. There are two important outputs from this step: (1)
the test itself, and (2) the assumptions underlying the test that determine the validity of
conclusions drawn from the test results.

This section describes the two primary activities in this step of DQA. The remaining
sections in this chapter contain statistical tests that may be useful for analyzing environmental
data. In the one-sample tests discussed in Section 3.2, data from a population are compared with
an absolute criterion such as a regulatory threshold or action level. In the two-sample tests
discussed in Section 3.3, data from a population are compared with data from another population
(for example, an area expected to be contaminated might be compared with a background area).
For each statistical test, this chapter presents its purpose, assumptions, limitations, robustness,
and the sequence of steps required to apply the test.

The directions for each hypothesis test given in this chapter are for simple random
sampling and randomized systematic sampling designs, except where noted otherwise. If a more
complex design is used (such as a stratified design or a composite random sampling design) then
different formulas are needed, some of which are contained in this chapter.

3.1.1 Select Statistical Hypothesis Test

If a particular test has been specified either in the DQO Process, the QA Project Plan, or
by the particular program or study, the analyst should use the results of the preliminary data
review to determine if this statistical test is legitimate for the data collected. If the test is not
legitimate, the analyst should document why this particular statistical test should not be applied to
the data and then select a different test, possibly after consultation with the decision maker. If a
particular test has not been specified, the analyst should select a statistical test based on the data
user's objectives, preliminary data review, and likely viable assumptions.

3.1.2 Identify Assumptions Underlying the Statistical Test

All statistical tests make assumptions about the data. Parametric tests assume the data
have some distributional form (e.g., the t-test assumes normal distribution), whereas
nonparametric tests do not make this assumption (e.g., the Wilcoxon test only assumes the data
are symmetric but not necessarily normal). However, both parametric and nonparametric tests
may assume that the data are statistically independent or that there are no trends in the data.
While examining the data, the analyst should always list the underlying assumptions of the
statistical hypothesis test, such as distribution, dispersion, or others as applicable.

EPA QA/G-9 Final
QAOO Version 3 - 3 July 2000
-------
Another important feature of statistical tests is their sensitivities (nonrobustness) to
departures from the assumptions. A statistical procedure is called robust if its performance is not
seriously affected by moderate deviations from its underlying assumptions. The analyst should
note any sensitive assumptions where relatively small deviations could jeopardize the validity of
the test results.

3.2 TESTS OF HYPOTHESES ABOUT A SINGLE POPULATION

A one-sample test involves the comparison of a population parameter (e.g., a mean,
percentile, or variance) to a threshold value. Both the threshold value and the population
parameter were specified during Step 1: Review DQOs and Sampling Design. In a one-sample
test, the threshold value is a fixed number that does not vary. If the threshold value was estimated
(and therefore contains variability), a one-sample test is not appropriate. An example of a one-
sample test would be to determine if 95% of all companies emitting sulfur dioxide into the air are
below a fixed discharge level. For this example, the population parameter is a percentage
(proportion) and the threshold value is 95% (.95). Another example is a common Superfund
problem that involves comparing the mean contaminant concentration to a risk-based standard. In
this case, the risk-based standard (which is fixed) is the threshold value and the statistical
parameter is the true mean contaminant concentration level of the site. However, comparing the
mean concentration in an area to the mean concentration of a reference area (background) would
not be a one-sample test because the mean concentration in the reference area would need to be
estimated.

The statistical tests discussed in this section may be used to determine if 6 < 60 or 6 > 60,
where 6 represents either the population mean, median, a percentile, or a proportion and 60
represents the threshold value. Section 3.2.1 discusses tests concerning the population mean,
Section 3.2.2 discusses tests concerning a proportion or percentile, and Section 3.2.2 discusses
tests for a median.

3.2.1 Tests for a Mean

A population mean is a measure of the center of the population distribution. It is one of
the most commonly used population parameters in statistical hypothesis testing because its
distribution is well known for large sample sizes. The hypotheses considered in this section are:

Case 1: H0: ji < C vs. HA: |i> C; and

Case 2: H0: |i > C vs. HA: |i < C

where C represents a given threshold such as a regulatory level, and |i denotes the (true) mean
contaminant level for the population. For example, C may represent the arsenic concentration
level of concern. Then if the mean of the population exceeds C, the data user may wish to take
action.
EPA QA/G-9 Final
QAOO Version 3 - 4 July 2000
-------
The information required for this test (defined in Step 1) includes the null and alternative
hypotheses (either Case 1 or Case 2); the gray region, i.e., a value \il > C for Case 1 or a value \il
< C for Case 2 representing the bound of the gray region; the false rejection error rate a at C; the
false acceptance error rate P at nt; and any additional limits on decision errors. It may be helpful
to label any additional false rejection error limits as a2 at C2, a3 at C3, etc., and to label any
additional false acceptance error limits as P2 at |i2, P3 at |i3, etc. For example, consider the
following decision: determine whether the mean contaminant level at a waste site is greater than
10 ppm. The null hypothesis is H0: |i > 10 ppm and the alternative hypothesis is HA: |i < 10
ppm. A gray region has been set from 10 to 8 ppm, a false rejection error rate of 5% has been set
at 10 ppm, and a false acceptance error rate of 10% has been set at 8 ppm. Thus, C = 10 ppm, \il
= 8 ppm, a = 0.05, and P = 0.1. If an additional false acceptance error rate was set, for example,
an error rate of 1% at 4 ppm, then P2 = .01 and |i2 = 4 ppm.

3.2.1.1 The One-Sample t-Test

PURPOSE

Given a random sample of size n (or a composite sample of size n, each composite
consisting of k aliquots), the one-sample t-test can be used to test hypotheses involving the mean
(|i) of the population from which the sample was selected.

ASSUMPTIONS AND THEIR VERIFICATION

The primary assumptions required for validity of the one-sample t-test are that of a
random sample (independence of the data values) and that the sample mean x is approximately
normally distributed. Because the sample mean and standard deviation are very sensitive to
outliers, the t-test should be preceded by a test for outliers (see Section 4.4).

Approximate normality of the sample mean follows from approximate normality of the
data values. In addition, the Central Limit Theorem states that the sample mean of a random
sample from a population with an unknown distribution will be approximately normally distributed
provided the sample size is large. This means that although the population distribution from
which the data are drawn can be distinctly different from the normal distribution, the distribution
of the sample mean can still be approximately normal when the sample size is relatively large.
Although preliminary tests for normality of the data can and should be done for small sample
sizes, the conclusion that the sample does not follow a normal distribution does not automatically
invalidate the t-test, which is robust to moderate violations of the assumption of normality for
large sample sizes.

LIMITATIONS AND ROBUSTNESS

The t-test is not robust to outliers because the sample mean and standard deviation are
influenced greatly by outliers. The Wilcoxon signed rank test (see Section 3.2.1.2) is more
EPA QA/G-9 Final
QAOO Version 3 - 5 July 2000
-------
robust, but is slightly less powerful. This means that the Wilcoxon signed rank test is slightly less
likely to reject the null hypothesis when it is false than the t-test.

The t-test has difficulty dealing with less-than values, e.g., values below the detection
limit, compared with tests based on ranks or proportions. Tests based on a proportion above a
given threshold (Section 3.2.2) are more valid in such a case, if the threshold is above the
detection limit. It is also possible to substitute values for below detection-level data (e.g., /^ the
detection level) or to adjust the statistical quantities to account for nondetects (e.g., Cohen's
Method for normally or lognormally distributed data). See Chapter 4 for more information on
dealing with data that are below the detection level.

SEQUENCE OF STEPS

Directions for a one-sample t-test for a simple, systematic, and composite random samples
are given in Box 3-1 and an example is given in Box 3-2. Directions for a one-sample t-test for a
stratified random sample are given in Box 3-3 and an example is given in Box 3-4.
EPA QA/G-9 Final
QAOO Version 3 - 6 July 2000
-------
Box 3-1: Directions for a One-Sample t-Test
for Simple and Systematic Random Samples
with or without Compositing

Let X.,, X2, . . . , Xn represent the n data points. These could be either n individual samples or n composite
samples consisting of k aliquots each. These are the steps for a one-sample t-test for Case 1 (H0: u < C);
modifications for Case 2 (H0: u > C) are given in braces {}.

STEP 1: Calculate the sample mean, x (Section 2.2.2), and the standard deviation, s (Section 2.2.3).

STEP 2: Use Table A-1 of Appendix A to find the critical value t.,_,., such that 100(1-a)% of the t distribution with
n -1 degrees of freedom is below !,_„. For example, if a = 0.05 and n = 16, then n-1 =15 and !,_„ =
1.753.

STEP 3: Calculate the sample value t = (X- C) I (s/\fn) .

STEP 4: Compare t with t.,_a.

1) If t > ti-a {t < -t-i-J, the null hypothesis may be rejected. Go to Step 6.

2) If t > t.,_a {t < -t^J, there is not enough evidence to reject the null hypothesis and the false
acceptance error rate should be verified. Go to Step 5.

STEP 5: As the null hypothesis (H0) was not rejected, calculate either the power of the test or the sample size
necessary to achieve the false rejection and false acceptance error rates. To calculate the power,
assume that the true values for the mean and standard deviation are those obtained in the sample
and use a software package like the Decision Error Feasibility Trial (DEFT) software (EPA, 1994) to
generate the power curve of the test.

If only one false acceptance error rate (P) has been specified (at u^, it is possible to calculate the
sample size which achieves the DQOs, assuming the true mean and standard deviation are equal to
the values estimated from the sample, instead of calculating the power of the test. To do this,
s 2(z +z )2
calculates = — + (0.5)zj where zp is the pth percentile of the standard normal
(nrQ2
distribution (Table A-1 of Appendix A). Round m up to the next integer. If m < n, the false
acceptance error rate has been satisfied. If m > n, the false acceptance error rate has not been
satisfied.

STEP 6: The results of the test may be:

1) the null hypothesis was rejected and it seems that the true mean is greater than C {less than C};

2) the null hypothesis was not rejected and the false acceptance error rate was satisfied and it seems
that the true mean is less than C {greater than C}; or

3) the null hypothesis was not rejected and the false acceptance error rate was not satisfied and it
seems that the true mean is less than C {greater than C} but conclusions are uncertain since the
sample size was too small.

Report the results of the test, the sample size, sample mean, standard deviation, t and t1_a.

Note: The calculations for the t-test are the same for both simple random or composite random sampling. The
use of compositing will usually result in a smaller value of "s" than simple random sampling.
EPA QA/G-9 Final
QAOO Version 3 - 7 July 2000
-------
Box 3-2: An Example of a One-Sample t-Test
for a Simple Random or Composite Sample

Consider the following 9 random (or composite samples each of k aliquots) data points: 82.39 ppm,
103.46 ppm, 104.93 ppm, 105.52 ppm, 98.37 ppm, 113.23 ppm, 86.62 ppm, 91.72 ppm, and 108.21
ppm. This data will be used to test the hypothesis: H0: u < 95 ppm vs. HA: u > 95 ppm. The decision
maker has specified a 5% false rejection decision error limit (a) at 95 ppm (C), and a 20% false
acceptance decision error limit (P) at 105 ppm (m).

STEP 1 : Using the directions in Box 2-2 and Box 2-3, it was found that

X = 99.38 ppm and s = 10.41 ppm.

STEP 2: Using Table A-1 of Appendix A, the critical value of the t distribution with 8 degrees of freedom
is t0 95 =1.86.

STEP3: t=L-C= 99-38 - 95 = 1.26
s/fi 10.41/^9

STEP 4: Because 1.26 > 1.86, there is not enough evidence to reject the null hypothesis and the false
acceptance error rate should be verified.

STEP 5: Because there is only one false acceptance error rate, it is possible to use the sample size
formula to determine if the error rate has been satisfied. Therefore,

- - * «"*•-•
0.842)2 +
(95 - 105)2

i.e., 9
Notice that it is customary to round upwards when computing a sample size. Since m=n, the
false acceptance error rate has been satisfied.

STEP 6: The results of the hypothesis test were that the null hypothesis was not rejected but the false
acceptance error rate was satisfied. Therefore, it seems that the true mean is less than 95
ppm.
EPA QA/G-9 Final
QAOO Version 3 - 8 July 2000
-------
Box 3-3: Directions for a One-Sample t-Test
for a Stratified Random Sample

Let h=1 , 2, 3, . . . , L represent the L strata and nh represent the sample size of stratum h. These steps are for a
one-sample t-test for Case 1 (H0: u < C); modifications for Case 2 (H0: u > C) are given in braces { }.

STEP 1: Calculate the stratum weights (Wh) by calculating the proportion of the volume in
V,
stratum h, Wh = - where Vh is the surface area of stratum h multiplied by
h=l
the depth of sampling in stratum h.
STEP 2: For each stratum, calculate the sample stratum mean Xh = and the sample stratum
standard error sh =
,,2
STEPS: Calculate overall mean XST = 2_/V^/i' ar|d variance SST = /]Vh—.
A=i A=I nh

2
STEP 4: Calculate the degrees of freedom (dof): dof =
L Wist
£-
Use Table A-1 of Appendix A to find the critical value t .,_,., so that 100(1-a)% of the t
distribution with the above degrees of freedom (rounded to the next highest integer) is
below t.,_a.
-A. OT-T ~~ \_,
STEP 5: Calculate the sample value: t =
STEP 6: Compare t to t.,_a. If t > t.|_a {t < -t^J, the null hypothesis may be rejected. Go to Step 8. If t > t^,, {t < -
t.,.,.,}, there is not enough evidence to reject the null hypothesis and the false acceptance error rate
should be verified. Go to Step 7.

STEP 7: If the null hypothesis was not rejected, calculate either the power of the test or the sample size
necessary to achieve the false rejection and false acceptance error rates (see Step 5, Box 3-2).

STEP 8: The results of the test may be: 1) the null hypothesis was rejected so it seems that the true mean is
less than C {greater than C}; 2) the null hypothesis was not rejected and the false acceptance error
rate was satisfied and it seems that the true mean is greater than C {less than C}; or 3) the null
hypothesis was not rejected and the false acceptance error rate was not satisfied and it seems that
the true mean is greater than C {less than C} but conclusions are uncertain since the sample size was
too small.

Report the results of the test, as well as the sample size, sample mean, and sample standard
deviation for each stratum, the estimated t, the dof, and !.,_„.
EPA QA/G-9 Final
QAOO Version 3 - 9 July 2000
-------
Box 3-4: An Example of a One-Sample t-Test
for a Stratified Random Sample

Consider a stratified sample consisting of two strata where stratum 1 comprises 10% of the total site
surface area and stratum 2 comprises the other 90%, and 40 samples were collected from stratum 1, and
60 samples were collected from stratum 2. For stratum 1, the sample mean is 23 ppm and the sample
standard deviation is 18.2 ppm. For stratum 2, the sample mean is 35 ppm, and the sample standard
deviation is 20.5 ppm. This information will be used to test the null hypothesis that the overall site mean
is greater than or equal to 40 ppm, i.e., H0: u > 40 ppm (Case 2). The decision maker has specified a
1% false rejection decision limit (a) at 40 ppm and a 20% false acceptance decision error limit (P) at 35
STEP 1: W., = 10/100 = 0.10, W2 = 90/100 = 0.9.

STEP 2: From above, x., = 23 ppm, X2 = 35 ppm, s., = 18.2, and s2 = 20.5. This information was
developed using the equations in step 2 of Box 3-3.

STEP 3: The estimated overall mean concentration is:

_ L _ _ _
XST = YFhXh = W\X\ + W2X2 = (-!X23) + (.9)(35) = 33.8 ppm.
h=l

and the estimated overall variance is:

L 2 T T T T
vv 2 sh (.1) (18.2) (.9) (20.5)
SVT = / "\i — = "~ :— + "~ :— = 5.76
w hTf nh 40 60

STEP 4: The approximate degrees of freedom (dof) is:

(O2 (5 76)2
dof = -±i- = ^-^. = 60.8, i.e.,
.1)4(18.2)4 + (.9)4(20.5)4

(40)239 (60)259
v ' v '
~ 2 2
61
Note how the degrees of freedom has been rounded up to a whole number.
Using Table A-1 of Appendix A, the critical value i^_a of the t distribution with
61 dof is approximately 2.39.

X?T ~ C QQ « _ AH
STEPS: Calculate the sample value t = — - = = -2.58
STEP 6: Because -2.58 < -2.39 the null hypothesis may be rejected.

STEP 7: Because the null hypothesis was rejected, it is concluded that the mean is probably less than
40 ppm. In this example there is no need to calculate the false acceptance rate as the null
hypothesis was rejected and so the chance of making a false acceptance error is zero by
definition.
EPA QA/G-9 Final
QAOO Version 3-10 July 2000
-------
3.2.1.2 The Wilcoxon Signed Rank (One-Sample) Test for the Mean

PURPOSE

Given a random sample of size n (or composite sample size n, each composite consisting
of k aliquots), the Wilcoxon signed rank test can be used to test hypotheses regarding the
population mean or median of the population from which the sample was selected.

ASSUMPTIONS AND THEIR VERIFICATION

The Wilcoxon signed rank test assumes that the data constitute a random sample from a
symmetric continuous population. (Symmetric means that the underlying population frequency
curve is symmetric about its mean/median.) Symmetry is a less stringent assumption than
normality since all normal distributions are symmetric, but some symmetric distributions are not
normal. The mean and median are equal for a symmetric distribution, so the null hypothesis can
be stated in terms of either parameter. Tests for symmetry can be devised which are based on the
chi-squared distribution, or a test for normality may be used. If the data are not symmetric, it may
be possible to transform the data so that this assumption is satisfied. See Chapter 4 for more
information on transformations and tests for symmetry.

LIMITATIONS AND ROBUSTNESS

Although symmetry is a weaker assumption than normality, it is nonetheless a strong
assumption. If the data are not approximately symmetric, this test should not be used. For large
sample sizes (n > 50), the t-test is more robust to violations of its assumptions than the Wilcoxon
signed rank test. For small sample sizes, if the data are not approximately symmetric and are not
normally distributed, this guidance recommends consulting a statistician before selecting a
statistical test or changing the population parameter to the median and applying a different
statistical test (Section 3.2.3).

The Wilcoxon signed rank test may produce misleading results if many data values are the
same. When values are the same, their relative ranks are the same, and this has the effect of
diluting the statistical power of the Wilcoxon test. Box 3-5 demonstrates the method used to
break tied ranks. If possible, results should be recorded with sufficient accuracy so that a large
number of equal values do not occur. Estimated concentrations should be reported for data
below the detection limit, even if these estimates are negative, as their relative magnitude to the
rest of the data is of importance.

SEQUENCE OF STEPS

Directions for the Wilcoxon signed rank test for a simple random sample and a systematic
simple random sample are given in Box 3-5 and an example is given in Box 3-6 for samples sizes
smaller than 20. For sample sizes greater than 20, the large sample approximation to the
Wilcoxon Signed Rank Test should be used. Directions for this test are given in Box 3-7.

EPA QA/G-9 Final
QAOO Version 3-11 July 2000
-------
Box 3-5: Directions for the Wilcoxon Signed Rank Test
for Simple and Systematic Random Samples

Let X.,, X2, . . . , Xn represent the n data points. The following describes the steps for applying the
Wilcoxon signed rank test for a sample size (n) less than 20 for Case 1 (H0: u < C); modifications for
Case 2 (H0: u > C) are given in braces {}. If the sample size is greater than or equal to 20, use Box 3-7.

STEP 1: If possible, assign values to any measurements below the detection limit. If
this is not possible, assign the value "Detection Limit divided by 2" to each
value. Then subtract each observation X, from C to obtain the deviations d, =
C - X,. If any of the deviations are zero delete them and correspondingly
reduce the sample size n.

STEP 2: Assign ranks from 1 to n based on ordering the absolute deviations |dj (i.e.,
magnitude of differences ignoring the sign) from smallest to largest. The rank
1 is assigned to the smallest value, the rank 2 to the second smallest value,
and so forth. If there are ties, assign the average of the ranks which would
otherwise have been assigned to the tied observations.

STEP 3: Assign the sign for each observation to create the signed rank. The sign is
positive if the deviation d, is positive; the sign is negative if the deviation d, is
negative.

STEP 4: Calculate the sum R of the ranks with a positive sign.

STEP 5: Use Table A-6 of Appendix A to find the critical value wa.

If R < wa, {R > n(n+1)/2 - wj, the null hypothesis may be rejected; proceed to Step 7.

Otherwise, there is not enough evidence to reject the null hypothesis, and the false
acceptance error rate will need to be verified; proceed to Step 6.

STEP 6: If the null hypothesis (H0) was not rejected, calculate either the power of the
test or the sample size necessary to achieve the false rejection and false
acceptance error rates using a software package like the DEFT software (EPA
, 1994). For large sample sizes, calculate,
where zp is the pth percentile of the standard normal distribution (Table A-1 of Appendix A). If
1.16m < n, the false acceptance error rate has been satisfied.

STEP 7: The results of the test may be:

1) the null hypothesis was rejected and it seems that the true mean is greater than C {less
than C};

2) the null hypothesis was not rejected and the false acceptance error rate was satisfied and
it seems that the true mean is less than C {greater than C}; or

3) the null hypothesis was not rejected and the false acceptance error rate was not satisfied
and it seems that the true mean is greater than C {less than C} but conclusions are uncertain
since the sample size was too small.
EPA QA/G-9 Final
QAOO Version 3-12 July 2000
-------

Box 3-6: An Example of the Wilcoxon Signed Rank Test
for a Simple Random Sample
Consider the following 10 data points: 974 ppb, 1044 ppb, 1093 ppb, 897 ppb, 879 ppb, 1161 ppb, 839
ppb, 824 ppb, 796 ppb, and one observation below the detection limit of 750 ppb. This data will be used
to test the hypothesis: H0: u > 1000 ppb vs. HA: u < 1000 ppb (Case 2). The decision maker has
specified a 10% false rejection decision error limit (a) at 1000 ppb (C), and a 20% false acceptance
decision error limit (P) at 900 ppb (u^.
STEP 1:
x.
|d,|
rank
s-rank
STEP 2:

STEP 3:
STEP 4:
STEP 5:
STEP 7:
Assign the value 375 ppb (750 divided by 2) to the data point below the detection limit.
Subtract C (1 000) from each of the n observations X, to obtain the deviations d, = 1 000 - X,.
This is shown in row 2 of the table below.
974 1044 1093 897 879 1161 839 824 796 375
26 -44 -93 103 121 -161 161 176 204 625
26 44 93 103 121 161 161 176 204 625
12 2 3 4 5 6.5 6.5 8 9 10
1-2-3 45 -6.5 6.5 8 9 10
Assign ranks from 1 to n based on ordering the absolute deviations |d,| (magnitude ignoring
any negative sign) from smallest to largest. The absolute deviations are listed in row 3 of the
table above. Note that the data have been sorted (rearranged) for clarity so that the absolute
deviations are ordered from smallest to largest.
The rank 1 is assigned to the smallest value, the rank 2 to the second smallest value, and so
forth. Observations 6 and 7 are ties, therefore, the average (6+7)/2 = 6.5 will be assigned to
the two observations. The ranks are shown in row 4.
Assign the sign for each observation to create the signed rank. The sign is positive if the
deviation d, is positive; the sign is negative if the deviation d, is negative. The signed rank is
shown in row 5.
R = 1+4 + 5 + 6.5 + 8 + 9+10 = 43.5.
Table A-6 of Appendix A was used to find the critical value wa where a = 0.10. For this
example, w010 = 15. Since 43.5 > (10x11)72 - 15 = 40, the null hypothesis may be rejected.
The null hypothesis was rejected with a 10% significance level using the Wilcoxon signed rank
test. Therefore, it would seem that the true mean is below 1 000 ppb.
EPA QA/G-9 Final
QAOO Version 3-13 July 2000
-------
Box 3-7: Directions for the Large Sample Approximation to the Wilcoxon Signed Rank Test
for Simple and Systematic Random Samples

Let X.,, X2, . . . , Xn represent the n data points where n is greater than or equal to 20. The following
describes the steps for applying the large sample approximation for the Wilcoxon signed rank test for
Case 1 (H0: u < C); modifications for Case 2 (H0: u > C) are given in braces {}.

STEP 1: If possible, assign values to any measurements below the detection limit. If this is not
possible, assign the value "Detection Limit divided by 2" to each value. Then subtract each
observation X, from C to obtain the deviations d, = C - X,. If any of the deviations are zero
delete them and correspondingly reduce the sample size n.

STEP 2: Assign ranks from 1 to n based on ordering the absolute deviations |d,| (i.e., magnitude of
differences ignoring the sign) from smallest to largest. The rank 1 is assigned to the smallest
value, the rank 2 to the second smallest value, and so forth. If there are ties, assign the
average of the ranks which would otherwise have been assigned to the tied observations.

STEP 3: Assign the sign for each observation to create the signed rank. The sign is positive if the
deviation d, is positive; the sign is negative if the deviation d, is negative.

STEP 4: Calculate the sum R of the ranks with a positive sign.

n(n + 1)
STEPS: Calculate w = — + x \jn(n + l)(2n + 1)124 where p = 1-a {p = a} and zp
4 p

is the pth percentile of the standard normal distribution (Table A-1 of Appendix A).

STEP 6: If R < w {R > w}, the null hypothesis may be rejected. Go to Step 8.

Otherwise, there is not enough evidence to reject the null hypothesis, and the false
acceptance error rate will need to be verified. Go to Step 7.

STEP 7: If the null hypothesis (H0) was not rejected, calculate either the power of the test or the sample
size necessary to achieve the false rejection and false acceptance error rates using a software
package like the DEFT software (EPA , 1994). For large sample sizes, calculate,

m = 11«—i_p^_ + (Q.5)Zl2_a

(MI~ O

where zp is the pth percentile of the standard normal distribution (Table A-1 of Appendix
A). If 1.16m < n, the false acceptance error rate has been satisfied.

STEPS: The results of the test may be:

1) the null hypothesis was rejected and it seems that the true mean is greater {less} than C;

2) the null hypothesis was not rejected and the false acceptance error rate was satisfied and it
seems that the true mean is less than C {greater than C}; or

Report the results of the test, the sample size, R, and w.
EPA QA/G-9 Final
QAOO Version 3-14 July 2000
-------
3.2.1.3 The Chen Test

PURPOSE

Environmental data such as concentration measurements are often confined to positive
values and appear to follow a distribution with most of the data values relatively small or near
zero but with a few relatively large data values. Underlying such data is some distribution which
is not symmetrical (like a normal) but which is skewed to the right (like a log normal). Given a
random sample of size 'n' from a right-skewed distribution, the Chen test can be used to compare
the mean (|i) of the distribution with a threshold level or regulatory value. The null hypothesis
has the form H0: ji < C, where C is a given threshold level; the alternative hypothesis is HA: ji > C.
The method is not recommended for testing null hypotheses of the form H0: |i > C against
HA:n
-------
Box 3-8: Directions for the Chen Test

LetX.,, X2, . . . , Xn represent the n data points. Let C denote the threshold level of interest. The null hypothesis
is H0: U < C and the alternative is HA: |J > C, the level of significance is a.

STEP 1: If at most 15% of the data points are below the detection limit (DL) and C is much larger than the
DL, then replace values below the DL with DL/2.

STEP 2: Visually check the assumption of right-skewness by inspecting a histogram or frequency plot for the
data. (The Chen test is appropriate only for data which are skewed to the right.)

STEP 3: Calculate the sample mean, x (Section 2.2.2), and the standard deviation, s (Section 2.2.3).
STEP 4: Calculate the sample skewness b = — — - ' the quantity a = b/6i/n , the t-statistic
(n - V)(n - 2>3 '

t = - — ^ — - , and then compute T = t + a(l + 2t2 ) + 4a2 (t + 2t3 ).
NOTE: The skewness b should be greater than 1 to confirm the data are skewed to the right.

STEP 5: Use the last row of Table A-1 in Appendix A to find the critical value z.|_a such that 100(1-a)% of the
Normal distribution is below !.,_„. For example, if a = 0.05 then z1_a = 1 .645.

STEP 6: Compare t with z.,_a.

1 ) If t > z.,.,, the null hypothesis may be rejected and it seems that the true mean is greater than C;

2) If t > z1_a, there is not enough evidence to reject the null hypothesis so it seems that the true mean
is less than C.
3.2.2 Tests for a Proportion or Percentile

This section considers hypotheses concerning population proportions and percentiles.
A population proportion is the ratio of the number of elements of a population that has some
specific characteristic to the total number of elements. A population percentile represents the
percentage of elements of a population having values less than some threshold C. Thus, if x is the
95th percentile of a population, 95% of the elements of the population have values less than C and
5% of the population have values greater than C.

This section of the guidance covers the following hypothesis: Case 1: H0: P < P0 vs.
HA: P > P0 and Case 2: H0: P > P0 vs. HA: P < P0 where P is a proportion of the population,
and P0 represents a given proportion (0 < P0 < 1). Equivalent hypotheses written in terms of
percentiles are H0: the 100Pth percentile is C or larger for Case 1, and H0: the 100Pth percentile is
C or smaller for Case 2. For example, consider the decision to determine whether the 95th
percentile of a container of waste is less than 1 mg/L cadmium. The null hypothesis in this case is
EPA QA/G-9 Final
QAOO Version 3-16 July 2000
-------
Box 3-9: Example of the Chen Test

Consider the following sample of a contaminant concentration measurements in (in ppm):2.0, 2.0, 5.0, 5.2, 5.9,
6.6, 7.4, 7.4, 9.7, 9.7, 10.2, 11.5, 12.4, 12.7, 14.1, 15.2, 17.7, 18.9, 22.8, 28.6, 30.5, 35.5. We want to test the
null hypothesis that the mean u is less than 10 ppm versus the alternative that it exceeds 10 ppm. A
significance level of 0.05 is to be used.

STEP 1: Since all of the data points exceed the detection limit, there is no need to substitute values for below
the detection limit.
STEP 2: A frequency plot of the 23 data points confirms the right-skewness.

STEP 3: Using boxes 2-2 and 2-4 of Chapter 2, it is found thatX = 13.08 ppm
and s = 8.99 ppm.
0 5 10 15 20 25 30 35 40
23 Concentration (ppm)

, -13.08)3
STEP 4: b = — ^ - r = - — - ;— = 1.14 ,
(fj-l)(fi-2>3 (22)(21)(8.99)3

a = b/6*Jn = 114/6^23 = 0.0396 ,
,
8.997 '
/V23

compute T = t + a(l + 2t2 ) + 4a2 (t + 2t3 ) = 1.965

(The value of 1.14 for skewness confirms that the data are skewed to the right.)

STEP 5: Using the last row of Table A-1 of Appendix A, the critical value z095 of the Normal distribution is
1.645.

STEP 6: Since T > z095 (1.965 > 1.645), the null hypothesis is rejected and we conclude that the true mean is
greater than 10 ppm.
H0: the 95th percentile of cadmium is less than 1 mg/L. Now, instead of considering the
population to consist of differing levels of cadmium, consider the population to consist of a binary
variable that is T if the cadmium level is above 1 mg/L or is '0' if the level is below 1 mg/L. In
this case, the hypothesis may be changed to a test for a proportion so that the null hypothesis
becomes H0: P < .95 where P represents the proportion of 1's (cadmium levels above 1 mg/L) in
the container of waste. Thus, any hypothesis about the proportion of the site below a threshold
can be converted to an equivalent hypothesis about percentiles. Therefore, only hypotheses about
the proportion of the site below a threshold will be discussed in this section. The information
required for this test includes the null and alternative hypotheses, the gray region, the false
rejection error rate a at P0, the false acceptance error rate P at Pl3 and any additional limits on
decision errors. It may be helpful to label any additional false rejection error limits as a2 at Pa2, a3
at Pa3, etc., and any additional false acceptance error limits as P2 at Pp2, P3 at Pp3, etc.
EPA QA/G-9 Final
QAOO Version 3-17 July 2000
-------
3.2.2.1 The One-Sample Proportion Test

PURPOSE

Given a random sample of size n, the one-sample proportion test may be used to test
hypotheses regarding a population proportion or population percentile for a distribution from
which the data were drawn. Note that for P=.5, this test is also called the Sign test.

ASSUMPTIONS AND THEIR VERIFICATION

The only assumption required for the one-sample proportion test is the assumption of a
random sample. To verify this assumption, review the procedures and documentation used to
select the sampling points and ascertain that proper randomization has been used.

LIMITATIONS AND ROBUSTNESS

Since the only assumption is that of a random sample, the procedures are valid for any
underlying distributional shape. The procedures are also robust to outliers, as long as they do not
represent data errors.

SEQUENCE OF STEPS

Directions for the one-sample test for proportions for a simple random sample and a
systematic random sample are given in Box 3-10, an example is given in Box 3-11.

3.2.3 Tests for a Median

A population median (jl) is another measure of the center of the population distribution.
This population parameter is less sensitive to extreme values and nondetects than the sample
mean. Therefore, this parameter is sometimes used instead of the mean when the data contain a
large number of nondetects or extreme values. The hypotheses considered in this section are:

Case 1: H0: jl < C vs. HA: jl > C; and

Case 2: H0: ji > C vs. HA: £< C

where C represents a given threshold such as a regulatory level.

It is worth noting that the median is the 50th percentile, so the methods described in
Section 3.2.2 may be used to test hypotheses concerning the median by letting P0 = 0.50. In this
case, the one-sample test for proportions is also called the Sign Test for a median. The Wilcoxon
signed rank test (Section 3.2.1.2) can also be applied to a median in the same manner as it is
applied to a mean. In
EPA QA/G-9 Final
QAOO Version 3-18 July 2000
-------
Box 3-10: Directions for the One-Sample Test for Proportions
for Simple and Systematic Random Samples

This box describes the steps for applying the one-sample test for proportions for Case 1 (H0: P < P0);
modifications for Case 2 (H0: P > P0) are given in braces { }.

STEP 1: Given a random sample X.,, X2, . . . , Xn of measurements from the population, let
p (small p) denote the proportion of X's that do not exceed C, i.e., p is the number
(k) of sample points that are less than or equal to C, divided by the sample size n.

STEP 2: Compute np, and n(1-p). If both np and n(1-p) are greater than or equal to 5, use
Steps 3 and 4. Otherwise, consult a statistician as analysis may be complex.

p - .5/n - P p + .5/n - PQ
STEP 3: Calculate z = — - for Case 1 or z = — - for
Case 2.

STEP 4: Use Table A-1 of Appendix A to find the critical value z^ such that 100(1-a)% of
the normal distribution is below z^.a. For example, if a = 0.05 then z^.a = 1 .645.

If z > z.,.,, {z < -z-i.J, the null hypothesis may be rejected. Go to Step 6.

If z > z1_cl {z < -z .,_„}, there is not enough evidence to reject the null hypothesis. Therefore, the
false acceptance error rate will need to be verified. Go to Step 5.

STEP 5: To calculate the power of the test, assume that the true values for the mean and
standard deviation are those obtained in the sample and use a statistical software
package like the DEFT software (EPA, 1994) or the DataQUEST software (EPA,
1996) to generate the power curve of the test.

If only one false acceptance error rate (P) has been specified (at P.,), it is possible to calculate
the sample size which achieves the DQOs. To do this, calculate
m =
If m < n, the false acceptance error rate has been satisfied. Otherwise, the false acceptance
error rate has not been satisfied.

STEP 6: The results of the test may be:

1) the null hypothesis was rejected and it seems that the proportion is greater than {less than}
PO;

2) the null hypothesis was not rejected, the false acceptance error rate was satisfied, and it
seems that proportion is less than {greater than} P0; or

3) the null hypothesis was not rejected, the false acceptance error rate was not satisfied, and it
would seem the proportion was less than {greater than} P0, but the conclusions are uncertain
because the sample size was too small.
EPA QA/G-9 Final
QAOO Version 3-19 July 2000
-------
Box 3-1 1 : An Example of the One-Sample Test for Proportions
for a Simple Random Sample

Consider 85 samples of which 1 1 samples have concentrations greater than the clean-up standard. This
data will be used to test the null hypothesis H0: P > .20 vs. HA: P < .20 (Case 2). The decision maker
has specified a 5% false rejection rate (a) for P0 = .2, and a false acceptance rate (P) of 20% for P., =
0.15.

STEP 1: From the data, the observed proportion (p) is p = 11/85 = .1294

STEP 2: np = (85)(.1294) = 1 1 and n(1-p) = (85)(1-.1294) = 74. Since both np and n(1-p) are
greater than or equal to 5, Steps 3 and 4 will be used.

STEP 3: Because H0: P > .20, Case 2 formulas will be used.

P + -5/n ~ Po .1294 + .5/85 - .2 _

JP0(\-P0)/n J. 2(1 -.2)785
STEP 4: Using Table A-1 of Appendix A, it was found that z., 05 = z95 = 1.645. Because z < -z., „
(i.e.,
-1.492 < -1.645), the null hypothesis is not rejected so Step 5 will need to be completed.

STEP 5: To determine whether the test was powerful enough, the sample size necessary to
achieve the DQOs was calculated as follows:
m =
1.64/2(1- .2)+ 1.04/15(1- .15)
= 422.18
.15 - .2

So 423 samples are required, many more than were actually taken.

STEP 6: The null hypothesis was not rejected and the false acceptance error rate was not
satisfied. Therefore, it would seem the proportion is greater than 0.2, but this
conclusion is uncertain because the sample size is too small.
addition, this test is more powerful than the Sign Test for symmetric distributions. Therefore, the
Wilcoxon signed rank test is the preferred test for the median.

3.2.4 Confidence Intervals

In some instances, a test of hypotheses for the estimated parameter (i.e. mean or difference
of two means) is not required, but an idea of the uncertainty of the estimate with respect to the
parameter is needed. The most common type of interval estimate is a confidence interval. A
confidence interval may be regarded as combining an interval around an estimate with a
probabilistic statement about the unknown parameter. When interpreting a confidence interval
statement such as "The 95% confidence interval for the mean is 19.1 to 26.3", the implication is
that the best estimate for the unknown population mean is 22.7 (halfway between 19.1 and 26.3),
and that we are 95% certain that the interval 19.1 to 26.3 captures the unknown population mean.
Box 3-12 gives the directions on how to calculate a confidence interval for the mean, Box 3-13
gives and example of the method.
EPA QA/G-9 Final
QAOO Version 3 - 20 July 2000
-------
The concept of a confidence interval can be shown by a simple example. Suppose a stable
situation producing data without any anomalies was sampled many times. Each time the sample
was taken, the mean and standard deviation was calculated from the sample and a confidence
interval constructed using the method of Box 3-10.
Box 3-12: Directions for a Confidence Interval for a Mean
for Simple and Systematic Random Samples

Let X.,, X2, ..., Xn represent a sample of size n from a population of normally distributed values.

Step 1: Use the directions in Box 2-2 to calculate the sample mean, X. Use the directions in Box 2-3 to
calculate the sample standard deviation, s.

Step 2: Use Table A-1 of Appendix A to find the critical value t^ such that 100(1-a/2)% of the t distribution
with n -1 degrees of freedom is below t^^. For example, if a = 0.10 and n = 16, then n-1 =15 and t
= 1.753.

f v / V
Step 3: The (1-a)100% confidence interval is: X ~"_ to X '
V« \n
Box 3-13: An Example of a Confidence Interval for a Mean
for a Random or Systematic Random Samples

The effluent from a discharge point in a plating manufacturing plant was sampled 7 times over the course of 4
days for the presence of Arsenic with the following results: 8.1, 7.9, 7.9. 8.2, 8.2, 8.0,7.9. The directions in Box
3-12 will be used to develop a 95% confidence interval for the mean.

Stepl: Using Box 2-2, X=8.Q3. Use Box 2-3, s=0.138.

Step 2: Using Table A-1 of Appendix A and 6 degrees of freedom, t^^ = 2.447.

Step 3: The (1-a)100% confidence interval is:

2.447x0.138 2.447x0.138
803 p= to 803 + p= or 7.902 to 8.158.
V7
3.3 TESTS FOR COMPARING TWO POPULATIONS

A two-sample test involves the comparison of two populations or a "before and after"
comparison. In environmental applications, the two populations to be compared may be a
potentially contaminated area with a background area or concentration levels from an upgradient
and a downgradient well. The comparison of the two populations may be based on a statistical
parameter that characterizes the relative location (e.g., a mean or median), or it may be based on a
distribution-free comparison of the two population distributions. Tests that do not assume an
underlying distributions (e.g., normal or lognormal) are called distribution-free or nonparametric
tests. These tests are often more useful for comparing two populations than those that assume a
specific distribution because they make less stringent assumptions. Section 3.3.1 covers tests for
differences in the means of two populations. Section 3.3.2 covers tests for differences in the
EPA QA/G-9 Final
QAOO Version 3-21 July 2000
-------
proportion or percentiles of two populations. Section 3.3.3 describes distribution-free
comparisons of two populations. Section 3.3.4 describes tests for comparing two medians.

Often, a two-sample test involves the comparison of the difference of two population
parameters to a threshold value. For environmental applications, the threshold value is often zero,
representing the case where the data are used to determine which of the two population
parameters is greater than the other. For example, concentration levels from a Superfund site may
be compared to a background site. Then, if the Superfund site levels exceed the background
levels, the site requires further investigation. A two-sample test may also be used to compare
readings from two instruments or two separate populations of people.

If the exact same sampling locations are used for both populations, then the two samples
are not independent. This case should be converted to a one-sample problem by applying the
methods described in Section 3.2 to the differences between the two populations at the same
location. For example, one could compare contaminant levels from several wells after treatment
to contaminant levels from the same wells before treatment. The methods described in Section
3.2 would then be applied to the differences between the before and after treatment contaminant
levels for each well.

3.3.1 Comparing Two Means

Let jij represent the mean of population 1 and |i2 represent the mean of population 2. The
hypotheses considered in this section are:

Case 1: H0: jij - |i2 < 50 vs. HA: jij - |i2 > 60; and

Case 2: H0: ^ - |i2 > 50 vs. HA: ^ - |i2 < 50.

An example of a two-sample test for population means is comparing the mean contaminant level
at a remediated Superfund site to a background site; in this case, 50 would be zero. Another
example is a Record of Decision for a Superfund site which specifies that the remediation
technique must reduce the mean contaminant level by 50 ppm each year. Here, each year would
be considered a separate population and 50 would be 50 ppm.

The information required for these tests includes the null and alternative hypotheses (either
Case 1 or Case 2); the gray region (i.e., a value 6X > 50 for Case 1 or a value 6X < 50 for Case 2
representing the bound of the gray region); the false rejection error rate a at 50; the false
acceptance error rate P at 5t; and any additional limits on decision errors. It may be helpful to
label additional false rejection error limits as a2 at 5a2, a3 at 6a3, etc., and to label additional false
acceptance error limits as P2 at 5p2, P3 at 6p3, etc.
EPA QA/G-9 Final
QAOO Version 3-22 July 2000
-------
3.3.1.1 Student's Two-Sample t-Test (Equal Variances)

PURPOSE

Student's two-sample t-test can be used to compare two population means based on the
independent random samples Xl3 X2, . . . , X,,, from the first population, and Yl3 Y2, . . . , Yn from
the second population. This test assumes the variabilities (as expressed by the variance) of the
two populations are approximately equal. If the two variances are not equal (a test is described in
Section 4.5), use Satterthwaite's t test (Section 3.3.1.2).

ASSUMPTIONS AND THEIR VERIFICATION

The principal assumption required for the two-sample t-test is that a random sample of
size m (Xl3 X2, . . . , XjJ is drawn from population 1, and an independent random sample of size n
(Yl3 Y2, . . . , Yn) is drawn from population 2. Validity of the random sampling and independence
assumptions should be confirmed by reviewing the procedures used to select the sampling points.

The second assumption required for the two-sample t-tests are that the sample means x
(sample 1) and Y (sample 2) are approximately normally distributed. If both m and n are large,
one may make this assumption without further verification. For small sample sizes, approximate
normality of the sample means can be checked by testing the normality of each of the two
samples.

LIMITATIONS AND ROBUSTNESS

The two-sample t-test with equal variances is robust to violations of the assumptions of
normality and equality of variances. However, if the investigator has tested and rejected
normality or equality of variances, then nonparametric procedures may be applied. The t-test is
not robust to outliers because sample means and standard deviations are sensitive to outliers.

SEQUENCE OF STEPS

Directions for the two-sample t-test for a simple random sample and a systematic simple
random sample are given in Box 3-14 and an example in Box 3-15.

3.3.1.2 Satterthwaite's Two-Sample t-Test (Unequal Variances)

Satterthwaite's t-test should be used to compare two population means when the variances
of the two populations are not equal. It requires the same assumptions as the two-sample t-test
(Section 3.3.1.1) except the assumption of equal variances.

Directions for Satterthwaite's t-test for a simple random sample and a systematic simple
random sample are given in Box 3-16 and an example in Box 3-17.
EPA QA/G-9 Final
QAOO Version 3 - 23 July 2000
-------
Box 3-14: Directions for the Student's Two-Sample t-Test (Equal Variances)
for Simple and Systematic Random Samples

This describes the steps for applying the two-sample t-tests for differences between the population
means when the two population variances are equal for Case 1 (H0: U., - u2 < 50). Modifications for Case
2
(H0: U., - u2 > 50) are given in parentheses {}.

STEP 1: Calculate the sample mean X and the sample variance sx2 for sample 1 and compute
the sample mean Y and the sample variance sY2 for sample 2.

STEP 2: Use Section 4.5 to determine if the variances of the two populations are equal. If the
variances of the two populations are not equal, use Satterthwaite's ttest (Section
3.3.1.2). Otherwise, compute the pooled standard deviation
(iw-!)+(«-!)
X- Y- 60
STEP 3: Calculate t = -
II m

Use Table A-1 of Appendix A to find the critical value t.,_a such that 100(1-a)% of the t-
distribution with (m+n-2) degrees of freedom is below t^.

If t > t.,.,, {t < -t^J, the null hypothesis may be rejected. Go to Step 5.

If t > i^ {t < -t^J, there is not enough evidence to reject the null hypothesis. Therefore, the
false acceptance error rate will need to be verified. Go to Step 4.

STEP 4: To calculate the power of the test, assume that the true values for the mean and
standard deviation are those obtained in the sample and use a statistical software
package like the DEFT software (EPA, 1994) or the DataQUEST software (EPA, 1996)
to generate the power curve of the two-sample t-test. If only one false acceptance error
rate (P) has been specified (at 6.,), it is possible to calculate the sample size which
achieves the DQOs, assuming the true mean and standard deviation are equal to the
values estimated from the sample, instead of calculating the power of the test.
Calculate
If m* < m and n* < n, the false acceptance error rate has been satisfied. Otherwise, the false
acceptance error rate has not been satisfied.

STEP 5: The results of the test could be:

1) the null hypothesis was rejected, and it seems U., - u2 > 50 {p., - u2 < 50};

2) the null hypothesis was not rejected, the false acceptance error rate was satisfied, and it
seems U., - u2 < 50 {p., - u2 > 50}; or

3) the null hypothesis was not rejected, the false acceptance error rate was not satisfied, and
it seems U., - u2 < 50 {p., - u2 > 50}, but this conclusion is uncertain because the sample size

EPA QA/G-9 Final
QAOO Version 3-24 July 2000
-------
Box 3-15: An Example of a Student's Two-Sample t-Test (Equal Variances)
for Simple and Systematic Random Samples

At a hazardous waste site, area 1 (cleaned using an in-situ methodology) was compared with a similar
(but relatively uncontaminated) reference area, area 2. If the in-situ methodology worked, then the two
sites should be approximately equal in average contaminant levels. If the methodology did network,
then area 1 should have a higher average than the reference area. Seven random samples were taken
from area 1, and eight were taken from area 2. Because the contaminant concentrations in the two areas
are supposedly equal, the null hypothesis is H0: U., - u2 < 0 (Case 1). The false rejection error rate was
set at 5% and the false acceptance error rate was set at 20% (P) if the difference between the areas is 2.5
ppb.

STEP 1: Sample Mean Sample Variance
Area 1 7.8 ppm 2.1 ppm2
Area 2 6.6 ppm 2.2 ppm2

STEP 2: Methods described in Section 4.5 were used to determine that the variances were
essentially equal. Therefore,
SE=
E
- D2.1 + (8- 1)2.2 =
(7-1) +(8-1)
STEPS: t = 7-8"6-6"0 = 1.5798
1.4676^/1/7+1/8

Table A-1 of Appendix A was used to find that the critical value t095 with (7 + 8 - 2) = 13
degrees of freedom is 1.771.

Because t > i^_a (i.e., 1.5798 > 1.771), there is not enough evidence to reject the null
hypothesis. The false acceptance error rate will need to be verified.

STEP 4: Assuming the true values for the mean and standard deviation are those obtained in the
sample:

2(1.46762)(1.645 + 0.842)2 ^^,^,2 „ noo • c
m = n — - — - — + (0.25)1.645 = 4.938, i.e., 5.
(2.5 - O)2
Because m* < m (7) and n* < n (8), the false acceptance error rate has been satisfied.

STEP 5: The null hypothesis was not rejected and the false acceptance error rate was satisfied.
Therefore, it seems there is no difference between the two areas and that the in-situ
methodology worked as expected.
EPA QA/G-9 Final
QAOO Version 3-25 July 2000
-------
Box 3-16: Directions for Satterthwaite's t-Test (Unequal Variances)
for Simple and Systematic Random Samples

This describes the steps for applying the two-sample t-test for differences between the population means
for Case 1 (H0: U., - u2 < 50). Modifications for Case 2 (H0: U., - u2 > 50) are given in parentheses {}.

STEP 1: Calculate the sample mean X and the sample variance sx2 for sample 1 and compute
the sample mean Y and the sample variance sY2 for sample 2.

STEP 2: Using Section 4.5, test whether the variances of the two populations are equal. If the
variances of the two populations are not equal, compute:
SNE \
o a
y Y

m n
If the variances of the two populations appear approximately equal, use Student's two-
sample t-test (Section 3.3.1.1, Box 3-14).

X- Y- 60
STEP 3: Calculate =
SNE

Use Table A-1 of Appendix A to find the critical value t.,_a such that 100(1-a)% of the t-
distribution with f degrees of freedom is below t.,_a, where

2 2?
/=•
m
sx s¥
m2(m-V) n\n-Y)

(Round f down to the nearest integer.)

If t > t.,.,, {t < -t .,_„}, the null hypothesis may be rejected. Go to Step 5.

If t > t1_cl {t i -t-i.J, there is not enough evidence to reject the null hypothesis and therefore,
the false acceptance error rate will need to be verified. Go to Step 4.

STEP 4: If the null hypothesis (H0) was not rejected, calculate either the power of the test or the
sample size necessary to achieve the false rejection and false acceptance error rates.
To calculate the power of the test, assume that the true values for the mean and
standard deviation are those obtained in the sample and use a statistical software
package to generate the power curve of the two-sample t-test. A simple method to
check on statistical power does not exist.

STEP 5: The results of the test could be:

1) the null hypothesis was rejected, and it seems U., - u2 > 50 {p., - u2 < 50};

2) the null hypothesis was not rejected, the false acceptance error rate was satisfied, and it
seems ^ - u2 < 50 {^ - u2 > 50}; or

3) the null hypothesis was not rejected, the false acceptance error rate was not satisfied,
and it seems U., - u2 < 50 {p., - u2 > 50}, but this conclusion is uncertain because the sample

EPA QA/G-9 Final
QAOO Version 3-26 July 2000
-------
Box 3-17: An Example of Satterthwaite's t-Test (Unequal Variances)
for Simple and Systematic Random Samples

STEP 1: Sample Mean Sample Variance
Area 1 9.2 ppm 1.3ppm2
Area 2 6.1 ppm 5.7 ppm2

STEP 2: Using Section 4.5, it was determined that the variances of the two populations were not
equal, and therefore using Satterthwaite's method is appropriate:
SNE = ^1.3/7 + 5.7/8 = 0.9477
STEPS: ,= 9-2-6-1-° =3.271
0.9477
Table A-1 was used with f degrees of freedom, where

/ = —i— '• ! = 10.307 (i.e., 10 degrees of freedom)
1.32 5.72
72(7-l) 82(8-l)

(recall that f is rounded down to the nearest integer), to find t1_c, = 1.812.

Because t > t095 (3.271 > 1.812), the null hypothesis may be rejected.

STEP 5: Because the null hypothesis was rejected, it would appear there is a difference between
the two areas (area 1 being more contaminated than area 2, the reference area) and
that the in-situ methodology has networked as intended.
3.3.2 Comparing Two Proportions or Percentiles

This section considers hypotheses concerning two population proportions (or two
population percentiles); for example, one might use these tests to compare the proportion of
children with elevated blood lead in one urban area compared with the proportion of children with
elevated blood lead in another area. The population proportion is the ratio of the number of
elements in a subset of the total population to the total number of elements, where the subset has
some specific characteristic that the rest of the elements do not. A population percentile
represents the percentage of elements of a population having values less than some threshold
value C.
EPA QA/G-9 Final
QAOO Version 3-27 July 2000
-------
Let Pj represent the true proportion for population 1, and P2 represent the true proportion
of population 2. The hypotheses considered in this section are:

Case 1: H0: P1 - P2 < 50 vs. HA: P1 - P2 > 60; and

Case 2: H0: Pt - P2 > 50 vs. HA: Pt - P2 < 50

where 50 is some numerical value. An equivalent null hypothesis for Case 1, written in terms of
percentiles, is H0: the lOOP/11 percentile minus the 100P2th percentile is C or larger, the reverse
applying to Case 2. Since any hypothesis about the proportion below a threshold can be
converted to an equivalent hypothesis about percentiles (see Section 3.2.2), this guidance will
only consider hypotheses concerning proportions.

The information required for this test includes the null and alternative hypotheses (either
Case 1 or Case 2); the gray region (i.e., a value 5j > 50 for Case 1 or a value 5j < 50 for Case 2,
representing the bound of the gray region); the false rejection error rate a at 60; the false
acceptance error rate P at 5^ and any additional limits on decision errors.

3.3.2.1 Two-Sample Test for Proportions

PURPOSE

The two-sample test for proportions can be used to compare two population percentiles or
proportions and is based on an independent random sample of m (Xl3 X2, . . . , XjJ from the first
population and an independent random sample size n (Yl3 Y2, . . . , Yn) from the second
population.
ASSUMPTIONS AND THEIR VERIFICATION

The principal assumption is that of random sampling from the two populations.

LIMITATIONS AND ROBUSTNESS

The two-sample test for proportions is valid (robust) for any underlying distributional
shape and is robust to outliers, providing they are not pure data errors.

SEQUENCE OF STEPS

Directions for a two-sample test for proportions for a simple random sample and a
systematic simple random sample are given in Box 3-18; an example is provided in Box 3-19.
EPA QA/G-9 Final
QAOO Version 3 - 28 July 2000
-------
Box 3-18: Directions for a Two-Sample Test for Proportions
for Simple and Systematic Random Samples

The following describes the steps for applying the two-sample test for proportions for Case 1 (H0: P., - P2
< 0). Modifications for Case 2 (H0: P., - P2 > 0) are given in braces {}.

STEP 1: Given m random samples X.,, X2, . . . , Xm from the first population, and n samples from the
second population, Y.,, Y2, . . . , Yn, let k., be the number of points from sample 1 which
exceed C, and let k2 be the number of points from sample 2 which exceed C. Calculate the
sample proportions p., = k/m and p2 = k2/n. Then calculate the pooled proportion

p = (kl + k2) / (m + n).

STEP 2: Compute imp.,, m(1-p.|), np2, n(1-p2). If all of these values are greater than or equal to 5,
continue. Otherwise, seek assistance from a statistician as analysis is complicated.
STEP 3: Calculate z = (pl - p2) I Jp(\ - p)(\lm + lln).

Use Table A-1 of Appendix A to find the critical value z^.a such that 100(1-a)% of the normal
distribution is below z^.a. For example, if a = 0.05 then z^.a = 1 .645.

If z > z.,.,, {z < -z .,_„}, the null hypothesis may be rejected. Go to Step 5.

If z > z1_cl {z < -z^J, there is not enough evidence to reject the null hypothesis. Therefore, the
false acceptance error rate will need to be verified. Go to Step 4.

STEP 4: If the null hypothesis (H0) was not rejected, calculate either the power of the test or the
sample size necessary to achieve the false rejection and false acceptance error rates. If only
one false acceptance error rate (P) has been specified at P., - P2, it is possible to calculate
the sample sizes that achieve the DQOs (assuming the proportions are equal to the values
estimated from the sample) instead of calculating the power of the test. To do this, calculate
2(Z, + Z, n
m* = n* = — -^ - li£ — - - - where P =
and zp is the pth percentile of the standard normal distribution (Table A-1 of Appendix A). If
both m and n exceed m*, the false acceptance error rate has been satisfied. If both m and n
are below m*, the false acceptance error rate has not been satisfied.

If m* is between m and n, use a software package like the DEFT software (EPA, 1994) or the
DataQUEST software (EPA, 1996) to calculate the power of the test, assuming that the true
values for the proportions P., and P2 are those obtained in the sample. If the estimated
power is below 1-(3, the false acceptance error rate has not been satisfied.

STEP 5: The results of the test could be:

1) the null hypothesis was rejected, and it seems the difference in proportions is greater
than 0 {less than 0};

2) the null hypothesis was not rejected, the false acceptance error rate was satisfied, and it
seems the difference in proportions is less than or equal to 0 {greater than or equal to 0}; or

3) the null hypothesis was not rejected, the false acceptance error rate was not satisfied,
and it seems the difference in proportions is less than or equal to 0 {greater than or equal to
0}, but this outcome is uncertain because the sample size was probably too small.
EPA QA/G-9 Final
QAOO Version 3-29 July 2000
-------
Box 3-19: An Example of a Two-Sample Test for Proportions
for Simple and Systematic Random Samples

At a hazardous waste site, investigators must determine whether an area suspected to be contaminated
with dioxin needs to be remediated. The possibly contaminated area (area 1) will be compared to a
reference area (area 2) to see if dioxin levels in area 1 are greater than dioxin levels in the reference area.
An inexpensive surrogate probe was used to determine if each individual sample is either "contaminated,"
i.e., over the health standard of 1 ppb, or "clean," i.e., less than the health standard of 1 ppb. The null
hypothesis will be that the proportion or contaminant levels in area 1 is less than or equal to the
proportion in area 2, or H0: P., - P2 < 0 (Case 1). The decision maker is willing to accept a false rejection
decision error rate of 10% (a) and a false-negative decision error rate of 5% (P) when the difference in
proportions between areas exceeds 0.10. A team collected 92 readings from area 1 (of which 12 were
contaminated) and 80 from area 2, the reference area, (of which 10 were contaminated).

STEP 1: The sample proportion for area 1 is p., = 12/92 = 0.130, the sample proportion for area 2 is
p2 = 10/80 = 0.125, and the pooled proportion p = (12 + 10) / (92 + 80 ) = 0.128.

STEP 2: imp., = 12, m(1-p.,) = 80, np2 = 10, n(1-p2) =70. Because these values are greater than or
equal to 5, continue to step 3.
STEPS: z = (0.130 - 0.125) / JO.128(1 - 0.128)(l/92 + 1/80) = 0.098

Table A-1 of Appendix A was used to find the critical value zogo = 1.282.

Because z > zogo (0.098 > 1.282), there is not enough evidence to reject the null hypothesis
and the false acceptance error rate will need to be verified. Go to Step 4.

STEP 4: Because the null hypothesis (H0) was not rejected, calculate the sample size necessary to
achieve the false rejection and false acceptance error rates. Because only one false
acceptance error rate (p = 0.05) has been specified (at a difference of P., - P2 = 0.1), it is
possible to calculate the sample sizes that achieve the DQOs, assuming the proportions are
equal to the values estimated from the sample:

2(1.282+ 1.645)20.1275(1-0.1275) 1on,,. im , ,
m = n = — - = 190.6 (i.e., 191 samples)
n m* P °-115 + °-055
where 0.1275 = P =
2

Because both m and n are less than m*, the false acceptance error rate has not been
satisfied.

STEP 5: The null hypothesis was not rejected, and the false acceptance error rate was not satisfied.
Therefore, it seems that there is no difference in proportions and that the contaminant
concentrations of the investigated area and the reference area are probably the same.
However, this outcome is uncertain because the sample sizes obtained were in all likelihood
too small.
EPA QA/G-9 Final
QAOO Version 3 - 30 July 2000
-------
3.3.3 Nonparametric Comparisons of Two Populations

In many cases, assumptions on distributional characteristics are difficult to verify or
difficult to satisfy for both populations. In this case, several distribution-free test procedures are
available that compare the shape and location of the two distributions instead of a statistical
parameter (such as a mean or median). The statistical tests described below test the null
hypothesis "H0: the distributions of population 1 and population 2 are identical (or, the site is not
more contaminated than background)" versus the alternative hypothesis "HA: part of the
distribution of population 1 is located to the right of the distribution of population 2 (or the site is
more contaminated than background)." Because of the structure of the hypothesis tests, the
labeling of populations 1 and 2 is of importance. For most environmental applications, population
1 is the area of interest (i.e., the potentially contaminated area) and population 2 is the reference
area.

There is no formal statistical parameter of interest in the hypotheses stated above.
However, the concept of false rejection and false acceptance error rates still applies.

3.3.3.1 The Wilcoxon Rank Sum Test

PURPOSE

The Wilcoxon rank sum test can be used to compare two population distributions based
on m independent random samples Xl3 X2, . . . , X,,, from the first population, and n independent
random samples Yl3 Y2, . . . , Yn from the second population. When applied with the Quantile test
(Section 3.3.3.2), the combined tests are most powerful for detecting true differences between
two population distributions.

ASSUMPTIONS AND THEIR VERIFICATION

The validity of the random sampling and independence assumptions should be verified by
review of the procedures used to select the sampling points. The two underlying distributions are
assumed to have the same shape and dispersion, so that one distribution differs by some fixed
amount (or is increased by a constant) when compared to the other distribution. For large
samples, to test whether both site distributions have approximately the same shape, one can create
and compare histograms for the samples.

LIMITATIONS AND ROBUSTNESS

The Wilcoxon rank sum test may produce misleading results if many data values are the
same. When values are the same, their relative ranks are the same, and this has the effect of
diluting the statistical power of the Wilcoxon rank sum test. Estimated concentrations should be
reported for data below the detection limit, even if these estimates are negative, because their
relative magnitude to the rest of the data is of importance. An important advantage of the
Wilcoxon rank sum test is its partial robustness to outliers, because the analysis is conducted in

EPA QA/G-9 Final
QAOO Version 3-31 July 2000
-------
terms of rankings of the observations. This limits the influence of outliers because a given data
point can be no more extreme than the first or last rank.

SEQUENCE OF STEPS

Directions and an example for the Wilcoxon rank sum test are given in Box 3-20 and Box
3-21. However, if a relatively large number of samples have been taken, it is more efficient in
terms of statistical power to use a large sample approximation to the Wilcoxon rank sum test
(Box 3-22) to obtain the critical values of W.
Box 3-20: Directions for the Wilcoxon Rank Sum Test
for Simple and Systematic Random Samples

LetX.,, X2, . . . , Xn represent the n data points from population 1 and Y.,, Y2, . . . , Ym represent the m data points
from population 2 where both n and m are less than or equal to 20. For Case 1, the null hypothesis will be that
population 1 is shifted to the left of population 2 with the alternative that population 1 is either the same as or
shifted to the right of population 2; Case 2 will be that population 1 is shifted to the right of population 2 with the
alternative that population 1 is the same as or shifted to the left of population 2; for Case 3, the null hypothesis
will be that there is no difference between the two populations and the alternative hypothesis will be that
population 1 is shifted either to the right or left of population 2. If either m or n are larger than 20, use Box 3-22.

STEP 1: List and rank the measurements from both populations from smallest to largest,
keeping track of which population contributed each measurement. The rank of 1 is
assigned to the smallest value, the rank of 2 to the second smallest value, and so
forth. If there are ties, assign the average of the ranks that would otherwise have
been assigned to the tied observations.

STEP 2: Calculate R as the sum of the ranks of the data from population 1, then calculate

W=R- H(n+l\
2

STEP 3: Use Table A-7 of Appendix A to find the critical value wa (or wa/2 for Case 3). For
Case 1, reject the null hypothesis if W > nm - wa. For Case 2, reject the null
hypothesis if W < wa. For Case 3, reject the null hypothesis if W > nm - wa/2 or W <
wa/2. If the null hypothesis is rejected, go to Step 5. Otherwise, go to Step 4.

STEP 4: If the null hypothesis (H0) was not rejected, the power of the test or the sample size
necessary to achieve the false rejection and false acceptance error rates should be
calculated. For small samples sizes, these calculations are too complex for this
document.

STEP 5: The results of the test could be:

1) the null hypothesis was rejected and it seems that population 1 is shifted to the right (Case 1), to
the left (Case 2) or to the left or right (Case 3) of population 2.

2) the null hypothesis was not rejected and it seems that population 1 is shifted to the left (Case 1)
or to the right (Case 2) of population 2, or there is no difference between the two populations (Case
3).
EPA QA/G-9 Final
QAOO Version 3-32 July 2000
-------
Box 3-21: An Example of the Wilcoxon Rank Sum Test
for Simple and Systematic Random Samples

At a hazardous waste site, area 1 (cleaned using an in-situ methodology) was compared with a similar
(but relatively uncontaminated) reference area, area 2. If the in-situ methodology worked, then the two
sites should be approximately equal in average contaminant levels. If the methodology did network,
then area 1 should have a higher average than the reference area. The null hypothesis will be that area 1
is shifted to the right of area 2 and the alternative hypothesis will be that there is no difference between
the two areas or that area 1 is shifted to the left of area 2 (Case 2). The false rejection error rate was set
at 10% and the false acceptance error rate was set at 20% (P) if the difference between the areas is 2.5
ppb. Seven random samples were taken from area 1 and eight samples were taken from area 2:

Area 1 Area 2
17,23,26,5 16,20,5,4
13,13,12 8,10,7,3

STEP 1: The data listed and ranked by size are (Area 1 denoted by *):

Data (ppb): 3, 4, 5, 5*, 7, 8, 10, 12*, 13*, 13*, 16, 17*, 20, 23*, 26*
Rank: 1, 2,3.5,3.5*, 5, 6, 7, 8*, 9.5*, 9.5* 11, 12*, 13, 14*, 15*

STEP 2: R = 3.5+ 8 +9.5 +9.5+12+14+15 = 71..5. W= 71.5 - 7(7 + 1)/2 = 43.5

STEPS: Using Table A-7 of Appendix A, a = 0.10 and Wa = 17. Since 43.5 > 17, do
not reject the null hypothesis.

STEP 4: The null hypothesis was not rejected and it would be appropriate to calculate
the probable power of the test. However, because the number of samples is
small, extensive computer simulations are required in order to estimate the
power of this test which is beyond the scope of this guidance.

STEP 5: The null hypothesis was not rejected. Therefore, it is likely that there is no
difference between the investigated area and the reference area, although the
statistical power is low due to the small sample sizes involved.
EPA QA/G-9 Final
QAOO Version 3 - 33 July 2000
-------
Box 3-22: Directions for the Large Sample Approximation
to the Wilcoxon Rank Sum Test for Simple and Systematic Random Samples

LetX.,, X2, . . . , Xn represent the n data points from population 1 and Y.,, Y2, . . . , Ym represent the m data
points from population 2 where both n and m are greater than 20. For Case 1, the null hypothesis will be
that population 1 is shifted to the left of population 2 with the alternative that population 1 is the same as
or shifted to the right of population 2; for Case 2, the null hypothesis will be that population 1 is shifted to
the right of population 2 with the alternative that population 1 is the same as or shifted to the left of
population 2; for Case 3, the null hypothesis will be that there is no difference between the populations
and the alternative hypothesis will be that population 1 is shifted either to the right or left of population 2.

STEP 1: List and rank the measurements from both populations from smallest to
largest, keeping track of which population contributed each measurement.
The rank of 1 is assigned to the smallest value, the rank of 2 to the second
smallest value, and so forth. If there are ties, assign the average of the ranks
that would otherwise have been assigned to the tied observations.

STEP 2: Calculate Was the sum of the ranks of the data from population 1.
mn
STEPS: Calculate w = - +Zp\lmn(n + m + 1)/12 where p = 1 - a for Case

1, p = a for Case 2, and zp is the pth percentile of the standard normal
distribution (Table A-1 of Appendix A). For Case 3, calculate both wa/2 (p =
a/2) and w^^ (p = 1 -a/2).

STEP 4: For Case 1 , reject the null hypothesis if W > \N^a. For Case 2, reject the null
hypothesis if W < wa. For Case 3, reject the null hypothesis if W > w^^ or
W < wa/2. If the null hypothesis is rejected, go to Step 6. Otherwise, go to
Step 5.

STEP 5: If the null hypothesis (H0) was not rejected, calculate either the power of the
test or the sample size necessary to achieve the false rejection and negative
error rates. If only one false acceptance error rate (P) has been specified (at
5.,), it is possible to calculate the sample size that achieves the DQOs,
assuming the true mean and standard deviation are equal to the values
estimated from the sample, instead of calculating the power of the test. If m
and n are large, calculate:
where zp is the pth percentile of the standard normal distribution (Table A-1 of Appendix A). If
1.16m* < m and 1.1 6n* < n, the false acceptance error rate has been satisfied.

STEP 6: The results of the test could be:

1) the null hypothesis was rejected, and it seems that population 1 is shifted to the right
(Case 1 ), to the left (Case 2) or to the left or right (Case 3) of population 2.

2) the null hypothesis was not rejected, the false acceptance error rate was satisfied, and it
seems that population 1 is shifted to the left (Case 1 ) or to the right (Case 2) of population 2,
or there is no difference between the two populations (Case 3).

3) the null hypothesis was not rejected, the false acceptance error rate was not satisfied, and
it seems that population 1 is shifted to the left (Case 1 ) or to the right (Case 2) of population

EPA QA/G-9 Final
QAOO Version 3-34 July 2000
-------
3.3.3.2 The Quantile Test

PURPOSE

The Quantile test can be used to compare two populations based on the independent
random samples Xt, X2, . . ., X,,, from the first population and Yl5 Y2, . . ., Yn from the second
population. When the Quantile test and the Wilcoxon rank sum test (Section 3.3.3.1) are applied
together, the combined tests are the most powerful at detecting true differences between two
populations. The Quantile test is useful in detecting instances where only parts of the data are
different rather than a complete shift in the data. It essentially looks at a certain number of the
largest data values to determine if too many data values from one population are present to be
accounted for by pure chance.

ASSUMPTIONS AND THEIR VERIFICATION

The Quantile test assumes that the data Xl3 X2, . . ., X,,, are a random sample from
population 1, and the data Yl3 Y2, . . ., Yn are a random sample from population 2, and the two
random samples are independent of one another. The validity of the random sampling and
independence assumptions is assured by using proper randomization procedures, either random
number generators or tables of random numbers. The primary verification required is to review
the procedures used to select the sampling points. The two underlying distributions are assumed
to have the same underlying dispersion (variance).

LIMITATIONS AND ROBUSTNESS

The Quantile test is not robust to outliers. In addition, the test assumes either a systematic
(e.g., a triangular grid) or simple random sampling was employed. The Quantile test may not be
used for stratified designs. In addition, exact false rejection error rates are not available, only
approximate rates.

SEQUENCE OF STEPS

The Quantile test is difficult to implement by hand. Therefore, directions are not included
in this guidance but the DataQUEST software (EPA, 1996) can be used to conduct this test.
However, directions for a modified Quantile test that can be implemented by hand are contained
in Box 3-23 and an example is given in Box 3-24.
EPA QA/G-9 Final
QAOO Version 3-35 July 2000
-------
Box 3-23: Directions for a Modified Quantile Test for
Simple and Systematic Random Samples

Let there be 'm' measurements from population 1 (the reference area or group) and 'n' measurement from
population 2 (the test area or group). The Modified Quantile test can be used to detect differences in shape
and location of the two distributions. For this test, the significance level (a) can either be approximately 0.10
or approximately 0.05. The null hypothesis for this test is that the two population are the same (i.e., the test
group is the same as the reference group) and the alternative is that population 2 has larger measurements
than population 1 (i.e., the test group has larger values than the reference group).

STEP 1: Combine the two samples and order them from smallest to largest keeping track
of which sample a value came from.

STEP 2: Using Table A-13 of Appendix A, determine the critical number (C) for a sample
size n from the reference area, sample size m from the test area using the
significance level a. If the Cth largest measurement of the combined population is
the same as others, increase C to include all of these tied values.

STEP 3: If the largest C measurements from the combined samples are all from
population 2 (the test group), then reject the null hypothesis and conclude that
there are differences between the two populations. Otherwise, the null hypothesis
is not rejected and it appears that there is no difference between the two
populations.
3.3.4 Comparing Two Medians

Let jlj represent the median of population 1 and £2 represent the median of population 2.
The hypothesis considered in this section are:

Case 1: H0: j^ - J12 < 50 vs. HA: j^ - J12 > 50; and

Case 2: H0: ^ - £2 > 50 vs. HA: ^ - £2 < 50.

An example of a two-sample test for the difference between two population medians is comparing
the median contaminant level at a Superfund site to the median of a background site. In this case,
50 would be zero.

The median is also the 50th percentile, and, therefore, the methods described in Section
3.3.2 for percentiles and proportions may be used to test hypotheses concerning the difference
between two medians by letting ?! = ?„ = 0.50. The Wilcoxon rank sum test (Section 3.3.3.1) is
also recommended for comparing two medians. This test is more powerful than those for
proportions for symmetric distributions.
EPA QA/G-9 Final
QAOO Version 3-36 July 2000
-------
Box 3-24: A Example of a Modified Quantile Test for
Simple and Systematic Random Samples

At a hazardous waste site a new, cheaper, in-site methodology was compared against an existing
methodology by remediating separate areas of the site using each method. If the new methodology works,
the overall contamination of the area of the site remediated using the new methodology should be the same as
the area of the site remediated using the standard methodology. If the new methodology does network, then
there would be higher contamination values remaining on the area of the site remediated using the new
methodology. The site manager wishes to determine if the new methodology works and has chosen a 5%
significance level. A modified Quantile Test will be used to make this determination based on 7 samples from
the area remediated using the standard methodology (population 1) and 12 samples from the area remediated
using the new methodology (population 2). The sampled values are:

Standard Methodology New Methodology
17, 8, 20, 4, 6, 5, 4 7, 18, 2, 4, 6, 11, 5, 9, 10, 2, 3, 3

STEP 1: Combine the two samples and ordering them from smallest to largest yields:

2* 2* 3* 3* 4 4 4* 5* 5 6 6* 7* 8 9* 10* 11* 17 18* 20

where * denoted samples from the new methodology portion of the site (population 2).

STEP 2: Using Table A-13 of Appendix A with m = 7, n = 12, and a = 0.05, the critical value C = 5. Since
the 5th largest value is 10, there is not need to increase C.

(Note however, if the data were 2* 2* 3* 3* 4 4 4* 5* 5 6 6* 7* 8 9*9* 11* 17 18* 20
then the 5th largest value would have been 9 which is the same tied with 1 other value. In this
case, C would have been raised to 6 to include the tied value.)

STEPS: From Step 1, the 5 largest values are 10* 11* 17 18* 20. Only 3 of these 5
values come from population 2, therefore the null hypothesis can not be rejected
and the site manager concludes that it seems that the new in-situ methodology
works as well as the standard methodology.
3.4 Tests for Comparing Several Populations

3.4.1 Tests for Comparing Several Means

This section describes procedures to test the differences between several sample means
from different populations either against a control population or among themselves. For example,
the test described in this section could be used identify whether or not there are differences
between several drinking water wells or could be used to identify if several downgradient wells
differ from an upgradient well.

In this situation, it would be possible to apply the tests described in Section 3.3.1 multiple
times. However, applying a test multiple times underestimates the true false rejection decision
error rate. Therefore, the test described in this section controls the overall false rejection decision
error rate by making the multiple comparisons simultaneously.
EPA QA/G-9 Final
QAOO Version 3-37 July 2000
-------
3.4.1.1 Dunnett's Test

PURPOSE

Dunnett's test is used to test the difference between sample means from different
populations against a control population. A typical application would involve different cleaned
areas of a hazardous waste site being compared to a reference sample; this reference sample
having been obtained from a uncontaminated part of the hazardous waste site.

ASSUMPTIONS AND THEIR VERIFICATION

Multiple application of any statistical test is inappropriate because the continued use of the
same reference sample violates the assumption that the two samples were obtained independently
for each statistical test. The tests are strongly correlated between themselves with the degree of
correlation depending on the degree of similarity in number of samples used for the control group
and investigated groups. The test is really best suited for approximately equal sample sizes in
both the control group and the groups under investigation.

LIMITATIONS AND ROBUSTNESS

Dunnett's method is the same in operation as the standard two-sample t-test of Section
3.3.1 except for the use of a larger pooled estimate of variance and the need for special t-type
tables (Table A-14 of Appendix A). These tables are for the case of equal number of samples in
the control and each of the investigated groups, but remain valid provided the number of samples
from the investigated group are approximately more than half but less than double the size of the
control group. In this guidance, only the null hypothesis that the mean of the sample populations
is the same as the mean of the control population will be considered.

SEQUENCE OF STEPS

Directions for the use of Dunnett's method for a simple random sample or a systematic
random sample are given in Box 3-25 and an example is contained in Box 3-26.
EPA QA/G-9 Final
QAOO Version 3 - 38 July 2000
-------
Box 3-25: Directions for Dunnett's Test for
Simple Random and Systematic Samples

Let k represent the total number of populations to be compared so there are (k-1) sample populations and a
single control population. Let n.,, n2 ... nk_., represent the sample sizes of each ofthe (k-1) sample populations
and let m represent the sample size ofthe control population. The null hypothesis is H0: ur uc < 0 (i.e., no
difference between the sample means and the control mean) and the alternative hypothesis is HA: ur uc > 0 for
1 =1,2, ..., k-1 where u, represents the mean ofthe ith sample population and uc represents the mean ofthe
control population. Let a represent the chosen significance level for the test.

STEP 1: For each sample population, make sure that approximately 0.5 < m/n, < 2. If not, Dunnett's Test
should not be used.

STEP 2: Calculate the sample mean, x, (Section 2.2.2), and the variance, s2 (Section 2.2.3) for each ofthe k
populations (i.e., i = 1, 2, ... k).

STEP 3: Calculated the pooled standard:
(m—!) + («! —1)+...+(«£_! —1)

STEP 4: For each ofthe k-1 sample populations, compute
Xi - Xc
STEP 5: Use Table A-14 of Appendix A to determine the critical value TD(1_a) where the degrees of freedom is
(m -1) + (n.,-1) + . . . + (nk_.| -1).

STEP 6: Compare t, to TD(1_a) for each of the k-1 sample populations. If t, >TD(1_a) for any of the sample
populations, then reject the null hypothesis and conclude that there are differences between the
means ofthe sample populations and the mean ofthe control populations. Otherwise, conclude that
there is no difference between the sample and control population means.
EPA QA/G-9 Final
QAOO Version 3-39 July 2000
-------

Number of
Samples:
Mean:
Variance:
Ratio: m/n
t;:
Reference
Area
7
10.3
2.5

IAK3
6
11.4
2.6
7/6 = 1.16
1.18
ZBF6
5
12.2
3.3
7/5 = 1.4
1.93
3BG5
6
10.2
3.0
7/6 = 1.16
0.11
4GH2
7
11.4
3.2
7/7 = 1
1.22
5FF3
8
11.9
2.6
7/8 = 0.875
1.84
6GW4
7
12.1
2.8
7/7 = 1
2.00
Box 3-26: An Example of Dunnett's Test for
Simple Random and Systematic Samples

At a hazardous work site, 6 designated areas previously identified as 'hotspots' have been cleaned. In order for
this site to be a potential candidate for the local Brownfields program, it must be demonstrated that these areas
are not longer contaminated. Therefore, the means of these areas will be compared to mean of a reference area
also located on the site using Dunnett's test. The null hypothesis will be that there is no difference between the
means of the 'hotspot' areas and the mean of the reference area. A summary of the data from the areas follows.
STEP 1: Calculate the ratio m/n for each investigated area. These are shown in the 4th row of the table above.
Since all of these ration fall within the range of 0.5 to 2.0, Dunnett's test may be used.

STEP 2: The sample means, x, and the variance, s2 were calculated using Sections 2.2.2 and 2.2.3 of
Chapter 2. These are shown in the 2nd and 3rd row of the table above.

STEP 3: The pooled standard deviation for all 7 areas is:
SD =
(7 -1)2.5 + (6 -1)2.6+.. .+(7 ~ 1)2.8
11.4-10.3
STEP 4: For each 'hotspot' area, t, was computed. For example, t\ = , — = 1.18
1.68^/1/6+1/7
These are shown in the 5th row of the table above.
STEP 5: The degrees of freedom is (7 -1) + (6 -1) + . . . + (7 -1) = 39. So using Table A-14 of Appendix A
with 39 for the degrees of freedom, the critical value TD(095) = 2.37 and TD(090) = 2.03.

STEP 6: Since none of the values in row 5 of the table are greater than either 2.37 or 2.03, it appears that
none of the 'hotspot' areas have contamination levels that are significantly different than the
reference area. Therefore, this site may be a potential candidate to be a Brownsfield site.

NOTE: If an ordinary 2-sample t-test (see Section 3.3.1.1) had been used to compare each 'hotspot' area with
the reference area at the 5% level of significance, areas 2BF6, 5FF3, and 69W4 would have erroneously been
declared different from the reference area, which would probably alter the final conclusion to include the site as a
Brownfields candidate.
EPA QA/G-9
QAOO Version
3-40
Final
July 2000
-------
CHAPTER 4
STEP 4: VERIFY THE ASSUMPTIONS OF THE STATISTICAL TEST
THE DATA QUALITY ASSESSMENT PROCESS
Review DQOs and Sampling Design
Conduct Preliminary Data Review
Select the Statistical Test
Verify the Assumptions
Draw Conclusions From the Data
VERIFY THE ASSUMPTIONS OF THE
STATISTICAL TEST

Purpose

Examine the underlying assumptions of the statistical
hypothesis test in light of the environmental data.
Activities

• Determine Approach for Verifying Assumptions
• Perform Tests of Assumptions
• Determine Corrective Actions
Tools

• Tests of distributional assumptions
• Tests for independence and trends
• Tests for dispersion assumptions
Step 4: Verify the Assumptions of the Statistical Test

! Determine approach for verifying assumptions.
P Identify any strong graphical evidence from the preliminary data review.
P Review (or develop) the statistical model for the data.
P Select the tests for verifying assumptions.

! Perform tests of assumptions.
P Adjust for bias if warranted.
P Perform the calculations required for the tests.

! If necessary, determine corrective actions.
P Determine whether data transformations will correct the problem.
P If data are missing, explore the feasibility of using theoretical justification or
collecting new data.
P Consider robust procedures or nonparametric hypothesis tests.
EPA QA/G-9
QAOO Version
4- 1
Final
July 2000
-------
List of Boxes
Box 4-1: Directions for the Coefficient of Variation Test and an Example 4-9
Box 4-2: Directions for Studentized Range Test and an Example 4-10
Box 4-3: Directions for Geary's Test 4-11
Box 4-4: Example of Geary's Test 4-11
Box 4-5: Directions for the Test for a Correlation Coefficient and an Example 4-15
Box 4-6: "Upper Triangular" Data for Basic Mann-Kendall Trend Test 4-17
Box 4-7: Directions for the Mann-Kendall Trend Test for Small Sample Sizes 4-18
Box 4-8: An Example of Mann-Kendall Trend Test for Small Sample Sizes 4-18
Box 4-9: Directions for the Mann-Kendall Procedure Using Normal Approximation 4-19
Box 4-10: An Example of Mann-Kendall Trend Test by Normal Approximation 4-20
Box 4-11: Data for Multiple Times and Multiple Stations 4-21
Box 4-12: Testing for Comparability of Stations and an Overall Monotonic Trend 4-22
Box 4-13: Directions for the Wald-Wolfowitz Runs Test 4-25
Box 4-14: An Example of the Wald-Wolfowitz Runs Test 4-26
Box 4-15: Directions for the Extreme Value Test (Dixon's Test) 4-28
Box 4-16: An Example of the Extreme Value Test (Dixon's Test) 4-28
Box 4-17: Directions for the Discordance Test 4-29
Box 4-18: An Example of the Discordance Test 4-29
Box 4-19: Directions for Rosner's Test for Outliers 4-30
Box 4-20: An Example of Rosner's Test for Outliers 4-31
Box 4-21: Directions for Walsh's Test for Large Sample Sizes 4-32
Box 4-22: Directions for Constructing Confidence Intervals and Confidence Limits 4-34
Box 4-23: Directions for Calculating an F-Test to Compare Two Variances 4-34
Box 4-24: Directions for Bartletf s Test 4-35
Box 4-25: An Example of Bartletf s Test 4-36
Box 4-26: Directions for Levene's Test 4-37
Box 4-27: An Example of Levene's Test 4-38
Box 4-28: Directions for Transforming Data and an Example 4-40
Box 4-29: Directions for Cohen's Method 4-44
Box 4-30: An Example of Cohen's Method 4-44
Box 4-31: Double Linear Interpolation 4-45
Box 4-32: Directions for Developing a Trimmed Mean 4-46
Box 4-33: An Example of the Trimmed Mean 4-46
Box 4-34: Directions for Developing a Winsorized Mean and Standard Deviation 4-47
Box 4-35: An Example of a Winsorized Mean and Standard Deviation 4-47
Box 4-36: 11 Directions for Aitchison's Method to Adjust Means and Variances 4-48
Box 4-37: An Example of Aitchison's Method 4-48
Box 4-38: Directions for Selecting Between Cohen's Method or Aitchison's Method .... 4-49
Box 4-39: Example of Determining Between Cohen's Method and Aitchison's Method .. 4-49
Box 4-40: Directions for the Rank von Neumann Test 4-52
Box 4-41: An Example of the Rank von Neumann Test 4-53
EPA QA/G-9 Final
QAOO Version 4 - 2 July 2000
-------
CHAPTER 4
STEP 4: VERIFY THE ASSUMPTIONS OF THE STATISTICAL TEST

4.1 OVERVIEW AND ACTIVITIES

In this step, the analyst should assess the validity of the statistical test chosen in step 3 by
examining its underlying assumptions in light of the newly generated environmental data. The
principal thrust of this section is the determination of whether the data support the underlying
assumptions necessary for the selected test, or if modifications to the data are necessary prior to
further statistical analysis.

This determination can be performed quantitatively using statistical analysis of data to
confirm or reject the assumptions that accompany any statistical test. Almost always, however,
the quantitative techniques must be supported by qualitative judgments based on the underlying
science and engineering aspects of the study. Graphical representations of the data, such as those
described in Chapter 2, can provide important qualitative information about the reasonableness of
the assumptions. Documentation of this step is important, especially when subjective judgments
play a pivotal role in accepting the results of the analysis.

If the data support all of the key assumptions of the statistical test, then the DQA
continues to the next step, drawing conclusions from the data (Chapter 5). However, often one
or more of the assumptions will be called into question which may trigger a reevaluation of one of
the previous steps. This iteration in the DQA is an important check on the validity and
practicality of the results.

4.1.1 Determine Approach for Verifying Assumptions

In most cases, assumptions about distributional form, independence, and dispersion can be
verified formally using the statistical tests described in the technical sections in the remainder of
this chapter, although in some situations, information from the preliminary data review may serve
as sufficiently strong evidence to support the assumptions. As part of this activity, the analyst
should identify methods to verify that the type and quantity of data required to perform the
desired test are available. The outputs of this activity should include a list of the specific tests that
will be used to verify the assumptions.

For each statistical test it will be necessary for the investigator to select the "level of
significance." For the specific null hypothesis for the test under consideration, the level of
significance is the chance that this null hypothesis is rejected even though it is true. For example,
if testing for normality of data, the null hypothesis is that the data do indeed exhibit normality.
When a test statistic is computed, choosing a level of significance of 5% is saying that if the null
hypothesis is true then the chance that normally distributed data will produce a statistic more
extreme than that value tabulated is only 1 in 20 (5%).
EPA QA/G-9 Final
QAOO Version 4 - 3 July 2000
-------
The choice of specific level of significance is up to the investigator and is a matter of
experience or personal choice. It does not have to be the same as that chosen in Step 3 (Select
the Statistical Test). If more than a couple of statistical tests are contemplated, it is advisable to
choose a numerically low value for the level of significance to prevent the accumulation of
potential errors. The level of significance for a statistical test is by definition the same as false
rejection error.

The methods and approach chosen for assumption verification depend on the nature of the
study and its documentation. For example, if computer simulation was used to estimate the
theoretical power of the statistical test, then this simulation model should be the basis for
evaluation of the effect of changes to assumptions using estimates calculated from the data to
replace simulation values.

If it is not already part of the design documentation, the analyst may need to formulate a
statistical model that describes the data. In a statistical model, the data are conceptually
decomposed into elements that are assumed to be "fixed" (i.e., the component is either a constant
but unknown feature of the population or is controlled by experimentation) or "random" (i.e., the
component is an uncontrolled source of variation). Which components are considered fixed and
which are random is determined by the assumptions made for the statistical test and by the
inherent structure of the sampling design. The random components that represent the sources of
uncontrolled variation could include several types of measurement errors, as well as other sources
such as temporal and/or spatial components.

In addition to identifying the components that make up an observation and specifying
which are fixed and which are random, the model should also define whether the various
components behave in an additive or multiplicative fashion (or some combination). For example,
if temporal or spatial autocorrelations are believed to be present, then the model needs to identify
the autocorrelation structure (see Section 2.3.8).

4.1.2 Perform Tests of Assumptions

For most statistical tests, investigators will need to assess the reasonableness of
assumptions in relation to the structure of the components making up an observation. For
example, a t-test assumes that the components, or errors, are additive, uncorrelated, and normally
distributed with homogeneous variance. Basic assumptions that should be investigated include:

(1) Is it reasonable to assume that the errors (deviations from the model) are
normally distributed? If adequate data are available, then standard tests for
normality can be conducted (e.g., the Shapiro-Wilk test or the Kolmogorov-
Smirnov test).

(2) Is it reasonable to assume that errors are uncorrelated? While it is natural to
assume that analytical errors imbedded in measurements made on different sample
units are independent, other errors from other sources may not be independent. If

EPA QA/G-9 Final
QAOO Version 4 - 4 July 2000
-------
sample units are "too close together," either in time or space, independence may
not hold. If the statistical test assumes independence and this assumption is not
correct, the proposed false rejection and false acceptance error rates for the
statistical test cannot be verified.

(3) Is it reasonable to assume that errors are additive and have a constant
variability? If sufficient data are available, a plot of the relevant standard
deviations versus mean concentrations may be used to discern if variability tends to
increase with concentration level. If so, transformations of the data may make the
additivity assumption more tenable.

One of the most important assumptions underlying the statistical procedures described
herein is that there is no inherent bias (systematic deviation from the true value) in the data. The
general approach adopted here is that if a long term bias is known to exist, then adjustment for
this bias should be made. If bias is present, then the basic effect is to shift the power curves
associated with a given test to the right or left, depending on the direction of the bias. Thus
substantial distortion of the nominal false rejection and false acceptance decision error rates may
occur and so the level of significance could be very different than that assumed, and the power of
the test be far less than expected. In general, bias cannot be discerned by examination of routine
data; rather, appropriate and adequate QA data are needed, such as performance evaluation data.
If one chooses not to make adjustment for bias on the basis of such data, then one should, at a
minimum, construct the estimated worse-case power curves so as to understand the potential
effects of the bias.

4.1.3 Determine Corrective Actions

Sometimes the assumptions underlying the primary statistical test will not be satisfied and
some type of corrective action will be required before proceeding. In some cases, a
transformation of the data will correct a problem with distributional assumptions. In other cases,
the data for verifying some key assumption may not be available, and existing information may not
support a theoretical justification of the validity of the assumption. In this situation, it may be
necessary to collect additional data to verify the assumptions. If the assumptions underlying a
hypothesis test are not satisfied, and data transformations or other modifications do not appear
feasible, then it may be necessary to consider an alternative statistical test. These include robust
test procedures and nonparametric procedures. Robust test procedures involve modifying the
parametric test by using robust estimators. For instance, as a substitute for a t-test, a trimmed
mean and its associated standard error (Section 4.7.2) might be used to form a t-type statistic.

4.2 TESTS FOR DISTRIBUTIONAL ASSUMPTIONS

Many statistical tests and models are only appropriate for data that follow a particular
distribution. This section will aid in determining if a distributional assumption of a statistical test
is satisfied, in particular, the assumption of normality. Two of the most important distributions
for tests involving environmental data are the normal distribution and the lognormal distribution,

EPA QA/G-9 Final
QAOO Version 4 - 5 July 2000
-------
both of which are discussed in this section. To test if the data follow a distribution other than the
normal distribution or the lognormal distribution, apply the chi-square test discussed in Section
4.2.7 or consult a statistician.
There are many methods available for verifying the assumption of normality ranging from
simple to complex. This section discusses methods based on graphs, sample moments (kurtosis
and skewness), sample ranges, the Shapiro-Wilk test and closely related tests, and goodness-of-fit
tests. Discussions for the simplest tests contain step-by-step directions and examples based on the
data in Table 4-1. These tests are summarized in Table 4-2. This section ends with a comparison
of the tests to help the analyst select a test for normality.

Table 4-1. Data for Examples in Section 4.2
15.63
11.00
11.75
10.45
13.18
10.37
10.54
11.55
11.01
10.23
X=11.57
s= 1.677
Table 4-2. Tests for Normality
Test
Shapiro Wilk W
Test
Filliben's Statistic
Coefficient of
Variation Test
Skewness and
Kurtosis Tests
Geary's Test
Studentized
Range Test
Chi-Square Test
Lilliefors
Kolmogorov-
Smirnoff Test
Section
4.2.2
4.2.3
4.2.4
4.2.5
4.2.6
4.2.6
4.2.7
4.2.7
Sample
Size
< 50
< 100
Any
>50
>50
< 1000
Largea
>50
Recommended Use
Highly recommended.
Highly recommended.
Only use to quickly discard
an assumption of
normality.
Useful for large sample
sizes.
Useful when tables for
other tests are not
available.
Highly recommended
(with some conditions).
Useful for grouped data
and when the comparison
distribution is known.
Useful when tables for
other tests are not
available.
Data-
QUEST
Yes
Yes
Yes
Yes
Yes
Yes
No
No
a The necessary sample size depends on the number of groups formed when implementing this test. Each group
should contain at least 5 observations.
EPA QA/G-9
QAOO Version
4-6
Final
July 2000
-------
The assumption of normality is very important as it is the basis for the majority of
statistical tests. A normal, or Gaussian, distribution is one of the most common probability
distributions in the analysis of environmental data. A normal distribution is a reasonable model of
the behavior of certain random phenomena and can often be used to approximate other probability
distributions. In addition, the Central Limit Theorem and other limit theorems state that as the
sample size gets large, some of the sample summary statistics (e.g., the sample mean) behave as if
they are a normally distributed variable. As a result, a common assumption associated with
parametric tests or statistical models is that the errors associated with data or a model follow a
normal distribution.
The graph of a normally distributed
random variable, a normal curve, is bell-
shaped (see Figure 4-1) with the highest point
located at the mean which is equal to the
median. A normal curve is symmetric about
the mean, hence the part to the left of the
mean is a mirror image of the part to the
right. In environmental data, random errors
occurring during the measurement process
may be normally distributed.
Normal Distribution
Lognormal Distribution
15
20
Figure 4-1. Graph of a Normal and Lognormal
Distribution
Environmental data commonly exhibit
frequency distributions that are non-negative
and skewed with heavy or long right tails. Several standard parametric probability models have
these properties, including the Weibull, gamma, and lognormal distributions. The lognormal
distribution (Figure 4-1) is a commonly used distribution for modeling environmental contaminant
data. The advantage to this distribution is that a simple (logarithmic) transformation will
transform a lognormal distribution into a normal distribution. Therefore, the methods for testing
for normality described in this section can be used to test for lognormality if a logarithmic
transformation has been used.
4.2.1 Graphical Methods

Graphical methods (Section 2.3) present detailed information about data sets that may not
be apparent from a test statistic. Histograms, stem-and-leaf plots, and normal probability plots
are some graphical methods that are useful for determining whether or not data follow a normal
curve. Both the histogram and stem-and-leaf plot of a normal distribution are bell-shaped. The
normal probability plot of a normal distribution follows a straight line. For non-normally
distributed data, there will be large deviations in the tails or middle of a normal probability plot.

Using a plot to decide if the data are normally distributed involves making a subjective
decision. For extremely non-normal data, it is easy to make this determination; however, in many
cases the decision is not straightforward. Therefore, formal test procedures are usually necessary
to test the assumption of normality.
EPA QA/G-9
QAOO Version
4-7
Final
July 2000
-------
4.2.2 Shapiro-Wilk Test for Normality (the W test)

One of the most powerful tests for normality is the W test by Shapiro and Wilk. This test
is similar to computing a correlation between the quantiles of the standard normal distribution and
the ordered values of a data set. If the normal probability plot is approximately linear (i.e., the
data follow a normal curve), the test statistic will be relatively high. If the normal probability plot
contains significant curves, the test statistic will be relatively low.

The W test is recommended in several EPA guidance documents and in many statistical
texts. Tables of critical values for sample sizes up to 50 have been developed for determining the
significance of the test statistic. However, this test is difficult to compute by hand since it requires
two different sets of tabled values and a large number of summations and multiplications.
Therefore, directions for implementing this test are not given in this document, but the test is
contained in the Data Quality Assessment Statistical Toolbox (QA/G-9D) (EPA, 1996).

4.2.3 Extensions of the Shapiro-Wilk Test (Filliben's Statistic)

Because the W test may only be used for sample sizes less than or equal to 50, several
related tests have been proposed. D'Agostino's test for sample sizes between 50 and 1000 and
Royston's test for sample sizes up to 2000 are two such tests that approximate some of the key
quantities or parameters of the W test.

Another test related to the W test is the Filliben statistic, also called the probability plot
correlation coefficient. This test measures the linearity of the points on the normal probability
plot. Similar to the W test, if the normal probability plot is approximately linear (i.e., the data
follow a normal curve), the correlation coefficient will be relatively high. If the normal probability
plot contains significant curves (i.e., the data do not follow a normal curve), the correlation
coefficient will be relatively low. Although easier to compute that the W test, the Filliben statistic
is still difficult to compute by hand. Therefore, directions for implementing this test are not given
in this guidance; however, it is contained in the software, Data Quality Assessment Statistical
Toolbox (QA/G-9D) (EPA, 1996).

4.2.4 Coefficient of Variation

The coefficient of variation (CV) may be used to quickly determine whether or not the
data follow a normal curve by comparing the sample CV to 1. The use of the CV is only valid for
some environmental applications if the data represent a non-negative characteristic such as
contaminant concentrations. If the CV is greater than 1, the data should not be modeled with a
normal curve. However, this method should not be used to conclude the opposite, i.e., do not
conclude that the data can be modeled with a normal curve if the CV is less than 1. This test is to
be used only in conjunction with other statistical tests or when graphical representations of the
data indicate extreme departures from normality. Directions and an example of this method are
contained in Box 4-1.
EPA QA/G-9 Final
QAOO Version 4 - 8 July 2000
-------
Box 4-1: Directions for the Coefficient of Variation Test for
Environmental Data and an Example
Directions
STEP1: Calculate the coefficient of variation (CV): CV = s IX =
«-l;=i
i "
-
«;=!
STEP 2: If CV > 1 .0, conclude that the data are not normally distributed. Otherwise, the test is
inconclusive.
The following example demonstrates using the coefficient of variation to determine that the data in Table
4-1 should not be modeled using a normal curve.

STEP1: Calculate the coefficient of variation (CV): CV = 4 = = 0.145
X 11.571

STEP 2: Since 0.145 > 1.0, the test is inconclusive.
4.2.5 Coefficient of Skewness/Coefficient of Kurtosis Tests

The degree of symmetry (or asymmetry) displayed by a data set is measured by the
coefficient of skewness (g3). The coefficient of kurtosis, g4, measures the degree of flatness of a
probability distribution near its center. Several test methods have been proposed using these
coefficients to test for normality. One method tests for normality by adjusting the coefficients of
skewness and kurtosis to approximate a standard normal distribution for sample sizes greater than
50.

Two other tests based on these coefficients include a combined test based on a chi-squared
(/2) distribution and Fisher's cumulant test. Fisher's cumulant test computes the exact sampling
distribution of g3 and g4; therefore, it is more powerful than previous methods which assume that
the distributions of the two coefficients are normal. Fisher's cumulant test requires a table of
critical values, and these tests require a sample size of greater than 50. Tests based on skewness
and kurtosis are rarely used as they are less powerful than many alternatives.

4.2.6 Range Tests

Almost 100% of the area of a normal curve lies within ±5 standard deviations from the
mean and tests for normality have been developed based on this fact. Two such tests, which are
both simple to apply, are the studentized range test and Geary's test. Both of these tests use a
ratio of an estimate of the sample range to the sample standard deviation. Very large and very
small values of the ratio then imply that the data are not well modeled by a normal curve.
EPA QA/G-9 Final
QAOO Version 4 - 9 July 2000
-------
a. The studentized range test (or w/s test). This test compares the range of the
sample to the sample standard deviation. Tables of critical values for sample sizes up to 1000
(Table A-2 of Appendix A) are available for determining whether the absolute value of this ratio is
significantly large. Directions for implementing this method are given in Box 4-2 along with an
example. The studentized range test does not perform well if the data are asymmetric and if the
tails of the data are heavier than the normal distribution. In addition, this test may be sensitive to
extreme values. Unfortunately, lognormally distributed data, which are common in environmental
applications, have these characteristics. If the data appear to be lognormally distributed, then this
test should not be used. In most cases, the studentized range test performs as well as the Shapiro-
Wilk test and is much easier to apply.
Box 4-2: Directions for Studentized Range Test
and an Example
Directions
STEP 1: Calculate sample range (w) and sample standard deviation (s) using Section 2.2.3.

W (n) ~ (I)
STEP 2: Compare — = —^ = to the critical values given in Table A-2 (labeled a and b).
s s

If w/s falls outside the two critical values then the data do not follow a normal curve.

Example

The following example demonstrates the use of the studentized range test to determine if the data from Table 4-
1 can be modeled using a normal curve.

STEP 1: w = X(n)-X(1)= 15.63-10.23 = 5.40 and s = 1.677.
STEP 2: w/s = 5.4/1.677 = 3.22. The critical values given in Table A-2 are 2.51 and 3.875. Since 3.22 falls
between these values, the assumption of normality is not rejected.
b. Geary's Test. Geary's test uses the ratio of the mean deviation of the sample to
the sample standard deviation. This ratio is then adjusted to approximate a standard normal
distribution. Directions for implementing this method are given in Box 4-3 and an example is
given in Box 4-4. This test does not perform as well as the Shapiro-Wilk test or the studentized
range test. However, since Geary's test statistic is based on the normal distribution, critical values
for all possible sample sizes are available.
EPA QA/G-9 Final
QAOO Version 4-10 July 2000
-------
Box 4-3: Directions for Geary's Test

STEP 1: Calculate the sample mean x, the sample sum of squares (SSS), and the sum of absolute
deviations (SAD):

(b^
_ sss = £Xf - ——- and SAD = T^x,~x\
« ;=1 ' ;=1 n i=l '

SAD
STEP 2: Calculate Geary's test statistic a =
/n(SSS)

a - o7979
STEPS: Test "a" for significance by computing Z= : . Here 0.7979 and 0.2123 are
0.2123/y«

constants used to achieve normality.

STEP 4: Use Table A-1 of Appendix A to find the critical value z.,_a such that 100(1-a)% of the normal
distribution is below z^. For example, if a = 0.05, then z^ = 1.645. Declare "a" to be
sufficiently small or large (i.e., conclude the data are not normally distributed) if Zi > Z.,_a.
Box 4-4: Example of Geary's Test

The following example demonstrates the use of Geary's test to determine if the data from Table 4-1 can
be modeled using a normal curve.

_ i « « _
STEP1: X= -YX.= 11.571, SAD = Y\X.-X\ = 11.694, and
n,,\ ,= i
SSS = - — - = 1364.178 - 1338.88 = 25.298
STEP 2: a = = _ = Q
^10(25.298)
STEPS: z = a735 - a7979 = -Q.934
0.2123/^10
STEP 4: Since Zi > 1.64 (5% significance level), there is not enough information to
conclude that the data do not follow a normal distribution.
EPA QA/G-9 Final
QAOO Version 4-11 July 2000
-------
4.2.7 Goodness-of-Fit Tests

Goodness-of-fit tests are used to test whether data follow a specific distribution, i.e., how
"good" a specified distribution fits the data. In verifying assumptions of normality, one would
compare the data to a normal distribution with a specified mean and variance.

a. Chi-square Test. One classic goodness-of-fit test is the chi-square test which
involves breaking the data into groups and comparing these groups to the expected groups from
the known distribution. There are no fixed methods for selecting these groups and this test also
requires a large sample size since at least 5 observations per group are required to implement this
test. In addition, the chi-square test does not have the power of the Shapiro-Wilk test or some of
the other tests mentioned above.

b. Tests Based on the Empirical Distribution Function. The cumulative
distribution function, denoted by F(x), and the empirical distribution function of the data for a
given sample of size n are defined in Section 2.3.7.4. Since empirical distribution functions
estimate the true F(x) underlying a set of data, and as the cumulative distribution function for a
given type of distribution [e.g., a normal distribution (see Section 2.4) with given mean and
standard deviation] can be computed, a goodness of fit test can be performed using the empirical
distribution function. If the empirical distribution function is "not close to" the given cumulative
distribution function, then there is evidence that the data do not come from the distribution having
that cumulative distribution function.

Various methods have been used to measure the discrepancy between the sample empirical
distribution function and the theoretical cumulative distribution function. These measures are
referred to as empirical distribution function statistics. The best known empirical distribution
function statistic is the Kolmogorov-Smirnov (K-S) statistic. The K-S approach is appropriate if
the sample size exceeds 50 and if F(x) represents a specific distribution with known parameters
(e.g., a normal distribution with mean 100 and variance 30). A modification to the test, called the
Lilliefors K-S test, is appropriate (for n>50) for testing that the data are normally distributed
when the F(x) is based on an estimated mean and variance.

Unlike the K-S type statistics, most empirical distribution function statistics are based on
integrated or average values between the empirical distribution function and cumulative
distribution functions. The two most powerful are the Cramer-von Mises and Anderson-Darling
statistics. Extensive simulations show that the Anderson-Darling empirical distribution function
statistic is just as good as any, including the Shapiro-Wilk statistic, when testing for normality.
However, the Shapiro-Wilk test is applicable only for the case of a normal-distribution cumulative
distribution function, while the Anderson-Darling method is more general.

Most goodness-of-fit tests are difficult to perform manually and are usually included in
standard statistical software. The application of goodness-of-fit tests to non-normal data is
beyond the scope of this guidance and consultation with a statistician recommended.
EPA QA/G-9 Final
QAOO Version 4-12 July 2000
-------
4.2.8 Recommendations

Analysts can perform tests for normality with samples as small as 3. However, the tests
lack statistical power for small sample size. Therefore, for small sample sizes, it is recommended
that a nonparametric statistical test (i.e., one that does not assume a distributional form of the
data) be selected during Step 3 of the DQA in order to avoid incorrectly assuming the data are
normally distributed when there is simply not enough information to test this assumption.

If the sample size is less than 50, then this guidance recommends using the Shapiro-Wilk
W test, wherever practicable. The Shapiro-Wilk W test is one of most powerful tests for
normality and it is recommended in several EPA guidance as the preferred test when the sample
size is less than 50. This test is difficult to implement by hand but can be applied easily using the
Data Quality Assessment Statistical Toolbox (QA/G-9D) (EPA, 1996). If the Shapiro-Wilk W
test is not feasible, then this guidance recommends using either Filliben's statistic or the
studentized range test. Filliben's statistic performs similarly to the Shapiro-Wilk test. The
studentized range is a simple test to perform; however, it is not applicable for non-symmetric data
with large tails. If the data are not highly skewed and the tails are not significantly large
(compared to a normal distribution), the studentized range provides a simple and powerful test
that can be calculated by hand.

If the sample size is greater than 50, this guidance recommends using either the Filliben's
statistic or the studentized range test. However, if critical values for these tests (for the specific
sample size) are not available, then this guidance recommends implementing either Geary's test or
the Lilliefors Kolmogorov-Smirnoff test. Geary's test is easy to apply and uses standard normal
tables similar to Table A-l of Appendix A and widely available in standard textbooks. Lilliefors
Kolmogorov-Smirnoff is more statistically powerful but is also more difficult to apply and uses
specialized tables not readily available.

4.3 TESTS FOR TRENDS

4.3.1 Introduction

This section presents statistical tools for detecting and estimating trends in environmental
data. The detection and estimation of temporal or spatial trends are important for many
environmental studies or monitoring programs. In cases where temporal or spatial patterns are
strong, simple procedures such as time plots or linear regression over time can reveal trends. In
more complex situations, sophisticated statistical models and procedures may be needed. For
example, the detection of trends may be complicated by the overlaying of long- and short-term
trends, cyclical effects (e.g., seasonal or weekly systematic variations), autocorrelations, or
impulses or jumps (e.g., due to interventions or procedural changes).

The graphical representations of Chapter 2 are recommended as the first step to identify
possible trends. A plot of the data versus time is recommended for temporal data, as it may reveal
long-term trends and may also show other major types of trends, such as cycles or impulses. A

EPA QA/G-9 Final
QAOO Version 4-13 July 2000
-------
posting plot is recommended for spatial data to reveal spatial trends such as areas of high
concentration or areas that were inaccessible.

For most of the statistical tools presented below, the focus is on monotonic long-term
trends (i.e., a trend that is exclusively increasing or decreasing, but not both), as well as other
sources of systematic variation, such as seasonality. The investigations of trend in this section are
limited to one-dimensional domains, e.g., trends in a pollutant concentration over time. The
current edition of this document does not address spatial trends (with 2- and 3-dimensional
domains) and trends over space and time (with 3- and 4-dimensional domains), which may involve
sophisticated geostatistical techniques such as kriging and require the assistance of a statistician.
Section 4.3.2 discusses estimating and testing for trends using regression techniques. Section
4.3.3 discusses more robust trend estimation procedures, and Section 4.3.4 discusses hypothesis
tests for detecting trends under several types of situations.

4.3.2 Regression-Based Methods for Estimating and Testing for Trends

4.3.2.1 Estimating a Trend Using the Slope of the Regression Line

The classic procedures for assessing linear trends involve regression. Linear regression is
a commonly used procedure in which calculations are performed on a data set containing pairs of
observations (X;, Y;), so as to obtain the slope and intercept of a line that "best fits" the data. For
temporal trends, the X; values represent time and the Y; values represent the observations, such as
contaminant concentrations. An estimate of the magnitude of trend can be obtained by
performing a regression of the data versus time (or some function of the data versus some
function of time) and using the slope of the regression line as the measure of the strength of the
trend.

Regression procedures are easy to apply; most scientific calculators will accept data
entered as pairs and will calculate the slope and intercept of the best fitting line, as well as the
correlation coefficient r (see Section 2.2.4). However, regression entails several limitations and
assumptions. First of all, simple linear regression (the most commonly used method) is designed
to detect linear relationships between two variables; other types of regression models are
generally needed to detect non-linear relationships such as cyclical or non-monotonic trends.
Regression is very sensitive to extreme values (outliers), and presents difficulties in handling data
below the detection limit, which are commonly encountered in environmental studies. Regression
also relies on two key assumptions: normally distributed errors, and constant variance. It may be
difficult or burdensome to verify these assumptions in practice, so the accuracy of the slope
estimate may be suspect. Moreover, the analyst must ensure that time plots of the data show no
cyclical patterns, outlier tests show no extreme data values, and data validation reports indicate
that nearly all the measurements were above detection limits. Because of these drawbacks,
regression is not recommended as a general tool for estimating and detecting trends, although it
may be useful as an informal, quick, and easy screening tool for identifying strong linear trends.
EPA QA/G-9 Final
QAOO Version 4-14 July 2000
-------
4.3.2.2 Testing for Trends Using Regression Methods

The limitations and assumptions associated with estimating trends based on linear
regression methods apply also to other regression-based statistical tests for detecting trends.
Nonetheless, for situations in which regression methods can be applied appropriately, there is a
solid body of literature on hypothesis testing using the concepts of statistical linear models as a
basis for inferring the existence of temporal trends. The methodology is complex and beyond the
scope of this document.

For simple linear regression, the statistical test of whether the slope is significantly
different from zero is equivalent to testing if the correlation coefficient is significantly different
from zero. Directions for this test are given in Box 4-5 along with an example. This test assumes
a linear relation between Y and X with independent normally distributed errors and constant
variance across all X and Y values. Censored values (e.g., below the detection limit) and outliers
may invalidate the tests.
Box 4-5: Directions for the Test for a Correlation Coefficient
and an Example

Directions

STEP 1: Calculate the correlation coefficient, r (Section 2.2.4).

r
STEP 2: Calculate the t-value t =
\
1 - r2
n - 2
STEP 3: Use Table A-1 of Appendix A to find the critical value t^ such that 100(1-a/2)% of the t
distribution with n - 2 degrees of freedom is below t.,_,.,. For example, if a = 0.10 and n = 17,
then n-2 = 15andt1.a/2= 1.753. Conclude that the correlation is significantly different from
zero if
't' > W-

Example: Consider the following data set (in ppb): for Sample 1, arsenic (X) is 4.0 and lead (Y) is 8.0;
for Sample 2, arsenic is 3.0 and lead is 7.0; for Sample 3, arsenic is 2.0 and lead is 7.0; and for Sample
4, arsenic is 1.0 and lead is 6.0.

STEP 1: In Section 2.2.4, the correlation coefficient r for this data was calculated to be 0.949.

0 949
STEP 2: t = - = 4.26
\
1 - 0.9492
4-2
STEP 3: Using Table A-1 of Appendix A, t^ = 2.920 fora 10% level of significance and 4-2 = 2
degrees of freedom. Therefore, there appears to be a significant correlation between the two
variables lead and arsenic.
EPA QA/G-9 Final
QAOO Version 4-15 July 2000
-------
4.3.3 General Trend Estimation Methods

4.3.3.1 Sen's Slope Estimator

Sen's Slope Estimate is a nonparametric alternative for estimating a slope. This approach
involves computing slopes for all the pairs of ordinal time points and then using the median of
these slopes as an estimate of the overall slope. As such, it is insensitive to outliers and can
handle a moderate number of values below the detection limit and missing values. Assume that
there are n time points (or n periods of time), and let Xj denote the data value for the ith time
point. If there are no missing data, there will be n(n-l)/2 possible pairs of time points (i, j) in
which i > j. The slope for such a pair is called a pairwise slope, by, and is computed as by = (Xj -
Xj) / (i - j). Sen's slope estimator is then the median of the n(n-l)/2 pairwise slopes.

If there is no underlying trend, then a given Xj is as likely to be above another Xj as it is
below. Hence, if there is no underlying trend, there would be an approximately equal number of
positive and negative slopes, and thus the median would be near zero. Due to the number of
calculations required, Sen's estimator is rarely calculated by hand and directions are not given in
this document.

4.3.3.2 Seasonal Kendall Slope Estimator

If the data exhibit cyclic trends, then Sen's slope estimator can be modified to account for
the cycles. For example, if data are available for each month for a number of years, 12 separate
sets of slopes would be determined (one for each month of the year); similarly, if daily
observations exhibit weekly cycles, seven sets of slopes would be determined, one for each day of
the week. In these estimates, the above pairwise slope is calculated for each time period and the
median of all of the slopes is an estimator of the slope for a long-term trend. This is known as the
seasonal Kendall slope estimator. Because of the number of calculations required, this estimator
is rarely calculated by hand.

4.3.4 Hypothesis Tests for Detecting Trends

Most of the trend tests treated in this section involve the Mann-Kendall test or extensions
of it. The Mann-Kendall test does not assume any particular distributional form and
accommodates trace values or values below the detection limit by assigning them a common
value. The test can also be modified to deal with multiple observations per time period and
generalized to deal with multiple sampling locations and seasonality.

4.3.4.1 One Observation per Time Period for One Sampling Location

The Mann-Kendall test involves computing a statistic S, which is the difference between
the number of pairwise slopes (described in 4.3.3.1) that are positive minus the number that are
negative. If S is a large positive value, then there is evidence of an increasing trend in the data. If
S is a large negative value, then there is evidence of a decreasing trend in the data. The null

EPA QA/G-9 Final
QAOO Version 4-16 July 2000
-------
hypothesis or baseline condition for this test is that there is no temporal trend in the data values,
i.e., "H0: no trend". The alternative condition or hypothesis will usually be either "HA: upward
trend" or "H4: downward trend."
The basic Mann-Kendall trend test involves listing the observations in temporal order, and
computing all differences that may be formed between measurements and earlier measurements, as
depicted in Box 4-6. The test statistic is the difference between the number of strictly positive
differences and the number of strictly negative differences. If there is an underlying upward trend,
then these differences will tend to be positive and a sufficiently large value of the test statistic will
suggest the presence of an upward trend. Differences of zero are not included in the test statistic
(and should be avoided, if possible, by recording data to sufficient accuracy). The steps for
conducting the Mann-Kendall test for small sample sizes (i.e., less than 10) are contained in Box
4-7 and an example is contained in Box 4-8.
Box 4-6: "Upper Triangular" Data for Basic Mann-Kendall Trend Test
with a Single Measurement at Each Time Point
Data Table
Original Time
Measurement
X,
X2
xn_2
Xn.!
Original Time
Measurement
X,
X2
xn_2
Xn.!
NOTE: Xy-Yk=
discarded.
t., t2 t3 t4 ... tn_! tn (time from earliest to latest)
X1 X2 X3 X4 ... Xn-1 Xn (actual values recorded)
X3-X2 X4-X2 . . . Xn_!-X2 Xn-X2
Xn-rXn-2 Xn-Xn_2
Xn-Xn_.|
After performing the subtractions this table converts to:
t! t2 t3 t4 ... tn_., tn # Of + # Of -
X., X2 X3 X4 ... Xn_! Xn Differences Differences
(>0) (<0)
'21 '31 '41 • • • '(n-1)1 'n1
'32 '42 • • • '(n-1)2 'n2
Y Y
' (n-1)(n-2) ' n(n-2)
Yn(n-1)
0 do not contribute to either total and are Total # >0 Total # <0

where Yik = sign (XrXk) = + if X; - Xk > 0
= 0 ifXi-Xk = 0
= - ifXi-Xk<0
EPA QA/G-9
QAOO Version
4- 17
Final
July 2000
-------
Box 4-7: Directions for the Mann-Kendall Trend Test for Small Sample Sizes

If the sample size is less than 10 and there is only one datum per time period, the Mann-Kendall Trend Test for
small sample sizes may be used.

STEP 1: List the data in the order collected overtime: X.,, X2, ..., Xn, where X, is the datum at time t,. Assign
a value of DL/2 to values reported as below the detection limit (DL). Construct a "Data Matrix"
similar to the top half of Box 4-6.

STEP 2: Compute the sign of all possible differences as shown in the bottom portion of Box 4-6.

STEP 3: Compute the Mann-Kendall statistic S, which is the number of positive signs minus the number of
negative signs in the triangular table: S = (number of + signs) - (number of - signs).

STEP 4: Use Table A-11 of Appendix A to determine the probability p using the sample size n and the
absolute value of the statistic S. For example, if n=5 and S=8, p=0.042.

STEP 5: For testing the null hypothesis of no trend against H., (upward trend), reject H0 if S > 0 and if p < a.
For testing the null hypothesis of no trend against H2 (downward trend), reject H0 if S < 0 and if p <
a.
Box 4-8: An Example of Mann-Kendall Trend Test for Small Sample Sizes

Consider 5 measurements ordered by the time of their collection: 5, 6, 11, 8, and 10. This data will be used to
test the null hypothesis, H0: no trend, versus the alternative hypothesis H., of an upward trend at an a = 0.05
significance level.

STEP 1: The data listed in order by time are: 5, 6, 11, 8, 10.

STEP 2: A triangular table (see Box 4-6) was used to construct the possible differences. The sum of signs of
the differences across the rows are shown in the columns 7 and 8.

Time 1 2 3 4 5 No. of + No. of
Data 5 6 11 8 10 Signs
Signs
5 + + + + 40
6 + + +30
11 - - 0 2
8 + 1 Q
8 2

STEP 3: Using the table above, 5 = 8-2 = 6.

STEP 4: From Table A-11 of Appendix A for n = 5 and S = 6, p = 0.117.

STEP 5: Since S > 0 but p = 0.117 < 0.05, the null hypothesis is not rejected. Therefore, there is not enough
evidence to conclude that there is an increasing trend in the data.
EPA QA/G-9 Final
QAOO Version 4-18 July 2000
-------
For sample sizes greater than 10, a normal approximation to the Mann-Kendall test is
quite accurate. Directions for this approximation are contained in Box 4-9 and an example is
given in Box 4-10. Tied observations (i.e., when two or more measurements are equal) degrade
the statistical power and should be avoided, if possible, by recording the data to sufficient
accuracy.

4.3.4.2 Multiple Observations per Time Period for One Sampling Location

Often, more than one sample is collected for each time period. There are two ways to
deal with multiple observations per time period. One method is to compute a summary statistic,
such as the median, for each time period and to apply one of the Mann-Kendall trend tests of
Section 4.3.4.1 to the summary statistic. Therefore, instead of using the individual data points in
the triangular table, the summary statistic would be used. Then the steps given in Box 4-7 and
Box 4-9 could be applied to the summary statistics.

An alternative approach is to consider all the multiple observations within a given time
period as being essentially equal (i.e., tied) values within that period. The S statistic is computed
as before with n being the total of all observations. The variance of the S statistic (previously
calculated in step 2) is changed to:
VAR(S) = —
18

w +5) -
p
p=\ q=\ p= 1 q= 1
9n(n-l)(n-2) 2n(n-l)
Box 4-9: Directions for the Mann-Kendall Procedure Using Normal Approximation

If the sample size is 10 or more, a normal approximation to the Mann-Kendall procedure may be used.

STEP 1 : Complete steps 1 , 2, and 3 of Box 4-7.
STEP 2: Calculate the variance of S: V(S) = n(n~V(2n + 5
18
If ties occur, let g represent the number of tied groups and wp represent the number of data points in
the pth group. The variance of S is: V(S) = — [n(n- l)(2« + 5) - 2^ w (w - l)(2w +5)]
18 p=\

STEP 4: Calculate Z = if S > 0, Z = 0 if S = 0, or Z = if S < 0.
STEP 5: Use Table A-1 of Appendix A to find the critical value z^_a such that 100(1-a)% of the normal
distribution is below z^_a. For example, if a=0.05 then 2^=1.645.

STEP 6: For testing the hypothesis, H0 (no trend) against 1) H., (an upward trend) - reject H0 if Z > z^_a, or 2)
H2 (a downward trend) - reject H0 if Z < 0 and the absolute value of Z > z^_a.

EPA QA/G-9 Final
QAOO Version 4-19 July 2000
-------
Box 4-10: An Example of Mann-Kendall Trend Test by Normal Approximation
A test for an upward trend with a=.05 will be based on the 1 1
STEP 1:

Week
Data
10
10
10
5
10
20
18
17
15
24

STEP 2:
STEP 3:

STEP 4:

STEP 5:
STEP 6:
Using Box 4-6, a triangular table was constructed
if the difference is zero, a "+" sign if the difference
negative.
1234567
10 10 10 5 10 20 18
0 0 - 0 + +
0 - 0 + +
0 + +
+ + +
+ +
-

S = (sum of + signs) - (sum of- signs) = 35 - 13 =
There are several observations tied at 10 and 15.
this formula, g=2, t.,=4 for tied values of 1 0, and t2
V(S) = — [11(11- l)(2(ll) + 5) - [4(4-
18
S-l 22-1
^ i nr^r* ^ i** no17* itiwr*" f — —

[V(Sj\* (155.33)
From Table A-1 of Appendix A, z1_05=1 .645.
H! is the alternative of interest. Therefore, since 1
weekly measurements shown below.

of the possible differences. A zero has been used
is positive, and a "-" sign if the difference is

8 9 10 11 No. of
17 15 24 15 + Signs
+ + + + 6
+ + + + 6
+ + + + 6
+ + + + 7
+ + + + 6
+ - 1
+ - 1
+ - 1
+ 0 1
0
35
22
Thus, the formula for tied values will be
=2 for tied values of 15.
-l)(2(4) + 5) + 2(2-l)(2(2) + 5)]]

20
•'•" - 1 (,C\Z.

v> 12.46

.605 is not greater than 1.645, H0 is not

No. of
- Signs
1
1
1
0
0
4
3
2
0
1
13

used. In

= 155.33

rejected.
Therefore, there is not enough evidence to determine that there is an upward trend.
where g represents the number of tied groups, wp represents the number of data points in the p*
group, h is the number of time periods which contain multiple data, and uq is the sample size in the
q* time period.

The preceding variance formula assumes that the data are not correlated. If correlation
within single time periods is suspected, it is preferable to use a summary statistic (e.g., the
median) for each time period and then apply either Box 4-7 or Box 4-9 to the summary statistics.

4.3.4.3 Multiple Sampling Locations with Multiple Observations

The preceding methods involve a single sampling location (station). However,
environmental data often consist of sets of data collected at several sampling locations (see Box
4-11). For example, data are often systematically collected at several fixed sites on a lake or
river, or within a region or basin. The data collection plan (or experimental design) must be
systematic in the sense that approximately the same sampling times should be used at all locations.

EPA QA/G-9 Final
QAOO Version 4 - 20 July 2000
-------
In this situation, it is desirable to express the results by an overall regional summary statement
across all sampling locations. However, there must be consistency in behavioral characteristics
across sites over time in order for a single summary statement to be valid across all sampling
locations. A useful plot to assess the consistency requirement is a single time plot (Section
2.3.8.1) of the measurements from all stations where a different symbol is used to represent each
station.
Box 4-11: Data for Multiple Times and Multiple Stations

Let i = 1, 2, ..., n represent time, k = 1,2, ..., K represent sampling locations, and Xik represent the
measurement at time i for location k. This data can be summarized in matrix form, as shown below.
Stations
1
2
Time
X,
X21
Xn1
81
V(S1)
Z1
X12
X22
Xn, '.
S2
V(S2)
Z2
• • X1K
• • X2K
• • XnK
SK
V(SK)
ZK
where Sk = Mann-Kendall statistic for station k (see STEP 3, Box 4-7),
V(Sk) = variance for S statistic for station k (see STEP 2, Box 4-9), and
If the stations exhibit approximately steady trends in the same direction (upward or
downward), with comparable slopes, then a single summary statement across stations is valid and
this implies two relevant sets of hypotheses should be investigated:

Comparability of stations. H0: Similar dynamics affect all K stations vs. HA: At least
two stations exhibit different dynamics.

Testing for overall monotonic trend. H0*: Contaminant levels do not change over time
vs. HA': There is an increasing (or decreasing) trend consistently exhibited across all
stations.

Therefore, the analyst must first test for homogeneity of stations, and then, if homogeneity is
confirmed, test for an overall monotonic trend.

Ideally, the stations in Box 4-11 should have equal numbers. However, the numbers of
observations at the stations can differ slightly, because of isolated missing values, but the overall
time periods spanned must be similar. This guidance recommends that for less than 3 time
periods, an equal

EPA QA/G-9 Final
QAOO Version 4-21 July 2000
-------
number of observations (a balanced design) is required. For 4 or more time periods, up to 1
missing value per sampling location may be tolerated.

a. One Observation per Time Period. When only one measurement is taken for
each time period for each station, a generalization of the Mann-Kendall statistic can be used to
test the above hypotheses. This procedure is described in Box 4-12.
Box 4-12: Testing for Comparability of Stations and an Overall Monotonic Trend

Let i = 1, 2, ..., n represent time, k = 1, 2, ..., K represent sampling locations, and Xik represent the
measurement at time i for location k. Let a represent the significance level for testing homogeneity and a*
represent the significance level for testing for an overall trend.

STEP 1: Calculate the Mann-Kendall statistic Sk and its variance V(Sk) for each of the K stations using the
methods of Section 4.3.4.1, Box 4-9.

STEP 2: For each of the K stations, calculate Zk = Sk/JV(Sk).

_ K
STEPS: Calculate the average Z = ^Zk/K.
k=\

K _
STEP 4: Calculate the homogeneity chi-square statistic 5^ = ^ Zk - K Z .
k=\
STEP 5: Using a chi-squared table (Table A-8 of Appendix A), find the critical value for x2 with (K-1 ) degrees
of freedom at an a significance level. For example, for a significance level of 5% and 5 degrees of
freedom, x2(5) = 11.07, i.e., 11.07 is the cut point which puts 5% of the probability in the upper tail of
a chi-square variable with 5 degrees of freedom.

STEP 6: If x2h X2(K-i), tne stations are not homogeneous (i.e., different dynamics at different stations) at the
significance level a. Therefore, individual a*-level Mann-Kendall tests should be conducted at each
station using the methods presented in Section 4.3.4.1.

STEP 7: Using a chi-squared table (Table A-8 of Appendix A), find the critical value for x2 with 1 degree of
freedom at an a significance level. If
then reject H0* and conclude that there is a significant (upward or downward) monotonic trend
across all stations at significance level a*. The signs of the Sk indicate whether increasing or
decreasing trends are present. If

K Z2 < X^1} ,

there is not significant evidence at the a' level of a monotonic trend across all stations. That is, the
stations appear approximately stable over time.
EPA QA/G-9 Final
QAOO Version 4 - 22 July 2000
-------
b. Multiple Observations per Time Period. If multiple measurements are taken at
some times and station, then the previous approaches are still applicable. However, the variance
of the statistic Sk must be calculated using the equation for calculating V(S) given in Section
4.3.4.2. Note that Sk is computed for each station, so n, wp, g, h, and uq are all station-specific.

4.3.4.4 One Observation for One Station with Multiple Seasons

Temporal data are often collected over extended periods of time. Within the time
variable, data may exhibit periodic cycles, which are patterns in the data that repeat over time
(e.g., the data may rise and fall regularly over the months in a year or the hours in a day). For
example, temperature and humidity may change with the season or month, and may affect
environmental measurements. (For more information on seasonal cycles, see Section 2.3.8). In
the following discussion, the term season represents one time point in the periodic cycle, such as a
month within a year or an hour within a day.

If seasonal cycles are anticipated, then two approaches for testing for trends are the
seasonal Kendall test and Sen's test for trends. The seasonal Kendall test may be used for large
sample sizes, and Sen's test for trends may be used for small sample sizes. If different seasons
manifest similar slopes (rates of change) but possibly different intercepts, then the Mann-Kendall
technique of Section 4.3.4.3 is applicable, replacing time by year and replacing station by season.

The seasonal Kendall test, which is an extension of the Mann-Kendall test, involves
calculating the Mann-Kendall test statistic, S, and its variance separately for each "season" (e.g.,
month of the year, day of the week). The sum of the S's and the sum of their variances are then
used to form an overall test statistic that is assumed to be approximately normally distributed for
larger size samples.

For data at a single site, collected at multiple seasons within multiple years, the techniques
of Section 4.3.4.3 can be applied to test for homogeneity of time trends across seasons. The
methodology follows Boxes 4-11 and 4-12 exactly except that "station" is replaced by "season"
and the inferences refer to seasons.

4.3.5 A Discussion on Tests for Trends

This section discusses some further considerations for choosing among the many tests for
trends. All of the nonparametric trend tests and estimates use ordinal time (ranks) rather than
cardinal time (actual time values, such as month, day or hour) and this restricts the interpretation
of measured trends. All of the Mann-Kendall Trend Tests presented are based on certain pairwise
differences in measurements at different time points. The only information about these differences
that is used in the Mann-Kendall calculations is their signs (i.e., whether they are positive or
negative) and therefore are generalizations of the sign test. Mann-Kendall calculations are
relatively easy and simply involve counting the number of cases in which X; + j exceeds X; and the
number of cases in which X; exceeds Xj+j. Information about magnitudes of these differences is
not used by the Mann-Kendall methods and this can adversely affect the statistical power when
only limited amounts of data are available.
EPA QA/G-9 Final
QAOO Version 4 - 23 July 2000
-------
There are, however, nonparametric methods based on ranks that takes such magnitudes
into account and still retains the benefit of robustness to outliers. These procedures can be
thought of as replacing the data by their ranks and then conducting parametric analyses. These
include the Wilcoxon rank sum test and its many generalizations. These methods are more
resistant to outliers than parametric methods; a point can be no more extreme than the smallest or
largest value.

Rank-based methods, which make fuller use of the information in the data than the Mann-
Kendall methods, are not as robust with respect to outliers as the sign and the Mann-Kendall
tests. They are, however, more statistically powerful than the sign test and the Mann-Kendall
methods; the Wilcoxon test being a case in point. If the data are random samples from normal
distributions with equal variances, then the sign test requires approximately 1.225 times as many
observations as the Wilcoxon rank sum test to achieve a given power at a given significance level.
This kind of tradeoff between power and robustness exemplifies the analyst's evaluation process
leading to the selection of the best statistical procedure for the current situation. Further
statistical tests will be developed in future editions of this guidance.

4.3.6 Testing for Trends in Sequences of Data

There are cases where it is desirable to see if a long sequence (for example, readings from
a monitoring station) could be considered random variation or correlated in some way, that is, if
consecutive results are attributable to random chance. An everyday example would be to
determine if a basketball player exhibited "hot streaks" during the season when shooting a basket
from the free-throw line. One test to make this determination is the Wald-Wolfowitz test. This
test can only be used if the data are binary, i.e., there are only two potential values. For example,
the data could either be 'Yes/No', '0/1', or 'black/white'. Directions for the Wald-Wofowitz test
are given in Box 4-13 and an example in Box 4-14.

4.4 OUTLIERS

4.4.1 Background

Outliers are measurements that are extremely large or small relative to the rest of the data
and, therefore, are suspected of misrepresenting the population from which they were collected.
Outliers may result from transcription errors, data-coding errors, or measurement system
problems such as instrument breakdown. However, outliers may also represent true extreme
values of a distribution (for instance, hot spots) and indicate more variability in the population
than was expected. Not removing true outliers and removing false outliers both lead to a
distortion of estimates of population parameters.

Statistical outlier tests give the analyst probabilistic evidence that an extreme value
(potential outlier) does not "fit" with the distribution of the remainder of the data and is therefore
a statistical outlier. These tests should only be used to identify data points that require further
investigation. The

EPA QA/G-9 Final
QAOO Version 4 - 24 July 2000
-------
Box 4-13: Directions for the Wald-Wolfowitz Runs Test

Consider a sequence of two values and let n denote the number of observations of one value and m denote the
number of observations of the other value. Note that if is customary for n < m (i.e., n denotes the value that
occurs the least amount of times. This test is used to test the null hypothesis that the sequence is random
against the alternative hypothesis that the data in the sequence are correlated or may come from different
populations.

STEP 1: List the data in the order collected and identify which will be the 'n' values, and which will be the 'm'
values.

STEP 2: Bracket the sequences within the series. A sequence is a group of consecutive values. For
example, consider the data AAABAABBBBBBBABB. The following are sequences in the data

{AAA} {B} {AA} {BBBBBB} {A} {BB}

In the example above, the smallest sequence is has one data value and the largest sequence has 6.

STEP 3: Count the number of sequences for the 'n' values and call it T. For the example sequence, the 'n'
values are 'A' since there are 6 A's and 9 B's, and T = 3: {AAA}, {AA}, and {A}.

STEP 4: If T is less than the critical value from Table A-12 of Appendix A for the specified significance level a,
then reject the null hypothesis that the sequence is random in favor of the alternative that the data
are correlated amongst themselves or possibly came from different distributions. Otherwise,
conclude the sequence is random. In the example above, 3 < 6 (where 6 is the critical value from
Table A-12 using n=6, m=9, and a = 0.01) so the null hypothesis that the sequence is random is
rejected.
tests alone cannot determine whether a statistical outlier should be discarded or corrected within a
data set; this decision should be based on judgmental or scientific grounds.

There are 5 steps involved in treating extreme values or outliers:

1. Identify extreme values that may be potential outliers;
2. Apply statistical test;
3. Scientifically review statistical outliers and decide on their disposition;
4. Conduct data analyses with and without statistical outliers; and
5. Document the entire process.

Potential outliers may be identified through the graphical representations of Chapter 2 (step 1
above). Graphs such as the box and whisker plot, ranked data plot, normal probability plot, and
time plot can all be used to identify observations that are much larger or smaller than the rest of
the data. If potential outliers are identified, the next step is to apply one of the statistical tests
described in the following sections. Section 4.4.2 provides recommendations on selecting a
statistical test for outliers.
EPA QA/G-9 Final
QAOO Version 4 - 25 July 2000
-------
Box 4-14: An Example of the Wald-Wolfowitz Runs Test

This is a set of monitoring data from the main discharge station at a chemical manufacturing plant. The permit
states that the discharge should have a pH of 7.0 and should never be less than 5.0. So the plant manager has
decided to use a pH of 6.0 to an indicate potential problems. In a four-week period the following values were
recorded:

6.5 6.6 6.4 6.2 5.9 5.8 5.9 6.2 6.2 6.3 6.6 6.6 6.7 6.4
6.2 6.3 6.2 5.8 5.9 5.8 6.1 5.9 6.0 6.2 6.3 6.2

STEP 1: Since the plant manager has decided that a pH of 6.0 will indicate trouble the data have been
replaced with a binary indicator. If the value is greater than 6.0, the value will be replaced by a 1;
otherwise the value will be replaced by a 0. So the data are now:

11110001111111111000100111

As there are 8 values of'O'and 19 values of '1', n = 8 and m = 19.

STEP 2: The bracketed sequence is: {1 1 1 1} {0 0 0} {1 1 1 1 1 1 1 1 1 1} {0 0 0} {1} {0 0 } {1 1 1}

STEP 3: T = 3: {000}, {000}, and {00}

STEP 4: Since 3 < 9 (where 9 is the critical value from Table A-12 using a = 0.05) so the null hypothesis that
the sequence is random is rejected.
If a data point is found to be an outlier, the analyst may either: 1) correct the data point;
2) discard the data point from analysis; or 3) use the data point in all analyses. This decision
should be based on scientific reasoning in addition to the results of the statistical test. For
instance, data points containing transcription errors should be corrected, whereas data points
collected while an instrument was malfunctioning may be discarded. One should never discard an
outlier based solely on a statistical test. Instead, the decision to discard an outlier should be based
on some scientific or quality assurance basis. Discarding an outlier from a data set should be done
with extreme caution, particularly for environmental data sets, which often contain legitimate
extreme values. If an outlier is discarded from the data set, all statistical analysis of the data
should be applied to both the full and truncated data set so that the effect of discarding
observations may be assessed. If scientific reasoning does not explain the outlier, it should not be
discarded from the data set.

If any data points are found to be statistical outliers through the use of a statistical test,
this information will need to be documented along with the analysis of the data set, regardless of
whether any data points are discarded. If no data points are discarded, document the
identification of any "statistical" outliers by documenting the statistical test performed and the
possible scientific reasons investigated. If any data points are discarded, document each data
point, the statistical test performed, the scientific reason for discarding each data point, and the
effect on the analysis of deleting the data points. This information is critical for effective peer
review.
EPA QA/G-9 Final
QAOO Version 4 - 26 July 2000
-------
4.4.2 Selection of a Statistical Test

There are several statistical tests for determining whether or not one or more observations
are statistical outliers. Step by step directions for implementing some of these tests are described
in Sections 4.4.3 through 4.4.6. Section 4.4.7 describes statistical tests for multivariate outliers.

If the data are normally distributed, this guidance recommends Rosner's test when the
sample size is greater than 25 and the Extreme Value test when the sample size is less than 25. If
only one outlier is suspected, then the Discordance test may be substituted for either of these
tests. If the data are not normally distributed, or if the data cannot be transformed so that the
transformed data are normally distributed, then the analyst should either apply a nonparametric
test (such as Walsh's test) or consult a statistician. A summary of this information is contained in
Table 4-3.
Table 4-3. Recommendations for Selecting a Statistical Test for Outliers
Sample
Size
n < 25
n < 50
n> 25
n > 50
Test
Extreme Value Test
Discordance Test
Rosner's Test
Walsh's Test
Section
4.4.3
4.4.4
4.4.5
4.4.6
Assumes
Normality
Yes
Yes
Yes
No
Multiple
Outliers
No/Yes
No
Yes
Yes
4.4.3 Extreme Value Test (Dixon's Test)

Dixon's Extreme Value test can be used to test for statistical outliers when the sample size
is less than or equal to 25. This test considers both extreme values that are much smaller than the
rest of the data (case 1) and extreme values that are much larger than the rest of the data (case 2).
This test assumes that the data without the suspected outlier are normally distributed; therefore, it
is necessary to perform a test for normality on the data without the suspected outlier before
applying this test. If the data are not normally distributed, either transform the data, apply a
different test, or consult a statistician. Directions for the Extreme Value test are contained in Box
4-15; an example of this test is contained in Box 4-16.

This guidance recommends using this test when only one outlier is suspected in the data.
If more than one outlier is suspected, the Extreme Value test may lead to masking where two or
more outliers close in value "hide" one another. Therefore, if the analyst decides to use the
Extreme Value test for multiple outliers, apply the test to the least extreme value first.
EPA QA/G-9
QAOO Version
4-27
Final
July 2000
-------
Box 4-15: Directions for the Extreme Value Test
(Dixon's Test)

STEP 1 : Let X( 1(, X( 2 ),..., X( n ) represent the data ordered from smallest to largest. Check that the data
without the suspect outlier are normally distributed, using one of the methods of Section 4.2. If
normality fails, either transform the data or apply a different outlier test.

STEP 2: X< 1 ; is a Potential Outlier (case 1): Compute the test statistic C, where

c = Xm ~ XW for3,n,7, C = X(3) ~ X(l) for 11, n, 13,
X(n) ~ X(l) X(n-l) ~ X(l)

X - X X - X
C = — £) - ^L- for 8 < n < 10, C = — £> - ~ X(n~2) for 11. n. 13,
X(n) ~ X(\) X(n) ~ X(2)

C = X(n) ~ X("~l) for8. n. 10, C = *<"> ~ X(n~2) for 14 , n < 25
X(n) ~ X(2) X(n) (3)

STEP 5: If C exceeds the critical value from Table A-3 of Appendix A for the specified significance level a, X(
n) is an outlier and should be further investigated.
Box 4-16: An Example of the Extreme Value Test
(Dixon's Test)

The data in order of magnitude from smallest to largest are: 82.39, 86.62, 91.72, 98.37, 103.46, 104.93,
105.52, 108.21, 113.23, and 150.55 ppm. Because the largest value (150.55) is much larger than the other
values, it is suspected that this data point might be an outlier which is Case 2 in Box 4-15.

STEP 1: A normal probability plot of the data shows that there is no reason to suspect that the data (without
the extreme value) are not normally distributed. The studentized range test (Section 4.2.6) also
shows that there is no reason to suspect that the data are not normally distributed. Therefore, the
Extreme Value test may be used to determine if the largest data value is an outlier.

STEp4. c = x(n) ~ Vi) = 150.55 - 113.23 = 37.32 = Q 5g4
X(n} - X(2) 150.55 - 86.62 63.93

STEP 5: Since C = 0.584 > 0.477 (from Table A-3 of Appendix A with n=10), there is evidence that X(n) is an
outlier at a 5% significance level and should be further investigated.
EPA QA/G-9 Final
QAOO Version 4 - 28 July 2000
-------
4.4.4 Discordance Test

The Discordance test can be used to test if one extreme value is an outlier. This test
considers two cases: 1) where the extreme value (potential outlier) is the smallest value of the
data set, and 2) where the extreme value (potential outlier) is the largest value of the data set.
The Discordance test assumes that the data are normally distributed; therefore, it is necessary to
perform a test for normality before applying this test. If the data are not normally distributed
either transform the data, apply a different test, or consult a statistician. Note that the test
assumes that the data without the outlier are normally distributed; therefore, the test for normality
should be performed without the suspected outlier. Directions and an example of the Discordance
test are contained in Box 4-17 and Box 4-18.
Box 4-17: Directions for the Discordance Test

STEP 1: Let X( 1}, X( 2),..., X( n} represent the data ordered from smallest to largest. Check that the data
without the suspect outlier are normally distributed, using one of the methods of Section 4.2. If
normality fails, either transform the data or apply a different outlier test.

STEP 2: Compute the sample mean, x (Section 2.2.2), and the sample standard deviation, s (Section 2.2.3).
If the minimum value X(1) is a suspected outlier, perform Steps 3 and 4. If the maximum value X(n) is
a suspected outlier, perform Steps 5 and 6.

~~ (i1
STEP 3: IfX,!; is a Potential Outlier (case 1): Compute the test statistic D = =
s

STEP 4: If D exceeds the critical value from Table A-4, X(., (is an outlier and should be further investigated.

X - X
STEPS: If X;n) is a Potential Outlier (case 2): Compute the test statistic D = —^
s

STEP 6: If D exceeds the critical value from Table A-4, X(n) is an outlier and should be further investigated.
Box 4-18: An Example of the Discordance Test

The ordered data are 82.39, 86.62, 91.72, 98.37, 103.46, 104.93, 105.52, 108.21, 113.23, and 150.55 ppm.
Because the largest value of this data set (150.55) is much larger than the rest, it may be an outlier.

STEP 1: A normal probability plot of the data shows that there is no reason to suspect that the data (without
the extreme value) are not normally distributed. The studentized range test (Section 4.2.6) also
shows that there is no reason to suspect that the data are not normally distributed. Therefore, the
Discordance test may be used to determine if the largest data value is an outlier.

STEP 2: x = 104.5 ppm and s = 18.922 ppm.

STEP 5: D - *<"> " * - 15°-55 " 104-5° = 2.43
s 18.92

STEP 6: Since D = 2.43 > 2.176 (from Table A-4 of Appendix A with n = 10), there is evidence that X(n) is an
outlier at a 5% significance level and should be further investigated.

EPA QA/G-9 Final
QAOO Version 4 - 29 July 2000
-------
4.4.5 Rosner's Test

A parametric test developed by Rosner can be used to detect up to 10 outliers for sample
sizes of 25 or more. This test assumes that the data are normally distributed; therefore, it is
necessary to perform a test for normality before applying this test. If the data are not normally
distributed either transform the data, apply a different test, or consult a statistician. Note that the
test assumes that the data without the outlier are normally distributed; therefore, the test for
normality may be performed without the suspected outlier. Directions for Rosner's test are
contained in Box 4-19 and an example is contained in Box 4-20.

Rosner's test is not as easy to apply as the preceding tests. To apply Rosner's test, first
determine an upper limit r0 on the number of outliers (r0 < 10), then order the r0 extreme values
from most extreme to least extreme. Rosner's test statistic is then based on the sample mean and
sample
Box 4-19: Directions for Rosner's Test for Outliers

STEP 1: LetX.,, X2, . . . , Xn represent the ordered data points. By inspection, identify the maximum number
of possible outliers, r0. Check that the data are normally distributed, using one of the methods of
Section 4.2.

STEP 2: Compute the sample mean x, and the sample standard deviation, s, for a]i the data. Label these
values x(0) and s(0), respectively. Determine the observation farthest from x(0) and label this
observation y( ° '. Delete y( ° ' from the data and compute the sample mean, labeled x( 1 ', and the
sample standard deviation, labeled s( 1 '. Then determine the observation farthest from x( 1 ' and
label this observation y(1 '. Delete y(1 ' and compute x(2) and s(2). Continue this process until r0
extreme values have been eliminated.

In summary, after the above process the analyst should have

[X(°\ s<0>,/°>]; [XW, s]; ..., [X(r°~l\ s(r°~ l\ y(r«~ l)} where
(l) l

X(l) = — — jc., s(l) = [— — £fr.-x(0)2]1/2, and y'1' is the farthest value fromx'1'.
n - ij, i J n- ij, i J

(Note, the above formulas for x( ' ' and s( ' ' assume that the data have been renumbered after each
observation is deleted.)

(r-l) _ ^(r-1)
STEPS: To test if there are V outliers in the data, compute: Rr = — - and compare Rr
to Ar in Table A-5 of Appendix A. If Rr > \, conclude that there are r outliers.

First, test if there are r0 outliers (compare R to A ). If not, test if there are r0-1 outliers
'o-1 'o-1
(compare .ft to A ). If not, test if there are r0-2 outliers, and continue, until either it is
'o- 1 'o-1
determined that there are a certain number of outliers or that there are no outliers at all.
EPA QA/G-9 Final
QAOO Version 4-30 July 2000
-------
Box 4-20: An Example of Rosner's Test for Outliers

STEP 1: Consider the following 32 data points (in ppm) listed in order from smallest to largest: 2.07, 40.55,
84.15, 88.41, 98.84, 100.54, 115.37, 121.19, 122.08, 125.84, 129.47, 131.90, 149.06, 163.89,
166.77, 171.91, 178.23, 181.64, 185.47, 187.64, 193.73, 199.74, 209.43, 213.29, 223.14, 225.12,
232.72, 233.21, 239.97, 251.12, 275.36, and 395.67.

A normal probability plot of the data shows that there is no reason to suspect that the data (without
the suspect outliers) are not normally distributed. In addition, this graph identified four potential
outliers: 2.07, 40.55, 275.36, and 395.67. Therefore, Rosner's test will be applied to see if there are
4 or fewer (r0 = 4) outliers.

STEP 2: First the sample mean and sample standard deviation were computed for the entire data set (x(0) and
s(0)). Using subtraction, it was found that 395.67 was the farthest data p_pint from x(0), so y(0) =
395.67. Then 395.67 was deleted from the data and the sample mean, x(1), and the sample standard
deviation, s(1), were computed. Using subtraction, it was found that 2.07 was the farthest value from
x(1). This value was then dropped from the data and the process was repeated again on 40.55 to
yield x(2), s(2), and y(2) and x(3), s(3), and y(3). These values are summarized below.

i x('' s(''
~0 169.923 75.133
1 162.640 63.872
2 167.993 57.460
3 172.387 53.099

STEP 3: To apply Rosner's test, it is first necessary to test if there are 4 outliers by computing

|275.36 - 172.387
5(3) 53.099

and comparing R4 to A4 in Table A-5 of Appendix A with n = 32. Since R4 = 1.939 A4 = 2.89, there
are not 4 outliers in the data set. Therefore, it will next be tested if there are 3 outliers by computing
I (2) _ x(2) |4055 _ 167.9931
- L.
s (2) 57.460
218
and comparing R3 to A3 in Table A-5 with n = 32. Since R3 = 2.218 A3 = 2.91, there are not 3
outliers in the data set. Therefore, it will next be tested if there are 2 outliers by computing
R, =
|2.07 - 162.640
2 SW 63.872
and comparing R2 to A2 in Table A-5 with n = 32. Since R2 = 2.514 A3 = 2.92, there are not 2
outliers in the data set. Therefore, it will next be tested if there is 1 outlier by computing

|395.67 - 169.923
s
(0) 75.133
and comparing R1 to A., in Table A-5 with n = 32. Since R1 = 3.005 > A.,= 2.94, there is evidence at a
5% significance level that there is 1 outlier in the data set. Therefore, observation 355.67 is a
statistical outlier and should be further investigated.
EPA QA/G-9 Final
QAOO Version 4-31 July 2000
-------
standard deviation computed without the r = r0 extreme values. If this test statistic is greater than
the critical value given in Table A-5 of Appendix A, there are r0 outliers. Otherwise, the test is
performed again without the r = r0 - 1 extreme values. This process is repeated until either
Rosner's test statistic is greater than the critical value or r = 0.

4.4.6 Walsh's Test

A nonparametric test was developed by Walsh to detect multiple outliers in a data set.
This test requires a large sample size: n > 220 for a significance level of a = 0.05, and n > 60 for a
significance level of a = 0.10. However, since the test is a nonparametric test, it may be used
whenever the data are not normally distributed. Directions for the test by Walsh for large sample
sizes are given in Box 4-21.
Box 4-21: Directions for Walsh's Test for Large Sample Sizes

Let X(.,), X( 2), . . . , X( n) represent the data ordered from smallest to largest. If n < 60, do not apply this
test. If 60 < n < 220, then a = 0.10. If n > 220, then a = 0.05.

STEP 1: Identify the number of possible outliers, r. Note that r can equal 1.

STEP 2: Compute c = [fin], k = r + c, b2 = 1 /a, and a = 1 + bv(c~b ^c ~ l\
c-b2- 1

where [ ] indicates rounding the value to the largest possible integer (i.e., 3.24 becomes 4).

STEP 3: The r smallest points are outliers (with a a% level of significance) if

STEP 4: The r largest points are outliers (with a a% level of significance) if

X(n+l-r) ~ (^+a)X(n-r) + aX(n+l-k)>®
STEP 5: If both of the inequalities are true, then both small and large outliers are indicated.
4.4.7 Multivariate Outliers

Multivariate analysis, such as factor analysis and principal components analysis, involves
the analysis of several variables simultaneously. Outliers in multivariate analysis are then values
that are extreme in relationship to either one or more variables. As the number of variables
increases, identifying potential outliers using graphical representations becomes more difficult. In
addition, special procedures are required to test for multivariate outliers. Details of these
procedures are beyond the scope of this guidance. However, procedures for testing for
multivariate outliers are contained in statistical textbooks on multivariate analysis.
EPA QA/G-9 Final
QAOO Version 4-32 July 2000
-------
4.5 TESTS FOR DISPERSIONS

Many statistical tests make assumptions on the dispersion (as measured by variance) of
data; this section considers some of the most commonly used statistical tests for variance
assumptions. Section 4.5.1 contains the methodology for constructing a confidence interval for a
single variance estimate from a sample. Section 4.5.2 deals with the equality of two variances, a
key assumption for the validity of a two-sample t-test. Section 4.5.3 describes Bartlett's test and
Section 4.5.4 describes Levene's test. These two tests verify the assumption that two or more
variances are equal, a requirement for a standard two-sample t-test, for example. The analyst
should be aware that many statistical tests only require the assumption of approximate equality
and that many of these tests remain valid unless gross inequality in variances is determined.

4.5.1 Confidence Intervals for a Single Variance

This section discusses confidence intervals for a single variance or standard deviation for
analysts interested in the precision of variance estimates. This information may be necessary for
performing a sensitivity analysis of the statistical test or analysis method. The method described
in Box 4-22 can be used to find a two-sided 100(l-a)% confidence interval. The upper end point
of a two-sided 100(l-a)% confidence interval is a 100(l-a/2)% upper confidence limit, and the
lower end point of a two-sided 100(l-a)% confidence interval is a 100(l-a/2)% lower confidence
limit. For example, the upper end point of a 90% confidence interval is a 95% upper confidence
limit and the lower end point is a 95% lower confidence limit. Since the standard deviation is the
square root of the variance, a confidence interval for the variance can be converted to a
confidence interval for the standard deviation by taking the square roots of the endpoints of the
interval. This confidence interval assumes that the data constitute a random sample from a
normally distributed population and can be highly sensitive to outliers and to departures from
normality.

4.5.2 The F-Test for the Equality of Two Variances

An F-test may be used to test whether the true underlying variances of two populations
are equal. Usually the F-test is employed as a preliminary test, before conducting the two-sample
t-test for the equality of two means. The assumptions underlying the F-test are that the two
samples are independent random samples from two underlying normal populations. The F-test for
equality of variances is highly sensitive to departures from normality. Directions for implementing
an F-test with an example are given in Box 4-23.

4.5.3 Bartlett's Test for the Equality of Two or More Variances

Bartlett's test is a means of testing whether two or more population variances of normal
distributions are equal. In the case of only two variances, Bartlett's test is equivalent to the F-test.
Often in practice unequal variances and non-normality occur together and Bartlett's test is itself
sensitive
EPA QA/G-9 Final
QAOO Version 4-33 July 2000
-------
Box 4-22: Directions for Constructing Confidence Intervals and
Confidence Limits for the Sample Variance and Standard Deviation with an Example

Directions: Let X.,, X2, . . . , Xn represent the n data points.

STEP 1: Calculate the sample variance s2 (Section 2.2.3).
STEP 2: For a 100(1-a)% two-sided confidence interval, use Table A-8 of Appendix A to find the cutoffs L and

: ( 1 - a/2 ) '
U such that L = x2a/2 and U = x2< 1.a/2) with (n-1) degrees of freedom (dof).
(n-l}s2 (n-l}s2
STEPS: A 100(1-a)% confidence interval forthe true underlying variance is: - — to —.
_/_j LJ
i 9 I

A 100(1-a)% confidence interval forthe true standard deviation is: —to '
L M u

Example: Ten samples were analyzed for lead: 46.4, 46.1, 45.8, 47, 46.1, 45.9, 45.8, 46.9, 45.2, 46 ppb.

STEP 1: Using Section 2.2.3, s2 = 0.286.

STEP 2: Using Table A-8 of Appendix A and 9 dof, L = x205/2 = X2025 = 19-02 and U = x2(i-.05/2) = X2.975 = 2.70.

STEP 3: A 95% confidence interval forthe variance is: ^ 1)0-286 fo (10-1)0.286 ^ Q^5 to Q g54
19.02 2.70

A 95% confidence interval forthe standard deviation is: A/0.135 = .368 to t/0.954 = .977.
Box 4-23: Directions for Calculating an F-Test to Compare
Two Variances with an Example

Directions: Let X.,, X2, . . . , Xm represent the m data points from population 1 and Y.,, Y2, . . . , Yn
represent the n data points from population 2. To perform an F-test, proceed as follows.

STEP 1: Calculate the sample variances sx2 (for the X's) and sY2 (for the Y's ) (Section 2.2.3).

STEP 2: Calculate the variance ratios Fx = sx2/sY2 and FY = sY2/sx2. Let F equal the larger of these two
values. If F = Fx, then let k = m -1 and q = n -1. If F = Fy, then let k = n -1 and q = m -1.

STEP 3: Using Table A-9 of Appendix A of the F distribution, find the cutoff U = f^k, q). If F > U,
conclude that the variances of the two populations are not the same.

Example: Manganese concentrations were collected from 2 wells. The data are Well X: 50, 73, 244,
202 ppm; and Well Y: 272, 171, 32, 250, 53 ppm. An F-test will be used to test if the variances are
equal.

STEP1: For Well X, sx2 = 9076. For Well Y, sY2 = 12125.
STEP 2: Fx = sx2/sY2 = 9076 /12125 = 0.749. FY = sY2/sx2 = 12125 / 9076 = 1.336. Since, FY > Fx, F =
FY = 1.336, k = 5 -1 = 4 and q = 4 - 1 = 3.

STEP 3: Using Table A-9 of Appendix A of the F distribution with a = 0.05, L = f,..^ 4, 3) = 15.1.
Since 1.336 < 15.1, there is no evidence that the variability of the two wells is different.
EPA QA/G-9 Final
QAOO Version 4-34 July 2000
-------
to departures from normality. With long-tailed distributions, the test too often rejects equality
(homogeneity) of the variances.

Bartletfs test requires the calculation of the variance for each sample, then calculation of a
statistic associated with the logarithm of these variances. This statistic is compared to tables and
if it exceeds the tabulated value, the conclusion is that the variances differ as a complete set. It
does not mean that one is significantly different from the others, nor that one or more are larger
(smaller) than the rest. It simply implies the variances are unequal as a group. Directions for
Bartlett's test are given in Box 4-24 and an example is given in Box 4-25.

4.5.4 Levene's Test for the Equality of Two or More Variances

Levene's test provides an alternative to Bartlett's test for homogeneity of variance (testing
for differences among the dispersions of several groups). Levene's test is less sensitive to
departures from normality than Bartlett's test and has greater power than Bartlett's for non-normal
data. In addition, Levene's test has power nearly as great as Bartlett's test for normally distributed
data. However, Levene's test is more difficult to apply than Bartlett's test since it involves
applying an analysis of variance (ANOVA) to the absolute deviations from the group means.
Directions and an example of Levene's test are contained in Box 4-26 and Box 4-27, respectively.
Box 4-24: Directions for Bartlett's Test

Consider k groups with a sample size of n, for each group. Let N represent the total number of samples,
i.e., let N = n., + n2 + . . . + nk. For example, consider two wells where 4 samples have been taken from
well 1 and 3 samples have been taken from well 2. In this case, k = 2, n., = 4, n2 = 3, and N = 4 + 3 = 7.

STEP 1: For each of the k groups, calculate the sample variances, s2 (Section 2.2.3).

1 k
STEP 2: Compute the pooled variance across groups: s = /iX- ~ I)-?,
(N-k) ,= i

k
STEPS: Compute the test statistic: TS = (N - k) ln(s^2) - Y^n,~ *) ln(5f)
;= 1
where "In" stands for natural logarithms.

STEP 4: Using a chi-squared table (Table A-8 of Appendix A), find the critical value for x2 with (k-1)
degrees of freedom at a predetermined significance level. For example, for a significance
level of 5% and 5 degrees of freedom, x2 = 11.1. If the calculated value (TS) is greater than
the tabulated value, conclude that the variances are not equal at that significance level.
EPA QA/G-9 Final
QAOO Version 4-35 July 2000
-------
Box 4-25: An Example of Bartlett
Manganese concentrations were collected from 6 wells over a 4 month
following table. Before analyzing the data, it is important to determine
Bartlett's test will be used to make this determination.
STEP
STEP
STEP

STEP
1 : For each of the 6 wells, the sample means and variances
bottom rows of the table below.
Sampling Date Well 1 Well 2
January 1 50
February 1 73
March 1 244 46
April 1 202 77
n;(N=17) 4 2
X; 142.25 61.50
s2 9076.37 480.49
o- o 2 1 Yin 1 IT 2
p (N-k) fff ' ' (17-6)
3:
TS = (17-6) ln(751837.27) - , (4-
Well 3
272
171
32
53
4
132
12455
[(4-1)9076 -

l)ln(9076) H
's Test

period. The data are shown in the
if the variances of the six wells are equal.
were calculated. These
Well 4
34
3940
2
1987
7628243
,-,(3

Well 5
48
54
2
51.00
17.98

are shown
Well 6
68
991
54
3
371.00
288348
in the

751837.27

- ... + (3- l)ln(288348) , =
4: The critical x2 value with 6-1=5 degrees of freedom at the 5% significance level
Table A-8 of Appendix A). Since 43.16 is larger than 11.1, it is concluded that the
(s2, . . . , s|) are not homogeneous at the 5% significance level.

43.16
is 11.1 (from
six variances
EPA QA/G-9
QAOO Version
4-36
Final
July 2000
-------
Box 4-26: Directions for Levene's Test

Consider k groups with a sample size of n, for the ith group. Let N represent the total number of samples, i.e.,
let N = n., + n2 + . . . + nk. For example, consider two wells where 4 samples have been taken from well 1 and 3
samples have been taken from well 2. In this case, k = 2, n., = 4, n2 = 3, and N = 4 + 3 = 7.

STEP 1: For each of the k groups, calculate the group mean, x , (Section 2.2.2), i.e., calculate:

_ 1 "l _ 1 "2 1 "k
x\ = — T$ir X2= — Y$2j> •••' xk= — Yfkj-
STEP 2: Compute the absolute residuals z.. = JC.. - X where Xy represents the jth value of the i

group. For each of the k groups, calculate the means, zh of these residuals, i.e., calculate:

" "
1 k "' 1 k
= — / / z .. = — / n.z..
\r *• — ' *• — ^ (/ AT -^ — ^ I I
Also calculate the overall mean residual as z =
-\T '-^1 '-^1 II -\T
i= 1

STEP 3: Compute the following sums of squares for the absolute residuals:

k "i - £ -2 _
oo _ V^V~^2 _ z oo _ V^• / _ Z _
TOTAL / > / »n ~5 ^^GROUPS 2-J ~Tf ^^ERROR ~ ^^TOTAL ' ^^GROUPS-

1)
STEP 4: Compute / =

STEP 5: Using Table A-9 of Appendix A, find the critical value of the F-distribution with (k-1) numerator
degrees of freedom, (N-k) denominator degrees of freedom, and a desired level of significance (a).
For example, if a = 0.05, the numerator degrees of freedom is 5, and the denominator degrees of
freedom is 18, then using Table A-9, F = 2.77. Iff is greater than F, reject the assumptions of equal
variances.
EPA QA/G-9 Final
QAOO Version 4-37 July 2000
-------
Box 4-27: An Example of Levene's Test
Four months of data on arsenic concentration were collected from six wells at a Superfund site. This data set is
shown in the table below. Before analyzing this data, it is important to determine if the variances of the six wells
are equal. Levene's test will be used to make this determination.
STEP 1 : The group mean for each well (x ,) is shown in the last row of the table below.

STEP 2:

STEP 3:
STEP 4:
STEP 5:
Arsenic Concentration (ppm)
Month Well 1 Well 2 Well 3 Well 4 Well 5 Well 6
1 22.90 2.00 2.0 7.84 24.90 0.34
2 3.09 1.25 109.4 9.30 1.30 4.78
3 35.70 7.80 4.5 25.90 0.75 2.85
4 4.18 52.00 2.5 2.00 27.00 1.20
Group Means x .,=16.47 x X3=29.6 x X5=13.49 x
2=15.76 4=11.26 6=2.29

To compute the absolute residuals zy in each well, the value 16.47 will be subtracted from Well 1
data, 15.76 from Well 2 data, 29.6 from Well 3 data, 11.26 from Well 4 data, 13.49 from Well 5
data, and 2.29 from Well 6 data. The resulting values are shown in the following table with the new
well means (z~) and the total mean z.
Residual Arsenic Concentration (ppm)
Month Well 1 Well 2 Well 3 Well 4 Well 5 Well 6
1 6.43 13.76 27.6 3.42 11.41 1.95
2 13.38 14.51 79.8 1.96 12.19 2.49
3 19.23 7.96 25.1 14.64 12.74 0.56
4 12.29 36.24 27.1 9.26 13.51 1.09
Residual Means z.,=12.83 z2=18.12 z3=39.9 z4=7.32 z5=12.46 z6=1.52
Total Residual Mean z = (1/6)(12.83 + 18.12 + 39.9 + 7.32 + 12.46+ 1.52) = 15.36
The sum of squares are: SSTOML = 6300.89, SS^^s = 3522.90, and SSERROR = 2777.99.
, ^WELLS' (k~V 3522.97(6-1) ^ ^

SSERRORl(N-k) 2777.997(24-6)
Using Table A-9 of Appendix A, the F statistic for 5 and 1 8 degrees of freedom with a = 0.05 is 2.77.
Since f=4.56 exceeds F05=2.77, the assumption of equal variances should be rejected.
EPA QA/G-9
QAOO Version
4-38
Final
July 2000
-------
4.6 TRANSFORMATIONS

Most statistical tests and procedures contain assumptions about the data to which they will
be applied. For example, some common assumptions are that the data are normally distributed;
variance components of a statistical model are additive; two independent data sets have equal
variance; and a data set has no trends over time or space. If the data do not satisfy such
assumptions, then the results of a statistical procedure or test may be biased or incorrect.
Fortunately, data that do not satisfy statistical assumptions may often be converted or transformed
mathematically into a form that allows standard statistical tests to perform adequately.

4.6.1 Types of Data Transformations

Any mathematical function that is applied to every point in a data set is called a
transformation. Some commonly used transformations include:

Logarithmic (Log X or Ln X): This transformation may be used when the original
measurement data follow a lognormal distribution or when the variance at each level of the
data is proportional to the square of the mean of the data points at that level. For
example, if the variance of data collected around 50 ppm is approximately 250, but the
variance of data collected around 100 ppm is approximately 1000, then a logarithmic
transformation may be useful. This situation is often characterized by having a constant
coefficient of variation (ratio of standard deviation to mean) over all possible data values.

The logarithmic base (for example, either natural or base 10) needs to be consistent
throughout the analysis. If some of the original values are zero, it is customary to add a
small quantity to make the data value non-zero as the logarithm of zero does not exist.
The size of the small quantity depends on the magnitude of the non-zero data and the
consequences of potentially erroneous inference from the resulting transformed data. As a
working point, a value of one tenth the smallest non-zero value could be selected. It does
not matter whether a natural (In) or base 10 (log) transformation is used because the two
transformations are related by the expression ln(X) = 2.303 log(X). Directions for
applying a logarithmic transformation with an example are given in Box 4-28.

Square Root (x): This transformation may be used when dealing with small whole
numbers, such as bacteriological counts, or the occurrence of rare events, such as
violations of a standard over the course of a year. The underlying assumption is that the
original data follow a Poisson-like distribution in which case the mean and variance of the
data are equal. It should be noted that the square root transformation overcorrects when
very small values and zeros appear in the original data. In these cases, \/X+ 1 is often
used as a transformation.
EPA QA/G-9 Final
QAOO Version 4-39 July 2000
-------
Box 4-28: Directions for Transforming Data and an Example

Let X.,, X2, . . . , Xn represent the n data points. To apply a transformation, simply apply the transforming
function to each data point. When a transformation is implemented to make the data satisfy some
statistical assumption, it will need to be verified that the transformed data satisfy this assumption.

Example: Transforming Loqnormal Data

A logarithmic transformation is particularly useful for pollution data. Pollution data are often skewed,
thus the log-transformed data will tend to be symmetric. Consider the data set shown below with 15 data
points. The frequency plot of this data (below) shows that the data are possibly lognormally distributed.
If any analysis performed with this data assumes normality, then the data may be logarithmically
transformed to achieve normality. The transformed data are shown in column 2. A frequency plot of the
transformed data (below) shows that the transformed data appear to be normally distributed.
Observed
X
0.22
3.48
6.67
2.53
1.11
0.33
1.64
1.37
« 6
? 5
O
•5 3
Transformed
ln(X)
-1.51
1.25
1.90
0.93
0.10
-1.11
0.50
0.31
Observed
X
0.47
0.67
0.75
0.60
0.99
0.90
0.26
Transformed
ln(X)
-0.76
-0.40
-0.29
-0.51
-0.01
-0.11
-1.35
2 3
Observed Values
«6
O
"S *i

o
•5 3
2
-101
Transformed Values
EPA QA/G-9
QAOO Version
4-40
Final
July 2000
-------
Inverse Sine (Arcsine X): This transformation may be used for binomial proportions
based on count data to achieve stability in variance. The resulting transformed data are
expressed in radians (angular degrees). Special tables must be used to transform the
proportions into degrees.

Box-Cox Transformations: This transformation is a complex power transformation that
takes the original data and raises each data observation to the power lambda (A). A
logarithmic transformation is a special case of the Box-Cox transformation. The rationale
is to find A such that the transformed data have the best possible additive model for the
variance structure, the errors are normally distributed, and the variance is as constant as
possible over all possible concentration values. The Maximum Likelihood technique is
used to find A such that the residual error from fitting the theorized model is minimized.
In practice, the exact value of A is often rounded to a convenient value for ease in
interpretation (for example, A = -1.1 would be rounded to -1 as it would then have the
interpretation of a reciprocal transform). One of the drawbacks of the Box-Cox
transformation is the difficulty in physically interpreting the transformed data.

4.6.2 Reasons for Transforming Data

By transforming the data, assumptions that are not satisfied in the original data can be
satisfied by the transformed data. For instance, a right-skewed distribution can be transformed to
be approximately Gaussian (normal) by using a logarithmic or square-root transformation. Then
the normal-theory procedures can be applied to the transformed data. If data are lognormally
distributed, then apply procedures to logarithms of the data. However, selecting the correct
transformation may be difficult. If standard transformations do not apply, it is suggested that the
data user consult a statistician.

Another important use of transformations is in the interpretation of data collected under
conditions leading to an Analysis of Variance (ANOVA). Some of the key assumptions needed
for analysis (for example, additivity of variance components) may only be satisfied if the data are
transformed suitably. The selection of a suitable transformation depends on the structure of the
data collection design; however, the interpretation of the transformed data remains an issue.

While transformations are useful for dealing with data that do not satisfy statistical
assumptions, they can also be used for various other purposes. For example, transformations are
useful for consolidating data that may be spread out or that have several extreme values. In
addition, transformations can be used to derive a linear relationship between two variables, so that
linear regression analysis can be applied. They can also be used to efficiently estimate quantities
such as the mean and variance of a lognormal distribution. Transformations may also make the
analysis of data easier by changing the scale into one that is more familiar or easier to work with.

Once the data have been transformed, all statistical analysis must be performed on the
transformed data. No attempt should be made to transform the data back to the original form
because this can lead to biased estimates. For example, estimating quantities such as means,

EPA QA/G-9 Final
QAOO Version 4-41 July 2000
-------
variances, confidence limits, and regression coefficients in the transformed scale typically leads to
biased estimates when transformed back into original scale. However, it may be difficult to
understand or apply results of statistical analysis expressed in the transformed scale. Therefore, if
the transformed data do not give noticeable benefits to the analysis, it is better to use the original
data. There is no point in working with transformed data unless it adds value to the analysis.

4.7 VALUES BELOW DETECTION LIMITS

Data generated from chemical analysis may fall below the detection limit (DL) of the
analytical procedure. These measurement data are generally described as not detected, or
nondetects, (rather than as zero or not present) and the appropriate limit of detection is usually
reported. In cases where measurement data are described as not detected, the concentration of
the chemical is unknown although it lies somewhere between zero and the detection limit. Data
that includes both detected and non-detected results are called censored data in the statistical
literature.

There are a variety of ways to evaluate data that include values below the detection limit.
However, there are no general procedures that are applicable in all cases. Some general
guidelines are presented in Table 4-4. Although these guidelines are usually adequate, they should
be implemented cautiously.
Table 4-4. Guidelines for Analyzing Data with Nondetects
Percentage of
Nondetects
< 15%
15% -50%
> 50% - 90%
Section
4.7.1
4.7.2
4.7.3
Statistical Analysis
Method
Replace nondetects with
DL/2, DL, or a very small
number.
Trimmed mean, Cohen's
adjustment, Winsorized
mean and standard deviation.
Use tests for proportions
(Section 3. 2.2)
All of the suggested procedures for analyzing data with nondetects depend on the amount
of data below the detection limit. For relatively small amounts below detection limit values,
replacing the nondetects with a small number and proceeding with the usual analysis may be
satisfactory. For moderate amounts of data below the detection limit, a more detailed adjustment
is appropriate. In situations where relatively large amounts of data below the detection limit exist,
EPA QA/G-9 Final
QAOO Version 4 - 42 July 2000
-------
one may need only to consider whether the chemical was detected as above some level or not.
The interpretation of small, moderate, and large amounts of data below the DL is subjective.
Table 4-4 provides percentages to assist the user in evaluating their particular situation.
However, it should be recognized that these percentages are not hard and fast rules, but should be
based on judgement.

In addition to the percentage of samples below the detection limit, sample size influences
which procedures should be used to evaluate the data. For example, the case where 1 sample out
of 4 is not detected should be treated differently from the case where 25 samples out of 100 are
not detected. Therefore, this guidance suggests that the data analyst consult a statistician for the
most appropriate way to evaluate data containing values below the detection level.

4.7.1 Less than 15% Nondetects - Substitution Methods

If a small proportion of the observations are not detected, these may be replaced with a
small number, usually the detection limit divided by 2 (DL/2), and the usual analysis performed.
As a guideline, if 15% or fewer of the values are not detected, replace them with the method
detection limit divided by two and proceed with the appropriate analysis using these modified
values. If simple substitution of values below the detection limit is proposed when more than
15% of the values are reported as not detected, consider using nonparametric methods or a test of
proportions to analyze the data. If a more accurate method is to be considered, see Cohen's
Method (Section 4.7.2.1).

4.7.2 Between 15-50% Nondetects

4.7.2.1 Cohen's Method

Cohen's method provides adjusted estimates of the sample mean and standard deviation
that accounts for data below the detection level. The adjusted estimates are based on the
statistical technique of maximum likelihood estimation of the mean and variance so that the fact
that the nondetects are below the limit of detection but may not be zero is accounted for. The
adjusted mean and standard deviation can then be used in the parametric tests described in
Chapter 3 (e.g., the one sample t-test of Section 3.2.1). However, if more than 50% of the
observations are not detected, Cohen's method should not be used. In addition, this method
requires that the data without the nondetects be normally distributed and the detection limit is
always the same. Directions for Cohen's method are contained in Box 4-29; an example is given
in Box 4-30.
EPA QA/G-9 Final
QAOO Version 4 - 43 July 2000
-------
Box 4-29: Directions for Cohen's Method

Let X.,, X2, . . . , Xn represent the n data points with the first m values representing the data points above the
detection limit (DL). Thus, there are (n-m) data points are below the DL.

STEP 1: Compute the sample mean xd from the data above the detection limit:
m ,=

STEP 2: Compute the sample variance s2dfrom the data above the detection limit:
m- 1
s
STEP 3: Compute h = ^—^ and y = ——
" (Xd-DL)2

STEP 4: Use h and y in Table A-10 of Appendix A to determine A. For example, if h = 0.4 and y = 0.30, then
A = 0.6713. If the exact value of h and y do not appear in the table, use double linear interpolation
(Box 4-31) to estimate A.

STEP 5: Estimate the corrected sample mean, x, and sample variance, s2, to account for the data below the
detection limit, as follows: X = Xd - l(Xd - DL) and s2 = s] + l(Xd - DL)2.
Box 4-30: An Example of Cohen's Method

Sulfate concentrations were measured for 24 data points. The detection limit was 1,450 mg/L and 3 of the 24
values were below the detection level. The 24 values are 1850, 1760, < 1450 (ND), 1710, 1575, 1475, 1780,
1790, 1780, < 1450 (ND), 1790, 1800, < 1450 (ND), 1800, 1840, 1820, 1860, 1780, 1760, 1800, 1900, 1770,
1790, 1780 mg/L. Cohen's Method will be used to adjust the sample mean for use in a t-test to determine if the
mean is greater than 1600 mg/L.

STEP 1: The sample mean of the m = 21 values above the detection level is Xd = 1771.9

STEP 2: The sample variance of the 21 quantified values is s2d= 8593.69.

STEP 3: h = (24 - 21)/24 = 0.125 and Y = 8593.697(1771.9 - 1450)2 = 0.083

STEP 4: Table A-10 of Appendix A was used for h = 0.125 and y = 0.083 to find the value of A. Since the
table does not contain these entries exactly, double linear interpolation was used to estimate A =
0.149839 (see Box 4-31).

STEP 5: The adjusted sample mean and variance are then estimated as follows:

X= 1771.9 - 0.149839(1771.9-1450) = 1723.67 and

s2 = 8593.69 + 0.149839(1771.9-1450)2 = 24119.95
EPA QA/G-9 Final
QAOO Version 4 - 44 July 2000
-------
Box 4-31: Double Linear Interpolation

The details of the double linear interpolation are provided to assist in the use of Table A-10 of Appendix
A. The desired value for A corresponds to y = 0.083 and, h = 0.125 from Box 4-30, Step 3. The values
from Table A-10 for interpolatation are:

Y h = 0.10 h = 0.15
0.05 0.11431 0.17925
0.10 0.11804 0.18479

There are 0.05 units between 0.10 and 0.15 on the h-scale and 0.025 units between 0.10 and 0.125.
Therefore, the value of interest lies (0.025/0.05)100% = 50% of the distance along the interval between
0.10 and 0.15. To linearly interpolate between tabulated values on the h axis for y = 0.05, the range
between the values must be calculated, 0.17925 -0.11431 = 0.06494; the value that is 50% of the
distance along the range must be computed, 0.06494 x 0.50 = 0.03247; and then that value must be
added to the lower point on the tabulated values, 0.11431 + 0.03247 = 0.14678. Similarly for y = 0.10,
0.18479 - 0.11804 = 0.06675, 0.06675 x 0.50 = 0.033375, and 0.11804 + 0.033375 = 0.151415.

On the Y-axis there are 0.033 units between 0.05 and 0.083 and there are 0.05 units between 0.05 and
0.10. The value of interest (0.083) lies (0.033/0.05 x 100) = 66% of the distance along the interval
between 0.05 and 0.10, so 0.151415 - 0.14678 = 0.004635, 0.004635 * 0.66 = 0.003059. Therefore,

A = 0.14678 + 0.003059 = 0.149839.
4.7.2.2 Trimmed Mean

Trimming discards the data in the tails of a data set in order to develop an unbiased
estimate of the population mean. For environmental data, nondetects usually occur in the left tail
of the data so trimming the data can be used to adjust the data set to account for nondetects when
estimating a mean. Developing a 100p% trimmed mean involves trimming p% of the data in both
the lower and the upper tail. Note that p must be between 0 and .5 since p represents the portion
deleted in both the upper and the lower tail. After np of the largest values and np of the smallest
values are trimmed, there are n(l-2p) data values remaining. Therefore, the proportion trimmed
is dependent on the total sample size (n) since a reasonable amount of samples must remain for
analysis. For approximately symmetric distributions, a 25% trimmed mean (the midmean) is a
good estimator of the population mean. However, environmental data are often skewed (non-
symmetric) and in these cases a 15% trimmed mean performance may be a good estimator of the
population mean. It is also possible to trim the data only to replace the nondetects. For example,
if 3% of the data are below the detection limit, a 3% trimmed mean could be used to estimate the
population mean. Directions for developing a trimmed mean are contained in Box 4-32 and an
example is given in Box 4-33. A trimmed variance is rarely calculated and is of limited use.

4.7.2.3 Winsorized Mean and Standard Deviation

Winsorizing replaces data in the tails of a data set with the next most extreme data value.
For environmental data, nondetects usually occur in the left tail of the data. Therefore,
winsorizing can be used to adjust the data set to account for nondetects. The mean and standard
deviation can then be computed on the new data set. Directions for winsorizing data (and
revising the sample size) are contained in Box 4-34 and an example is given in Box 4-35
EPA QA/G-9 Final
QAOO Version 4 - 45 July 2000
-------
4.7.2.4 Atchison's Method

Previous adjustments to the mean and variance assumed that the data values really were
present but could not be recorded or "seen" as they were below the detection limit. In other
words, if the detection limit had been substantially lower, the data values would have been
recorded. There are however, cases where the data values are below the detection limit because
they are actually zero, the contaminant or chemical of concern being entirely absent. Such data
sets are actually a mixture - partly the assumed distribution (for example, a normal distribution)
and partly a number of zero values. Aitchison's Method is used in this situation to adjust the
mean and variance for the zero values.
Box 4-32: Directions for Developing a Trimmed Mean

LetX.,, X2, . . . , Xn represent the n data points. To develop a 100p% trimmed mean (0 < p < 0.5):

STEP 1 : Let t represent the integer part of the product np. For example, if p = .25 and n = 1 7,
np = (.25)(17) = 4.25, so t = 4.
STEP 2: Delete the t smallest values of the data set and the t largest values of the data set.
_ i n-2t
STEPS: Compute the arithmetic mean of the remaining n - 2t values: X =
n - 2t ,,

This value is the estimate of the population mean.
Box 4-33: An Example of the Trimmed Mean

Sulfate concentrations were measured for 24 data points. The detection limit was 1,450 mg/L and 3 of
the 24 values were below this limit. The 24 values listed in order from smallest to largest are: < 1450
(ND), < 1450 (ND), < 1450 (ND), 1475, 1575, 1710, 1760, 1760, 1770, 1780, 1780, 1780, 1780, 1790,
1790, 1790, 1800, 1800, 1800, 1820, 1840, 1850, 1860, 1900 mg/L. A 15% trimmed mean will be used
to develop an estimate of the population mean that accounts for the 3 nondetects.

STEP 1: Since np = (24)(.15) = 3.6, t = 3.

STEP 2: The 3 smallest values of the data set and the 3 largest values of the data set were deleted.
The new data set is: 1475, 1575, 1710, 1760, 1760, 1770, 1780, 1780, 1780, 1780, 1790,
1790, 1790, 1800, 1800, 1800, 1820, 1840 mg/L.

STEP 3: Compute the arithmetic mean of the remaining n-2t values:

X= i (1475 + ... + 1840) = 1755.56
24 - (2)(3)

Therefore, the 15% trimmed mean is 1755.56 mg/L.
EPA QA/G-9 Final
QAOO Version 4 - 46 July 2000
-------
Box 4-34: Directions for Developing a Winsorized
Mean and Standard Deviation

Let X.,, X2, . . . , Xn represent the n data points and m represent the number of data points above the
detection limit (DL), and hence n-m below the DL.

STEP 1: List the data in order from smallest to largest, including nondetects. Label these points X,.,}, X,
2)l. . ., X(n) (so that X(1) is the smallest, X(2) is the second smallest, and X(n) is the largest).

STEP 2: Replace the n-m nondetects with X, m +.,} and replace the n-m largest values with X, n _ m}.

STEP 3: Using the revised data set, compute the sample mean, x, and the sample standard deviation,
s:
X = - yXt and s
n

i= 1

sf) - nx2

n-1
STEP 4: The Winsorized mean x w is equal to x. The Winsorized standard deviation is

s(n- 1)
s =
(2m-n- 1)
Box 4-35: An Example of a Winsorized
Mean and Standard Deviation

Sulfate concentrations were measured for 24 data points. The detection limit was 1,450 mg/L and 3 of
the 24 values were below the detection level. The 24 values listed in order from smallest to largest are: <
1450 (ND), < 1450 (ND), < 1450 (ND), 1475, 1575, 1710, 1760, 1760, 1770, 1780, 1780, 1780, 1780,
1790, 1790, 1790, 1800, 1800, 1800, 1820, 1840, 1850, 1860, 1900 mg/L.

STEP 1: The data above are already listed from smallest to largest. There are n=24 samples, 21 above
DL, and n-m=3 nondetects.

STEP 2: The 3 nondetects were replaced with X, 4 ( and the 3 largest values were replaced with X, 21 >.
The resulting data set is: 1475, 1475, 1475, 1475, 1575, 1710, 1760, 1760, 1770, 1780, 1780,
1780, 1780, 1790, 1790, 1790, 1800, 1800, 1800, 1820, 1840, 1840, 1840, 1840 mg/L

STEP 3: For the new data set, x = 1731 mg/L and s = 128.52 mg/L.

STEP 4: The Winsorized mean xw = 1731 mg/L. The Winsorized sample standard deviation is:

s __ 128.52(24-1) = n388
2(21) -24-1
Aitchison's method for adjusting the mean and variance of the above the detection level
values works quite well provided the percentage of non-detects is between 15-50% of the total
number of values. Care must be taken when using Aitchison's adjustment to the mean and
standard deviation as the mean is reduced and variance increased. With such an effect it may
become very difficult to use the adjusted data for tests of hypotheses or for predicative purposes.
As a diagnostic tool, the relevance of Aitchison' adjustment can lead to an evaluation of the data
to determine if two populations are being sampled simultaneously: one population being
represented by a normal distribution, the other being simply blanks. In some circumstances, for
EPA QA/G-9 Final
QAOO Version 4 - 47 July 2000
-------
example investigating a hazardous site, it may be possible to relate the position of the sample
through a Posting Plot (Section 2.3.9) and determine if the target population has not been
adequately stratified. Directions for Aitchison's method are contained in Box 4-36; an example
(with a comparison to Cohen's method) is contained in Box 4-37.
Box 4-36: 11 Directions for Aitchison's Method to Adjust Means and Variances

Let X.,, X2, . .,Xm, . . . , Xn represent the data points where the first m values are above the detection limit (DL)
and the remaining (n-m) data points are below the DL
STEP 1: Using the data above the detection level, compute the sample mean,
1 \ m
Xd = — y Xj and the sample variance, sd = ——
m-l
STEP 2: Estimate the corrected sample mean, X = —Xd
n
2 m-l ^ , m(n-m)—2
j — ^ d d
and the sample variance n — i n(n — i)
Box 4-37: An Example of Aitchison's Method

The following data consist of 10 Methylene Chloride samples: 1.9,1.3, <1, 2.0, 1.9, <1, <1, <1, 1.6, and 1.7.
There are 7 values above the detection limit and 3 below, so m = 7 and n - m=3. Aitchison's method will be
used to estimate the mean and sample variance of this data.
STEP1: Xd=~2^ Xt =-(1.9 + 1.3 + 2.0 + 1.9 + 1.6 + 1.6 + 1.7) = 1.714
and s2d = -^ - ^ - = 0.05809
d 7-1

STEP 2: The corrected sample mean is then X = —(1.714) = 1.2
7 _ i
and the sample variance s2 = - (0 05809) H -- — (1 714) = 0 7242
10-1 3(3-1)
EPA QA/G-9 Final
QAOO Version 4 - 48 July 2000
-------
4.7.2.5
Selecting Between Atchison's Method or Cohen's Method
To determine if a data set is better adjusted by Cohen's method or Aitchison's method, a
simple graphical procedure using Normal Probability Paper (Section 2.3) can be used. Examples
for this procedures are given in Box 4-38 and an example is contained in Box 4-39.
Box 4-38: Directions for Selecting Between
Cohen's Method or Aitchison's Method

LetX.,, X2, . .,Xm, . . . , Xn represent the data points with the first m values are above the detection limit (DL) and
the remaining (n-m) data points are below the DL.

STEP 1: Use Box 2-19 to construct a Normal Probability Plot of all the data but only plot the values belonging
to those above the detection level. This is called the Censored Plot.

STEP 2: Use Box 2-19 to construct a Normal Probability Plot of only those values above the detection level.
This called the Detects only Plot.

STEP 3: If the Censored Plot is more linear than the Detects Only Plot, use Cohen's Method to estimate the
sample mean and variance. If the Detects Only Plot is more linear than the Censored Plot, then use
Aitchison's Method to estimate the sample mean and variance.
Box 4-39: Example of Determining Between
Cohen's Method and Aitchison's Method

In this example, 10 readings of Chlorobenzene were obtained from a monitoring well and submitted for
consideration for a permit: < 1, 1.9, 1.4, 1.5, < 1, 1.2, <1, 1.3, 1.9, 2.1 ppm. The data can be thought to be
independent readings.
Step 1: Using the directions in Box 2-19 the following is the
Censored Plot:
STEP 2: Using the directions in Box 2-19 the following is the
Detects only Plot:
STEP 3: Since the Censored Plots is more linear than the Detects Only Plot, Cohen's Method should be used
to estimate the sample mean and variance.
EPA QA/G-9
QAOO Version
4-49
Final
July 2000
-------
4.7.3 Greater than 50% Nondetects - Test of Proportions

If more than 50% of the data are below the detection limit but at least 10% of the
observations are quantified, tests of proportions may be used to test hypotheses using the data.
Thus, if the parameter of interest is a mean, consider switching the parameter of interest to some
percentile greater than the percent of data below the detection limit. For example, if 67% of the
data are below the DL, consider switching the parameter of interest to the 75th percentile. Then
the method described in 3.2.2 can be applied to test the hypothesis concerning the 75th percentile.
It is important to note that the tests of proportions may not be applicable for composite samples.
In this case, the data analyst should consult a statistician before proceeding with analysis.

If very few quantified values are found, a method based on the Poisson distribution may be
used as an alternative approach. However, with a large proportion of nondetects in the data, the
data analyst should consult with a statistician before proceeding with analysis.

4.7.4 Recommendations

If the number of sample observations is small (n<20), maximum likelihood methods can
produce biased results since it is difficult to assure that the underlying distribution appropriate and
the solutions to the likelihood equation for the parameters of interest are statistically consistent
only if the number of samples is large. Additionally, most methods will yield estimated parameters
with large estimation variance, which reduces the power to detect import differences from
standards or between populations. While these methods can be applied to small data sets, the user
should be cautioned that they will only be effective in detecting large departures from the null
hypothesis.

If the degree of censoring (the percentage of data below the detection limit) is relatively
low, reasonably good estimates of means, variances and upper percentiles can be obtained.
However, if the rate of censoring is very high (greater than 50%) then little can be done
statistically except to focus on some upper quantile of the contaminant distribution, or on some
proportion of measurements above a certain critical level that is at or above the censoring limit.

When the numerical standard is at or below one of the censoring levels and a one-sample
test is used, the most useful statistical method is to test whether the proportion of a population is
above (below) the standard is too large, or to test whether and upper quantile of the population
distribution is above the numerical standard. Table 4-5 gives some recommendation on which
statistical parameter to use when censoring is present in data sets for different sizes of the
coefficient of variation (Section 2.2.3).

When comparing two data sets with different censoring levels (i.e., different detection
limits), it is recommended that all data be censored at the highest censoring value present and a
non parametric test such as the Wilcoxon Rank Sum Test (Section 3.3.3.1) used to compare the
two data sets. There is a corresponding loss of statistical power but this can be minimized
through the use of large samples.

EPA QA/G-9 Final
QAOO Version 4-50 July 2000
-------
Table 4-5. Guidelines for Recommended Parameters for Different
Coefficient of Variations and Censoring
Assumed Coefficient
of Variation (CV)
Large: CV> 1.5
Medium: 0.530%)
Upper Percentile
Upper Percentile
Median
4.8 INDEPENDENCE

The assumption of independence of data is key to the validity of the false rejection and
false acceptance error rates associated with the selected statistical test. When data are truly
independent between themselves, the correlation between data points is by definition zero and the
selected statistical tests work with the desired chosen decision error rates (given appropriate the
assumptions have been satisfied). When correlation (usually positive) exists, the effectiveness of
statistical tests is diminished. Environmental data are particularly susceptible to correlation
problems due to the fact that such environmental data are collected under a spatial pattern (for
example a grid) or sequentially over time (for example, daily readings from a monitoring station).

The reason non-independence is an issue for statistical testing situations is that if
observations are positively correlated over time or space, then the effective sample size for a test
tends to be smaller than the actual sample size - i.e., each additional observation does not provide
as much "new" information because its value is partially determined by (or a function of) the value
of adjacent observations. This smaller effective sample size means that the degrees of freedom for
the test statistic is less, or equivalently, the test is not as powerful as originally thought. In
addition to affecting the false acceptance error rate, applying the usual tests to correlated data
tends to result in a test whose actual significance level (false rejection error rate) is larger than the
nominal error rate.

When observations are correlated, estimates of the variance that are used in test statistic
formulas are often understated. For example, consider the mean of a series of n temporally-
ordered observations. If these observations are independent, then the variance of the mean is
o2/n, where a2 is the variance of individual observations (see Section 2.2.3). However, if the
observations are not independent and the correlation between successive observations is p (for
example, the correlation between the first and second observation is p, between first and third
observations is p2, between first and fourth observations is p3, etc.), then the variance of the mean
increases to
EPA QA/G-9
QAOO Version
4-51
Final
July 2000
-------
VAR(X) = o\l + q\where q=~
n k=\
which will tend to be larger than o2/n if the correlations (on average) are positive. If one conducts
a t-test at the significance level, using the usual formula for the estimated variance (Box 2-3), then
the actual significance level can be approximately double what was expected even for relatively
low values of p.

One of the most effective ways to determine statistical independence is through use of the
Rank von Neumann Test. Directions for this test are given in Box 4-40 and an example in
contained in Box 4-41. Compared to other tests of statistical independence, the rank von
Neumann test has been shown to be more powerful over a wide variety of cases. It is also a
reasonable test when the data really follow a Normal distribution. In that case, the efficiency of
the test is always close to 90 percent when compared to the von Neumann ratio computed on the
original data instead of the ranks. This means that very little effectiveness is lost by always using
the ranks in place of the original concentrations; the rank von Neumann ration should still
correctly detect non-independent data.
Box 4-40: Directions for the Rank von Neumann Test

LetX.,, X2, . . . , Xn represent the data values collected in sequence over equally spaced periods of time.

Step 1. Order the data measurements from smallest to largest and assign a unique rank (H) to each
measurement (See Box 3-20 Then list the observations and their corresponding ranks in the order that
sampling occurred(i.e., by sampling event or time order.)

Step 2. Using the list of ranks, rh for the sampling periods i=1, 2, ..., n, compute the rank von
Neumann ratio:
n(n2 -1)/12

Step 3: Use Table A-15 of Appendix A to determine the lower critical point of the rank von Neumann ration
using the sample size, n, and the desired significance level, a. If the computed ratio, v, is smaller than
this critical point, conclude that the data series is strongly auto correlated. If not, the data may be
mildly correlated, but there is no statistically significant evidence to reject the hypothesis of
independence. Therefore, the data should be regarded as independent in subsequent statistical testing.

Note: if the rank von Neumann ratio test indicates significant evidence of dependence in the data, a
statistician should be consulted before further analysis is performed.
EPA QA/G-9 Final
QAOO Version 4 - 52 July 2000
-------
Th
St
St
St
Box 4-41: An Example of the Rank von Neumann Test
le following are hourly readings from a discharge monitor: hourly readings from a discharge monitor.
fime:
Reading
Rank
ep 1: Tht
ep2: v
ep 3: Us
the
12:00
6.5
7
3 ranks
n
I
/' =
13:00
6.6
8.5
14:00
6.7
10
are displayed
(r- ~ r- ,
2 ' l~l
15:00
6.4
5.5
16:00
6.3
3.5
17:00
6.4
5.5
18:00
6.2
1.5
in the table above and the time
)2
(8.5- 7)2 + (10-8
19:00
6.2
1.5
periods
5)2 + .
20:00
6.3
3.5
21:00
6.6
8.5
22:00 23:00
6.8 6.9
11 12
24:00
7.0
13
were labeled 1 through 13.
..+ (13- 12)2 _
«O2-1)/12 13(132-1)/12
ng Table A-15 of Appendix A with a = 0.05 , the lower critical point is 1.17.
hypothesis that the data are independent must be rejected.
- U.t/J
Since v = 0.473 < 1.17,
EPA QA/G-9
QAOO Version
4-53
Final
July 2000
-------
This page is intentionally blank.
EPA QA/G-9 Final
QAOO Version 4 - 54 July 2000
-------
CHAPTER 5
STEP 5: DRAW CONCLUSIONS FROM THE DATA
THE DATA QUALITY ASSESSMENT PROCESS
Review DQOs and Sampling Design
Conduct Preliminary Data Review
Select the Statistical Test
Verify the Assumptions
Draw Conclusions From the Data
DRAW CONCLUSIONS FROM THE DATA

Purpose

Conduct the hypothesis test and interpret the results
in the context of the data user's objectives.

Activities

• Perform the Statistical Hypothesis Test
• Draw Study Conclusions.
• Evaluate Performance of the Sampling Design

Tools

• Issues in hypothesis testing related to understanding
and communicating the test results
Step 5: Draw Conclusions from the Data

! Perform the calculations for the statistical hypothesis test.
P Perform the calculations and document them clearly.
P If anomalies or outliers are present in the data set, perform the calculations
with and without the questionable data.

! Evaluate the statistical test results and draw conclusions.
P If the null hypothesis is rejected, then draw the conclusions and document the
analysis.
P If the null hypothesis is not rejected, verify whether the tolerable limits on
false acceptance decision errors have been satisfied. If so, draw conclusions
and document the analysis; if not, determine corrective actions, if any.
P Interpret the results of the test.

! Evaluate the performance of the sampling design if the design is to be used
again.
P Evaluate the statistical power of the design over the full range of parameter
values; consult a statistician as necessary.
EPA QA/G-9
QAOO Version
5- 1
Final
July 2000
-------
List of Boxes

Box 5-1: Checking Adequacy of Sample Size for a One-
Sample t-Test for Simple Random Sampling 5-5
Box 5-2: Example of Power Calculations for the One-Sample Test of a Single Proportion . . 5 - 6
Box 5-3: Example of a Comparison of Two Variances
which is Statistically but not Practically Significant 5-9
Box 5-4: Example of a Comparison of Two Biases 5-10

List of Figures
Page
Figure 5-1. Illustration of Unbiased versus Biased Power Curves 5-11
EPA QA/G-9 Final
QAOO Version 5 - 2 July 2000
-------
CHAPTER 5
STEP 5: DRAW CONCLUSIONS FROM THE DATA

5.1 OVERVIEW AND ACTIVITIES

In this final step of the DQA, the analyst performs the statistical hypothesis test and draws
conclusions that address the data user's objectives. This step represents the culmination of the
planning, implementation, and assessment phases of the data operations. The data user's planning
objectives will have been reviewed (or developed retrospectively) and the sampling design
examined in Step 1. Reports on the implementation of the sampling scheme will have been
reviewed and a preliminary picture of the sampling results developed in Step 2. In light of the
information gained in Step 2, the statistical test will have been selected in Step 3. To ensure that
the chosen statistical methods are valid, the key underlying assumptions of the statistical test will
have been verified in Step 4. Consequently, all of the activities conducted up to this point should
ensure that the calculations performed on the data set and the conclusions drawn here in Step 5
address the data user's needs in a scientifically defensible manner. This chapter describes the main
activities that should be conducted during this step. The actual procedures for implementing
some commonly used statistical tests are described in Step 3, Select the Statistical Test.

5.1.1 Perform the Statistical Hypothesis Test

The goal of this activity is to conduct the statistical hypothesis test. Step-by-step
directions for several commonly used statistical tests are described in Chapter 3. The calculations
for the test should be clearly documented and easily verifiable. In addition, the documentation of
the results of the test should be understandable so that the results can be communicated
effectively to those who may hold a stake in the resulting decision. If computer software is used
to perform the calculations, ensure that the procedures are adequately documented, particularly if
algorithms have been developed and coded specifically for the project.

The analyst should always exercise best professional judgment when performing the
calculations. For instance, if outliers or anomalies are present in the data set, the calculations
should be performed both with and without the questionable data to see what effect they may
have on the results.

5.1.2 Draw Study Conclusions

The goal of this activity is to translate the results of the statistical hypothesis test so that
the data user may draw a conclusion from the data. The results of the statistical hypothesis test
will be either:

(a) reject the null hypothesis, in which case the analyst is concerned about a possible
false rejection decision error; or
EPA QA/G-9 Final
QAOO Version 5 - 3 July 2000
-------
(b) fail to reject the null hypothesis, in which case the analyst is concerned about a
possible false acceptance decision error.

In case (a), the data have provided the evidence needed to reject the null hypothesis, so
the decision can be made with sufficient confidence and without further analysis. This is because
the statistical test based on the classical hypothesis testing philosophy, which is the approach
described in prior chapters, inherently controls the false rejection decision error rate within the
data user's tolerable limits, provided that the underlying assumptions of the test have been verified
correctly.

In case (b), the data do not provide sufficient evidence to reject the null hypothesis, and
the data must be analyzed further to determine whether the data user's tolerable limits on false
acceptance decision errors have been satisfied. One of two possible conditions may prevail:

(1) The data do not support rejecting the null hypothesis and the false acceptance
decision error limits were satisfied. In this case, the conclusion is drawn in favor
of the null hypothesis, since the probability of committing a false acceptance
decision error is believed to be sufficiently small in the context of the current study
(see Section 5.2).

(2) The data do not support rejecting the null hypothesis, and the false acceptance
decision error limits were not satisfied. In this case, the statistical test was not
powerful enough to satisfy the data user's performance criteria. The data user may
choose to tolerate a higher false acceptance decision error rate than previously
specified and draw the conclusion in favor of the null hypothesis, or instead take
some form of corrective action, such as obtaining additional data before drawing a
conclusion and making a decision.

When the test fails to reject the null hypothesis, the most thorough procedure for verifying
whether the false acceptance decision error limits have been satisfied is to compute the estimated
power of the statistical test, using the variability observed in the data. Computing the power of
the statistical test across the full range of possible parameter values can be complicated and
usually requires specialized software. Power calculations are also necessary for evaluating the
performance of a sampling design. Thus, power calculations will be discussed further in Section
5.1.3.

A simpler method can be used for checking the performance of the statistical test. Using
an estimate of variance obtained from the actual data or upper 95% confidence limit on variance,
the sample size required to satisfy the data user's objectives can be calculated retrospectively. If
this theoretical sample size is less than or equal to the number of samples actually taken, then the
test is sufficiently powerful. If the required number of samples is greater than the number actually
collected, then additional samples would be required to satisfy the data user's performance criteria
for the statistical test. An example of this method is contained in Box 5-1. The equations
EPA QA/G-9 Final
QAOO Version 5 - 4 July 2000
-------
Box 5-1: Checking Adequacy of Sample Size for a One-
Sample t-Test for Simple Random Sampling

In Box 3-1, the one-sample t-test was used to test the hypothesis H0: u < 95 ppm vs. HA: u > 95 ppm. DQOs
specified that the test should limit the false rejection error rate to 5% and the false acceptance error rate to 20%
if the true mean were 105 ppm. A random sample of size n = 9 had sample mean x = 99.38 ppm and standard
deviation s = 10.41 ppm. The null hypothesis was not rejected. Assuming that the true value of the standard
deviation was equal to its sample estimate 10.41 ppm, it was found that a sample size of 9 would be required,
which validated the sample size of 9 which had actually been used.

The distribution of the sample standard deviation is skewed with a long right tail. It follows that the chances
are greater than 50% that the sample standard deviation will underestimate the true standard deviation. In
such a case it makes sense to build in some conservatism, for example, by using an upper 90% confidence
limit for o in Step 5 of Box 3-12. Using Box 4-22 and n -1 = 8 degrees of freedom, it is found that L = 3.49, so
that an upper 90% confidence limit for the true standard deviation is
sj[(n - 1)/Z] = 10.41^/8/3.49 = 15.76

Using this value for s in Step 5 of Box 3-1 leads to the sample size estimate of 17. Hence, a sample size of at
least 17 should be used to be 90% sure of achieving the DQOs. Since it is generally desirable to avoid the
need for additional sampling, it is advisable to conservatively estimate sample size in the first place. In cases
where DQOs depend on a variance estimate, this conservatism is achieved by intentionally overestimating the
variance.
required to perform these calculations have been provided in the detailed step-by-step instructions
for each hypothesis test procedure in Chapter 3.

5.1.3 Evaluate Performance of the Sampling Design

If the sampling design is to be used again, either in a later phase of the current study or in
a similar study, the analyst will be interested in evaluating the overall performance of the design.
To evaluate the sampling design, the analyst performs a statistical power analysis that describes
the estimated power of the statistical test over the range of possible parameter values. The power
of a statistical test is the probability of rejecting the null hypothesis when the null hypothesis is
false. The estimated power is computed for all parameter values under the alternative hypothesis
to create a power curve. A power analysis helps the analyst evaluate the adequacy of the
sampling design when the true parameter value lies in the vicinity of the action level (which may
not have been the outcome of the current study). In this manner, the analyst may determine how
well a statistical test performed and compare this performance with that of other tests.

The calculations required to perform a power analysis can be relatively complicated,
depending on the complexity of the sampling design and statistical test selected. Box 5-2
illustrates power calculations for a test of a single proportion, which is one of the simpler cases.
A further discussion of power curves (performance curves) is contained in the Guidance for Data
Quality Objectives (QA/G-4) (EPA 1994).
EPA QA/G-9 Final
QAOO Version 5 - 5 July 2000
-------
Box 5-2: Example of Power Calculations for the One-Sample Test of a Single Proportion

This box illustrates power calculations for the test of H0: P > .20 vs. HA: P < .20, with a false rejection error rate
of 5% when P=.20 presented in Boxes 3-10 and 3-11 The power of the test will be calculated assuming P., =
.15 and before any data are available. Since nP., and n(1-P.,) both exceed 4, the sample size is large enough
for the normal approximation, and the test can be carried out as in steps 3 and 4 of Box 3-10

STEP 1: Determine the general conditions for rejection of the null hypothesis. In this case, the null
hypothesis is rejected if the sample proportion is sufficiently smaller than P0. (Clearly, a sample
proportion above P0 cannot cast doubt on H0.) By steps 3 and 4 of Box 3-10 and 3-3 H0 is rejected
if

P + -5/n ~ Po ^
Here p is the sample proportion, Q0 = 1 - P0, n is the sample size, and z^_a is the critical value such
that 100(1-a)% of the standard normal distribution is below z^_a. This inequality is true if

p + .5/n < P0 - z^JP^JTt.

STEP 2: Determine the specific conditions for rejection of the null hypothesis if P., (=1-Q.,) is the true value of
the proportion P. The same operations as are used in step 3 of Box 3-10 are performed on both
sides of the above inequality. However, P0 is replaced by P., since it is assumed that P., is the true
proportion. These operations make the normal approximation applicable. Hence, rejection occurs if
p + .5/n - P1 P0 - P1 - Zl_tt^P0Q0/n _ 20 - .15 - ,..-,.vv.^v.uy, „. = _Q
STEP 3: Find the probability of rejection if P., is the true proportion. By the same reasoning that led to the
test in steps 3 and 4 of Boxes 3-10 and 3-11 the quantity on the left-hand side of the above
inequality is a standard normal variable. Hence the power at P., = .15 (i.e., the probability of
rejection of H0 when .15 is the true proportion) is the probability that a standard normal variable is
less than -0.55. In this case, the probability is approximately 0.3 (using the last line from Table A-1
of Appendix A) which is fairly small.
5.2 INTERPRETING AND COMMUNICATING THE TEST RESULTS

Sometimes difficulties may arise in interpreting or explaining the results of a statistical test.
One reason for such difficulties may stem from inconsistencies in terminology; another may be due
to a lack of understanding of some of the basic notions underlying hypothesis tests. As an
example, in explaining the results to a data user, an analyst may use different terminology than
that appearing in this guidance. For instance, rather than saying that the null hypothesis was or
was not rejected, analysts may report the result of a test by saying that their computer output
shows a p-value of 0.12. What does this mean? Similar problems of interpretation may occur
when the data user attempts to understand the practical significance of the test results or to
explain the test results to others. The following paragraphs touch on some of the philosophical
issues related to hypothesis testing which may help in understanding and communicating the test
results.
EPA QA/G-9 Final
QAOO Version 5 - 6 July 2000
-------
5.2.1 Interpretation of p-Values

The classical approach for hypothesis tests is to prespecify the significance level of the
test, i.e., the Type I decision error rate a. This rate is used to define the decision rule associated
with the hypothesis test. For instance, in testing whether the population mean |i exceeds a
threshold level (e.g., 100 ppm), the test statistic may depend on x, an estimate of |i. Obtaining an
estimate X that is greater than 100 ppm may occur simply by chance even if the true mean |i is less
than or equal to 100; however, if X is "much larger" than 100 ppm, then there is only a small
chance that the null hypothesis H0 (|i < 100 ppm) is true. Hence the decision rule might take the
form "reject H0 if x exceeds 100 + C", where C is a positive quantity that depends on a (and on
the variability of x). If this condition is met, then the result of the statistical test is reported as
"reject H0"; otherwise, the result is reported as "do not reject H0."

An alternative way of reporting the result of a statistical test is to report its p-value, which
is defined as the probability, assuming the null hypothesis to be true, of observing a test result at
least as extreme as that found in the sample. Many statistical software packages report p-values,
rather than adopting the classical approach of using a prespecified false rejection error rate. In the
above example, for instance, the p-value would be the probability of observing a sample mean as
large as X (or larger) if in fact the true mean was equal to 100 ppm. Obviously, in making a
decision based on the p-value, one should reject H0 when p is small and not reject it if p is large.
Thus the relationship between p-values and the classical hypothesis testing approach is that one
rejects H0 if the p-value associated with the test result is less than a. If the data user had chosen
the false rejection error rate as 0.05 a priori and the analyst reported a p-value of 0.12, then the
data user would report the result as "do not reject the null hypothesis;" if the p-value had been
reported as 0.03, then that person would report the result as "reject the null hypothesis." An
advantage of reporting p-values is that they provide a measure of the strength of evidence for or
against the null hypothesis, which allows data users to establish their own false rejection error
rates. The significance level can be interpreted as that p-value (a) that divides "do not reject H0"
from "reject H0."

5.2.2 "Accepting" vs. "Failing to Reject" the Null Hypothesis

As noted in the paragraphs above, the classical approach to hypothesis testing results in
one of two conclusions: "reject H0" (called a significant result) or "do not reject H0" (a
nonsignificant result). In the latter case one might be tempted to equate "do not reject H0" with
"accept H0." This terminology is not recommended, however, because of the philosophy
underlying the classical testing procedure. This philosophy places the burden of proof on the
alternative hypothesis, that is, the null hypothesis is rejected only if the evidence furnished by the
data convinces us that the alternative hypothesis is the more likely state of nature. If a
nonsignificant result is obtained, it provides evidence that the null hypothesis could sufficiently
account for the observed data, but it does not imply that the hypothesis is the only hypothesis that
could be supported by the data. In other words, a highly nonsignificant result (e.g., a p-value of
0.80) may indicate that the null hypothesis provides a reasonable model for explaining the data,
but it does not necessarily imply that the null hypothesis is true. It may, for example, simply

EPA QA/G-9 Final
QAOO Version 5 - 7 July 2000
-------
indicate that the sample size was not large enough to establish convincingly that the alternative
hypothesis was more likely. When the phrase "accept H0" is encountered, it must be considered
as "accepted with the preceding caveats."

5.2.3 Statistical Significance vs. Practical Significance

There is an important distinction between these two concepts. Statistical significance
simply refers to the result of the hypothesis test: Was the null hypothesis rejected? The likelihood
of achieving a statistically significant result depends on the true value of the population parameter
being tested (for example, |i), how much that value deviates from the value hypothesized under
the null hypothesis (for example, |i0), and on the sample size. This dependence on (|i - |i0) is
depicted by the power curve associated with the test (Section 5.1.3). A steep power curve can be
achieved by using a large sample size; this means that there will be a high likelihood of detecting
even a small difference. On the other hand, if small sample sizes are used, the power curve will be
less steep, meaning that only a very large difference between ji and |i0 will be detectable with high
probability. Hence, suppose one obtains a statistically significant result but has no knowledge of
the power of the test. Then it is possible, in the case of the steep power curve, that one may be
declaring significance (claiming ji > |i0, for example) when the actual difference, from a practical
standpoint, may be inconsequential. Or, in the case of the slowly increasing power curve, one
may not find a significant result even though a "large" difference between ji and |i0 exists. Neither
of these situations is desirable: in the former case, there has been an excess of resources
expended, whereas in the latter case, a false acceptance error is likely and has occurred.

But how large a difference between the parameter and the null value is of real importance?
This relates to the concept of practical significance. Ideally, this question is asked and answered
as part of the DQO process during the planning phase of the study. Knowing the magnitude of
the difference that is regarded as being of practical significance is important during the design
stage because this allows one, to the extent that prior information permits, to determine a
sampling plan of type and size that will make the magnitude of that difference commensurate with
a difference that can be detected with high probability. From a purely statistical design
perspective, this can be considered to be main purpose of the DQO process. With such planning,
the likelihood of encountering either of the undesirable situations mentioned in the prior
paragraph can be reduced. Box 5-3 contains an example of a statistically significant but fairly
inconsequential difference.

5.2.4 Impact of Bias on Test Results

Bias is defined as the difference between the expected value of a statistic and a population
parameter. It is relevant when the statistic of interest (e.g., a sample average x) is to be used as
an estimate of the parameter (e.g., the population mean ji). For example, the population
parameter of interest may be the average concentration of dioxin within the given bounds of a
hazardous waste site, and the statistic might be the sample average as obtained from a random
sample of points within those bounds. The expected value of a statistic can be interpreted as
supposing one repeatedly implemented the particular sampling design a very large number of

EPA QA/G-9 Final
QAOO Version 5 - 8 July 2000
-------
times and calculated the statistic of interest in each case. The average of the statistic's values
would then be regarded as its expected value. Let E denote the expected value of x and denote
the relationship between the expected value and the parameter, |i, as E = |i + b where b is the
bias. For instance, if the bias occurred due to incomplete recovery of an analyte (and no
adjustment is made), then b = (R-100)|o/100, where R denotes the percent recovery. Bias may
also occur for other reasons, such as lack of coverage of the entire target population (e.g., if only
the drums within a storage site that are easily accessible are eligible for inclusion in the sample,
then inferences to the entire group of drums may be biased). Moreover, in cases of incomplete
coverage, the magnitude and direction of the bias may be unknown. An example involving
comparison of the biases of two measurement methods is contained in Box 5-4.
Box 5-3: Example of a Comparison of Two Variances
which is Statistically but not Practically Significant

The quality control (QC) program associated with a measurement system provides important information on
performance and also yields data which should be taken into account in some statistical analyses. The QC
program should include QC check samples, i.e., samples of known composition and concentration which are
run at regular frequencies. The term precision refers to the consistency of a measurement method in repeated
applications under fixed conditions and is usually equated with a standard deviation. The appropriate standard
deviation is one which results from applying the system to the same sample over a long period of time.

This example concerns two methods for measuring ozone in ambient air, an approved method and a new
candidate method. Both methods are used once per week on a weekly basis for three months. Based on 13
analyses with each method of the mid-range QC check sample at 100 ppb, the null hypothesis of the equality
of the two variances will be tested with a false rejection error rate of 5% or less. (If the variances are equal,
then the standard deviations are equal.) Method 1 had a sample mean of 80 ppb and a standard deviation of 4
ppb. Method 2 had a mean of 90 ppb and a standard deviation of 8 ppb. The Shapiro-Wilks test did not reject
the assumption of normality for either method. Applying the F-test of Box 4-23, the F ratio is 82/42 = 2. Using
12 degrees of freedom for both the numerator and denominator, the F ratio must exceed 3.28 in order to reject
the hypothesis of equal variances (Table A-9 of Appendix A). Since 4 > 3.28, the hypothesis of equal variances
is rejected, and it is concluded that method 1 is significantly more precise than method 2.

In an industrialized urban environment, the true ozone levels at a fixed location and time of day are known to
vary over a period of months with a coefficient of variation of at least 100%. This means that the ratio of the
standard deviation (SD) to the mean at a given location is at least 1. For a mean of 100 ppb, the standard
deviation over time for true ozone values at the location would be at least 100 ppb. Relative to this degree of
variability, a difference between measurement error standard deviations of 4 or 8 ppb is negligible. The overall
variance, incorporating the true process variability and measurement error, is obtained by adding the individual
variances. For instance, if measurement error standard deviation is 8 ppb, then the total variance is (100
ppb)(100 ppb) + (8 ppb)(8 ppb). Taking the square root of the variance gives a corresponding total standard
deviation of 100.32 ppb. For a measurement error standard deviation of 4 ppb, the total standard deviation
would be 100.08 ppb. From a practical standpoint, the difference in precision between the two methods is
insignificant for the given application, despite the finding that there is a statistically significant difference
between the variances of the two methods.
EPA QA/G-9 Final
QAOO Version 5 - 9 July 2000
-------
Box 5-4: Example of a Comparison of Two Biases

This example is a continuation of the ozone measurement comparison described in Box 5-3. Let x and
sx denote the sajriple mean and standard deviation of measurement method 1 applied to the QC check
sample, andjet Y and SY denote the sample mean and standard deviation of method 2. Then x = 80 ppb,
sx = 4 ppb, Y = 90 ppb and SY = 8 ppb. The estimated biases are x - T = 80 -100 = -20 ppb for method
1, and Y - T = 90-100 = 10 ppb for method 2, since 100 ppb is the true value T. That is, method 1 seems
to underestimate by 20 ppb, and method 2 seems to underestimate by 10 ppb. Let U., and u2 be the
underlying mean concentrations for measurement methods 1 and 2 applied to the QC check sample.
These means correspond to the average results which would obtain by applying each method a large
number of times to the QC check sample, over a long period of time.

A two-sample t-test (Boxes 3-14 and 3-16) can be used to test for a significant difference between these
two biases. In this case, a two-tailed test of the null hypothesis H0: U., - u2 = 0 against the alternative HA:
U., - u2t 0 is appropriate, because there is no a priori reason (in advance of data collection) to suspect
that one measurement method is superior to the other. (In general, hypotheses should not be tailored to
data.) Note that the difference between the two biases is the same as the difference (p., - u2) between the
two underlying means of the measurement methods. The test will be done to limit the false rejection
error rate to 5% if the two means are equal.

STEP 1: x = 80 ppb, sx = 4 ppb, Y = 90 ppb, SY = 8 ppb.

STEP 2: From Box 5-3, it is known that the methods have significantly different variances, so that
Sattherthwaite's t-test should be used. Therefore,
JNE
2
SX ^
m
2
. SY
n
42 82
— + — = 2.48
13 13
STEP 3: / =
sx
m
4
SX
m2(m - 1)
SY
n
2

4
SY
n2 (n - 1)

13
44
132 12
13
2
84
132 12
= 17.65.
STEP 4:
Rounding down to the nearest integer gives f = 17. For a two-tailed test, the critical value is t
1-0/2 = t.975 = 2.110, from Table A-1 of Appendix A.
t =
X - Y
80 - 90
2.48
= -4.032
STEP 5: For a two-tailed test, compare *t* with t^ = 2.11. Since 4.032 > 2.11, reject the null
hypothesis and conclude that there is a significant difference between the two method biases,
in favor of method 2.

This box illustrates a situation involving two measurement methods where one method is more precise,
but also more biased, than the other. If no adjustment for bias is made, then for many purposes, the
less biased, more variable method is preferable. However, proper bias adjustment can make both
methods unbiased, so that the more precise method becomes the preferred method. Such adjustments
can be based on QC check sample results, if the QC check samples are regarded as representative of
environmental samples involving sufficiently similar analytes and matrices.
EPA QA/G-9
QAOO Version
5- 10
Final
July 2000
-------
1
OJ
| 0.8
ro ID
p c 06
1!
If °'4
£• 8
J5 m 0.2
o
Q_
0
8

-
Unbiased

-T^/7

/
/

/ /
/ /
/ /
^^/
^7
/
j
1

^^~

/ Biased Upward

0 100 120 140 160
True Value of the Parameter

In the context of hypothesis testing, the impact of bias can be quite severe in some
circumstances. This can be illustrated by comparing the power curve of a test when bias is not
present with a power curve for the same test when bias is present. The basic influence of bias is
to shift the former "no bias" curve to
the right or left, depending on the
direction of the bias. If the bias is
constant, then the second curve will be
an exact translation of the former
curve; if not, there will be a change in
the shape of the second curve in
addition to the translation. If the
existence of the bias is unknown, then
the former power curve will be
regarded as the curve that determines
the properties of the test when in fact Figure 5_L illustration of Unbiased versus Biased
the second curve will be the one that Power Curves
actually represents the test's power.
For example, in Figure 5-1 when the true value of the parameter is 120, the "no bias" power is
0.72 but the true power (the biased power) is only 0.4, a substantial difference. Since bias is
not impacted by changing the sample size, while the precision of estimates and the power of tests
increases with sample size, the relative importance of bias becomes more pronounced when the
sample size increases (i.e., when one makes the power curve steeper). Similarly, if the same
magnitude of bias exists for two different sites, then the impact on testing errors will be more
severe for the site having the smaller inherent variability in the characteristic of interest (i.e., when
bias represents a larger portion of total variability).

To minimize the effects of bias: identify and document sources of potential bias; adopt
measurement procedures (including specimen collection, handling, and analysis procedures) that
minimize the potential for bias; make a concerted effort to quantify bias whenever possible; and
make appropriate compensation for bias when possible.

5.2.5 Quantity vs. Quality of Data

The above conclusions imply that, if compensation for bias cannot be made and if
statistically-based decisions are to be made, then there will be situations in which serious
consideration should be given to using an imprecise (and perhaps relatively inexpensive) chemical
method having negligible bias as compared to using a very precise method that has even a
moderate degree of bias. The tradeoff favoring the imprecise method is especially relevant when
the inherent variability in the population is very large relative to the random measurement error.

For example, suppose a mean concentration for a given spatial area (site) is of interest and
that the coefficient of variation (CV) characterizing the site's variability is 100%. Let method A
denote an imprecise method, with measurement-error CV of 40%, and let method B denote a
highly precise method, with measurement-error CV of 5%. The overall variability, or total
EPA QA/G-9
QAOO Version
5- 11
Final
July 2000
-------
variability, can essentially be regarded as the sum of the spatial variability and the measurement
variability. These are obtained from the individual CVs in the form of variances. As CV equals
standard deviation divided by mean, it follows that the site standard deviation is then the CV times
the mean. Thus, for the site, the variance is l.OO2 x mean2; for method A, the variance is 0.402 x
mean2; and for method B, the variance is 0.052 x mean2. The overall variability when using
method A is then (l.OO2 x mean2) + (0.402 x mean2) = 1.16 x mean2, and when using method B,
the variance is (l.OO2 x mean2) + (0.052 x mean2) = 1.0025 mean2. It follows that the overall CV
when using each method is then (1.077 x mean) / mean = 107.7% for method A, and (1.001 x
mean) / mean = 100.1% for method B.

Now consider a sample of 25 specimens from the site. The precision of the sample mean
can then be characterized by the relative standard error (RSE) of the mean (which for the simple
random sample situation is simply the overall CV divided by the square root of the sample size).
For Method A, RSE = 21.54%; for method B, RSE = 20.02%. Now suppose that the imprecise
method (Method A) is unbiased, while the precise method (Method B) has a 10% bias (e.g., an
analyte percent recovery of 90%). An overall measure of error that reflects how well the sample
mean estimates the site mean is the relative root mean squared error (RRMSE):
RRMSE = \I(RB)2 + (RSE)2

where RB denotes the relative bias (RB = 0 for Method A since it is unbiased and RB = ±10% for
Method B since it is biased) and RSE is as defined above. The overall error in the estimation of
the population mean (the RRMSE) would then be 21.54% for Method A and 22.38% for Method
B. If the relative bias for Method B was 15% rather than 10%, then the RRMSE for Method A
would be 21.54% and the RRMSE for Method B would be 25.02%, so the method difference is
even more pronounced. While the above illustration is portrayed in terms of estimation of a mean
based on a simple random sample, the basic concepts apply more generally.

This example serves to illustrate that a method that may be considered preferable from a
chemical point of view [e.g., 85 or 90% recovery, 5% relative standard deviation (RSD)] may not
perform as well in a statistical application as a method with less bias and greater imprecision (e.g.,
zero bias, 40% RSD), especially when the inherent site variability is large relative to the
measurement-error RSD.

5.2.6 "Proof of Safety" vs. "Proof of Hazard"

Because of the basic hypothesis testing philosophy, the null hypothesis is generally
specified in terms of the status quo (e.g., no change or action will take place if null hypothesis is
not rejected). Also, since the classical approach exercises direct control over the false rejection
error rate, this rate is generally associated with the error of most concern (for further discussion
of this point, see Section 1.2). One difficulty, therefore, may be obtaining a consensus on which
error should be of most concern. It is not unlikely that the Agency's viewpoint in this regard will
differ from the viewpoint of the regulated party. In using this philosophy, the Agency's ideal
approach is not only to set up the direction of the hypothesis in such a way that controlling the

EPA QA/G-9 Final
QAOO Version 5-12 July 2000
-------
false rejection error protects the health and environment but also to set it up in a way that
encourages quality (high precision and accuracy) and minimizes expenditure of resources in
situations where decisions are relatively "easy" (e.g., all observations are far from the threshold
level of interest).

In some cases, how one formulates the hypothesis testing problem can lead to very
different sampling requirements. For instance, following remediation activities at a hazardous
waste site, one may seek to answer "Is the site clean?" Suppose one attempts to address this
question by comparing a mean level from samples taken after the remediation with a threshold
level (chosen to reflect "safety"). If the threshold level is near background levels that might have
existed in the absence of the contamination, then it may be very difficult (i.e., require enormous
sample sizes) to "prove" that the site is "safe." This is because the concentrations resulting from
even a highly efficient remediation under such circumstances would not be expected to deviate
greatly from such a threshold. A better approach for dealing with this problem may be to
compare the remediated site with a reference ("uncontaminated") site, assuming that such a site
can be determined.

To avoid excessive expense in collecting and analyzing samples for a contaminant,
compromises will sometimes be necessary. For instance, suppose that a significance level of 0.05
is to be used; however, the affordable sample size may be expected to yield a test with power of
only 0.40 at some specified parameter value chosen to have practical significance (see Section
5.2.3). One possible way that compromise may be made in such a situation is to relax the
significance level, for instance, using a = 0.10, 0.15, or 0.20. By relaxing this false rejection rate,
a higher power (i.e., a lower false acceptance rate P) can be achieved. An argument can be made,
for example, that one should develop sampling plans and determine sample sizes in such a way
that both the false rejection and false acceptance errors are treated simultaneously and in a
balanced manner (for example, designing to achieve a = P = 0.15) instead of using the traditional
approach of fixing the false rejection error rate at 0.05 or 0.01 and letting P be determined by the
sample size. This approach of treating the false rejection and false acceptance errors
simultaneously is taken in the DQO Process and it is recommended that several different scenarios
of a and P be investigated before a decision on specific values for a and P are selected.
EPA QA/G-9 Final
QAOO Version 5-13 July 2000
-------
This page is intentionally blank.
EPA QA/G-9 Final
QAOO Version 5-14 July 2000
-------
APPENDIX A

STATISTICAL TABLES
EPA QA/G-9 Final
QAOO Version A - 1 July 2000
-------
LIST OF TABLES

Table No. Page
A-l: CRITICAL VALUES OF STUDENT'S t DISTRIBUTION A - 3
A-2: CRITICAL VALUES FOR THE STUDENTIZED RANGE TEST A - 4
A-3: CRITICAL VALUES FOR THE EXTREME VALUE TEST (DIXONS TEST) A - 5
A-4: CRITICAL VALUES FOR DISCORDANCE TEST A - 6
A-5: APPROXIMATE CRITICAL VALUES Ar FOR ROSNERS TEST A - 7
A-6: QUANTILES OF THE WILCOXON SIGNED RANKS TEST A - 9
A-7: CRITICAL VALUES FOR THE RANK-SUM TEST A - 10
A-8: PERCENTILES OF THE CHI-SQUARE DISTRIBUTION A - 12
A-9: PERCENTILES OF THE F DISTRIBUTION A - 13
A-10: VALUES OF THE PARAMETER FOR COHENS ESTIMATES
ADJUSTING FOR NONDETECTED VALUES A - 18
A-l 1: PROBABILITIES FOR THE SMALL-SAMPLE
MANN-KENDALL TEST FOR TREND A - 19
A-12. QUANTILES FOR THE WALD-WOLFOWITZ TEST FOR RUNS A - 20
A-13. MODIFIED QUANTILE TEST CRITICAL NUMBERS A - 23
A-14. DUNNETT'S TEST (ONE TAILED) A - 27
A-15. APPROXIMATE a-LEVEL CRITICAL POINTS FOR
RANK VON NEUMANN RATIO TEST A - 29
EPA QA/G-9 Final
QAOO Version A - 2 July 2000
-------
TABLE A-l: CRITICAL VALUES OF STUDENT'S t DISTRIBUTION
Degrees of
Freedom
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60
120

.70
0.727
0.617
0.584
0.569
0.559
0.553
0.549
0.546
0.543
0.542
0.540
0.539
0.538
0.537
0.536
0.535
0.534
0.534
0.533
0.533
0.532
0.532
0.532
0.531
0.531
0.531
0.531
0.530
0.530
0.530
0.529
0.527
0.526
0.524

.75
1.000
0.816
0.765
0.741
0.727
0.718
0.711
0.706
0.703
0.700
0.697
0.695
0.694
0.692
0.691
0.690
0.689
0.688
0.6880
.687
0.686
0.686
0.685
0.685
0.684
0.684
0.684
0.683
0.683
0.683
0.681
0.679
0.677
0.674

.80
1.376
1.061
0.978
0.941
0.920
0.906
0.896
0.889
0.883
0.879
0.876
0.873
0.870
0.868
0.866
0.865
0.863
0.862
0.861
0.860
0.859
0.858
0.858
0.857
0.856
0.856
0.855
0.855
0.854
0.854
0.851
0.848
0.845
0.842

.85
1.963
1.386
1.250
1.190
1.156
1.134
1.119
1.108
1.100
1.093
1.088
1.083
1.079
1.076
1.074
1.071
1.069
1.067
1.066
1.064
1.063
1.061
1.060
1.059
1.058
1.058
1.057
1.056
1.055
1.055
1.050
1.046
1.041
1.036

.90
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.34
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310
1.303
1.296
1.289
1.282

.95
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.761
1.753
1.746
1.740
1.734
1.729
1.725
1.721
1.717
1.714
1.711
1.708
1.706
1.703
1.701
1.699
1.697
1.684
1.671
1.658
1.645

.975
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.145
2.131
2.120
2.110
2.101
2.093
2.086
2.080
2.074
2.069
2.064
2.060
2.056
2.052
2.048
2.045
2.042
2.021
2.000
1.980
1.960

.99
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.518
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457
2.423
2.390
2.358
2.326

.995
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
2.704
2.660
2.617
2.576
Note: The last row of the table (
= zng, = 1.645.
degrees of freedom) gives the critical values fora standard normal distribution (z), e.g., t
EPA QA/G-9
QAOO Version
A-3
Final
July 2000
-------
TABLE A-2: CRITICAL VALUES FOR THE STUDENTIZED RANGE TEST
n
o
J
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
150
200
500
1000

a
1.737
1.87
2.02
2.15
2.26
2.35
2.44
2.51
2.58
2.64
2.70
2.75
2.80
2.84
2.88
2.92
2.96
2.99
3.15
3.27
3.38
3.47
3.55
3.62
3.69
3.75
3.80
3.85
3.90
3.94
3.99
4.02
4.06
4.10
4.38
4.59
5.13
5.57
0.01
b
2.000
2.445
2.803
3.095
3.338
3.543
3.720
3.875
4.012
4.134
4.244
4.34
4.44
4.52
4.60
4.67
4.74
4.80
5.06
5.26
5.42
5.56
5.67
5.77
5.86
5.94
6.01
6.07
6.13
6.18
6.23
6.27
6.32
6.36
6.64
6.84
7.42
7.80
Level of Significance «
0.05
a
1.758
1.98
2.15
2.28
2.40
2.50
2.59
2.67
2.74
2.80
2.86
2.92
2.97
3.01
3.06
3.10
3.14
3.18
3.34
3.47
3.58
3.67
3.75
3.83
3.90
3.96
4.01
4.06
4.11
4.16
4.20
4.24
4.27
4.31
4.59
4.78
5.47
5.79
b
1.999
2.429
2.753
3.012
3.222
3.399
3.552
3.685
3.80
3.91
4.00
4.09
4.17
4.24
4.31
4.37
4.43
4.49
4.71
4.89
5.04
5.16
5.26
5.35
5.43
5.51
5.57
5.63
5.68
5.73
5.78
5.82
5.86
5.90
6.18
6.39
6.94
7.33
a
1.782
2.04
2.22
2.37
2.49
2.59
2.68
2.76
2.84
2.90
2.96
3.02
3.07
3.12
3.17
3.21
3.25
3.29
3.45
3.59
3.70
3.79
3.88
3.95
4.02
4.08
4.14
4.19
4.24
4.28
4.33
4.36
4.40
4.44
4.72
4.90
5.49
5.92
0.10
b
1.997
2.409
2.712
2.949
3.143
3.308
3.449
3.57
3.68
3.78
3.87
3.95
4.02
4.09
4.15
4.21
4.27
4.32
4.53
4.70
4.84
4.96
5.06
5.14
5.22
5.29
5.35
5.41
5.46
5.51
5.56
5.60
5.64
5.68
5.96
6.15
6.72
7.11
EPA QA/G-9
QAOO Version
A-4
Final
July 2000
-------
TABLE A-3: CRITICAL VALUES FOR THE EXTREME VALUE TEST
(DIXON'S TEST)
n
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Level of Significance «
0.10
0.886
0.679
0.557
0.482
0.434
0.479
0.441
0.409
0.517
0.490
0.467
0.492
0.472
0.454
0.438
0.424
0.412
0.401
0.391
0.382
0.374
0.367
0.360
0.05
0.941
0.765
0.642
0.560
0.507
0.554
0.512
0.477
0.576
0.546
0.521
0.546
0.525
0.507
0.490
0.475
0.462
0.450
0.440
0.430
0.421
0.413
0.406
0.01
0.988
0.889
0.780
0.698
0.637
0.683
0.635
0.597
0.679
0.642
0.615
0.641
0.616
0.595
0.577
0.561
0.547
0.535
0.524
0.514
0.505
0.497
0.489
EPA QA/G-9
QAOO Version
A-5
Final
July 2000
-------
TABLE A-4: CRITICAL VALUES FOR DISCORDANCE TEST
n
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Level of Significance «
0.01
1.155
1.492
1.749
1.944
2.097
2.221
2.323
2.410
2.485
2.550
2.607
2.659
2.705
2.747
2.785
2.821
2.854
2.884
2.912
2.939
2.963
2.987
3.009
3.029
3.049
3.068
3.085
3.103
3.119
3.135
0.05
1.153
1.463
1.672
1.822
1.938
2.032
2.110
2.176
2.234
2.285
2.331
2.371
2.409
2.443
2.475
2.504
2.532
2.557
2.580
2.603
2.624
2.644
2.663
2.681
2.698
2.714
2.730
2.745
2.759
2.773
n
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Level of Significance «
0.01
3.150
3.164
3.178
3.191
3.204
3.216
3.228
3.240
3.251
3.261
3.271
3.282
3.292
3.302
3.310
3.319
3.329
3.336
0.05
2.786
2.799
2.811
2.823
2.835
2.846
2.857
2.866
2.877
2.887
2.896
2.905
2.914
2.923
2.931
2.940
2.948
2.956
EPA QA/G-9
QAOO Version
A-6
Final
July 2000
-------
TABLE A-5: APPROXIMATE CRITICAL VALUES Ar FOR ROSNER'S TEST

n
25

r
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
a
0.05
2.82
2.80
2.78
2.76
2.73
2.59
2.84
2.82
2.80
2.78
2.76
2.62
2.86
2.84
2.82
2.80
2.78
2.65
2.88
2.86
2.84
2.82
2.80
2.68
2.89
2.88
2.86
2.84
2.82
2.71
2.91
2.89
2.88
2.86
2.84
2.73
2.92
2.91
2.89
2.88
2.86
2.76
0.01
3.14
3.11
3.09
3.06
3.03
2.85
3.16
3.14
3.11
3.09
3.06
2.89
3.18
3.16
3.14
3.11
3.09
2.93
3.20
3.18
3.16
3.14
3.11
2.97
3.22
3.20
3.18
3.16
3.14
3.00
3.24
3.22
3.20
3.18
3.16
3.03
3.25
3.24
3.22
3.20
3.18
3.06

n
32

r
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
a
0.05
2.94
2.92
2.91
2.89
2.88
2.78
2.95
2.94
2.92
2.91
2.89
2.80
2.97
2.95
2.94
2.92
2.91
2.82
2.98
2.97
2.95
2.94
2.92
2.84
2.99
2.98
2.97
2.95
2.94
2.86
3.00
2.99
2.98
2.97
2.95
2.88
3.01
3.00
2.99
2.98
2.97
2.91
0.01
3.27
3.25
3.24
3.22
3.20
3.09
3.29
3.27
3.25
3.24
3.22
3.11
3.30
3.29
3.27
3.25
3.24
3.14
3.32
3.30
3.29
3.27
3.25
3.16
3.33
3.32
3.30
3.29
3.27
3.18
3.34
3.33
3.32
3.30
3.29
3.20
3.36
3.34
3.33
3.32
3.30
3.22

n
39

r
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
a
0.05
3.03
3.01
3.00
2.99
2.98
2.91
3.04
3.03
3.01
3.00
2.99
2.92
3.05
3.04
3.03
3.01
3.00
2.94
3.06
3.05
3.04
3.03
3.01
2.95
3.07
3.06
3.05
3.04
3.03
2.97
3.08
3.07
3.06
3.05
3.04
2.98
3.09
3.08
3.07
3.06
3.05
2.99
0.01
3.37
3.36
3.34
3.33
3.32
3.24
3.38
3.37
3.36
3.34
3.33
3.25
3.39
3.38
3.37
3.36
3.34
3.27
3.40
3.39
3.38
3.37
3.36
3.29
3.41
3.40
3.39
3.38
3.37
3.30
3.43
3.41
3.40
3.39
3.38
3.32
3.44
3.43
3.41
3.40
3.39
3.33
EPA QA/G-9
QAOO Version
A-7
Final
July 2000
-------
TABLE A-5: APPROXIMATE CRITICAL VALUES Ar FOR ROSNER'S TEST

n
46

r
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
a
0.05
3.09
3.09
3.08
3.07
3.06
3.00
3.10
3.09
3.09
3.08
3.07
3.01
3.11
3.10
3.09
3.09
3.08
3.03
3.12
3.11
3.10
3.09
3.09
3.04
3.13
3.12
3.11
3.10
3.09
3.05
3.20
3.19
3.19
3.18
3.17
3.14
0.01
3.45
3.44
3.43
3.41
3.40
3.34
3.46
3.45
3.44
3.43
3.41
3.36
3.46
3.46
3.45
3.44
3.43
3.37
3.47
3.46
3.46
3.45
3.44
3.38
3.48
3.47
3.46
3.46
3.45
3.39
3.56
3.55
3.55
3.54
3.53
3.49

n
70

100

150

200

r
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
1
2
3
4
5
10
a
0.05
3.26
3.25
3.25
3.24
3.24
3.21
3.31
3.30
3.30
3.29
3.29
3.26
3.35
3.34
3.34
3.34
3.33
3.31
3.38
3.38
3.38
3.37
3.37
3.35
3.52
3.51
3.51
3.51
3.51
3.50
3.61
3.60
3.60
3.60
3.60
3.59
0.01
3.62
3.62
3.61
3.60
3.60
3.57
3.67
3.67
3.66
3.66
3.65
3.63
3.72
3.71
3.71
3.70
3.70
3.68
3.75
3.75
3.75
3.74
3.74
3.72
3.89
3.89
3.89
3.88
3.88
3.87
3.98
3.98
3.97
3.97
3.97
3.96

n
250

300
350
400
450
500

r
1
5
10
1
5
10
1
5
10
1
5
10
1
5
10
1
5
10
a
0.05
3.67
3.67
3.66
3.72
3.72
3.71
3.77
3.76
3.76
3.80
3.80
3.80
3.84
3.83
3.83
3.86
3.86
3.86
0.01
4.04
4.04
4.03
4.09
4.09
4.09
4.14
4.13
4.13
4.17
4.17
4.16
4.20
4.20
4.20
4.23
4.23
4.22
EPA QA/G-9
QAOO Version
A-8
Final
July 2000
-------
TABLE A-6: QUANTILES OF THE WILCOXON SIGNED RANKS TEST
n
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
w.01
0
0
0
1
2
4
6
8
10
13
16
20
24
28
33
38
44
W.05
0
1
3
4
6
9
11
14
18
22
26
31
36
42
48
54
61
WJO
1
3
4
6
9
11
15
18
22
27
32
37
43
49
56
63
70
w.20
3
4
6
9
12
15
19
23
28
33
39
45
51
58
66
74
82
EPA QA/G-9
QAOO Version
A-9
Final
July 2000
-------
TABLE A-7: CRITICAL VALUES FOR THE RANK-SUM TEST
n
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
a
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
0.05
0.10
m
2
0
0
0
1
0
1
1
2
1
2
1
2
2
3
2
3
2
4
2
4
3
5
3
5
4
5
4
6
4
6
4
7
3
0
1
1
2
1
2
2
3
3
4
3
5
4
6
5
6
5
7
6
8
6
9
7
10
8
11
8
11
9
12
10
13
4
0
1
1
2
2
4
3
5
4
6
5
7
6
8
7
10
8
11
9
12
10
13
11
14
12
16
13
17
15
18
16
19
5
1
2
2
3
3
5
5
6
6
8
7
9
9
11
10
13
12
14
13
16
14
18
16
19
17
21
19
23
20
24
21
26
6
1
2
3
4
4
6
6
8
8
10
9
12
111
4
13
16
15
18
17
20
18
22
20
24
22
26
24
28
26
30
27
32
7
1
2
3
5
5
7
7
9
9
12
12
14
14
17
16
19
18
22
20
24
22
27
25
29
27
32
29
34
31
37
34
39
8
2
3
4
6
6
8
9
11
11
14
14
17
16
20
19
23
21
25
24
28
27
31
29
34
32
37
34
40
37
43
40
46
9
2
3
5
6
7
10
10
13
13
16
16
19
19
23
22
26
25
29
28
32
31
36
34
39
37
42
40
46
43
49
46
53
10
2
4
5
7
8
11
12
14
15
18
18
22
21
25
25
29
28
33
32
37
35
40
38
44
42
48
45
52
49
55
52
59
11
2
4
6
8
9
12
13
16
17
20
20
24
24
28
28
32
32
37
35
41
39
45
43
49
47
53
51
58
55
62
58
66
12
3
5
6
9
10
13
14
18
18
22
22
27
27
31
31
36
35
40
39
45
43
50
48
54
52
59
56
64
61
68
65
73
13
3
5
7
10
11
14
16
19
20
24
25
29
29
34
34
39
38
44
43
49
48
54
52
59
57
64
62
69
66
75
71
80
14
4
5
8
11
12
16
17
21
22
26
27
32
32
37
37
42
42
48
47
53
52
59
57
64
62
70
67
75
72
81
78
86
15
4
6
8
11
13
17
19
23
24
28
29
34
34
40
40
46
45
52
51
58
56
64
62
69
67
75
73
81
78
87
84
93
16
4
6
9
12
15
18
20
24
26
30
31
37
37
43
43
49
49
55
55
62
61
68
66
75
72
81
78
87
84
94
90
100
17
4
7
10
13
16
19
21
26
27
32
34
39
40
46
46
53
52
59
58
66
65
73
71
80
78
86
84
93
90
100
97
107
18
5
7
10
14
17
21
23
28
29
35
36
42
42
49
49
56
56
63
62
70
69
78
76
85
83
92
89
99
96
107
103
114
19
5
8
11
15
18
22
24
29
31
37
38
44
45
52
52
59
59
67
66
74
73
82
81
90
88
98
95
105
102
113
110
121
20
5
8
12
16
19
23
26
31
33
39
40
47
48
55
55
63
63
71
70
79
78
87
85
95
93
103
101
111
108
120
116
128
EPA QA/G-9
QAOO Version
A- 10
Final
July 2000
-------
TABLE A-7: CRITICAL VALUES FOR THE RANK-SUM TEST
n
18
19
20
a
0.05
0.10
0.05
0.10
0.05
n 11)
m
2
5
1
5
8
5
8
3
10
14
11
15
12
ifi
4
17
21
18
22
19
9^
5
23
28
24
29
26
^1
6
29
35
31
37
33
^Q
7
36
42
38
44
40
47
8
42
49
45
52
48
ss
9
49
56
52
59
55
«
10
56
63
59
67
63
71
11
62
70
66
74
70
7Q
12
69
78
73
82
78
87
13
76
85
81
90
85
QS
14
83
92
88
98
93
im
15
89
99
95
105
101
1 1 1
16
96
107
102
113
108
190
17
103
114
110
121
116
198
18
110
121
117
129
124
^^,£,
19
117
129
124
136
131
144
20
124
136
131
144
139
1S9
EPA QA/G-9
QAOO Version
A- 11
Final
July 2000
-------
TABLE A-8: PERCENTILES OF THE CHI-SQUARE DISTRIBUTION
V
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
50
60
70
80
90
100
1-a
.005
0.04393
0.0100
0.072
0.207
0.412
0.676
0.989
1.34
1.73
2.16
2.60
3.07
3.57
4.07
4.60
5.14
5.70
6.26
6.84
7.43
8.03
8.64
9.26
9.89
10.52
11.16
11.81
12.46
13.12
13.79
20.71
27.99
35.53
43.28
51.17
59.20
67.33
.010
0.03157
0.0201
0.115
0.297
0.554
0.872
1.24
1.65
2.09
2.56
3.05
3.57
4.11
4.66
5.23
5.81
6.41
7.01
7.63
8.26
8.90
9.54
10.20
10.86
11.52
12.20
12.88
13.56
14.26
14.95
22.16
29.71
37.48
45.44
53.54
61.75
70.06
.025
0.03982
0.0506
0.216
0.484
0.831
1.24
1.69
2.18
2.70
3.25
3.82
4.40
5.01
5.63
6.26
6.91
7.56
8.23
8.91
9.59
10.28
10.98
11.69
12.40
13.12
13.84
14.57
15.31
16.05
16.79
24.43
32.36
40.48
48.76
57.15
65.65
74.22
.050
0.02393
0.103
0.352
0.711
1.145
1.64
2.17
2.73
3.33
3.94
3.57
5.23
5.89
6.57
7.26
7.96
8.67
9.39
10.12
10.85
11.59
12.34
13.09
13.85
14.61
15.38
16.15
16.93
17.71
18.49
26.51
34.76
43.19
51.74
60.39
69.13
77.93
.100
0.0158
0.211
0.584
1.064
1.61
2.20
2.83
3.49
4.17
4.87
5.58
6.30
7.04
7.79
8.55
9.31
10.09
10.86
11.65
12.44
13.24
14.04
14.85
15.66
16.47
17.29
18.11
18.94
19.77
20.60
29.05
37.69
46.46
53.33
64.28
73.29
82.36
.900
2.71
4.61
6.25
7.78
9.24
10.64
12.02
13.36
14.68
15.99
17.28
18.55
19.81
21.06
22.31
23.54
24.77
25.99
27.20
28.41
29.62
30.81
32.01
33.20
34.38
35.56
36.74
37.92
39.09
40.26
51.81
63.17
74.40
85.53
96.58
107.6
118.5
.950
3.84
5.99
7.81
9.49
11.07
12.59
14.07
15.51
16.92
18.31
19.68
21.03
22.36
23.68
25.00
26.30
27.59
28.87
30.14
31.41
32.67
33.92
35.17
36.42
37.65
38.89
40.11
41.34
42.56
43.77
55.76
67.50
79.08
90.53
101.9
113.1
124.3
.975
5.02
7.38
9.35
11.14
12.83
14.45
16.01
17.53
19.02
20.48
21.92
23.34
24.74
26.12
27.49
28.85
30.19
31.53
32.85
34.17
35.48
36.78
38.08
39.36
40.65
41.92
43.19
44.46
45.72
46.98
59.34
71.42
83.30
95.02
106.6
118.1
129.6
.990
6.63
9.21
11.34
13.28
15.09
16.81
18.48
20.09
21.67
23.21
24.73
26.22
27.69
29.14
30.58
32.00
33.41
34.81
36.19
37.57
38.93
40.29
41.64
42.98
44.31
45.64
46.96
48.28
49.59
50.89
63.69
76.15
88.38
100.4
112.3
124.1
135.8
.995
7.88
10.60
12.84
14.86
16.75
18.55
20.28
21.96
23.59
25.19
26.76
28.30
29.82
31.32
32.80
34.27
35.72
37.16
38.58
40.00
41.40
42.80
44.18
45.56
46.93
48.29
49.64
50.99
52.34
53.67
66.77
79.49
91.95
104.2
116.3
128.3
140.2
EPA QA/G-9
QAOO Version
A- 12
Final
July 2000
-------
TABLE A-9: PERCENTILES OF THE F DISTRIBUTION
Degrees
Freedom
for
Denom-
inator
1 .50
.90
.95
.975
.99
2 .50
.90
.95
.975
.99
3 .50
.90
.95
.975
.99
4 .50
.90
.95
.975
.99
.999
Degrees of Freedom for Numerator

1.00
39.9
161
648
4052
0.667
8.53
18.5
38.5
98.5
0.585
5.54
10.1
17.4
34.1
0.549
4.54
7.71
12.2
21.2
74.1

1.50
49.5
200
800
5000
1.00
9.00
19.0
39.0
99.0
0.881
5.46
9.55
16.0
30.8
0.828
4.32
6.94
10.6
18.0
61.2

1.71
53.6
216
864
5403
1.13
9.16
19.2
39.2
99.2
1.00
5.39
9.28
15.4
29.5
0.941
4.19
6.59
9.98
16.7
56.2

1.82
55.8
225
900
5625
1.21
9.24
19.2
39.2
99.2
1.06
5.34
9.12
15.1
28.7
1.00
4.11
6.39
9.60
16.0
53.4

1.89
57.2
230
922
5764
1.25
9.29
19.3
39.3
99.3
1.10
5.31
9.01
14.9
28.2
1.04
4.05
6.26
9.36
15.5
51.7

1.94
58.2
234
937
5859
1.28
9.33
19.3
39.3
99.3
1.13
5.28
8.94
14.7
27.9
1.06
4.01
6.16
9.20
15.2
50.5

1.98
58.9
237
948
5928
1.30
9.35
19.4
39.4
99.4
1.15
5.27
8.89
14.6
27.7
1.08
3.98
6.09
9.07
15.0
49.7

2.00
59.4
239
957
5981
1.32
9.37
19.4
39.4
99.4
1.16
5.25
8.85
14.5
27.5
1.09
3.95
6.04
8.98
14.8
49.0

2.03
59.9
241
963
6022
1.33
9.38
19.4
39.4
99.4
1.17
5.24
8.81
14.5
27.3
1.10
3.94
6.00
8.90
14.7
48.5

2.04
60.2
242
969
6056
1.34
9.39
19.4
39.4
99.4
1.18
5.23
8.79
14.4
27.2
1.11
3.92
5.96
8.84
14.5
48.1

2.07
60.7
244
977
6106
1.36
9.41
19.4
39.4
99.4
1.20
5.22
8.74
14.3
27.1
1.13
3.90
5.91
8.75
14.4
47.4

2.09
61.2
246
985
6157
1.38
9.42
19.4
39.4
99.4
1.21
5.20
8.70
14.3
26.9
1.14
3.87
5.86
8.66
14.2
46.8

2.12
61.7
248
993
6209
1.39
9.44
19.4
39.4
99.4
1.23
5.18
8.66
14.2
26.7
1.15
3.84
5.80
8.56
14.0
46.1

2.13
62.0
249
997
6235
1.40
9.45
19.5
39.5
99.5
1.23
5.18
8.64
14.1
26.6
1.16
3.83
5.77
8.51
13.9
45.8

2.15
62.3
250
1001
6261
1.41
9.46
19.5
39.5
99.5
1.24
5.17
8.62
14.1
26.5
1.16
3.82
5.75
8.46
13.8
45.4

2.17
62.8
252
1010
6313
1.43
9.47
19.5
39.5
99.5
1.25
5.15
8.57
14.0
26.3
1.18
3.79
5.69
8.36
13.7
44.7

120

2.18
63.1
253
1014
6339
1.43
9.48
19.5
39.5
99.5
1.26
5.14
8.55
13.9
26.2
1.18
3.78
5.66
8.31
13.6
44.4

2.20
63.3
254
1018
6366
1.44
9.49
19.5
39.5
99.5
1.27
5.13
8.53
13.9
26.1
1.19
3.76
5.63
8.26
13.5
44.1
EPA QA/G-9
QAOO Version
A- 13
Final
July 2000
-------
TABLE A-9: PERCENTILES OF THE F DISTRIBUTION
Degrees
Freedom
for
Denom-
inator
5 .50
.90
.95
.975
.99
.999
6 .50
.90
.95
.975
.99
.999
7 .50
.90
.95
.975
.99
.999
8 .50
.90
.95
.975
.99
.999
Degrees of Freedom for Numerator

0.528
4.06
6.61
10.0
16.3
47.2
0.515
3.78
5.99
8.81
22.8
35.5
0.506
3.59
5.59
8.07
12.2
29.2
0.499
3.46
5.32
7.57
11.3
25.4

0.799
3.78
5.79
8.43
13.3
37.1
0.780
3.46
5.14
7.26
10.9
27.0
.0767
3.26
4.74
6.54
9.55
21.7
0.757
3.11
4.46
6.06
8.65
18.5

0.907
3.62
5.41
7.76
12.1
33.2
0.886
3.29
4.76
6.60
9.78
23.7
0.871
3.07
4.35
5.89
8.45
18.8
0.860
2.92
4.07
5.42
7.59
15.8

0.965
3.52
5.19
7.39
11.4
31.1
0.942
3.18
4.53
6.23
9.15
21.9
0.926
2.96
4.12
5.52
7.85
17.2
0.915
2.81
3.84
5.05
7.01
14.4

1.00
3.45
5.05
7.15
11.0
29.8
0.977
3.11
4.39
5.99
8.75
20.8
0.960
2.88
3.97
5.29
7.46
16.2
0.948
2.73
3.69
4.82
6.63
13.5

1.02
3.40
4.95
6.98
10.7
28.8
1.00
3.05
4.28
5.82
8.47
20.0
0.983
2.83
3.87
5.12
7.19
15.5
0.971
2.67
3.58
4.65
6.37
12.9

1.04
3.37
4.88
6.85
10.5
28.2
1.02
3.01
4.21
5.70
8.26
19.5
1.00
2.78
3.79
4.99
6.99
15.0
0.988
2.62
3.50
4.53
6.18
12.4

1.05
3.34
4.82
6.76
10.3
27.6
1.03
2.98
4.15
5.60
8.10
19.0
1.01
2.75
3.73
4.90
6.84
14.6
1.00
2.59
3.44
4.43
6.03
12.0

1.06
3.32
4.77
6.68
10.2
27.2
1.04
2.96
4.10
5.52
7.98
18.7
1.02
2.72
3.68
4.82
6.72
14.5
1.01
2.56
3.39
4.36
5.91
11.8

1.07
3.39
4.74
6.62
10.1
26.9
1.05
2.94
4.06
5.46
7.87
18.4
1.03
2.70
3.64
4.76
6.62
14.1
1.02
2.54
3.35
4.30
5.81
11.5

1.09
3.27
4.68
6.52
9.89
26.4
1.06
2.90
4.00
5.37
7.72
18.0
1.04
2.67
3.57
4.67
6.47
13.7
1.03
2.50
3.28
4.20
5.67
11.2

1.10
3.24
4.62
6.43
9.72
25.9
1.07
2.87
3.94
5.27
7.56
17.6
1.05
2.63
3.51
4.57
6.31
13.3
1.04
2.46
3.22
4.10
5.52
10.8

1.11
3.21
4.56
6.33
9.55
25.4
1.08
2.84
3.87
5.17
7.40
17.1
1.07
2.59
3.44
4.47
6.16
12.9
1.05
2.42
3.15
4.00
5.36
10.5

1.12
3.19
4.53
6.28
9.47
25.1
1.09
2.82
3.84
5.12
7.31
16.9
1.07
2.58
3.41
4.42
6.07
12.7
1.06
2.40
3.12
3.95
5.28
10.3

1.12
3.17
4.50
6.23
9.38
24.9
1.10
2.80
3.81
5.07
7.23
16.7
1.08
2.56
3.38
4.36
5.99
12.5
1.07
2.38
3.08
3.89
5.20
10.1

1.14
3.14
4.43
6.12
9.20
24.3
1.11
2.76
3.74
4.96
7.06
16.2
1.09
2.51
3.30
4.25
5.82
12.1
1.08
2.34
3.01
3.78
5.03
9.73

120

1.14
3.12
4.40
6.07
9.11
24.1
1.12
2.74
3.70
4.90
6.97
16.0
1.10
2.49
3.27
4.20
5.74
11.9
1.08
2.32
2.97
3.73
4.95
9.53

1.15
3.11
4.37
6.02
9.02
23.8
1.12
2.72
3.67
4.85
6.88
15.7
1.10
2.47
3.23
4.14
5.65
11.7
1.09
2.29
2.93
3.67
4.86
9.33
EPA QA/G-9
QAOO Version
A- 14
Final
July 2000
-------
TABLE A-9: PERCENTILES OF THE F DISTRIBUTION
Degrees
Freedom
for
Denom-
inator
9 .50
.90
.95
.975
.99
.999
10 .50
.90
.95
.975
.99
.999
12 .50
.90
.95
.975
.99
.999
15 .50
.90
.95
.975
.99
.999
Degrees of Freedom for Numerator

0.494
3.36
5.12
7.21
10.6
22.9
0.490
3.29
4.96
6.94
10.0
21.0
0.484
3.18
4.75
6.55
9.33
18.6
0.478
3.07
4.54
6.20
8.68
16.6

0.749
3.01
4.26
5.71
8.02
16.4
0.743
2.92
4.10
5.46
7.56
14.9
0.735
2.81
3.89
5.10
6.93
13.0
0.726
2.70
3.68
4.77
6.36
11.3

0.852
2.81
3.86
5.08
6.99
13.9
0.845
2.73
3.71
4.83
6.55
12.6
0.835
2.61
3.49
4.47
5.95
10.8
0.826
2.49
3.29
4.15
5.42
9.34

0.906
2.69
3.63
4.72
6.42
12.6
0.899
2.61
3.48
4.47
5.99
11.3
0.888
2.48
3.26
4.12
5.41
9.63
0.878
2.36
3.06
3.80
4.89
8.25

0.939
2.61
3.48
4.48
6.06
11.7
0.932
2.52
3.33
4.24
5.64
10.5
0.921
2.39
3.11
3.89
5.06
8.89
0.911
2.27
2.90
3.58
4.56
7.57

0.962
2.55
3.37
4.32
5.80
11.1
0.954
2.46
3.22
4.07
5.39
9.93
0.943
2.33
3.00
3.73
4.82
8.38
0.933
2.21
2.79
3.41
4.32
7.09

0.978
2.51
3.29
4.20
5.61
10.7
0.971
2.41
3.14
3.95
5.20
9.52
0.959
2.28
2.91
3.61
4.64
8.00
0.949
2.16
2.71
3.29
4.14
6.74

0.990
2.47
3.23
4.10
5.47
10.4
0.983
2.38
3.07
3.85
5.06
9.20
0.972
2.24
2.85
3.51
4.50
7.71
0.960
2.12
2.64
3.20
4.00
6.47

1.00
2.44
3.18
4.03
5.35
10.1
0.992
2.35
3.02
3.78
4.94
8.96
0.981
2.21
2.80
3.44
4.39
7.48
0.970
2.09
2.59
3.12
3.89
6.26

1.01
2.42
3.14
3.96
5.26
9.89
1.00
2.32
2.98
3.72
4.85
8.75
0.989
2.19
2.75
3.37
4.30
7.29
0.977
2.06
2.54
3.06
3.80
6.08

1.01
2.38
3.07
3.87
5.11
9.57
1.01
2.28
2.91
3.62
4.71
8.45
1.00
2.15
2.69
3.28
4.16
7.00
0.989
2.02
2.48
2.96
3.67
5.81

1.03
2.34
3.01
3.77
4.96
9.24
1.02
2.24
2.84
3.52
4.56
8.13
1.01
2.10
2.62
3.18
4.01
6.71
1.00
1.97
2.40
2.86
3.52
5.54

1.04
2.30
2.94
3.67
4.81
8.90
1.03
2.20
2.77
3.42
4.41
7.80
1.02
2.06
2.54
3.07
3.86
6.40
1.01
1.92
2.33
2.76
3.37
5.25

1.05
2.28
2.90
3.61
4.73
8.72
1.04
2.18
2.74
3.37
4.33
7.64
1.03
2.04
2.51
3.02
3.78
6.25
1.02
1.90
2.29
2.70
3.29
5.10

1.05
2.25
2.86
3.56
4.65
8.55
1.05
2.16
2.70
3.31
4.25
7.47
1.03
2.01
2.47
2.96
3.70
6.09
1.02
1.87
2.25
2.64
3.21
4.95

1.07
2.21
2.79
3.45
4.48
8.19
1.06
2.11
2.62
3.20
4.08
7.12
1.05
1.96
2.38
2.85
3.54
5.76
1.03
1.82
2.16
2.52
3.05
4.64

120

1.07
2.18
2.75
3.39
4.40
8.00
1.06
2.08
2.58
3.14
4.00
6.94
1.05
1.93
2.34
2.79
3.45
5.59
1.04
1.79
2.11
2.46
2.96
4.48

1.08
2.16
2.71
3.33
4.31
7.81
1.07
2.06
2.54
3.08
3.91
6.76
1.06
1.90
2.30
2.72
3.36
5.42
1.05
1.76
2.07
2.40
2.87
4.31
EPA QA/G-9
QAOO Version
A- 15
Final
July 2000
-------
TABLE A-9: PERCENTILES OF THE F DISTRIBUTION
Degrees
Freedom
for
Denom-
inator
20 .50
.90
.95
.975
.99
.999
24 .50
.90
.95
.975
.99
.999
30 .50
.90
.95
.975
.99
.999
60 .50
.90
.95
.975
.99
.999
Degrees of Freedom for Numerator

0.472
2.97
4.35
5.87
8.10
14.8
0.469
2.93
4.26
5.72
7.82
14.0
0.466
2.88
4.17
5.57
7.56
13.3
0.461
2.79
4.00
5.29
7.08
12.0

0.718
2.59
3.49
4.46
5.85
9.95
0.714
2.54
3.40
4.32
6.66
9.34
0.709
2.49
3.32
4.18
5.39
8.77
0.701
2.39
3.15
3.93
4.98
7.77

0.816
2.38
3.10
3.86
4.94
8.10
0.812
2.33
3.01
3.72
4.72
7.55
0.807
2.28
2.92
3.59
4.51
7.05
0.798
2.18
2.76
3.34
4.13
6.17

0.868
2.25
2.87
3.51
4.43
7.10
0.863
2.19
2.78
3.38
4.22
6.59
0.858
2.14
2.69
3.25
4.02
6.12
0.849
2.04
2.53
3.01
3.65
5.31

0.900
2.16
2.71
3.29
4.10
6.46
0.895
2.10
2.62
3.15
3.90
5.98
0.890
2.05
2.53
3.03
3.70
5.53
0.880
1.95
2.37
2.79
3.34
4.76

0.922
2.09
2.60
3.13
3.87
6.02
0.917
2.04
2.51
2.99
3.67
5.55
0.912
1.98
2.42
2.87
3.47
5.12
0.901
1.87
2.25
2.63
3.12
4.37

0.938
2.04
2.51
3.01
3.70
5.69
0.932
1.98
2.42
2.87
3.50
5.23
0.927
1.93
2.33
2.75
3.30
4.82
0.917
1.82
2.17
2.51
2.95
4.09

0.950
2.00
2.45
2.91
3.56
5.44
0.944
1.94
2.36
2.78
3.36
4.99
0.939
1.88
2.27
2.65
3.17
4.58
0.928
1.77
2.10
2.41
2.82
3.86

0.959
1.96
2.39
2.84
3.46
5.24
0.953
1.91
2.30
2.70
3.26
4.80
0.948
1.85
2.21
2.57
3.07
4.39
0.937
1.74
2.04
2.33
2.72
3.69

0.966
1.94
2.35
2.77
3.37
5.08
0.961
1.88
2.25
2.64
3.17
4.64
0.955
1.82
2.16
2.51
2.98
4.24
0.945
1.71
1.99
2.27
2.63
3.54

0.977
1.89
2.28
2.68
3.23
4.82
0.972
1.83
2.18
2.54
3.03
4.39
0.966
1.77
2.09
2.41
2.84
4.00
0.956
1.66
1.92
2.17
2.50
3.32

0.989
1.84
2.20
2.57
3.09
4.56
0.983
1.78
2.11
2.44
2.89
4.14
0.978
1.72
2.01
2.31
2.70
3.75
0.967
1.60
1.84
2.06
2.35
3.08

1.00
1.79
2.12
2.46
2.94
4.29
0.994
1.73
2.03
2.33
2.74
3.87
0.989
1.62
1.93
2.20
2.55
3.49
0.978
1.54
1.75
1.94
2.20
2.83

1.01
1.77
2.08
2.41
2.86
4.15
1.00
1.70
1.98
2.27
2.66
3.74
0.994
1.64
1.89
2.14
2.47
3.36
0.983
1.51
1.70
1.88
2.12
2.69

1.01
1.74
2.04
2.35
2.78
4.00
1.01
1.67
1.94
2.21
2.58
3.59
1.00
1.61
1.84
2.07
2.39
3.22
0.989
1.48
1.65
1.82
2.03
2.55

1.02
1.68
1.95
2.22
2.61
3.70
1.02
1.61
1.84
2.08
2.40
3.29
1.01
1.54
1.74
1.94
2.21
2.92
1.00
1.40
1.53
1.67
1.84
2.25

120

1.03
1.64
1.90
2.16
2.52
3.54
1.02
1.57
1.79
2.01
2.31
3.14
1.02
1.50
1.68
1.87
2.11
2.76
1.01
1.35
1.47
1.58
1.73
2.08

1.03
1.61
1.84
2.09
2.42
3.38
1.03
1.53
1.73
1.94
2.21
2.97
1.02
1.46
1.62
1.79
2.01
2.59
1.01
1.29
1.39
1.48
1.60
1.89
EPA QA/G-9
QAOO Version
A- 16
Final
July 2000
-------
TABLE A-9: PERCENTILES OF THE F DISTRIBUTION
Degrees
Freedom
for
Denom-
inator
20.90
.95
.975
.99
.999
.90
.95
.975
.99
.999
Degrees of Freedom for Numerator
1
2.75
3.92
5.15
6.85
11.4
2.71
3.84
5.02
6.63
10.8
2
2.35
3.07
3.80
4.79
7.32
2.30
3.00
3.69
4.61
6.91
3
2.13
2.68
3.23
3.95
5.78
2.08
2.60
3.12
3.78
5.42
4
1.99
2.45
2.89
3.48
4.95
1.94
2.37
2.79
3.32
4.62
5
1.90
2.29
2.67
3.17
4.42
1.85
2.21
2.57
3.02
4.10
6
1.82
2.18
2.52
2.96
4.04
1.77
2.10
22.41
2.80
3.74
7
1.77
2.09
2.39
2.79
3.77
1.72
2.01
2.29
2.64
3.47
8
1.72
2.02
2.30
2.66
3.55
1.67
1.94
2.19
2.51
3.27
9
1.68
1.96
2.22
2.56
3.38
1.63
1.88
2.11
2.41
3.10
10
1.65
1.91
2.16
2.47
3.24
1.60
1.83
2.05
2.32
2.96
12
1.60
1.83
2.05
2.34
3.02
1.55
1.75
1.94
2.18
2.74
15
1.55
1.75
1.95
2.19
2.78
1.49
1.67
1.83
2.04
2.51
20
1.48
1.66
1.82
2.03
2.53
1.42
1.57
1.71
1.88
2.27
24
1.45
1.61
1.76
1.95
2.40
1.38
1.52
1.64
1.79
2.13
30
1.41
1.55
1.69
1.86
2.26
1.34
1. 46
1.57
1.70
1.99
60
1.32
1.43
1.53
1.66
1.95
1.24
1.32
1.39
1.47
1.66
120
1.26
1.35
1.43
1.53
1.77
1.17
1.22
1.27
1.32
1.45

1.19
1.25
1.31
1.38
1.54
1.00
1.00
1.00
1.00
1.00
EPA QA/G-9
QAOO Version
A- 17
Final
July 2000
-------
TABLE A-10: VALUES OF THE PARAMETER A FOR COHEN'S ESTIMATES
ADJUSTING FOR NONDETECTED VALUES
Y
.00
.05
.10
.15
.20
.25
.30
.35
.40
.45
.50
.55
.60
.65
.70
.75
.80
.85
.90
.95
1.00
.01
.010100
.010551
.010950
.011310
.011642
.011952
.012243
.012520
.012784
.013036
.013279
.013513
.013739
.013958
.014171
.014378
.014579
.014773
.014967
.015154
.015338
.02
.020400
.021294
.022082
.022798
.023459
.024076
.024658
.025211
.025738
.026243
.026728
.027196
.027849
.028087
.028513
.029927
.029330
.029723
.030107
.030483
.030850
.03
.030902
.032225
.033398
.034466
.035453
.036377
.037249
.038077
.038866
.039624
.040352
.041054
.041733
.042391
.043030
.043652
.044258
.044848
.045425
.045989
.046540
.04
.041583
.043350
.044902
.046318
.047829
.048858
.050018
.051120
.052173
.053182
.054153
.055089
.055995
.056874
.057726
.058556
.059364
.060153
.060923
.061676
.062413
.05
.052507
.054670
.056596
.058356
.059990
.061522
.062969
.064345
.065660
.066921
.068135
.069306
.070439
.071538
.072505
.073643
.074655
.075642
.075606
.077549
.078471
.06
.063625
.066159
.068483
.070586
.072539
.074372
.076106
.077736
.079332
.080845
.082301
.083708
.085068
.086388
.087670
.088917
.090133
.091319
.092477
.093611
.094720
h
.07
.074953
.077909
.080563
.083009
.085280
.087413
.089433
.091355
.093193
.094958
.096657
.098298
.099887
.10143
.10292
.10438
.10580
.10719
.10854
.10987
.11116
.08
.08649
.08983
.09285
.09563
.09822
.10065
.10295
.10515
.10725
.10926
.11121
.11208
.11490
.11666
.11837
.12004
.12167
.12225
.12480
.12632
.12780
.09
.09824
.10197
.10534
.10845
.11135
.11408
.11667
.11914
.12150
.12377
.12595
.12806
.13011
.13209
.13402
.13590
.13775
.13952
.14126
.14297
.14465
.10
.11020
.11431
.11804
.12148
.12469
.12772
.13059
.13333
.13595
.13847
.14090
.14325
.14552
.14773
.14987
.15196
.15400
.15599
.15793
.15983
.16170
.15
.17342
.17925
.18479
.18985
.19460
.19910
.20338
.20747
.21129
.21517
.21882
.22225
.22578
.22910
.23234
.23550
.23858
.24158
.24452
.24740
.25022
.20
.24268
.25033
.25741
.26405
.27031
.27626
.28193
.28737
.29250
.29765
.30253
.30725
.31184
.31630
.32065
.32489
.32903
.33307
.33703
.34091
.34471
Y
.00
.05
.10
.15
.20
.25
.30
.35
.40
.45
.50
.55
.60
.65
.70
.75
.80
.85
.90
.95
1.00
.25
.31862
.32793
.33662
.34480
.35255
.35993
.36700
.37379
.38033
.38665
.39276
.39679
.40447
.41008
.41555
.42090
.42612
.43122
.43622
.44112
.44592
.30
.4021
.4130
.4233
.4330
.4422
.4510
.4595
.4676
.4735
.4831
.4904
.4976
.5045
.5114
.5180
.5245
.5308
.5370
.5430
.5490
.5548
.35
.4941
.5066
.5184
.5296
.5403
.5506
.5604
.5699
.5791
.5880
.5967
.6061
.6133
.6213
.6291
.6367
.6441
.6515
.6586
.6656
.6724
.40
.5961
.6101
.6234
.6361
.6483
.6600
.6713
.6821
.6927
.7029
.7129
.7225
.7320
.7412
.7502
.7590
.7676
.7781
.7844
.7925
.8005
.45
.7096
.7252
.7400
.7542
.7673
.7810
.7937
.8060
.8179
.8295
.8408
.8517
.8625
.8729
.8832
.8932
.9031
.9127
.9222
.9314
.9406
.50
.8388
.8540
.8703
.8860
.9012
.9158
.9300
.9437
.9570
.9700
.9826
.9950
1.007
1.019
1.030
1.042
1.053
1.064
1.074
1.085
1.095
h
.55
.9808
.9994
1.017
1.035
1.051
1.067
1.083
1.098
1.113
1.127
1.141
1.155
1.169
1.182
1.195
1.207
1.220
1.232
1.244
1.255
1.287
.60
1.145
1.166
1.185
1.204
1.222
1.240
1.257
1.274
1.290
1.306
1.321
1.337
1.351
1.368
1.380
1.394
1.408
1.422
1.435
1.448
1.461
.65
1.336
1.358
1.379
1.400
1.419
1.439
1.457
1.475
1.494
1.511
1.528
1.545
1.561
1.577
1.593
1.608
1.624
1.639
1.653
1.668
1.882
.70
1.561
1.585
1.608
1.630
1.651
1.672
1.693
1.713
1.732
1.751
1.770
1.788
1.806
1.824
1.841
1.851
1.875
1.892
1.908
1.924
1.940
.80
2.176
2.203
2.229
2.255
2.280
2.305
2.329
2.353
2.376
2.399
2.421
2.443
2.465
2.486
2.507
2.528
2.548
2.568
2.588
2.607
2.626
.90
3.283
3.314
3.345
3.376
3.405
3.435
3.464
3.492
3.520
3.547
3.575
3.601
3.628
3.654
3.679
3.705
3.730
3.754
3.779
3.803
3.827
EPA QA/G-9
QAOO Version
A- 18
Final
July 2000
-------
TABLE A-ll: PROBABILITIES FOR THE SMALL-SAMPLE
MANN-KENDALL TEST FOR TREND
s
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36

n
458 9
0.625 0.592 0.548 0.540
0.375 0.408 0.452 0.460
0.167 0.242 0.360 0.381
0.042 0.117 0.274 0.306
0.042 0.199 0.238
0.0083 0.138 0.179
0.089 0.130
0.054 0.090
0.031 0.060
0.016 0.038
0.0071 0.022
0.0028 0.012
0.00087 0.0063
0.00019 0.0029
0.000025 0.0012
0.00043
0.00012
0.000025
0.0000028

S
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
n
67 10
0.500 0.500 0.500
0.360 0.386 0.431
0.235 0.281 0.364
0.136 0.191 0.300
0.068 0.119 0.242
0.028 0.068 0.190
0.0083 0.035 0.146
0.0014 0.015 0.108
0.0054 0.078
0.0014 0.054
0.00020 0.036
0.023
0.014
0.0083
0.0046
0.0023
0.0011
0.00047
0.00018
0.000058
0.000015
0.0000028
0.00000028
EPA QA/G-9
QAOO Version
A- 19
Final
July 2000
-------
TABLE A-12. QUANTILES FOR THE WALD-WOLFOWITZ TEST FOR RUNS
n
4
in
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
w,.01
-
-
-
-
-
-
-
3
3
3
3
3
4
4
4
4
4
w
" 0.05
-
-
3
3
3
3
4
4
4
4
5
5
5
5
5
5
5
w
"0.10
3
3
4
4
4
4
5
5
5
5
6
6
6
6
6
6
6
n
5
in
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
w,.01
3
3
3
3
3
4
4
4
4
5
5
5
5
5
5
5
w
" 0.05
4
4
4
4
4
4
5
5
5
6
6
6
6
6
6
6
w
"0.10
4
4
5
5
5
5
6
6
6
6
7
7
7
7
7
7
n
6
m
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
w,.01
3
4
4
4
4
5
5
5
5
5
6
6
6
6
6
w
" 0.05
4
5
5
5
6
6
6
7
7
7
7
7
7
7
7
w
"0.10
5
6
6
6
7
7
7
8
8
8
8
8
8
8
8
n
7
m

8
9
10
11
12
13
14
15
16
17
18
19
20
w,.01

4
4
4
5
5
5
5
6
6
6
7
7
7
w
" 0.05

5
5
6
6
6
7
7
7
7
8
8
8
8
Wo.!
0

6
6
7
7
7
7
8
8
8
8
9
9
9
EPA QA/G-9
QAOO Version
A-20
Final
July 2000
-------
TABLE A-12. QUANTILES FOR THE WALD-WOLFOWITZ TEST FOR RUNS
n
8

in
8
9
10
11
12
13
14
15
16
17
18
19
20
w,.01
5
5
5
6
6
6
6
6
7
7
7
7
7
w
" 0.05
6
6
7
7
7
7
8
8
8
8
9
9
9
w
"0.10
6
7
7
8
8
8
8
9
9
9
10
10
10
n
9

in
9
10
11
12
13
14
15
16
17
18
19
20
w,.01
5
6
6
6
7
7
7
7
8
8
8
8
w
" 0.05
6
7
7
7
8
8
9
9
9
9
10
10
w
"0.10
7
8
8
8
9
9
9
10
10
10
11
11
n
10

m
10
11
12
13
14
15
16
17
18
19
20
w,.01
7
7
7
7
8
8
8
8
9
9
9
w
" 0.05
8
8
8
9
9
9
9
10
10
11
11
w
"0.10
9
9
9
10
10
10
11
11
11
12
12
n
11

m
11
12
13
14
15
16
17
18
19
20
w,.01
7
7
7
8
8
8
9
9
9
9
w
" 0.05
8
8
9
9
9
10
10
10
11
11
Wo.!
0
9
9
10
10
10
11
11
11
12
12
12

12
13
14
15
16
17
7
8
8
8
9
9
9
10
10
10
11
11
10
10
10
11
12
12

13
14
15
16
17
18
9
9
10
10
10
10
10
11
11
11
11
11
11
12
12
12
12
13

14
15
16
17
18
19
9
9
10
10
10
11
11
11
11
12
12
13
12
12
13
13
13
14

15
16
17
18
19
20
8
9
10
10
11
12
10
11
12
12
13
14
11
12
13
14
15
16
EPA QA/G-9
QAOO Version
A-21
Final
July 2000
-------
TABLE A-12. QUANTILES FOR THE WALD-WOLFOWITZ TEST FOR RUNS
n

in
18
19
20
16
17
18
19
20
w,.01
9
10
10
10
11
11
12
13
w
" 0.05
11
12
12
12
13
13
14
14
w
"0.10
12
13
13
13
14
14
15
16
n

in
19
20
w,.01
10
11
w
" 0.05
12
12
w
"0.10
13
13
17

17
18
19
20
11
11
12
12
13
13
14
14
14
15
16
16
n

m
20
w,.01
11
w
" 0.05
13
w
"0.10
14
18

18
19
20
13
13
14
14
14
15
16
16
16
n
m
w,.01
w
" 0.05
Wo.!
0
19

19
20
14
14
16
16
17
17
20
20
143
16
17
When n or m is greater than 20 the Wp quantile is given by:
I , 2mn \ 7rl 2mn(2mn -m- n)
= \-\ -- + LP - -^— - -
m + n
where Zp is the appropriate quantile from the standard normal (see last row of Table A-1).
EPA QA/G-9
QAOO Version
A-22
Final
July 2000
-------
TABLE A-13. MODIFIED QUANTILE TEST CRITICAL NUMBERS
LEVEL OF SIGNIFICANCE ( ) APPROXIMATELY 0.10

m = number of measurements population 2
n = number of measurements population 1

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
5
3
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
6
3
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
7
4
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
8
4
4
3
3
3
2
2
2
2
2
2
2
2
2
2
2
90
5
4
4
3
3
3
2
2
2
2
2
2
2
2
2
2
10
5
4
4
4
3
3
3
2
2
2
2
2
2
2
2
2
11
5
4
4
4
4
3
3
3
2
2
2
2
2
2
2
2
12
6
5
4
4
4
3
3
3
3
2
2
2
2
2
2
2
13
6
5
5
4
4
4
3
3
3
3
2
2
2
2
2
2
14
7
6
5
5
4
4
4
3
3
3
3
2
2
2
2
2
15
7
6
6
5
5
4
4
4
3
3
3
3
2
2
2
2
16
7
7
6
5
5
4
4
4
4
3
3
3
3
2
2
2
17
8
7
6
5
5
4
4
4
4
4
3
3
3
3
2
2
18
8
8
7
6
5
5
5
4
4
4
4
3
3
3
3
2
19
8
8
7
6
5
5
5
5
4
4
4
4
3
3
3
3
20
8
8
7
6
6
5
5
5
4
4
4
4
3
3
3
3
EPA QA/G-9
QAOO Version
A-23
Final
July 2000
-------
TABLE A-13. MODIFIED QUANTILE TEST CRITICAL NUMBERS (CONTINUED)
LEVEL OF SIGNIFICANCE ( ) APPROXIMATELY 0.10

m = number of measurements population 2
n = number of measurements population 1

25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100

3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
30
4
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
35
4
4
3
3
3
3
3
3
3
2
2
2
2
2
2
2
40
5
4
4
4
3
3
3
3
3
3
2
2
2
2
2
2
45
5
5
4
4
4
3
3
3
3
3
3
2
2
2
2
2
50
5
5
4
4
4
4
3
3
3
3
3
3
3
3
2
2
55
6
5
5
4
4
4
4
3
3
3
3
3
3
3
3
3
60
6
6
5
5
4
4
4
4
3
3
3
3
3
3
3
3
65
7
6
5
5
4
4
4
4
4
3
3
3
3
3
3
3
70
7
6
6
5
5
4
4
4
4
4
3
3
3
3
3
3
75
8
7
6
5
5
5
4
4
4
4
4
3
3
3
3
3
80
8
7
6
6
5
5
4
4
4
4
4
4
3
3
3
3
85
8
7
6
6
5
5
5
4
4
4
4
4
4
3
3
3
90
9
8
7
6
6
5
5
5
4
4
4
4
4
4
3
3
95
9
8
7
6
6
5
5
5
5
4
4
4
4
4
3
3
100
10
8
7
7
6
6
5
5
5
4
4
4
4
4
4
4
EPA QA/G-9
QAOO Version
A-24
Final
July 2000
-------
TABLE A-13. MODIFIED QUANTILE TEST CRITICAL NUMBERS (CONTINUED)
LEVEL OF SIGNIFICANCE ( ) APPROXIMATELY 0.05

m = number of measurements population 2
n = number of measurements population 1

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
5
4
4
3
3
3
3
3
3
3
2
2
2
2
2
2
2
6
4
4
4
3
3
3
3
3
3
3
2
2
2
2
2
2
7
5
4
4
4
3
3
3
3
3
3
3
2
2
2
2
2
8
5
5
4
4
4
3
3
3
3
3
3
3
2
2
2
2
90
6
5
5
4
4
4
3
3
3
3
3
3
3
2
2
2
10
6
5
5
5
4
4
4
3
3
3
3
3
3
3
2
2
11
6
5
5
5
5
4
4
4
3
3
3
3
3
3
3
2
12
7
6
5
5
5
4
4
4
4
3
3
3
3
3
3
3
13
7
6
6
5
5
5
4
4
4
4
3
3
3
3
3
3
14
8
7
6
6
5
5
5
4
4
4
4
3
3
3
3
3
15
8
7
7
6
6
5
5
5
4
4
4
4
3
3
3
3
16
8
8
7
6
6
5
5
5
5
4
4
4
4
3
3
3
17
9
8
7
6
6
6
5
5
5
5
4
4
4
4
3
3
18
9
9
8
7
6
6
6
5
5
5
5
4
4
4
4
3
19
10
9
8
7
6
6
6
6
5
5
5
5
4
4
4
4
20
10
9
8
7
6
6
6
6
5
5
5
5
4
4
4
4
EPA QA/G-9
QAOO Version
A-25
Final
July 2000
-------
TABLE A-13. MODIFIED QUANTILE TEST CRITICAL NUMBERS (CONTINUED)
LEVEL OF SIGNIFICANCE ( ) APPROXIMATELY 0.05

m = number of measurements population 2
n = number of measurements population 1

25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100

4
4
3
3
3
3
3
3
2
2
2
2
2
2
2
2
30
4
4
4
4
4
3
3
3
3
3
3
3
2
2
2
2
35
5
5
4
4
4
4
3
3
3
3
3
3
3
2
2
2
40
6
5
5
4
4
4
4
3
3
3
3
3
3
3
2
2
45
7
6
5
5
4
4
4
4
3
3
3
3
3
3
3
2
50
7
6
6
5
5
4
4
4
4
3
3
3
3
3
3
3
55
8
7
6
5
5
5
4
4
4
4
3
3
3
3
3
3
60
8
7
6
6
5
5
5
4
4
4
4
3
3
3
3
3
65
8
8
7
6
6
5
5
5
4
4
4
4
3
3
3
3
70
9
8
7
7
6
5
5
5
5
4
4
4
4
3
3
3
75
9
9
8
7
6
6
5
5
5
5
5
4
4
4
3
3
80
10
9
8
7
7
6
6
5
5
5
5
5
4
4
4
3
85
10
9
8
8
7
6
6
6
5
5
5
5
5
4
4
4
90
11
10
9
8
7
7
6
6
5
5
5
5
5
5
4
4
95
11
10
9
8
8
7
6
6
6
5
5
5
5
5
5
4
100
12
11
10
9
8
7
7
6
6
6
5
5
5
5
5
5
EPA QA/G-9
QAOO Version
A-26
Final
July 2000
-------
TABLE A-14. DUNNETT'S TEST (ONE TAILED)
TOTAL NUMBER OF INVESTIGATED GROUPS (K - 1)
Degrees of Freedom

2
3
4
5
6
7
8
9
10
12
16
a
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
2
3.80
2.54
2.94
2.13
2.61
1.96
2.44
1.87
2.34
1.82
2.27
1.78
2.22
1.75
2.18
1.73
2.15
1.71
2.11
1.69
2.06
1.66
3
4.34
2.92
3.28
2.41
2.88
2.20
2.68
2.09
2.56
2.02
2.48
1.98
2.42
1.94
2.37
1.92
2.34
1.90
2.29
1.87
2.23
1.83
4
4.71
3.20
3.52
2.61
3.08
2.37
2.85
2.24
2.71
2.17
2.82
2.11
2.55
2.08
2.50
2.05
2.47
2.02
2.41
1.99
2.34
1.95
5
5.08
3.40
3.70
2.76
3.22
2.50
2.98
2.36
2.83
2.27
2.73
2.22
2.66
2.17
2.60
2.14
2.56
2.12
2.50
2.08
2.43
2.04
6
5.24
3.57
3.85
2.87
3.34
2.60
3.08
2.45
2.92
2.36
2.81
2.30
2.74
2.25
2.68
2.22
2.64
2.19
2.58
2.16
2.50
2.11
7
5.43
3.71
3.97
2.97
3.44
2.68
3.16
2.53
3.00
2.43
2.89
2.37
2.81
2.32
2.75
2.28
2.70
2.26
2.64
2.22
2.56
2.17
8
5.60
3.83
4.08
3.06
3.52
2.75
3.24
2.59
3.06
2.49
2.95
2.42
2.87
2.38
2.81
2.34
2.76
2.31
2.69
2.27
2.61
2.22
9
5.75
3.94
4.17
3.13
3.59
2.82
3.30
2.65
3.12
2.54
3.00
2.47
2.92
2.42
2.86
2.39
2.81
2.35
2.74
2.31
2.65
2.26
10
5.88
4.03
4.25
3.20
3.66
2.87
3.36
2.70
3.17
2.59
3.05
2.52
2.96
2.47
2.90
2.43
2.85
2.40
2.78
2.35
2.69
2.30
12
6.11
4.19
4.39
3.31
3.77
2.97
3.45
2.78
3.26
2.67
3.13
2.59
3.04
2.54
2.97
2.50
2.92
2.46
2.84
2.42
2.75
2.36
14
6.29
4.32
4.51
3.41
3.86
3.05
3.53
2.86
3.33
2.74
3.20
2.66
3.11
2.60
3.04
2.56
2.98
2.52
2.90
2.47
2.81
2.41
16
6.45
4.44
4.61
3.49
3.94
3.11
3.60
2.92
3.48
2.79
3.26
2.71
3.16
2.65
3.09
2.61
3.03
2.57
2.95
2.52
2.85
2.46
EPA QA/G-9
QAOO Version
A-27
Final
July 2000
-------
TABLE A-14. DUNNETT'S TEST (ONE TAILED) (CONTINUED)
TOTAL NUMBER OF INVESTIGATED GROUPS (K - 1)
Degrees of Freedom

20
24
30
40
50
60
70
80
90
100
120

a
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
.05
.10
2
2.03
1.64
2.01
1.63
1.99
1.62
1.97
1.61
1.96
1.61
1.95
1.60
1.95
1.60
1.94
1.60
1.94
1.60
1.93
1.59
1.93
1.59
1.92
1.58
3
2.19
1.81
2.17
2.80
2.15
1.79
2.13
1.77
2.11
1.77
2.10
1.76
2.10
1.76
2.10
1.76
2.09
1.76
2.08
1.75
2.08
1.75
2.06
1.73
4
2.30
1.93
2.28
1.91
2.25
1.90
2.23
1.88
2.22
1.88
2.21
1.87
2.21
1.87
2.20
1.87
2.20
1.86
2.18
1.85
2.18
1.85
2.16
1.84
5
2.39
2.01
2.36
2.00
2.34
1.98
2.31
1.96
2.29
1.96
2.28
1.95
2.28
1.95
2.28
1.95
2.27
1.94
2.27
1.93
2.26
1.93
2.23
1.92
6
2.46
2.08
2.43
2.06
2.40
2.05
2.37
2.03
2.32
2.02
2.34
2.01
2.34
2.01
2.34
2.01
2.33
2.00
2.33
1.99
2.32
1.99
2.29
1.98
7
2.51
2.14
2.48
2.12
2.45
2.10
2.42
2.08
2.41
2.07
2.40
2.06
2.40
2.06
2.39
2.06
2.39
2.06
2.38
2.05
2.37
2.05
2.34
2.03
8
2.56
2.19
2.53
2.17
2.50
2.15
2.47
2.13
2.45
2.12
2.44
2.11
2.44
2.11
2.43
2.10
2.43
2.10
2.42
2.09
2.41
2.09
2.38
2.07
9
2.60
2.23
2.57
2.21
2.54
2.19
2.51
2.17
2.49
2.16
2.48
2.15
2.48
2.15
2.47
2.15
2.47
2.14
2.46
2.14
2.45
2.13
2.42
2.11
10
2.64
2.26
2.60
2.24
2.57
2.22
2.54
2.20
2.52
2.19
2.51
2.18
2.51
2.18
2.50
2.18
2.50
2.17
2.49
2.17
2.48
2.16
2.45
2.14
12
2.70
2.33
2.66
2.30
2.63
2.28
2.60
2.26
2.58
2.25
2.57
2.24
2.56
2.24
2.55
2.23
2.55
2.23
2.54
2.22
2.53
2.22
2.50
2.20
14
2.75
2.38
2.72
2.35
2.68
2.33
2.65
2.31
2.63
2.30
2.61
2.29
2.61
2.29
2.60
2.28
2.60
2.28
2.59
2.27
2.58
2.27
2.55
2.24
16
2.80
2.42
2.76
2.40
2.72
2.37
2.69
2.35
2.67
2.34
2.65
2.33
2.65
2.33
2.64
2.32
2.63
2.31
2.63
2.31
2.62
2.31
2.58
2.28
EPA QA/G-9
QQAOO Version
A-28
Final
May 2000
-------
TABLE A-15. APPROXIMATE a-LEVEL CRITICAL POINTS FOR
RANK VON NEUMANN RATIO TEST
n/a
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
32
34
36
38
40
42
44
46
48
50
55
60
65
70
75
80
85
90
95
100
.050

0.70
0.80
0.86
0.93
0.98
1.04
1.08
1.11
1.14
1.17
1.19
1.21
1.24
1.26
1.27
1.29
1.31
1.32
1.33
1.35
1.36
1.37
1.38
1.39
1.40
1.41
1.43
1.45
1.46
1.48
1.49
1.50
1.51
1.52
1.53
1.54
1.56
1.58
1.60
1.61
1.62
1.64
1.65
1.66
1.66
1.67
.100

0.60
0.97
1.11
1.14
1.18
1.23
1.26
1.29
1.32
1.34
1.36
1.38
1.40
1.41
1.43
1.44
1.45
1.46
1.48
1.49
1.50
1.51
1.51
1.52
1.53
1.54
1.55
1.57
1.58
1.59
1.60
1.61
1.62
1.63
1.63
1.64
1.66
1.67
1.68
1.70
1.71
1.71
1.72
1.73
1.74
1.74
EPA QA/G-9
QAOO Version
A-29
Final
July 2000
-------
This page is intentionally blank.
EPA QA/G-9 Final
QAOO Version A - 30 July 2000
-------
APPENDIX B

REFERENCES
EPA QA/G-9 Final
QAOO Version B - 1 July 2000
-------
APPENDIX B: REFERENCES

This appendix provides references for the topics and procedures described in this
document. The references are broken into three groups: Primary, Basic Statistics Textbooks, and
Secondary. This classification does not refer in any way to the subject matter content but to the
relevance to the intended audience for this document, ease in understanding statistical concepts
and methodologies, and accessability to the non-statistical community. Primary references are
those thought to be of particular benefit as hands-on material, where the degree of sophistication
demanded by the writer seldom requires extensive training in statistics; most of these references
should be on an environmental statistician's bookshelf. Users of this document are encouraged to
send recommendations on additional references to the address listed in the Foreword.

Some sections within the chapters reference materials found in most introductory statistics
books. This document uses Walpole and Myers (1985), Freedman, Pisani, Purves, and Adhakari
(1991), Mendenhall (1987), and Dixon and Massey (1983). Table B-l (at the end of this
appendix) lists specific chapters in these books where topics contained in this guidance may be
found. This list could be extended much further by use of other basic textbooks; this is
acknowledged by the simple statement that further information is available from introductory text
books.

Some important books specific to the analysis of environmental data include: Gilbert
(1987), an excellent all-round handbook having strength in sampling, estimation, and hot-spot
detection; Gibbons (1994), a book specifically concentrating on the application of statistics to
groundwater problems with emphasis on method detection limits, censored data, and the detection
of outliers; and Madansky (1988), a slightly more theoretical volume with important chapters on
the testing for Normality, transformations, and testing for independence. In addition, Ott (1995)
describes modeling, probabilistic processes, and the Lognormal distribution of contaminants, and
Berthouex and Brown (1994) provide an engineering approach to problems including estimation,
experimental design and the fitting of models.

B.I CHAPTER 1

Chapter 1 establishes the framework of qualitative and quantitative criteria against which
the data that has been collected will be assessed. The most important feature of this chapter is the
concept of the test of hypotheses framework which is described in any introductory textbook. A
non-technical exposition of hypothesis testing is also to be found in U.S. EPA (1994a, 1994b)
which provides guidance on planning for environmental data collection. An application of the
DQO Process to geostatistical error management may be found in Myers (1997).

A full discussion of sampling methods with the attendant theory are to be found in Gilbert
(1987) and a shorter discussion may be found in U.S. EPA (1989). Cochran (1966) and Kish
(1965) also provide more advanced theoretical concepts but may require the assistance of a
statistician for full comprehension. More sophisticated sampling designs such as composite
EPA QA/G-9 Final
QAOO Version B - 2 July 2000
-------
sampling, adaptive sampling, and ranked set sampling, will be discussed in future Agency
guidance.

B.2 CHAPTER 2

Standard statistical quantities and graphical representations are discussed in most
introductory statistics books. In addition, Berthouex & Brown (1994) and Madansky (1988) both
contain thorough discussions on the subject. There are also several textbooks devoted exclusively
to graphical representations, including Cleveland (1993), which may contain the most applicable
methods for environmental data, Tufte (1983), and Chambers, Cleveland, Kleiner and Tukey
(1983).

Two EPA sources for temporal data that keep theoretical discussions to a minimum are
U.S. EPA (1992a) and U.S. EPA (1992b). For a more complete discussion on temporal data,
specifically time series analysis, see Box and Jenkins (1970), Wei (1990), or Ostrum (1978).
These more complete references provide both theory and practice; however, the assistance of a
statistician may be needed to adapt the methodologies for immediate use. Theoretical discussions
of spatial data may be found in Journel and Huijbregts (1978), Cressie (1993), and Ripley (1981).

B.3 CHAPTER 3

The hypothesis tests covered in this edition of the guidance are well known and straight-
forward; basic statistics texts cover these subjects. Besides basic statistical text books, Berthouex
& Brown (1994), Hardin and Gilbert (1993), and U.S. EPA (1989, 1994c) may be useful to the
reader. In addition, there are some statistics books devoted specifically to hypothesis testing, for
example, see Lehmann (1991). These books may be too theoretical for most practitioners, and
their application to environmental situations may not be obvious.

The statement in this document that the sign test requires approximately 1.225 times as
many observations as the Wilcoxon rank sum test to achieve a given power at a given significance
level is attributable to Lehmann (1975).

B.4 CHAPTER 4

This chapter is essentially a compendium of statistical tests drawn mostly from the primary
references and basic statistics textbooks. Gilbert (1987) and Madansky (1988) have an excellent
collection of techniques and U.S. EPA (1992a) contains techniques specific to water problems.

For Normality (Section 4.2), Madansky (1988) has an excellent discussion on tests as does
Shapiro (1986). For trend testing (Section 4.3), Gilbert (1987) has an excellent discussion on
statistical tests and U.S. EPA (1992b) provides adjustments for trends and seasonality in the
calculation of descriptive statistics.
EPA QA/G-9 Final
QAOO Version B - 3 July 2000
-------
There are several very good textbooks devoted to the treatment of outliers (Section 4.4).
Two authoritative texts are Barnett and Lewis (1978) and Hawkins (1980). Additional
information is also to be found in Beckman and Cook (1983) and Tietjen and Moore (1972).
Several useful software programs are available on the statistical market including U.S. EPA's
GEO-EASE and Scout, both developed by the Environmental Monitoring Systems Laboratory,
Las Vegas, Nevada and described in U.S. EPA (1991) and U.S. EPA (1993b), respectively.

Tests for dispersion (Section 4.5) are described in the basic textbooks and examples are to
be found in U.S. EPA (1992a). Transformation of data (Section 4.6) is a sensitive topic and
thorough discussions may be found in Gilbert (1987), and Dixon and Massey (1983). Equally
sensitive is the analysis of data where some values are recorded as non-detected (Section 4.7);
Gibbons (1994) and U.S. EPA (1992a) have relevant discussions and examples.

B.5 CHAPTER 5

Chapter 5 discusses some of the philosophical issues related to hypothesis testing which
may help in understanding and communicating the test results. Although there are no specific
references for this chapter, many topics (e.g., the use of p-values) are discussed in introductory
textbooks. Future editions of this guidance will be expanded by incorporating practical
experiences from the environmental community into this chapter.

B.6 LIST OF REFERENCES

B.6.1 Primary References

Berthouex, P.M., and L.C. Brown, 1994. Statistics for Environmental Engineers. Lewis, Boca
Raton, FL.

Gilbert, R.O., 1987. Statistical Methods for Environmental Pollution Monitor ing. John Wiley,
New York, NY.

Gibbons, R. D., 1994. Statistical Methods for Groundwater Monitoring. John Wiley, New
York, NY.

Madansky, A., 1988. Prescriptions for Working Statisticians. Springer-Verlag, New York, NY.

Ott, W.R., 1995. Environmental Statistics and Data Analysis. Lewis, Boca Raton, FL.

U.S. Environmental Protection Agency, 1996. The Data Quality Evaluation Statistical Toolbox
(DataQUEST) Software, EPA QA/G-9D. Office of Research and Development.

U.S. Environmental Protection Agency, 1994a. Guidance for the Data Quality Objectives
Process (EPA QA/G4). EPA/600/R-96/055. Office of Research and Development.
EPA QA/G-9 Final
QAOO Version B - 4 July 2000
-------
U.S. Environmental Protection Agency, 1994b. The Data Quality Objectives Decision Error
Feasibility Trials (DEFT) Software (EPA QA/G-4D). EPA/600/R-96/056. Office of
Research and Development.

U.S. Environmental Protection Agency , 1992a. Guidance Document on the Statistical Analysis
of Ground-Water Monitoring Data at RCRA Facilities. EPA/530/R-93/003. Office of
Solid Waste. (NTIS: PB89-151026)

B.6.2 Basic Statistics Textbooks

Dixon, W.J., and FJ. Massey, Jr., 1983. Introduction to Statistical Analysis (Fourth Edition).
McGraw-Hill, New York, NY.

Freedman, D., R. Pisani, R. Purves, and A. Adhikari, 1991. Statistics. W.W. Norton & Co.,
New York, NY.

Mendenhall, W., 1987. Introduction to Probability and Statistics (Seventh Edition). PWS-Kent,
Boston, MA.

Walpole, R., and R. Myers, 1985. Probability and Statistics for Engineers and Scientists (Third
Ed.). MacMillan, New York, NY.

B.6.3 Secondary References

Aitchison, J., 1955. On the distribution of a positive random variable having a discrete probability
mass at the origin. Journal of American Statistical Association 50(272):901-8

Barnett, V., and T. Lewis, 1978. Outliers in Statistical Data. John Wiley, New York, NY.

Beckman, R.J., and R.D. Cook, 1983. Outlier s, Technometrics 25:119-149.

Box[, G.E.P., and G.M. Jenkins, 1970. Time Series Analysis, Forecasting, and Control.
Holden-Day, San Francisco, CA.

Chambers, J.M, W.S. Cleveland, B. Kleiner, and P.A. Tukey, 1983. Graphical Methods for Data
Analysis. Wadsworth & Brooks/Cole Publishing Co., Pacific Grove, CA.

Chen, L., 1995. Testing the mean of skewed distributions, Journal of the American Statistical
Association 90:767-772.

Cleveland, W.S., 1993. Visualizing Data. Hobart Press, Summit, NJ.

Cochran, W. G., 1966. Sampling Techniques (Third Edition). John Wiley, New York, NY.
EPA QA/G-9 Final
QAOO Version B - 5 July 2000
-------
Cohen, A.C., Jr. 1959. Simplified estimators for the normal distribution when samples are singly
censored or truncated, Technometrics 1:217-237.

Conover, W.J., 1980. PracticalNonparametric Statistics (Second Edition). John Wiley, New
York, NY.

Cressie, N., 1993. Statistics for Spatial Data. John Wiley, New York, NY.

D'Agostino, R.B., 1971. An omnibus test of normality for moderate and large size samples,
Biometrika 58:341-348.

David, H.A., H.O. Hartley, and E.S. Pearson, 1954. The distribution of the ratio, in a single
normal sample, of range to standard deviation, Biometrika 48:41-55.

Dixon, W.J., 1953. Processing data for outliers, Biometrika 9:74-79.

Filliben, J.J., 1975. The probability plot correlation coefficient test for normality, Technometrics
17:111-117.

Geary, R.C., 1947. Testing for normality, Biometrika 34:209-242.

Geary, R.C., 1935. The ratio of the mean deviation to the standard deviation as a test of
normality, Biometrika 27:310-32.

Grubbs, F.E., 1969. Procedures for detecting outlying observations in samples, Technometrics
11:1-21.

Hardin, J.W., and R.O. Gilbert, 1993. Comparing Statistical Tests for Detecting Soil
Contamination Greater than Background, Report to U.S. Department of Energy, PNL-
8989, UC-630, Pacific Northwest Laboratory, Richland, WA.

Hawkins, D.M., 1980. Identification of Outliers. Chapman and Hall, New York, NY.

Hochberg, Y., and A. Tamhane, 1987. Multiple Comparison Procedures. John Wiley, New
York, NY.

Journel, A.G., and C.J. Huijbregts, 1978. Mining Geostatistics. Academic Press, London.

Kish, L., 1965. Survey Sampling. John Wiley, New York, NY.

Kleiner, B., and J.A. Hartigan, 1981. Representing points in many dimensions by trees and
castles, Journal of the American Statistical Association 76:260.
EPA QA/G-9 Final
QAOO Version B - 6 July 2000
-------
Lehmann, E.L., 1991. Testing Statistical Hypotheses. Wadsworth & Brooks/Cole Publishing
Co., Pacific Grove, CA.

Lehmann, E.L., 1975. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Inc.,
San Francisco, CA.

Lilliefors, H.W., 1969. Correction to the paper "On the Kolmogorov-Smirnov test for normality
with mean and variance unknown," Journal of the American Statistical Association
64:1702.

Lilliefors, H.W., 1967. On the Kolmogorov-Smirnov test for normality with mean and variance
unknown, Journal of the American Statistical Association 64:399-402.

Myers, J.C., 1997. Geostatistical Error Management., John Wiley, New York, NY.

Ostrum, C.W., 1978. Time Series Analysis (Second Edition). Sage University Papers Series, Vol
9. Beverly Hills and London.

Ripley, B.D., 1981. Spatial Statistics. John Wiley and Sons, Somerset, NJ.

Rosner, B., 1975. On the detection of many outliers, Technometrics 17:221-227'.

Royston, J.P., 1982. An extension of Shapiro and Wilk's W test for normality to large samples,
Applied Statistics 31:161-165.

Sen, P.K., 1968a. Estimates of the regression coefficient based on Kendall's tau, Journal of the
American Statistical Association 63:1379-1389.

Sen, P.K., 1968b. On a class of aligned rank order tests in two-way layouts, Annals of
Mathematical Statistics 39:1115-1124.

Shapiro, S., 1986. Volume 3: How to Test Normality and Other Distributional Assumptions.
American Society for Quality Control, Milwaukee, WI.

Shapiro, S., and M.B. Wilk, 1965. An analysis of variance test for normality (complete samples),
Biometrika 52:591-611.

Siegel, J.H., R.W. Goldwyn, and H.P. Friedman, 1971. Pattern and process of the evolution of
human septic shock, Surgery 70:232.

Stefansky, W., 1972. Rejecting outliers in factorial designs, Technometrics 14:469-478.

Stephens, M.A., 1974. EDF statistics for goodness-of-fit and some comparisons, Journal of the
American Statistical Association 90:730-737.

EPA QA/G-9 Final
QAOO Version B - 7 July 2000
-------
Tietjen, G.L., and R.M. Moore, 1972. Some Grubbs-type statistics for the detection of several
outliers, Technometrics 14:583-597.

Tufte, E.R., 1983. The Visual Display of Quantitative Information. Graphics Press, Chesire,
CN.

Tufte, E.R., 1990. Envisioning Information. Graphics Press, Chesire, CN.

U. S. Environmental Protection Agency, 1994c. Methods for Evaluating the Attainments of
Cleanup Standards: Volume 3: Reference-Based Standards. EPA/230/R-94-004. Office
of Policy, Planning, and Evaluation. (NTIS: PB94-176831)

U.S. Environmental Protection Agency, 1993 a. The Data Quality Objectives Process for
Superfund: Interim Final Guidance. EPA/540/R-93/071. Office of Emergency and
Remedial Response.

U.S. Environmental Protection Agency, 1993b. Scout: A Data Analysis Program.
Environmental Monitoring Systems Laboratory, Office of Research and Development.
(NTIS: PB93-505303)

U. S. Environmental Protection Agency, 1992b. Methods for Evaluating the Attainments of
Cleanup Standards: Volume 2: Ground Water. EPA/230/R-92/014. Office of Policy,
Planning, and Evaluation. (NTIS: PB94-138815)

U.S. Environmental Protection Agency, 1991. GEO-EAS 1.2.1. User's Guide. EPA 600/8-
91/008. Environmental Monitoring Systems Laboratory, Office of Research and
Development. (NTIS: PB93-504967)

U. S. Environmental Protection Agency, 1989. Methods for Evaluating the Attainments of
Cleanup Standards: Volume 1: Soils and Solid Media. EPA/230/02-89-042. Office of
Policy, Planning, and Evaluation. (NTIS: PB89-234959;

Walsh, I.E., 1958. Large sample nonparametric rejection of outlying observations, Annals of the
Institute of Statistical Mathematics 10:223-232.

Walsh, I.E., 1953. Correction to "Some nonparametric tests of whether the largest observations
of a set are too large or too small," Annals of Mathematical Statistics 24:134-135.

Walsh, I.E., 1950. Some nonparametric tests of whether the largest observations of a set are too
large or too small, Annals of Mathematical Statistics 21:583-592.

Wang, Peter C.C., 1978. Graphical Representation of Multivariate Data. Academic Press, New
York, NY.
EPA QA/G-9 Final
QAOO Version B - 8 July 2000
-------
Wegman, Edward J., 1990. Hyperdimensional data analysis using parallel coordinates, Journal of
the American Statistical Association 85: 664.

Wei, W.S., 1990. Time Series Analysis (SecondEdition). Addison Wesley, Menlo Park, CA.
EPA QA/G-9 Final
QAOO Version B - 9 July 2000
-------