oEPA
United States
Environmental Protection
Agency
Performance of Statistical
Tests for Site Versus
Background Soil Comparisons
When Distributional
Assumptions Are Not Met
Technology Support Center Issue
RESEARCH AND DEVELOPMENT
-------
-------
EPA/600/R-07/020
March 2007
www.epa.gov
Performance of Statistical
Tests for Site Versus
Background Soil Comparisons
When Distributional
Assumptions Are Not Met
Technology Support Center Issue
by
Evan J. England
U.S. Environmental Protection Agency
Office of Research and Development
National Exposure Research Laboratory
Environmental Sciences Division
Characterization and Monitoring Branch
Las Vegas, NV89119
Notice: Although this work was reviewed by EPA and approved for publication, it may not necessarily reflect official
Agency policy. Mention of trade names and commercial products does not constitute endorsement or
recommendation for use.
U.S. Environmental Protection Agency
Office of Research and Development
Washington, DC 20460
-------
-------
Technology Support Center Issue
Performance of statistical tests for site versus background
soil comparisons when distributional assumptions are not
met
Evan J. Englund
Abstract
Statistical distributions of site and background soil samples often do not meet the assumptions of
statistical tests. This is true even of "non-parametric" tests. This paper evaluates several
statistical tests over a variety of cases involving realistic population distribution scenarios and
sampling schemes. Over the range of cases, performance was erratic for most tests. When
planning a project, the sampling scheme must be designed together with the statistical test, and
the choice of test may vary depending on which scenario best matches the conceptual model for
the site.
Introduction
This report began as an inquiry into the ability of the Wilcoxon Rank Sum (WRS) test to
distinguish between background and contaminated soils in the "Choccolocco Corridor" (CC)
area of Fort McClellan Army Base in Alabama. The WRS test was performed under the null
hypothesis that there is no difference between the concentrations in the site and concentrations in
the background reference area. The alternative hypothesis was that the site concentrations are
greater than those in the background reference area. A positive test result for any analyte
indicated the presence of site contamination that would require further evaluation for possible
cleanup. Concerns about the performance of the WRS test led to an investigation of the test
under conditions similar to those in the Choccolocco Corridor.
Poor WRS performance under some conditions in the initial tests from the CC raised the obvious
question: Are there better alternatives? This question prompted the expanded and more
generalized investigation presented here, which compares the WRS test directly with several
other tests.
Student's t is the most well-known and widely-used statistical test for comparing two samples
sets. However, Student's t is based on very specific assumptions about the populations from
which the samples are drawn. These include the assumption that the population variances are
equal, and the assumption that the shape of both populations is normal. Welch's test (also called
Satterthwaite's t or the unequal-variance t) is a modified Student's t test that attempts to correct
for unequal variances, though it still requires the assumption of normality. The Wilcoxon Rank
Sum test (WRS) is a non-parametric alternative to the Student's t test that assumes that the
population shapes are identical, though not necessarily normal. The WRS test and the Welch's t
1
-------
test are often suggested when their assumptions appear to be more accurate than those of the
Student's t test. (EPA 2002, EPA 2006).
The literature is confusing at best regarding the relative merits of the WRS test and the Student's
t test. Hodges and Lehman (1956) argue on theoretical grounds in favor of the WRS, noting that
it is never much less efficient than Student's t (where less efficient means needing more samples
to get the same performance) but can be infinitely more efficient.
A sampling of papers that compare the tests in the context of specific applications or specific
distributions produces mixed conclusions. Bridge and Sawilowsky (1999) conclude that WRS is
better for small samples of skewed distributions; they recommend WRS when distributions are
unknown. Blair and Higgins (1980) found that WRS generally held power advantages over the t-
test for various non-normal distributions. Potvin and Roff (1993) conclude that the WRS test is
more powerful than the t-test for skewed distributions.
Johnson (1995), however, criticizes Potvin and Roff, pointing out that The Central Limit
Theorem works in favor of Student's t, making it insensitive to modest departures from
normality. Johnson also notes that the WRS test requires a strict assumption that the two
distributions are identical in both shape and variance - something that is rarely tested. He
concludes that if investigators use random sampling properly, ".. .parametric methods will
ordinarily be adequate; if they are not, nonparametric methods will not protect them from sailing
off course." Modarres et al. (2005) likewise note the strict distributional requirements of the
WRS test (and rank-based nonparametric tests in general), claiming results of these tests are not
valid when the assumptions are violated. For log-normal distributions, Zhou et al. (1997) reject
both WRS and t, and propose an alternative test that calculates a z-statistic using maximum
likelihood estimators.
The disparity among these papers is likely due to differences in the scenarios being investigated.
No two of the papers compare statistical tests on identical distributions and distribution shifts;
the conclusion to be drawn from the literature is that t-tests are superior in some cases, WRS in
others. None of the papers deals with the kinds of distributions that we would expect to
encounter when testing to distinguish between contaminated and uncontaminated soils.
Hypotheses
Statistical tests are framed in the form of a null hypothesis which is assumed to be true and an
alternative hypothesis which is accepted as true only if the data strongly indicate that the null
hypothesis is actually false.
Hypothesis tests favor the null hypothesis by setting stringent limits on the probability that the
null hypothesis will be rejected simply by chance, when it is, in fact, true. This is called a false
positive, or Type I error, and the limit on its probability is called the significance level of the test,
or alpha. Most hypothesis tests are designed so that when the assumptions of the test are met, the
test will perform as specified. In practical terms, that means in a large number of repeated trials,
the fraction of false positive decisions should be very close to alpha.
Two alternative null hypotheses are considered in this investigation. Adopting the terminology
from EPA (2002a), we have:
-------
Test Form 1 - The null hypothesis is that the distribution of concentrations in the site
population is identical to that of the background. In this case, no further action will be
taken unless the statistical test indicates that the site measurements are sufficiently higher
than the background measurements that they are unlikely to have occurred by chance. In
formal terms, the p-value of the test, which is the estimated probability that the result
could have occurred by chance, must be less than alpha. This test form directly controls
the costs of taking unnecessary action, while the environmental risks from failure to
detect contamination are determined by the sample size and the population variability.
Test Form 2 - The null hypothesis is that the site mean concentration exceeds the
background mean concentration by a specified threshold amount or more. In this case,
action will be taken unless the difference between site and background measurements is
significantly less than the threshold. This test form controls environmental risks by
setting an upper limit on the amount of contamination that can be missed by sampling,
while the costs of unnecessary action are a function of sample size and variability.
Statistical Tests and Assumptions
The statistical tests evaluated here include Wilcoxon Rank Sum, Student's t, Welch's t, the
quantile test, the quantile test combined with the WRS test, and the difference in sample means
compared to a specified threshold difference. In addition, Welch's t was performed on log-
transformed data. The test descriptions in italics below are taken from EPA (2006).
The Two-Sample Student's t Test
Purpose: Test for a difference or estimate the difference between two population means when it
is suspected the population variances are not equal.
Data: A simple or systematic random sample xi, X2, . . . , Xmfrom the one population, and an
independent simple or systematic random sample yi, y2, . . . , ynfrom the second population.
Assumptions: The two populations are independent. If not, then it is possible that a paired
method could be used. Both are approximately normally distributed or the sample sizes are large
(m andn both at least 30). If this is not the case, then a nonpar-ametric procedure is an
alternative.
Limitations and Robustness: The two-sample t-testwith unequal variances is robust to moderate
violations of the assumption of normality. The t-test is also not robust to outliers because sample
means and standard deviations are sensitive to outliers.
(U.S.EPA 2006, Section 3.3.1.1.1)
The Two-Sample t-Test (Welch-Satterthwaite: Unequal Variances)
Purpose: Test for a difference or estimate the difference between two population means when it
is suspected the population variances are not equal.
Data: A simple or systematic random sample xl, x2, . . . , xmfrom the one population, and an
independent simple or systematic random sample yl, y2, . . . , ynfrom the second population.
-------
Assumptions: The two populations are independent. If not, then it is possible that a paired
method could be used. Both are approximately normally distributed or the sample sizes are large
(m andn both at least 30). If this is not the case, then a nonpar ametric procedure is an
alternative,
Limitations and Robustness: The two-sample t-testwith unequal variances is robust to moderate
violations of the assumption of normality. The t-test is also not robust to outliers because sample
means and standard deviations are sensitive to outliers.
(U.S.EPA 2006, Section 3.3.1.1.2)
The Wilcoxon Rank Sum Test
Purpose: Test for a difference between two population means. The Wilcoxon Rank Sum test,
applied with the Quantile test, provides a powerful combination for detecting true differences
between two population distributions.
Data: A random sample xi, X2,..., Xmfrom one population, and an independent random
sample yi,y, . . . , ynfrom the second population.
Assumptions: The validity of the random sampling and independence assumptions should be
verified by review of the procedures used to select the sampling points. The two underlying
distributions are assumed to have approximately the same shape (variance) and that the only
difference between them is a shift in location. A qualitative test of this assumption can be done by
comparing histograms.
Limitations and Robustness: The Wilcoxon signed rank test may produce misleading results if
there are many tied data values. When many ties are present, their relative ranks are the same,
and this has the effect of diluting the statistical power of the Wilcoxon test. If possible, results
should be recorded with sufficient accuracy so that a large number of tied values do not occur.
Estimated concentrations should be reported for data below the detection limit, even if these
estimates are negative, as their relative magnitude to the rest of the data is of importance. If this
is not possible, substitute the value DL/2for each value below the detection limit providing all
the data have the same detection limit. When different detection limits are present, all data could
be censored at the highest detection limit but this will substantially weaken the test. A statistician
should be consulted on the potential use ofGehan ranking.
(U.S.EPA 2006, Section 3.3.2.1.1)
The Quantile Test
Purpose: Test for a shift to the right in the right-tail of population 1 versus population 2 This
may be regarded as being equivalent to detecting if the values in the right-tail of population 1
distribution are generally larger than the values in the right-tail of the population 2 distribution.
Data: A simple or systematic random sample, xi, X2, . . ., Xn,from the site population and an
independent simple or systematic random sample, yi, y2, . . ., ym,from the background
population.
-------
Assumptions: The validity of the random sampling and independence assumptions is assured by
using proper randomization procedures, which can be verified by reviewing the procedures used
to select the sampling points.
Limitations and Robustness: Since the Quantile test focuses on the right-tail, large outliers will
bias results. Also, the Quantile test says nothing about the center of the two distributions.
Therefore, this test should be used in combination with a location test like the t-test (if the data
are normally distributed) or the Wilcoxon Rank Sum test.
(U.S.EPA 2006, Section 3.3.2.1.2)
The Sample Means Test
The difference in arithmetic means of the two samples is tested against a threshold value. If the
site mean minus the background mean is greater than the threshold, then the site is considered
contaminated.
The arithmetic means test is not a true statistical test in the sense that it does not attempt to
control either alpha or beta. It is perhaps better described as a decision rule. The test is error-
neutral. It is the same whether using test form 1 or 2. If the error distributions are symmetrical, or
the sample sizes are large, then false positive and false negative error rates will be equal. It has
some potential advantages: it is simple; it directly tests the parameter of concern; and because it
uses no distributional information, it can be performed equally well with composite data at lower
analytical costs.
Alternately, the sample means test could be considered a degenerate form of the t-test. It is
equivalent to setting the alpha value of the t-test at the sample mean threshold to 0.5. In this case
the value oft becomes zero, and the critical value of the test is simply the threshold.
Tests on Transformed Data
Log transforms are not generally recommended when analyzing environmental data (EPA,
1997). Two problems are common. Upper confidence limits calculated by Land's method, while
theoretically correct if the true distribution of the population is exactly log-normal, may be much
too high when the population is only approximately log-normal. Also, it can easily be
demonstrated that in a one-sample test, comparing a statistic calculated on log-transformed data
against a log-transformed threshold limit can produce biased, non-protective decisions. To test
whether the same is true for a two-sample test, Welch's t on log-transformed data is included in
this study for comparison. The Welch's t test is chosen here instead of the Student's t test
because its assumptions are less stringent.
For test form 1, the approach is straight-forward: take the logarithms of the two data sets and
perform the Welch's t test. However, test form 2 presents complications when the significant
difference being tested is greater than zero. With untransformed data, the significant difference is
tested by either subtracting the difference from the site data or by adding the difference to the
background data, and then performing the test. Either way the result is the same. But the results
are not the same if the data are transformed before the test. Subtracting the significant difference
from the site data yields negative values that cannot be transformed, so the only real option is to
add the significant difference to the background values, transform the data, and perform the test.
-------
An alternative approach is to transform the original data before adding the significant difference.
There are a wrong way and two right ways to do this. The wrong way is to log-transform the
sample data and then to add the log of the significant difference (equal to the width of the gray
region - in this case, 50) to the transformed background values. That is the equivalent of
multiplying the background values by 50. The right ways differ in the interpretation of the
performance objectives (see Figure 1 below). We might be testing for a 50% increase, in which
case the log of 1.5 would be added to the transformed background data. Or we might be testing
for an increase of 50 units over an unknown background mean; then the added value would be
the log of the ratio: background mean plus 50 over the background mean. The latter is used in
this paper.
Combined Tests
The U.S. EPA (EPA, 1992, 1996) recommends using the quantile test in conjunction with the
Wilcoxon Rank Sum test. The reasoning is that the WRS test is robust to the presence of outliers,
which also makes it insensitive to localized contamination that would show up in the site
distribution as an increase or shift in the upper tail. The quantile test looks only at the upper tail
and should detect such an increase if it has occurred. The decision logic is that if either test
indicates further action, then further action will be taken.
Evaluation of the performance of a combination of two tests is beyond the scope of this paper.
To perform a quantile test requires choosing two parameters - the significance level alpha and
the quantile to test (i.e., the upper 10% of the distribution). A full evaluation would require
testing various combinations of alpha and quantile for the quantile test, along with various levels
of alpha for the WRS. This paper shows performance for a single combination as an example.
Both alpha values are set at 0.05, and the quantile is the upper 10%.
It has been suggested (EPA 1992) that a "hot sample" test be included in combination with the
WRS and quantile tests. This involves choosing an upper threshold value such that the site is
declared contaminated if any site measurement exceeds the threshold. That reference, however,
was not explicit as to how to choose or calculate a hot sample threshold. This paper does not
evaluate hot sample tests, alone or in combination. That is left, along with a full evaluation of
the combined WRS - quantile test, as a subject for future research.
Decision Objectives
Comparing statistical tests can be an "apples and oranges" exercise, because the tests operate on
different characteristics of the samples. Both the Student's t and the WRS tests set out to test
exactly the same thing. Both assume that the two populations being sampled have identical
shapes and variances, and that the populations differ only by a location shift. "Location shift"
usually denotes a change in mean, but the shape and variance constraints imply something
stronger - not only the mean, but every quantile of the population, including the median, are all
shifted by exactly the same amount.
When the tests are performed, they tell us different things. The t-tests actually estimate the
population means and might conclude that the mean of population A is significantly higher than
the mean of population B. The WRS test does not conclude anything about means - only that the
-------
relative rankings of population A are greater than the relative rankings of population B. As
Modarres et al. (2005) noted, when the WRS assumptions are violated, a difference in rankings
might be detecting a change in shape or variance instead of the intended location shift.
Similarly, the t-test performed on log transformed data compares the means of the logs, which
tells us nothing about the arithmetic population means. If the log-transformed populations are
strictly normal, then the t-test on log-transformed data is a test of medians because the mean of a
normal distribution of logs is the median of the corresponding arithmetic values. The simulated
site populations that will be used in this investigation are bimodal log-normal distributions - the
background and contaminant components of the site distribution are each log-normal, but the
combination is only log-normal if the entire site is contaminated.
As pointed out earlier, this investigation began as an evaluation of the performance of the WRS
test and expanded to include comparisons of the WRS test with some alternative tests. For site-
background comparisons, the WRS test is used (in the author's experience) only as a non-
parametric substitute for a t test when the hypothesis of normality is rejected; therefore it is
reasonable to assume in such cases that the purpose of performing either test is to compare the
population means. In this investigation, we choose to define the objective of performing a test to
be the detection of a significant difference in population mean, regardless of what the test itself
may actually be measuring. There may be good reasons to compare other population parameters,
but such comparisons are beyond the scope of this paper.
We adopt a pragmatic approach toward hypothesis testing. Contaminated sites are not controlled
experiments. In the real world, if we evaluate sets of sample data and find that there are no
available statistical tests whose assumptions are perfectly met, we do not have the luxury of not
making a decision. We do not abandon tests simply because the underlying assumptions are not
met. After all, t-tests and the WRS test are widely used at least in part because they are robust to
moderate violations of the assumptions. However, we do abandon the notion of formal statistical
inference. P-values can not be interpreted as exact probabilities; therefore, alpha values are not
true significance levels. The tests in effect become mechanical decision algorithms or "decision
rules" - the data go in and a decision comes out. Our concern is not about which test is formally
"correct", but about which test performs best under the circumstances.
Performance Evaluation
Performance was evaluated by computer simulations designed to mimic the sampling and
decision process as closely as possible. For this investigation, the site and reference area are
assumed to be sampled randomly, and that the comparison of samples is intended to answer the
question: Is the mean concentration of the site significantly higher than the mean concentration
of background? The approach is to produce realistic simulated "populations" for the background
and the site, to re-sample and test the populations numerous times, and to record the decision
outcomes. Each decision is classified as "action" or "no action," and the decimal fraction of
"action" decisions over a number of trials is plotted as an estimate of the true performance
probability for the test. A performance curve is constructed by repeating the process with a series
of simulated site populations with increasing mean concentrations.
Hypothesis tests are run at specified levels of alpha, which control (or attempt to control) the
false rejection rate at a specified threshold value. Here, the threshold for test form 1 is set at zero
difference. An alpha value of 0.05 would represent a 0.05 probability of cleanup when the site
7
-------
mean concentration is actually identical to the background mean concentration. A DQO diagram
(Figure 1) is used during project planning to quantify the desired performance of a
sampling/decision-making effort. The alpha "control point" for test form 1 is indicated by the
lower corner on the line bounding the left side of the gray region. The alpha control point for test
form 2 is the upper corner on the right boundary. For this study, the threshold difference for test
form two is set at 50, or 50% higher than the nominal background mean of 100. An alpha of
0.05 would represent a 0.95 probability of cleanup at this threshold.
A statistical test that is performing correctly because all of the assumptions are met should
always pass through the alpha control point. Whether it passes through the other control point is
a function of the variability of the populations and the number of samples.
For this investigation, the threshold value for the sample mean test is set at the midpoint of the
gray region, or 25. This is just a first approximation to make the performance of this test
comparable to the other tests. This is basically a Central Limit theorem approximation - if the
sample size is large enough to make the distribution of errors about the mean look approximately
normal, then the mean will approximately equal the median. When the true value equals the
threshold, the decision probability will be 0.5 either way; thus the performance curve will pass
through the center of the gray region.
-------
DQO Performance Limits
2
Q_
; GRAY
REGION
50
100
True Difference
150
200
Figure 1. Example DQO diagram specifying performance requirements for a two-sample test, (a) The alpha
control point for test form 1. (b) The alpha control point for test form 2. The dotted red line is a performance
curve for an acceptable DQO design; "acceptable" because it stays within the boundaries of the gray region,
or equivalently, within the control points. Different combinations of sample size and method, analytical
method, and statistical test can produce numerous acceptable designs. In the typical DQO process, the lowest
cost of these would be "optimal."
Performance results for the simulations in this study are presented in the form of simplified DQO
diagrams. To reduce visual clutter when comparing several performance curves on the same
graph, the gray region and most of its boundaries are removed. Only the corner sections of the
bounding lines remain to indicate the design control points. The alpha control point for the test
form being illustrated is shown in black; the other in gray, as in Figure 2. All results will be
shown in this fashion; as pairs of plots with test form 1 on the left and test form 2 on the right.
-------
Example Performance Plot
Test Form 1
Example Performance Plot
Test Form 2
c
0 00
'So ~
M
0
"^ _
o O
CD
.a
o ~
d ~
/'
/
/
/
/
3*' alpha = 0.05
I I I
0 50 100
I I
150 20C
True Difference
c
0 00
'So ~
M
0
"-- ^"
o O
CD
.a
o -
CL
p _
/ \
jr- alpha = 0.05
/
/
I
i
i
y
0 I I I I
0 50 100 150
b True Difference
i
20C
Figure 2. Simplified DQO diagram used to present results in this paper. The gray region and most of the
gray region boundaries in Figure 1 have been removed to reduce visual clutter. The gray and black "ticks''
remain to show the target points. The alpha point for formal statistical tests is shown in black for the
appropriate test form.
The Simulation Algorithm
1. Generate a simulated background population: 100,000 random values log-normally
distributed with arithmetic mean =100 and a specified log standard deviation.
2. Generate a simulated contaminant population of 100,000 values containing both zero
values and contaminated values in a specified proportion. The non-zero contaminated
values are log-normally distributed with a specified log standard deviation and an
arithmetic mean such that the arithmetic mean of the entire contaminant population,
including the zeros, is equal to D, a user-specified arithmetic "true difference" value. D is
initially set to zerono contamination.
3. Add the two distributions to create the site population.
4. Draw a simple random background sample of size nb and a simple random site sample of
size ns. The sample sizes, nb and ns, are user-specified and may differ.
5. Perform a Wilcoxon Rank Sum test with the null hypothesis: site = background, versus
the alternative hypothesis: site > background. Record the p-value.
6. Perform a two-sample Student's t test as in step 5.
1. Repeat steps 4 through 6. The number of repetitions is user-specified, and is determined
by trial-and-error to produce acceptably smooth performance curves. Evaluate the
performance of the tests by computing the fraction of repetitions with positive action
decisions (p-valuealpha for test form 2).
8. Repeat steps 2 through 7, for a series of specified D values.
9. Plot the performance for each test as a function of the true difference in means.
Simulations and graphics were done with R software (R Development Core Team, 2005).
10
-------
The Background Population
Trace metal concentrations in natural soils are usually positively skewed and are often
approximately log-normal. Background sample data from the Fort McClellan Choccolocco
Corridor were used as a basis for choosing a reasonable background population distribution for
this study. Normal probability plots of the log concentrations for six metals in the Choccolocco
Corridor background samples are shown in Figure 3. A perfectly normal, or in this, case log-
normal, distribution would plot on a straight line. All but mercury (Hg) appear approximately
log-normal. Mercury is anomalous - possibly bimodal. For purposes of this investigation, we
will assume that "typical" background distributions are log-normal.
Log-normal Probability-As
Log-normal Probability - Cd
a
o
Q.
01
CO
p
o '
in
o '
Log-normal Probability - Hg
l l l l \
-2-101 2
Theoretical Quantiles
Log-normal Probability-Ag
p
c>
O
*>
Q.
(3
CO
l l l r
-2-101
Theoretical Quantiles
Figure 3. Probability plots of log concentrations of trace metals in Choccolocco Corridor background
samples. The solid lines are the theoretical log-normal distributions matching the sample parameters.
11
-------
Table 1 contains natural log standard deviations for the six CC background distributions in
Figure 2. Four corresponding values from a USGS soil survey (USGS, 1984) covering the
contiguous 48 States are also shown. Normally, we would expect to see lower variability in a
local sample than in the USGS sample; but soil variability can be very sensitive to the soil
sampling method, the total mass of a sample, and sub-sampling procedures. In any event, Table 1
provides what is needed for this investigation - a range of values for real-world sample
variability. For this study, we will assume a lognormal background distribution, and choose a
middle-of-the-pack natural log standard deviation of 0.80.
Table 1. Natural log standard deviations of
Choccolocco Corridor and USGS background samples
Arsenic
Cadmium
Chromium
Lead
Mercury
Silver
CC
0.87
1.17
0.83
0.71
0.67
1.45
USGS
0.80
0.86
0.52
0.92
The Site Populations
Four scenarios for the distribution of the site population will be evaluated:
1. 100% of the site is contaminated.
2. 50% of the site is contaminated.
3. 20% of the site is contaminated.
4. 10% of the site is contaminated.
In these scenarios, the site population is initially set equal to the log-normal background
population and a lognormal contaminant distribution is added to the background distribution in
the contaminated fraction. The contaminant distribution is assumed to be more variable than the
background distribution - a natural log standard deviation of 1.5 is assumed for the contaminant
distribution in the simulations.
Figures 4-7 show site and background population distributions for the four site scenarios. The
distributions are shown as density plots, which can be thought of as smoothed histograms. Each
case is plotted with arithmetic concentration on the x-axis, and then plotted again to the right
with log concentration on the x-axis. In each figure, the upper pair of plots (a and b) shows the
case when the site concentration exceeds the background concentration by 50; that is, the target
threshold or significant difference that was chosen for test form 2. In the lower pair of plots (c
and d) the difference between the site and the background is 200. The background population, of
course, remains constant through all of these scenarios and cases.
As the fraction of contaminated soil decreases, the visual differences between the site and
background populations become less obvious. Intuitively, we might expect these low-fraction
scenarios to be more difficult to distinguish by statistical testing as well. Scenario 4 (Figure 7),
with only 10% of the soil contaminated, is statistically equivalent to a hot-spot scenario.
12
-------
Background and site populations are largely identical, with all of the differences occurring in the
upper tail.
100 % Contaminated -50 ppm Diff
0 100 200 300 400 500
Concentration
100 % Contaminated -50 ppm Diff
Background
Site
5 10
log Concentration
15
100 % Contaminated -200 ppm Diff
100 200 300 400 500
Concentration
100 % Contaminated -200 ppm Diff
d
Background
--- Site
I
10
log Concentration
15
Figure 4. Distributions from Scenario 1.
13
-------
50 % Contaminated -50 ppm Diff
50 % Contaminated -50 ppm Diff
0 100 200 300 400 500
Concentration
Background
Site
5 10
log Concentration
15
50 % Contaminated -200 ppm Diff
50 % Contaminated -200 ppm Diff
100 200 300 400 500
Concentration
d
Background
Site
5 10
log Concentration
15
Figure 5. Distributions from Scenario 2.
20 % Contaminated -50 ppm Diff
20 % Contaminated -50 ppm Diff
0 100 200 300 400 500
Concentration
Background
Site
I
10
log Concentration
15
20 % Contaminated -200 ppm Diff
20 % Contaminated -200 ppm Diff
o i i i i i r
0 100 200 300 400 500
Q Concentration
q
o
d
Background
Site
5 10
log Concentration
15
Figure 6. Distributions from Scenario 3.
14
-------
10 % Contaminated -50 ppm Diff
10 % Contaminated -50 ppm Diff
0 100 200 300 400 500
Concentration
5 10
log Concentration
15
10 % Contaminated -200 ppm Diff
10 % Contaminated -200 ppm Diff
100 200 300 400 500
Concentration
d
15
log Concentration
Figure 7. Distributions from Scenario 4.
Null Hypotheses
The performance of a formal statistical test of two populations depends on whether we begin by
assuming that the two populations are the same, or that they are different. EPA (2002) refers to
the former as "test form 1" and the latter as "test form 2." In statistical terms, the null hypothesis
for test form 1 is that the site mean is less than or equal to the background mean. When the test
rejects the null hypothesis, the site mean concentration is deemed significantly elevated above
the background mean concentration, thus requiring a cleanup operation, or at least a further
evaluation phase. When test form 2 is used the null hypothesis is that the site is contaminated by
at least a specified amount above background. In this case, cleanup or further evaluation will
proceed unless the null hypothesis is rejected in favor of the alternative.
The Student's t, Welch's t, and WRS tests are commonly used for both test forms. When test
form 1 is used, the sample data are used directly; the tests assume the two data sets are equal, and
determine whether or not the site data are higher than background. When test form 2 is used, one
of the data sets is shifted by the significant difference - the width of the gray region. The
difference can either be added to the background measurements or subtracted from the site
measurements. Statistical software packages usually have an option to test for a significant
difference between populations, making it unnecessary for the user to modify the data. The tests
assume that the modified data sets are equal, and then determine whether or not the background
data are lower than the site.
The quantile test is performed only using test form 1. The quantile test only looks at the upper
tails of the distributions; so, it does not make sense to shift an entire data set.
15
-------
Sample Sizes
Two sample sizes were used in the simulation: n = 30 and n = 150. N = 30 is a frequently used
(and misused) rule-of-thumb sample size. From the Central Limit Theorem, we know that for
any population distribution, the distribution of the sample mean approaches the normal
distribution as the sample size increases. An example of this is the Student's t distribution, where
the distribution of the mean for sample sizes of 30 or more is very close to normal. It is
sometimes mistakenly assumed that 30 samples are adequate for any distribution.
The sample size of 150 comes from an ad hoc design approach intended to provide a "ballpark"
number. The theoretical sample size required to achieve the performance objectives in Figure 1
was calculated by assuming a normally distributed population. An arbitrary "safety factor" of
20% was added, and the result rounded to the nearest 10 samples. The theoretical part of the
design was done using the Visual Sample Plan, v. 4.6D freeware package (http://dqo. pnl. gov/)
developed by the Pacific Northwest National Laboratory, as shown in the screen shot below.
True Mean vs. Reference Area True Mean
Two-Sample t-Test Sample Placement Costs
For Help, highlight an item and press F1
(* iDifference of True Means >= Action Level (Assume Dirty!
<~ Difference of True Means <= Action Level (Assume Clean)
You have chosen as a baseline to assume the survey unit is "Dirty"
False Rejection Rate (Alpha):
False Acceptance Bate (Beta):
Width of Gray Region (Delta):
Specified [Difference of True Means: ]50
E stimated S tandard D eviation: [ll 8.375
Minimum Number of Samples in Survey Unit: 122
Minimum Number of Samples in Reference Area: 122
Use Historical
OK
Cancel
Apply
Help
The first four inputs are taken directly from Figure 1, and are obvious. The arithmetic estimated
standard deviation (BSD) was calculated from the log standard deviation of the background
population by first calculating the coefficient of variation (CV):
CV =esy 1 = e ' -1= 0.896, where sy is the log standard deviation of the
background population. Then,
16
-------
CV = 0.947
By definition,
CV = BSD/mean, so
BSD = CV * mean.
The appropriate mean in this case is the midpoint of the gray region, or 125, so
BSD = 0.947* 125 = 118.375.
The theoretical sample size (Nt) from VSP is 122, and our ad hoc adjustment gives
N = 1.2 *Nt= 146.4-150.
Measurement Errors
Measurement errors were not treated separately in the simulations. The simulated populations
were assumed to be the sets of all possible soil sample measurements rather than being the sets
of all possible true soil sample concentrations. All measurements were assumed to be above the
detection limit.
Simulation Results
Figures 8-23 illustrate performance curves for the various statistical tests. Each figure shows one
test case: one contamination scenario sampled by a particular combination of site and
background sample sizes.
Each figure contains three pairs of plots as described earlier; each pair comparing several tests
for the two test forms. WRS and Welch's t results are repeated in each pair for reference.
Upper pair (a and b) - WRS, Student's t, and Welch's t.
Middle pair (c and d) - WRS, Welch's t, Quantile test, and WRS + Quantile test.
Lower Pair (e and f) - WRS, Welch's t, Welch's t (log transformed data), and Sample
Mean test.
Overall test performance is determined by both false positive and false negative rates. When
evaluating test performance with performance plots, there are two critical features to look at.
First, as indicated above, the curve should pass through the specified alpha control point. Second,
the curve should rise steeply to the right of the control point in the case of test form 1 or fall
steeply to the left of the control point in the case of test form 2. In general the best overall
indicator of performance is the steepness of the curve. Steeper curves indicate a sharper
separation between populations.
Table 2 presents some of the performance results in quantitative terms. Each of the 16 cases
shown in figures 8-23 has two rows of performance data, with columns representing specific
tests. The first row, with text in italics, shows the observed performance when the true
17
-------
concentration difference is zero. This is the false positive rate for test form 1, or for any test, the
false action rate. The second row contains the value of the true difference in concentration at
which the test achieved the desired 0.95 action rate. If the test did not achieve a 0.95 action rate,
the observed rate at the maximum true difference (200) is shown instead.
The focus on differences in row 2 is because it provides an intuitive evaluation of relative risk.
The true difference between the mean site concentration and the background concentration is
proportional to the increase in risk at the site. The upper performance target - a 0.95 action
decision rate at a threshold value of 50 - sets a de facto upper limit on the increase in risk that
can be allowed to go undetected. If a test meets this target, it can be considered "protective"
against a risk increase of 50 (or 50%, where the true background mean is 100). If the observed
performance of a test achieves the 0.95 action rate at a true difference of 70, that test can be said
to be protective against a 70% increase in risk.
Performance rates are used in row 1 because decision errors of this type increase action costs
unnecessarily. These (expected) increases are directly proportional to the error rate.
Table 2 highlights the "better" performance results. In row 1, error rates less than or equal to
0.10 are underlined. In row 2, distances less than or equal to 100 are shown in bold. Instances
where both criteria are met are shaded. These ranking criteria are arbitrary.
18
-------
Equal Sample Sizes: 30 Background Samples, 30 Site Samples
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
rod
q
ci
WRS
Students!
-- Welch's!
33 Bac\
-------
Scenario 2 50 % contaminated
Scenario 2 50 % contaminated
rod
q
ci
WRS
Students!
-- Welch's!
33 Bac\
-------
Scenarios 20 % contaminated
Scenarios 20 % contaminated
q _
c
B°°. _
**~ CD
~ ^_
£ ° ~
CL ^ -
O _
d
WRS
Students!
m Welch's!
^»~^
^ """" 30 Backg round Samples
\f ~ "^ X Site Samples
-"~^ Background mean = 100
q _
c
Jd-
~ ^_
£ ° ~
0
(t -
o _
d ~
,"^>"~~
/ WRS
/ Sludentst
^ Welch's!
^
^/^ X Backg round Samples
|>^ 30 Site Samples
' Background mean = 100
0 50 100 150 200 0 50 100 150 200
True Difference
True Difference
Scenarios 20 % contaminated
Scenarios 20 % contaminated
q _
l»_
*t~)
.-ti
!Q *
ro d ~
J2
SCM
°- d ~
o
d -
J7 - - *
s
s
s
/
I +
t - . VVRS"
/ C-S-2 Welch's t
/ .J*-^^" QuantileTest
/ ^^^^ m WRS+Quantile
*^ 30 Bacl^ round Samples
,A ^ ^ X Site Samples
~[ ^ ^ " Backg round mean= 100
50 100 150
True Difference
200
d
50 100 150
True Difference
200
Scenarios 20 % contaminated
Scenarios 20 % contaminated
q _
I oo
,ci
.ti
!Q ^J-
rod ~
o
^d-
a
o
d ~
^ r^ ^^^ *
x-r--
X *
// *
/If WRS
»/ . __^ Welch's!
»/ .- -- Welch's log!
/ ' -^ Sample Mean
^ X Backg round Samples
*/ M Site Samples
Background mean = 100
50
100 150
True Difference
200
50
f
100 150
True Difference
200
Figure 10 Scenario 3. Performance of several statistical tests. 30 background samples, 30 site samples. Left:
Test Form 1. Right: Test Form 2.
21
-------
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
p _
c
S<°..
"og _
!o **
rod ~
£3-
o _
o ~
WRS
Students!
Welch's!
33 Bac^ round Samples
"""'^ Backg round mean = 100
I I I 1 I
0 50 100 150 200
q _
c
Jd-
i= ^
rod -
o
o _
d ~
..-- *-'
^ ^ *°
/ WRS
/ ' Sludentst
/ Welch's!
' X Backg round Samples
.s **^ eg Citp 5gmn| es
' Backg round mean = 100
I I I 1 I
0 50 100 150 200
True Difference
True Difference
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
q _
!» _
^ d ~
= ,-j.
ro d ~
J2
°CM
o
d ~
WRS
Welch's!
QuanlileTesl
WRS+Quanlile
m ,£_ , 33 Bac^ round Samples
J^^ * " " Bac^ round mean= 100
50 100 150
True Difference
200
q _
c
|s-
ro o ~
°- d "
r ------
s
/ WRS
/ Welch's!
/ QuanlileTesl
/ WRS+Quantile
^ 33 Bac^ round Samples
m^"^ M Site Sampl es
I I I 1 I
0 50 100 150 200
d
True Difference
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
q _
!°q_
^CD
t i ~
.*=;
ro d ~
o
^" d ~
o
d ~
*
WRS
Welch's!
- - Welch' slog!
' Sample Mean
/
^e_ . _ 33 Bac^ round Samples
f^, «. -"-* =-: M Site Sampl es
^*^ Bac^ round mean = 100
q _
l»_
"Q CD _
^
= ,
ro d ~
o
^ ri ~
o
d ~
- *" ^^^ ^* ^"^ "
* ^» ^E *
£***
^ *
if f~ WRS
^ Welch's!
'/ - - Welch's log!
/ ' Sample Mean
/ .
^ ' 33 Bac^ round Samples
f^^ M Site Sampl es
Bac^ round mean = 100
50 100 150
True Difference
200
f
50 100 150
True Difference
200
Figure 11. Scenario 4. Performance of several statistical tests. 30 background samples, 30 site samples. Left:
Test Form 1. Right: Test Form 2.
22
-------
Equal Sample Sizes: 150 Background Samples, 150 Site Samples
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
o _
c
£<»_
OCD_
^*
~ ^.
ro H ~
Is-
o
o
/Jp5
WRS
Sludentst
Welch's!
1 50 Bac^ round Samples
Bac^
0 50 100
_ True Difference
a
150 Site Samples
round mean= 100
150 200
Scenario 1 100 % contaminated
q _
c
B<» _
'o'P-
>s°
s "*
ro d
J2
o
^~ o "
o _
/'P" ^'^
/'
--
WRS
Welch's!
QuanlileTesl
WRS+Quanlile
1 50 Bac^ round Samples
Backg
0 50 100
_ True Difference
c
150 Site Samples
round mean= 100
150 200
Scenario 1 100 % contaminated
q _
c
S°q _
^CD
£,0 -
~ ^
ro d "
0
^- H "
o
ci
/.»-*"" "
'
"
WRS
Welch's!
Welch' slog!
Sample Mean
1 50 Backg round Samples
f Backg
150 Site Samples
round mean= 100
50 100 150
True Difference
200
|oq
! O
q
ci
WRS
Students!
-- Welch's!
150 Back] round Samples
150 Site Samples
Back] round mean= 100
b
50 100 150 200
True Difference
Scenario 1 100 % contaminated
ro o
q
ci
WRS
Welch's!
QuantileTest
WRS+Quantile
150 Bac^ round Samples
150 Site Samples
Back] round mean= 100
50
100 150
True Difference
200
Scenario 1 100 % contaminated
C
.Qoq
Z°
"BCD
rod
o
q
ci
WRS
-- Welch's!
- Welch'slog!
m Sample Mean
150 Back] round Samples
150 Site Samples
Back] round mean= 100
50
f
100 150
True Difference
200
Figure 12. Scenario 1. Performance of several statistical tests. 150 background samples, 150 site samples.
Left: Test Form 1. Right: Test Form 2.
23
-------
Scenario 2 50 % contaminated
Scenario 2 50 % contaminated
0-rS -
q
ci
WRS
Students!
-- Welch's!
150 Bac^ round Samples
150 Site Samples
Backg round mean = 100
rod
isn
q
ci
WRS
Students!
-- Welch's!
150 Bac^ round Samples
150 Site Samples
Back] round mean= 100
50 100 150
True Difference
200
50 100 150
True Difference
200
Scenario 2 50 % contaminated
Scenario 2 50 % contaminated
q
B°P
"SO.
ro o
_Q
is-
q
ci
WRS
Welch's!
QuantileTest
WRS+Quantile
150 Bac^ round Samples
150 Site Samples
Backg round mean= 100
!cq
! O
ro o
q
ci
WRS
Welch's!
QuanlileTesl
WRS+Quanlile
150 Back] round Samples
150 Site Samples
Back] round mean= 100
50 100 150
True Difference
200
d
50 100 150
True Difference
200
Scenario 2 50 % contaminated
Scenario 2 50 % contaminated
Joq
rod
o
q
ci
WRS
-- Welch's!
- Welch'slog!
m Sample Mean
150 Bac^ round Samples
150 Site Samples
Backg round mean= 100
| oq
Q Ci
rod
o
q
ci
WRS
-- Welch's!
- Welch'slog!
m Sample Mean
150 Bac^ round Samples
150 Site Samples
Back] round mean= 100
50 100 150
True Difference
200
f
50 100 150
True Difference
200
Figure 13. Scenario 2. Performance of several statistical tests. 150 background samples, 150 site samples.
Left: Test Form 1. Right: Test Form 2.
24
-------
Scenarios 20 % contaminated
Scenarios 20 % contaminated
rod
q
ci
>"'
WRS
Sludentst
-- Welch's!
150 Backg round Samples
150 Site Samples
Background mean= 100
50 100 150
True Difference
200
q _
c
S» _
,°
!Q "*
ro ci ~
^d ~
o
o -
n-
X «
/"
/* X
* ^
/ >-^
*/?
"/
//
v '
u
i""'*^'''
^- '
^**^
WRS
Welch's!
QuanlileTesl
WRS+Quanlile
1 50 Backg round Samples
150 Site Samples
Background mean= 100
q _
Q
c
oo
o ci
^
"g CD _
>,d
!Q *
ro ci ~
2 CM _
ci ~
o
o -
p-'' ^. *"""*"
/ ^
/ ^
/ X
/ y WRS
; f Welch's!
/ f QuanlileTesl
If m WRS+Quantile
f 1 50 Backg round Samples
1 X- 150 Site Samples
^x Background mean = 100
50 100 150
True Difference
200
d
50 100 150
True Difference
200
Scenarios 20 % contaminated
Scenarios 20 % contaminated
q
S°P
"Btp
ro0
o
q
ci
WRS
-- Welch's!
- - Welch'slog!
Sample Mean
150 Bac^ round Samples
150 Site Samples
Backg round mean= 100
50 100 150
True Difference
200
q _
J °°. _
-------
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
o
*
CD
o -
CM
O ~
q _
r
»**"
S WRS
f Student st
/ _^ Welch's t
*X^^ 1 50 Backg round Samples
\jr 150 Site Samples
^ Background mean = 100
0 50 100 150 200
o
c
o m
ro d -
Is-
q _
r^--"
/
t WRS
* Sludentst
> Welch's!
/ 1 50 Bac^ round Samples
1 t 150 Site Samples
^ Back] round mean = 100
I I I 1 I
0 50 100 150 200
True Difference
True Difference
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
q
^°P
£*
ro o
q
ci
WRS
Welch's!
QuanlileTesl
WRS+Quantile
1 50 Bac^ round Samples
150 Site Samples
Backg round mean= 100
50
100 150
True Difference
200
d
o
!» _
°S-
ro ci -
.a
°- d ~
o
o -
r--"
/ ^^~* WRS
; ^^ Welch's!
/ ^* QuanlileTesl
/ X WRS+Quanlile
/X 1 50 Back] round Samples
1 X 150 Site Samples
^ Back] round mean = 100
50 100 150
True Difference
200
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
o
.0 (jo
0-^
\jt
-1
.--"
WRS
-- Welch's!
__- - Welch' slog!
Sample Mean
1 50 Bac^ round Samples
150 Site Samples
Backg round mean= 100
O
° oo
oto.
^^
?= .
ro d ~
o
°- ri -
o
r~ * --~~~*
»
i
t WRS
f Welch's!
/ ^m Welch's log 1
J - * Sample Mean
t ^ 1 50 Bac^ round Samples
JV s ~ 150 Site Samples
f\^* Bac\
-------
Unequal Sample Sizes: 30 Background Samples, 150 Site Samples
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
Q- rS ~
q
ci
WRS
Students!
-- Welch's!
33 Bactej round Samples
150 Site Samples
Backg round mean = 100
50 100 150
True Difference
200
,
Q- r-: H
b
WRS
Students!
-- Welch's!
30 Back] round Samples
150 Site Samples
Backg round mean= 100
50 100 150
True Difference
200
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
q
ci
WRS
Welch's!
QuantileTest
WRS+Quantile
30 Bac^ round Samples
150 Site Samples
Backg round mean= 100
!oq
! O
o
>s
^
5
ro
_Q
o
q
ci
f",
^ WRS
Welch's!
QuantileTest
WRS+Quantile
X Bac^ round Samples
150 Site Samples
Backg round mean= 100
50 100 150
True Difference
200
50 100 150
True Difference
200
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
q
. Q oo
rod
q
ci
WRS
-- Welch's!
- - Welch'slog!
m Sample Mean
30 Bac^ round Samples
150 Site Samples
ground mean= 100
q
. Q oo
rod
q
ci
WRS
-- Welch's!
- - Welch'slog!
m Sample Mean
X Back] round Samples
150 Site Samples
Backg round mean= 100
50 100 150
True Difference
200
f
50 100 150
True Difference
200
Figure 16. Scenario 1. Performance of several statistical tests. 30 background samples, 150 site samples. Left:
Test Form 1. Right: Test Form 1.
-------
Scenario 2 50 % contaminated
Scenario 2 50 % contaminated
.y oq
o d
o
-------
Scenarios 20 % contaminated
Scenarios 20 % contaminated
o
Probability of Action
0.2 0.4 0.6 0.8 1
o _
d ~
^'""-' '
S *"
X
/ WRS
/ Student st
/ -- Welch'st
j'^s****^ 30 Backg round Samples
^» Background mean = 100
0 50 100 150 200
o
Probability of Action
0.2 0.4 0.6 0.8 1
o _
d ~
,, WRS
I, ' Siudenfsi
t, Welch'st
ff X Bactej round Samples
^^ Backg round mean = 100
I I I 1 I
0 50 100 150 200
True Difference
True Difference
Scenarios 20 % contaminated
Scenarios 20 % contaminated
q
I oo
£*
ro o
q
d
WRS
Welch'st
QuantileTest
WRS+Quantile
33 Bac^ round Samples
150 Site Samples
round mean= 100
50 100 150
True Difference
200
o
J oo
.-*=:
ro d ~
o _
I
i
/ _,. .
/ , f WRS
' f Welch'st
. « ^ QuantileTest
/ »^ - WRS+Quantile
f - ^X 33 Bac^ round Samples
LX x*^^ ~~~~~~~ 150 Site Samples
-ji^ . " Bac^ round mean = 100
I I I 1 I
0 50 100 150 200
d
True Difference
Scenarios 20 % contaminated
Scenarios 20 % contaminated
O
|oq_
"BCD _
>,o
= ,
ro d ~
o
°- d ~
o
d ~
.',:.
' ' ^.t" WRS
/ ^, * Welch'st
lf' ^_ -- Welch's log t
-^ / ^^p-^-- m bample Mean
f'*f 30 Bac^ round Samples
LX''^ 1 50 Site Sampl es
^ Bac^ round mean = 100
O
|oq_
"5 CD _
>,o
?= .
ro d ~
.a
o
°- d ~
o
d ~
/FV--
/ *x
// WRS
M Welch'st
t't - Welch's log t
j. m Sample Mean
§ L 33 Back] round Samples
L ^ 150 Site Samples
^. ' Bac^ round mean = 100
50
100 150
True Difference
200
f
50 100 150
True Difference
200
Figure 18. Scenario 3. Performance of several statistical tests. 30 background samples, 150 site samples. Left:
Test Form 1. Right: Test Form 2.
29
-------
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
o
c
B°q_
< °
O ^ _
_>»°
~ Td-
ro H ~
_a *-*
^o1-
o
o
_,.-"
^ ^
x' WRS
x1 Sludentst
^' Welch's!
X
X
x ' ^30 Backg round Samples
L " 150 Site Samples
O
c
So5_
< °
o ^ _
.£ro
~ T.J-
ro d ~
J2
o
o
o
r .»«.-.
.
»/
»/
/' WRS
/.' ' Sludentst
. / Welch's!
" 33 Backg round Samples
/ 150 Site Samples
Rankq round mean =100
50 100 150
True Difference
200
50 100 150
True Difference
200
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
I»J
ro o
§
q
d
WRS
Welch's!
QuanlileTesl
WRS+Quantile
33 Backg round Samples
150 Site Samples
ground mean= 100
50 100 150
True Difference
200
d
q _
c
£<».-
<
'B'P-
>-°
= ,-j.
ro d -
2CM_
O
d -
r^.~~"
X
/
/ WRS
; Welch's!
, QuanlileTesl
/ m WRS+Quanlile
' ^_***~^ " 30 Backg round Samples
f ^,» ^*" 1 50 Site Sampl es
+ * RarkqrnunH mpan = 1(V)
50 100 150
True Difference
200
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
o
!» _
"o
-------
Unequal Sample Sizes: 150 Background Samples, 30 Site Samples
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
q
ci
WRS
Students!
-- Welch's!
150 Bactej round Samples
30 Site Samples
Backg round mean = 100
50 100 150
True Difference
200
rod
q
ci
WRS
Students!
-- Welch's!
150 Back] round Samples
30 Site Samples
Back] round mean= 100
50 100 150
True Difference
200
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
Q-r-i -
q
ci
WRS
Welch's!
QuantileTesl
WRS+Quantile
150 Bac^ round Samples
30 Site Samples
Backg round mean= 100
50 100 150
True Difference
200
o
o
F
WRS
Welch's!
QuantileTest
WRS+Quantile
150 Bac^ round Samples
30 Site Samples
Back] round mean= 100
d
50 100 150
True Difference
200
Scenario 1 100 % contaminated
Scenario 1 100 % contaminated
O «? _
i *
!°'
I
CNI
ci
q
ci
WRS
-- Welch's!
- - Welch's log t
m Sample Mean
150 Bac^ round Samples
M Site Samples
Backg round mean= 100
£°
"o
-------
Scenario 2 50 % contaminated
Scenario 2 50 % contaminated
.y oq
o d
o
-------
Scenarios 20 % contaminated
Scenarios 20 % contaminated
rod
q
ci
WRS
Students!
Welch's!
150 Bac^ round Samples
30 Site Samples
Backg round mean= 100
50 100 150
True Difference
200
q _
c
Soo _
ro d ~
_Q
0
O
d ~
/' WRS
i* ' Sludentst
J Welch's!
/» ^
/' ^ ' 150 Back] round Samples
., 30 Sile Sampl es
f,^^ Back] round mean = 100
I I I 1 I
0 50 100 150 200
True Difference
Scenarios 20 % contaminated
Scenarios 20 % contaminated
q _
J oo
^ d ~
:= ^.
ro d ~
o
Q- t~\ ~
o
d ~
r
*.. WRS
Welch's!
* " " QuanlileTesl
* ^X^x ^V-T- WRS+Quanlile
f ^-^.T^ " 1 50 Bac^ round Samples
L//^. " " 30 Sile Sampl es
"^ ^ Bac^ round mean =100
q _
|oq_
^ d ~
:= ^
ro d ~
^- CNI
O _
r>- ~~
/ _ *^ ^^ "WRS
x f»,-' Welch's!
/ * QuanlileTesl
<^S m WRS+Quantile
' f^^ 1 50 Bac^ round Samples
1 ^^ M Site Sampl es
** Back] round mean = 100
50
100 150
True Difference
200
d
50 100 150
True Difference
200
Scenarios 20 % contaminated
Scenarios 20 % contaminated
q _
S°q _
^ Bac^ round mean = 100
q _
|oq_
-°
rod ~
o
^ ri ~
o
d ~
f
b
/» X
/''
//
/ _^^"
/-"^
r"-^^*"*" * * * ""***"
xx**
'» *
WRS
-- Welch's!
- - Welch' slog!
^ Sample Mean
__^.. ' 1 50 Bac^ round Samples
30 Sile Samples
Back] round mean= 100
50
100 150
True Difference
200
50
f
100 150
True Difference
200
Figure 22. Scenario 3. Performance of several statistical tests. 150 background samples, 30 site samples. Left:
Test Form 1. Right: Test Form 2.
33
-------
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
p _
c
o «-.
**~ CD
ro o ~
_Q
o
o _
o ~
*"*""'
, - * WRS
Students!
t* Welch's!
f* 1 50 Rack} round Samples
1 ^ JT- 30 Site Samples
^^" ** Background mean =100
0 50 100 150 200
q _
c
O rr*
5 "*
rod ~
o _
d ~
r ;.,,«
/ * *
' WRS
4* Sludentst
t. Welch's!
/'
/* 1 50 Backg round Samples
M Sile Sampl es
^-^-" " Backg round mean = 100
I I I 1 I
0 50 100 150 200
True Difference
True Difference
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
o
!» _
'B'P-
.t;
ro d ~
o
d -
WRS
Welch's!
^ ^ * QuanlileTesl
« *^ f _ - - WRS+Quanlile
^, "" 1 50 Bac^ round Samples
L^-^^;- "* "" " " ' M Site Sampl es
"i^ ^ Bac^ round mean =100
50 100 150
True Difference
200
I oo
q
d
WRS
Welch's!
QuanlileTesl
WRS+Quantile
150 Bac^ round Samples
M Site Samples
Back] round mean= 100
d
50 100 150
True Difference
200
Scenario 4 10 % contaminated
Scenario 4 10 % contaminated
q _
!» _
5 CD
2?°
~ ^
ro d ~
o
(t ^ -
o
d ~
* WRS
* Welch's!
S - - Welch's log!
f m Sample Mean
/ ^^^ _ -^ 1 fin RarkqrnunH Samples
t^^^'- 3° site Samples
^£-* " Bac^ round mean =100
q _
!» _
^CD
£rd
~ ^
ro d ~
£}
o
(t ^ -
o
d ~
r -------
,'~'r**r's"^'
/.3 WRS
y* Welch's!
// - - Welch's log!
, Sample Mean
/ / 1 50 Bac^ round Samples
/ MSile Samples
^* "*"' Bac^ round mean = 100
50
100 150
True Difference
200
f
50 100 150
True Difference
200
Figure 23. Scenario 4. Performance of several statistical tests. 150 background samples, 30 site samples. Left:
Test Form 1. Right: Test Form 2.
34
-------
Table 2. Performance of Two-Sample Comparison Tests
% Contaminated
100
50
20
10
100
50
20
10
100
50
20
10
100
50
20
10
Background N
30
30
30
30
150
150
150
150
30
30
30
30
150
150
150
150
-z.
32
w
30
30
30
30
150
150
150
150
150
150
150
150
30
30
30
30
i
0.046
140
0.057
0.76
0.054
0.3
0.05
0.14
0.049
40
0.047
60
0.052
0.81
0.058
0.4
0.049
90
0.046
0.94
0.047
0.43
0.048
0.18
0.049
80
0.049
0.91
0.048
0.43
0.047
0.21
5
w
0.055
130
0.057
0.82
0.067
0.42
0.056
0.21
0.083
40
0.087
60
0.08
140
0.084
0.74
0.074
80
0.07
160
0.073
0.73
0.07
0.38
0.082
80
0.092
170
0.088
0.71
0.087
0.43
Student's T_1
0.049
0.93
0.057
0.78
0.052
0.41
0.05
0.18
0.049
50
0.049
60
0.052
150
0.057
0.85
0.035
0.9
0.037
0.65
0.035
0.14
0.035
0.01
0.06
100
0.065
140
0.058
0.9
0.062
0.75
o
1
0.047
0.93
0.057
0.77
0.057
0.39
0.049
0.17
0.049
50
0.049
60
0.052
150
0.057
0.85
0.702
100
0.093
130
0.703
0.93
0.095
0.78
0.022
170
0.027
0.81
0.02
0.39
0.079
0.16
Quantile
0.073
0.55
0.07
0.47
0.077
0.25
0.077
0.11
0.044
90
0.045
100
0.04
170
0.038
0.68
0.033
0.81
0.029
0.78
0.033
0.61
0.029
0.3
0.049
170
0.059
0.93
0.054
0.65
0.054
0.38
O)
_o
_o
0.048
130
0.057
0.82
0.054
0.41
0.052
0.22
0.057
40
0.048
60
0.055
190
0.054
0.71
0.057
80
0.047
150
0.047
0.7
0.057
0.39
0.049
80
0.052
0.92
0.052
0.52
0.048
0.28
(/)'
0.043
100
0.05
0.94
0.052
0.53
0.046
0.26
0
100
0
0.93
0
0.04
0
0
0.007
90
0
160
0
0.3
0
0.03
0.073
110
0.076
0.89
0.073
0.38
0.072
0.13
W
0.048
100
0.058
0.94
0.058
0.57
0.052
0.3
0.042
80
0.047
100
0.044
170
0.038
0.69
0.036
80
0.027
140
0.027
0.68
0.034
0.33
0.067
100
0.06
180
0.066
0.68
0.058
0.38
Student's T_2
0.295
70
0.308
80
0.372
130
0.298
0.91
0.006
60
0.007
60
0.007
80
0.006
100
0.755
50
0.75
50
0.757
60
0.758
60
0.757
70
0.755
90
0.759
190
0.753
0.86
CN
O
1
0.296
70
0.308
80
0.373
130
0.3
0.91
0.006
60
0.007
60
0.007
80
0.006
100
0.08
50
0.077
50
0.083
70
0.079
80
0.202
80
0.799
90
0.206
160
0.796
0.89
Quantile
0.07
0.56
0.072
0.48
0.07
0.26
0.077
0.11
0.042
90
0.047
100
0.044
170
0.038
0.69
0.035
0.8
0.027
0.77
0.027
0.61
0.034
0.32
0.054
170
0.054
0.92
0.06
0.65
0.052
0.36
The first three columns identify a test case a combination of contaminated fraction and sample sizes
Each case has two rows of results:
1 . Fraction of unnecessary "action" decisions. (The target is 0.05)
2. True difference in population means above which the fraction of "action" decisions is >0.95
(the target difference is 50) ... or, if the result is a number less than 1 :
2. The maximum "action " fraction at a true difference of 200
Underscore: Row 1 results less than twice the target fraction
Bold: Row 2 results less than twice the target difference
Shaded: Both underscore and bold
"1 " or "2" in column heading indicates test form
O)
_o
_o
0.387
50
0.379
90
0.392
0.93
0.368
0.82
0.003
50
0.007
90
0.003
0.92
0.004
0.57
0.789
50
0.792
80
0.793
0.93
0.205
0.77
0.202
60
0.799
90
0.796
0.91
0.792
0.77
CD
CD
0.74
90
0.745
130
0.752
0.93
0.747
0.82
0.077
50
0.07
60
0.072
80
0.077
110
0.084
70
0.073
80
0.085
90
0.078
120
0.09
80
0.7
110
0.097
0.94
0.7
0.83
35
-------
Design Considerations
The purpose of this paper was to compare the performance of statistical tests, not sampling
designs. But perhaps the most important message to be gleaned from the results presented here
is that decision performance is determined by a combination of the sampling design and the
statistical test. If one wants to optimize a decision making process, then both factors must be
considered together.
To design sampling plans for actual sites, it is recommended that case-specific simulations be
performed, similar to those used in this report. Programs for running such simulations, or
"scripts" as they are called in the R language, are included in Appendix 1. These scripts will
permit a user to reproduce, within the limits of simulation variability, the results presented in this
paper, and to experiment with different sample sizes, alpha levels, etc.
A cautionary note to anyone wishing to use these scripts as a design tool: although the scenarios
used in the simulations are fairly realistic and present difficult sampling problems, they do not
cover the full range of difficulties that the real world has to offer. In particular, these scripts do
not provide for superimposing a regional anthropogenic background component over both site
and reference areas. Nor do they allow for any real differences between the site and reference
area background distributions due simply to natural variability; or for measurement errors or
non-detects. Choosing appropriate background and site scenarios is a critical part of developing
the "conceptual model" in the early stages of project planning.
The decision quality objectives used in this investigation (Figure 1) were totally arbitrary. They
are used only to provide reference points to assist comparisons of the statistical tests and should
not be considered an example to emulate at actual sites. Nevertheless, the objectives used here
provide a useful starting point for discussion when setting objectives for a real site. The
objectives from Figure 1 protect against missing an increase in mean concentration of 50% over
the background mean level. Is a 50% increase too high, so that we need to aim to detect a 25% or
even a 10% increase? Or is it too low, allowing us to relax the detection threshold to 100%,
200%, or more?
Test Recommendations
The tests recommended below generally performed well over the range of scenarios and
sampling schemes evaluated. The recommendations should be considered tentative, and
challenged when developing case-specific designs, especially if the design objectives or the site
scenarios differ significantly from those assumed in this paper.
Perhaps the best approach is to start with the recommended tests, and then experiment to try to
find a better alternative. Alternatives are not limited to the tests that have been evaluated in this
paper. Any of the tests evaluated here can be modified by choosing different values for test
parameters, such as alpha, significant difference, or action threshold. Many different
combination tests are possible other than the WRS - quantile combination tested here.
Use Student's t or Welch's t with Test Form 2
36
-------
Although the Wilcoxon Rank Sum test with Test Form 1 performs somewhat better than the t
tests with Test Form 2 in the specific case of 100% contamination, WRS performance roughly
equals the t tests at 50% contamination and rapidly becomes much less protective as the
contaminated fraction drops. The t tests are consistently protective over the range of scenarios.
If the site conceptual model suggests a real possibility that the site may be less than 50%
contaminated, then t tests are the safer choice. When site and background sample sizes are equal,
it makes no difference whether Student's t or Welch's t is used. Although the two tests may
differ for any particular sampling event, over many simulations their overall performances
become indistinguishable. There is a difference, however, when the sample sizes are unequal.
The Welch's t test is superior to the Student's t test when the site sample size is larger than the
background sample size and inferior when the site sample is smaller. (The opposite relationship
holds for Test Form 1, but Test Form 1 is not recommended).
Consider composite sampling with a sample means test.
The sample means test did not perform as well as the t tests, but it was not far behind. The only
statistic used is the sample mean, which can be obtained as easily by analyzing a composite
sample as by averaging analyses of individual samples. This has considerable potential for
reducing analytical costs when the required sample size is large and there are many target
analytes requiring individual and costly analyses.
Observations and Discussion
Quantile test results are shown in Table 2 for both test forms, though the test was run the same
way in both cases. Comparing the two columns provides an indication of the precision of the
simulation process.
Within the limits of the precision of the simulation method, the WRS test and the Welch's log t
test achieved the specified alpha requirement for Test Form 1 over all scenarios and sample
sizes. This is as expected because when there is no difference between the site and background
distributions, the assumptions for the WRS test are met; and when there is no difference between
the distributions and they are log-normal, the assumptions of the Welch's log t test are met.
Unfortunately this is of limited practical value for two reasons. Real reference area populations
are inevitably different from true site background populations, so actual decision performance
can differ from theoretical performance. More importantly, neither test performs well for Test
Form 1 on scenarios 3 and 4 (20% and 10% contamination, respectively).
The WRS test fails completely for Test Form 2 on scenarios 3 and 4 when sample sizes are large.
This is a rather unusual case where test performance actually gets worse with more data. This
happens because Test form 2 subtracts 50 from each site sample. For example, in Figure 7c, the
downshift would move approximately 30% of the site population below the background
population, while only 10% of the site population was shifted upward by contamination. With
enough data, the ranks will always show this net downward shift, causing rejection of the null.
With few data, the test sometimes gets it wrong, which ironically gives the appearance of better
performance with respect to our mean-oriented objectives.
The quantile test is combined with the WRS to address the WRS insensitivity to shifts in the
upper tail. The quantile test is also a rank test, but looks only at the upper tail. However, neither
the quantile test by itself, nor the quantile test in combination with the WRS, performed
37
-------
consistently well enough to be recommended. This conclusion only applies to the particular
combination of parameters used in this paper. For Test Form 1, the WRS test parameters were:
alpha = 0.05 and significant difference = 0; the quantile test parameters were: alpha = 0.05 and
quantile = 0.9 (meaning that the test looks only at the upper 10% tail.) For Test Form 2, the
significant difference was 50 while the other three parameters did not change. As suggested
earlier, all four of the parameters could be varied, and it would be a substantial project to search
for the optimal combination. With the right set of parameters, a combination of the WRS test and
the quantile test might prove to be the best overall performer, but he parameter sets used in this
paper are not it.
The Welch's log t test performed similarly to the WRS test for Test Form 1. For those specific
scenarios where the WRS test with Test Form 1 was equal to or better than the untransformed t
tests with Test Form 2, the Welch's log t test equaled or slightly outperformed the WRS test.
References
Blair, C.R., and Higgins, J. J, 1980. A comparison of the Power ofWilcoxon 's Rank-Sum Statistic
to That of Student's t Statistic under Various Nonnormal Distribution. JEduc Statistics, 5:4, 309-
335.
Bridge, PD., and Sawilowsky, S.S.,1999, Increasing Physicians' Awareness of the Impact of
Statistics on Research Outcomes: Comparative Power of the t-test and Wilcoxon Rank-Sum Test
in Small Samples Applied Research. J Clin Epidemiol 52:3, 229-235.
Hodges, J.L. Jr. and Lehman, E.L., 1956. The Efficiency of some Nonparametric Competitors of
the "t"-Test. Annals Math Statistics 27:2,324-335.
Modarres, R., Gastwirth, J.L. and Ewens, W., 2005.^4 cautionary note on the use of non-
parametric tests in the Analysis of Environmental Data. Environmetrics 16, 319-326.
Potvin, C. and Roff, D.A., 1993. Distribution-Free and Robust Statistical Methods: Viable
Alternatives to Parametric Statistics? Ecology 74:6, 1617-1628.
R Development Core Team , 2005. R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-
project.orq.
Shacklette, H.T. and Boerngen, J.G., 1984. Element Concentrations in Soils and Other Surficial
Materials of the Conterminous United States, U.S.G.S. Prof. Paper 1270.
Singh, A.K., Singh, A. and Engelhardt, M., 1997. The LognormalDistribution In Environmental
Applications, EPA/600/R-97/006.
U.S.EPA, 2002. Guidance for Comparing Background and Chemical Concentrations in Soil for
CERCLA Sites, EPA 540-R-01-003, OSWER 9285.7-41.
U.S.EPA, 1992. Statistical Methods for Evaluating the Attainment of Cleanup Standards,
Volume 3: Reference-Based Standards for Soils and Solid Media, EPA 230-R-94-004.
38
-------
U.S.EPA, 2006. Data Quality Assessment: Statistical Methods for Practitioners, EPA QA/G-9S,
EPA/240/B-06/003.
Zhou, X-H., Gao, S. and Hui, S.L.,1997. Methods for Comparing the Means of Two Independent
Log-Normal Samples. Biometrics 53, 1129-1135.
Notice: The information in this document has been prepared by the United States Environmental
Protection Agency. It has been subjected to the Agency's peer and administrative review and has been
approved for publication as an EPA document. Mention of trade names or commercial products
does not constitute endorsement or recommendation for use.
APPENDIX 1 - Evaluating statistical test performance with R
R programs for simulating statistical test performance are included in the next three appendices.
The scripts presented here have been somewhat modified from the original that was used to
produce the results in this report: loops were removed that produced six graphs on a single
figure, and that ran all four scenarios in a single long run. Parameters that can be modified by the
user were moved to a separate file so that they can be more easily located and changed without
inadvertently altering the operational code. Graphical output has been reduced to a single plot;
the user can select any or all of the performance curves to be plotted. Text previously written
inside the performance plot has been moved outside. Performance results for all tests are
automatically written to an output file, along with a copy of the parameter file used during the
run. The user has been given the option of changing the arithmetic mean and the log standard
deviation of the background population, and changing the log standard deviation of the
contaminant population. Numerous comments have been added to the scripts.
The instructions below are not intended as an R tutorial. The intent is that an interested reader
who is not an R user (and who has no desire to become one) will be able to run simulations and
to be able to experiment with different input parameters. Although it would be possible to
perform a rudimentary trial-and-error design optimization using these scripts, the scripts
themselves have not been subjected to independent validation, and should be used with
appropriate caution.
Instructions
Download the R Windows installation program from: http://www.R-project.org. (The current
version at this writing is R-2.4.1-Win32.exe.)
Run the install program.
Create a work folder and copy the scripts below into the folder (use the file names in italics, e.g.,
parameters.r. Make copies in another folder as backup files.
Run R. R will open with a main (RGui) window and a console window:
39
-------
Ffc E* to* P-KtJWSS W«iew£ Htfe
R version 2.4.0 (20Q&-1G-03)
Copyright (C) 20-06 Tfe* R Foundation fa* Statistical Computing
ISBH 3-900051-07-0
R is feee Sft£t«ae* and cones with ABSOLUTELY 110 BARHJBfTY.
Ya-u are velcone ta redistribute it under eertalft e-oftditiftftfl.
Type ' licensed ' oe ' licence () ' foe distribution details.
K&tural laaguag-e
but
in aft English,
R is a collaborative project vith natty contributors.
Type ' eemtrilxitors 0 ' for wore inCortnation aatt
'eltationO ' ott faa-a to cite R or R p-atkages in- publications.
Type 'denoo1 f*K s«ne desnos, 'help()' for on-line help, or
1 help. start [) ' for an HTKL bEewser interface to help.
Type ' on the toolbar, then on the dropdown menu:
Load Workspace..,
Save Workspace.,.
Load History,..
Save History,,,
Change dir
:e and comes with ABSOLUTELY HO WARRANTY.
,o redistribute it under certain conditions.
or 'licence!)' for distribution details.
Edit Misc Packages Windows Help
Source R code,,.
New script
Open script..,
Display file(s)...
(2006-10-03)
D6 The R Foundation for Statistical Computing
-0
ge support but running in an English locale
Print.,,
Save to File,,.
Exit
'1TTS^r"c13"lITantrraT:ive project with many contributors.
Type 'contributors()' for more information and
'citation!)' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'helpj)' for on-line help, or
'help.start()' for an HTHL browser interface to help.
Type 'q(J ' to quit R.
[Previously saved workspace restored]
I
R 2,4,0 - A Language and Environment
40
-------
Browse and select your working directory (folder):
lange working directory to;
Natural language support but
R is a collaborative project
Type 'demo()' for some demos, 'help(j ' for on-line help, or
1 help.start ()' for an HTHL browser interface to help.
Type 'q()' to quit R,
[Previously saved workspace restored]
R 2,4.0 - A Language and Environment
On the R console window, type the command source("stat test perf.r"):
41
-------
File Edit Hisc Packages Windows Help
o
R version 2.4.0 (2006-10-03)
Copyright (C) 2006 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions
Type 'license!)' °r 'licenced' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citationf)' on how to cite R or R packages in publications
Type 'demon1 f°r some demos, 'nelpO' for on-line help, or
1 help.start() ' for an HTHL browser interface to help.
Type ' q() ' to quit R.
[Previously saved workspace restored]
> source("stat test perf.t")!
? 2,4,0 - A Language and Environment
R is a "command line interpreter" similar to Basic, rather than a compiled language, like Fortran
or C++. When you hit , R will execute the source command. R then reads and executes
the R commands in the stat testperf.r script, which in turn reads and executes the R commands
in theparameters.r and sort.data.frame.r script files. The parameters.r file contains the input
parameters that control the simulation and statistical tests. These parameters are kept in a
separate script file to make it easy for a user to make changes.
When R has finished, it will display a performance plot in a new graphics window, and print
output on the console:
42
-------
File History Resize Windows
urce("stat test perf.r:
[1] "Perf data 100 200 10
Diff WRS Helen Student
0 0.00 0.079 0.089
10 0.00 0.347
20 0.00 0.584
30 0.00 0.772
40 0.00 0.901
50 0.00 0.960
60 0.00 0.970
70 0.00 0.970
quant lie ₯RS.qu
lie Welch.log sample.mean
0.089
0.307
0.396
0.465
0.604
0.653
0.653
0.703
0.733
0.733
0.752
0.762
0.792
0.802
0.802
0.802
0.812
.832
0.851
0.861
0.871
0.376
0.644
0.842
0.921
0.970
0.970
0.990
80 0.00
90 0.01
100 0.02
110 0.02
120 0.02
130 0.02
140 0.02
150 0.02
160 0.02
170 0.02
180 0.02
190 0.02
200 0.02
WRS
Student's!
Welch's!
QuantileTest
WRS+Quantile
Welch's log t
Sample Mean
Warning messages:
1: appending column names
100 Background Samples
200 Site Samples
Background mean = 100
to file in: writ
50 100 150
True Difference
The graphics window can be saved in any of six common graphics formats.
The first line on the console after the source command is a file name into which R has written
the output data, as well as a copy of the input parameters. The output file is a comma separated
values file (.csv) that can be opened directly by a spreadsheet program. The file name is
generated by the script, and includes the background sample size, site sample size, contamination
percentage, and the date and time to make the name unique.
The data table on the console contains the data shown in the performance plot. Plotting the Diff
column on the x axis versus any other column on the y axis will reproduce one of the
performance curves.
The warning messages can be ignored. R is explaining - not very clearly - that it reformatted the
data table in order to create the csv format output file.
Note that the sample sizes in the example above are not the same as any of those in the body of
this report. They illustrate a step in the author's own rudimentary trial-and-error optimization
attempt. Compare these results with Figure 15. This example uses the same total number of
samples, but improves the performance of the t tests by taking fewer background and more site
samples. Although the performance targets are not changed, the tests were "tweaked" by
lowering the significant difference value ("mu" in the parameter file) from 50 to 35. This has the
43
-------
effect of shifting the performance curve a little to the left. Searching for a better design is left as
an exercise for the reader.
The screen shot below shows the output file after being opened in a spreadsheet.
BaHior osoft-Excel - Perf Jata j 00 200 10.-s Jan 30 % ia&;>asv
'* ''"*'^"'1 :-'' ""
ilj pile Edit y
J ^ :A
Insert Format Tools Data Window Help
T\ L>; .1 gu-:jCiui. ;.,
: 10 .IB U
" _ fi1 X
A1
Diff
7__
J_
_EI_
10
11
.11
_13
li
15.
17
J8_
J9
20"
2f
22
Zf
24
25.
26.
27
28"
29
W
1L
3_2_
33
A
Diff
B
WRS
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
0
0
0
0
0
0
0
0
0
0.01
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
Welch
0.079
0.347
0.584
0.772
0.901
0.96
0.97
0.97
1
1
1
Student
0.089
0.376
0.644
0.842
0.921
0.97
0.97
0.99
1
1
1
1
1
1
1
1
1
1
1
1
1
_E_
quantile
0.01
0.059
0.129
0.168
0.198
0.238
0.287
0.347
0.366
0.406
0.436
0.455
0.495
0.525
0.535
0.584
0.584
0.594
0.604
0.604
0.604
H
,
WRS.quarWelch.log sample.mean
0.01
0.059
0.129
0.168
0.198
0.238
0.287
0.347
0.366
0.406
0.436
0.455
0.495
0.525
0.535
0.584
0.584
0.594
0.604
0.604
0.604
Input.Parameters
bsamsize=100 ## Background sample size
ssamsize=200 ##Site sample size
frac= .10 ## Contaminated fraction between 0 and 1
nreps=101 ## Number of sampling replications
##Test parameters
TestForm=2 ##Test Form
alpha=0.05 ## Alpha value for tests that use it
mu=35 ## Significant difference value for tests that use it
b1=.90 ## Quantile for quantile test
al=20 ## Threshold for Sample Means test
##Do you want to plot the following tests? TRUE or FALSE
> M \Perf data 100 200 10Jan 30 ./ J <
089
307
396
465
604
653
653
703
733
733
752
762
792
802
802
802
812
832
851
861
871
0.05
0.248
0.406
0.574
0.762
0.861
0.911
0.95
0.97
0.99
0.99
1
1
1
1
1
1
1
1
1
1
Ready
To change parameters, open the parameter:s.r file using Notepad. Edit the parameter values as
desired, being careful to change only the numerical values or the TRUE-FALSE logical values.
44
-------
Save the revised file under the same file name, return to the R console, and execute the
source("stat test perf.r") command again.
If the scripts fail to run properly, replace them with copies of your backup scripts and try again.
Note. Each time the scripts are executed, they open another graphics window and add another
output file to the working directory. Periodic housekeeping is necessary.
APPENDIX 2 - Script file: parameters.r
########### Begin Script ##############################
## parameters for statistical test performance simulation
bsamsize=100 ## Background sample size
ssamsize=200 ## Site sample size
frac= .10 ## Contaminated fraction between 0 and 1
nreps=101 ## Number of sampling replications
## Test parameters
TestForm=2 ## Test Form
alpha=0.05 ## Alpha value for tests that use it
mu=35 ## Significant difference value for tests that use it
bl=.90 ## Quantile for quantile test
al=20 ## Threshold for Sample Means test
## Do you want to plot the following tests? TRUE or FALSE
plotWRS=TRUE
plotStudentst=TRUE
plotWelchst=TRUE
plotQuantile=TRUE
plotWRSplusQuantile=TRUE
plotWelchslogt=TRUE
plotSampleMean=TRUE
# WRS
# Students t
# Welch's t
# Quantile
# WRS plus Quantile
# Welch's log t
# Sample mean
## DQO targets (just for putting the ticks on, not used in tests)
lft=0 # left bound of the gray region
rgt=50 # right bound of the gray region
bot=.05 # the alpha you want for test form 1
top=.95 # l-(the alpha you want) for test form 2
## Defining the population parameters
bsd=0.8 # background log standard deviation (natural logs)
bmean=100 # background arithmetic mean
csd=l.5 # contaminant log standard deviation
## Defining the True Difference points for the performance curve
## (step=10 and numsteps=21 calculates 21 points: 0, 10, 20, . . . , 200)
step=10 ##
numsteps=21 ##
########### End Script ##########################
APPENDIX 3 - Script file: stat test perf.r
########### Begin Script ##############################
## Evaluate performance of several statistical tests
##
## get the parameters and a sort utility script
source("parameters.r")
source("sort.data.frame.r")
## initialize variables
45
-------
stpvalue=numeric()
wtpvalue=numeric()
wpvalue=numeric()
probcleanw=numeric()
probcleanst=numeric()
probcleanwt=numeric()
qpvalue=numeric()
probcleanq=numeric()
cpvalue=numeric()
probcleanc=numeric()
lpvalue=numeric()
mpvalue=numeric()
probcleanl=numeric()
probcleanm=numeric()
diff=numeric()
bsam=numeric(nreps*bsamsize)
sbsam=numeric(nreps*ssamsize)
ssam=numeric(nreps*ssamsize)
csam=numeric(nreps*ssamsize)
## open the graphics window and set the layout
windows(width=5,height=3)
mx=c(1,1,1,2,2,1,1,1,2,2)
layout(matrix(mx, 2, 5, byrow = TRUE))
par(mgp=c(2, 1, 0) ) ## reset at end to 3,1,0
## preliminary calculation for the quantile test
qnsite=qn=ssamsize
qmbkg=qm=bsamsize
qc=qmbkg+qnsite-floor((qmbkg+qnsite-1)*bl)-1
## set test parameters for thr R test functions
alt="l"
if(TestForm==l)alt="g"
mew=0
if(alt=="l")mew=mu
## create sets of background samples; set to specified mean
bsam=exp(rnorm(nreps*bsamsize,0,bsd))
bsam=bsam/mean(bsam)
bsam=bsam*bmean
dim(bsam)=c(nreps,bsamsize)
## create sets of site background samples before contamination
sbsam=exp(rnorm(nreps*ssamsize,0,bsd))
sbsam=sbsam/mean(sbsam)
sbsam=sbsam*bmean
dim(sbsam)=c(nreps,ssamsize)
## create sets of contanimant values
csam=exp(rnorm(nreps*ssamsize,0,csd))
csam[1:((1-frac)*length(csam))]=0 ## zero out the uncontaminated fraction
csam=csam/mean(csam) ## make site contaminant mean = 1.0
csam=sample(csam) ## shuffle the data
dim(csam)=c(nreps,ssamsize)
## calculate fraction of samplings that result in a cleanup decision
for (m in (1:numsteps)) # test a range of differences between site and
background
{
## add site-related contaminant to background
diff[m]=step*(m-1)
ssam=sbsam+csam*diff[m]
for(i in l:nreps)
46
-------
## generate p-values for the tests
## WRS
wout=wilcox . test ( ssam[i, ] , bsam[i, ] , alternative=alt,mu=mew)
wpvalue [ i ] =wout$p . value
## Welch's t
wtout=t . test ( ssam[i, ] , bsam[i, ] , alternative=alt,mu=mew)
wtpvalue [ i ] =wtout$p . value
## Student's t
stout=t .test (ssam[i, ] ,bsam[i, ] , alternative=alt,mu=mew, var . equal=TRUE)
stp value [i] =stout$p. value
## quantile test
qsam=c (bsam[i, ] , ssam[i, ] )
code=c ( rep ( 0 , qm) , rep ( 1 , qn ) )
first=length (code) -qc+1
last=length (code)
qdata=data . frame ( code, qsam)
qdata=sort . data . frame (~qsam, qdata)
qs=sum (qdata$code [ first : last] )
qmew= (qn*qc) / (qm+qn)
qsigma=sqrt (qn* (qc/ (qm+qn) ) * (1- (qc/ (qm+qn) ) ) * (qm/ ( qm+qn- 1 ) ) )
qp value [i] =l-pnorm ( (qs-0 . 5-qmew) /qsigma)
## combined quantile and WRS
if (alt=="g")
cpvalue [i] =1
if (wpvalue [i] alpha)
if (qpvalue [i] al ) {mpvalue [i] =0 }
} ## next i
## rejection probability (equals action probability for test form 1)
probcleanw [m] =length (wpvalue [wpvalue< (alpha) ] ) /length (wpvalue) #WRS
probcleanst [m] =length (stpvalue [stpvalue< (alpha) ] ) /length (stpvalue) #Students
probcleanwt [m] =length (wtpvalue [wtpvalue< (alpha) ] ) /length (wtpvalue) #Welchs t
probcleanc[m]=length(cpvalue[cpvalue<(alpha
quantile
probcleanq[m]=length(qpvalue[qpvalue<(alpha
probcleanl[m]=length(Ipvalue[lpvalue<(alpha
probcleanm[m]=length(mpvalue[mpvalue<(alpha
}# next m diff step
])/length(cpvalue) #combo: WRS &
/length(qpvalue)
/length(Ipvalue)
/length(mpvalue)
#quantile test
#log Welch's t
#mean
47
-------
md=max(diff)
## 1-rejection probability needed for test form 2
if(alt=="l")
{
probcleanst=l-probcleanst
probcleanwt=l-probcleanwt
probcleanw=l-probcleanw
probcleanc=l-probcleanc
probcleanl=l-probcleanl
}
## plot empty performance diagram with title
plot(diff,probcleanw,xlim=range(0,diff[m] ) , type="n", lwd=l,
ylim=c(0,1.1),cex=.4,xlab="",ylab="")
title(main=paste((frac*100),"% contaminated"," Test
Form",TestForm),cex.main=l,
xlab="True Difference",
ylab="Probability of Action")
## plot gray region tick marks
if(lft!=rgt)
{
cp=c("black","gray")
if(alt=="l"){cp=c("gray","black")}
lines(c(1ft-.0625*md,Ift,Ift),c(bot,bot,bot+.1) , col=cp[1] , lwd=2)
lines(c(rgt,rgt,rgt+.0625*md),c(top-.1,top,top),col=cp[2],lwd=2)
}
## performance curves
if(plotWRS==TRUE)lines(diff,probcleanw,col="black",lwd=l)
if(plotWelchst==TRUE)lines(diff, probcleanwt, col="blue" , lty=2)
if(plotStudentst==TRUE)lines(diff,probcleanst,col="green",lty=3,lwd=2)
if(plotQuantile==TRUE)lines(diff,probcleanq,lwd=2,col="orange", lty=4)
if(plotWRSplusQuantile==TRUE)lines(diff,probcleanc,col="brown",lty=3,lwd=3)
if(plotWelchslogt==TRUE)lines(diff,probcleanl,lwd=2 , col="purple" , lty=4)
if(plotSampleMean==TRUE)lines(diff,probcleanm,col="red",lty=3,lwd=3)
## plot legend
plot(0:1,0:1,type="n",xaxt="n",yaxt="n",xlab="",ylab="",axes=FALSE)
legend(0,1.05,
legend=c("WRS","Student's t","Welch's t","Quantile Test",
"WRS+Quantile","Welch's log t","Sample Mean"),
lwd=c(l,2,l,2,3,2,3) ,
col=c("black","green","blue","orange","brown","purple","red") ,
bty="n",cex=l.l,lty=c(l,3,2,4,3,4,3) )
text(.5,.42,paste(bsamsize,"Background Samples"))
text(.5,.34,paste(ssamsize,"Site Samples"))
text(.5, .26, paste("Background mean =",bmean))
## reset graphics parameters and layout
par(mgp=c(3,1,0))
layout(c(1,1))
## prepare data for output
filename=paste("Perf data",bsamsize,ssamsize,frac*100,"
",format(Sys.time(),"%b %d %H%M"),."csv")
pw=round(probcleanw,3)
pwt=round(probcleanwt,3)
pst=round(probcleanst,3)
pq=round(probcleanq,3)
pc=round(probcleanc,3)
48
-------
pl=round(probcleanl, 3)
pm=round(probcleanm,3)
out=data.frame(diff,pw,pwt,pst,pq,pc,pi,pm)
names(out)=c("Diff","WRS","Welch","Student","quantile","WRS.quantile","Welch.
log", "sample.mean")
print(filename)
print(out)
a=read.delim("parameters.r",sep=" ",header=TRUE)
a=data.frame(a[,1])
names(a)="Input.Parameters"
## Write output file
write.csv(out,filename,append=TRUE,row.names=FALSE)
write.csv(a,filename,append=TRUE,row.names=FALSE)
########### End Script ##########################
APPENDIX 4 - Script file: sort.data.frame.r.
This script was written by Kevin Wright and kindly made available on the internet
(http://tolstov.newcastle.edu.aU/R/help/04/09/4300.html). Mr. Wright's script has been
evaluated by the author, who is solely responsible for its validity in this application.
This script looks different than the scripts in the two previous appendices because R recognizes
two alternative symbols as the assignment operator: "=" and "<-." The command a = b + c has
the identical meaning as a <- b + c, namely, to replace the value of the variable (a) to the left of
the assignment operator symbol with the result obtained by evaluating the expression (b+c) to
the right of the symbol. The choice of symbol is just a matter of personal preference.
########### Begin Script ##############################
sort.data.frame <- function(form,dat){
# Author: Kevin Wright
# Some ideas from Andy Liaw
# http://tolstoy.newcastle.edu.au/R/help/04/07/1076.html
# Use + for ascending, - for decending.
# Sorting is left to right in the formula
# Useage is either of the following:
# sort.data.frame(~Block-Variety, Oats)
# sort.data.frame(Oats,--Variety+Block)
# If dat is the formula, then switch form and dat
if(inherits(dat,"formula")){
f=dat
dat=form
form=f
}
if(form[[1]] != "~")
stop("Formula must be one-sided.")
# Make the formula into character and remove spaces
forme <- as.character(form[2])
forme <- gsub(" ","",forme)
49
-------
# If the first character is not + or -, add +
if(!is.element(substring(forme,1,1),c("+","-")))
forme <- paste("+",forme,sep="")
# Extract the variables from the formula
vars <- unlist(strsplit(forme, "[\\+\\-]"))
vars <- vars[vars!=""] # Remove spurious "" terms
# Build a list of arguments to pass to "order" function
calllist <- list()
pos=l # Position of + or -
for(i in 1 : length (vars ) ) {
varsign <- substring ( forme, pos , pos )
pos <- pos+1+nchar ( vars [i] )
if (is . factor (dat [, vars [i] ] ) ) {
if ( varsign=="-" )
calllist [ [i] ] <- -rank (dat [, vars [i] ])
else
calllist [ [i] ] <- rank (dat [, vars [i] ])
}
else {
if ( varsign=="-" )
calllist [ [i] ] <- -dat [, vars [i] ]
else
calllist [ [i] ] <- dat [, vars [i] ]
dat [do. call ( "order" , calllist ) , ]
d = data.frame(b=factor(c("Hi","Med","Hi","Low"),levels=c("Low","Med","Hi"),
ordered=TRUE),
x=c("A","D","A","C"),y=c(8,3,9,9) , z=c(1, 1, 1, 2) )
sort.data.frame(~-z-b,d)
sort.data.frame(~x+y+z,d)
sort.data.frame(~-x+y+z,d)
sort.data.frame(d,~x-y+z)
########### End Script ##########################
50
-------
-------
&EPA
United States
Environmental Protection
Agency
Office of Research
and Development (8101R)
Washington, DC 20460
Official Business
Penalty for Private Use
$300
EPA/600/R-07/020
March 2007
www.epa.gov
Please make all necessary changes on the below label,
detach or copy, and return to the address in the upper
left-hand corner.
If you do not wish to receive these reports CHECK HERE D;
detach, or copy this cover, and return to the address in the
upper left-hand corner.
PRESORTED STANDARD
POSTAGE & FEES PAID
EPA
PERMIT No. G-35
Recycled/Recyclable
Printed with vegetable-based ink on
paper that contains a minimum of
50% post-consumer fiber content
processed chlorine free
-------
|