United States
 Environmental Protection
 Agency
Office of
Research and
Development
Office of Solid
Waste and
Emergency
Response
EPA/600/R-97/006
December 1997
 Technology  Support Center  Issue

THE LOGNORMAL DISTRIBUTION  IN
ENVIRONMENTAL APPLICATIONS
Ashok K. Singh1, Anita Singh2, and Max Engelhardt3
  The  Technology  Support  Projects,
Technology Support Center (TSC) for
Monitoring and Site Characterization was
established  in  1987 as  a result of an
agreement between the Office of Research
and Development (ORD), the Office of
Solid  Waste and Emergency Response
(OSWER) and  all ten Regional  Offices.
The objectives of the Technology Support
Project and the  TSC were to make avail-
able  and  provide  ORD's  state-of-the-
science   contaminant  characterization
technologies  and expertise  to Regional
staff,   facilitate  the  evaluation   and
application   of  site  characterization
technologies at Superfund and RCRA sites,
and to improve  communications between
Regions and ORD Laboratories. The TSC
identified a need to provide federal, state,
and  private  environmental  scientists
working on hazardous waste sites with a
technical  issue  paper that identifies  data
assessment  applications  that   can  be
implemented to better define and identify
the  distribution of hazardous waste site
contaminants. The examples given in this
Issue  paper  and the recommendations
provided were the result of numerous data
assessment approaches performed by the
TSC at hazardous waste sites.  Mr. John
Nocerino  provided  guidance   and
suggestions  that  greatly  enhanced the
quality of this Issue Paper.
             This paper was prepared by A. K. Singh,
           A. Singh, and M. Engelhardt. Support for
           this project was  provided by the EPA
           National Exposure Research Laboratory's
           Environmental Sciences Division with the
           assistance  of  the  Superfund Technical
           Support Projects  Technology  Support
           Center  for  Monitoring  and   Site
           Characterization, OSWER's Technology
           Innovation Office, the  U.S. DOE Idaho
           National Engineering and Environmental
           Laboratory, and the Associated Western
           Universities Faculty Fellowship Program.
           For further  information,  contact  Ken
           Brown,  Technology   Support  Center
           Director, at (702) 798-2270, A. K. Singh at
           (702) 895-0364,  A. Singh  at (702)  897-
           3234, or M. Engelhardt at (208) 526-2100.

           PURPOSE AND SCOPE

             The purpose of this issue paper is to
           provide  guidance  to  environmental
           scientists regarding the interpretation and
           statistical  assessment  of data collected
           from sites contaminated with inorganic and
           organic  contaminants.    Contaminant
           concentration data from sites quite often
           appear  to  follow a skewed probability
           distribution. The lognormal distribution is
           frequently used to model positively skewed
           contaminant concentration distributions.
           The H-statistic based Upper Confidence
           Limit (UCL) for the mean of a lognormal
 Department of Mathematics, University of Nevada, Las Vegas, NV 89154
 Lockheed Martin Environmental Systems & Technologies, 980 Kelly Johnson Dr., Las Vegas, NV 89119
3Lockheed Martin Idaho Technologies, P.O. Box 1625, Idaho Falls, ID 83415-3730
 Technology Support Center for
 Monitoring and Site Characterization,
 National Exposure Research Laboratory
 Environmental Sciences Division
 Las Vegas, NV 89193-3478
            Technology Innovation Office
            Office of Solid Waste and Emergency Response,
            U.S. EPA, Washington, D.C.

            Walter W. Kovalick, Jr., Ph.D., Director

            *3> Printed on Recycled Paper     031CMB98.RPT

-------
population  is  recommended  by  U.S.  EPA
guidance  documents  (see, for example,  EPA
(1992)) and is widely used to make remediation
decisions  at Superfund sites.  However, recent
work in environmental statistics has cast some
doubts on the performance of the formula based
on  the H-statistic  for  computing  an  upper
confidence limit  of the  mean  of a lognormal
population. This issue paper is mainly concerned
with  the   problem  of  computing  an  upper
confidence  limit  when  the  contaminant
concentration distribution appears to be highly
skewed.

  Several   approaches  to  computing  upper
confidence limits for the  mean of a lognormal
population are considered.   The  approaches
discussed  include those based on the H-statistic,
the jackknife method, the bootstrap method, and
a method  based on the  Chebychev inequality.
Simulated examples show that for values of the
coefficient of variation larger than 1, the upper
confidence  limits  for the mean contaminant
concentration based on the H-statistic are much
higher than the upper confidence limits obtained
by the other estimation methods. This may result
in an unnecessary cleanup. In other words, the
use of the j ackknife method, the bootstrap method,
or the Chebychev  inequality method provides
better input to the risk assessors and may result in
a significant reduction in remediation costs. This
is especially true when the number of samples is
thirty or less. When the value of the coefficient of
variation exceeds 1, upper confidence limits based
on any of the other estimation procedures appear
to be more stable and reliable than those based on
the H-statistic.   Values  of the coefficient of
variation computed from observed contaminant
concentrations are  typically  used by environ-
mental scientists to assess the normality of the
population distribution.  In this issue paper, the
issue  of using the coefficient of variation in
environmental data analysis is addressed and the
problem of estimating the coefficient of variation,
when sampling from lognormal populations, is
also discussed.

  This issue paper is divided  into the following
major  sections:   (1)  Introduction,  (2)  the
Lognormal   Distribution,   (3)   Methods  of
Computing a UCL of the Mean, (4) Examples,
and (5) Summary and Recommendations.

1. INTRODUCTION

  Most  of  the  procedures  available  in the
literature   of  environmental   statistics  for
computing UCL of the mean of a  population
assume  that contaminant concentration data is
approximately normally distributed. However, the
distributions of contaminant concentration data
from  Superfund sites  typically  are positively
skewed and are usually modeled by the lognormal
distribution.  This apparent skewness, however,
may  be  due to  biased  sampling,  multiple
populations, or outliers, and not necessarily due to
lognormally distributed data.

  Biased sampling is often used in sampling for
site characterization (Power, 1992).   Another
common   situation   often   present   with
environmental data is  a mixed  distribution of
several subpopulations (see  Figure 1). Also, the
presence  of one  or more outliers,  spurious
observations, or anomalies can result in a data set
which appears to come from a  highly skewed
distribution.   When  dealing  with  a skewed
distribution, statisticians sometimes  recommend
using the population median (instead of the
population  mean)  as  a  measure   of central
tendency.  However, remediation decisions at a
        Elevated levels of contaminant concentration
        Moderately high levels of contamination
        Extremely high levels of contamination
 D
        Background
Figure 1   A site with several sources of
          contamination.

-------
polluted site typically are made on the basis of the
population mean, and therefore UCL of the mean
of the concentration distribution is needed.  For
positively skewed distributions, the median is
smaller than the mean: therefore a UCL for the
median provides  an inappropriate  basis  for  a
decision about the mean.

  U.S. EPA guidance documents recommend the
use of H-statistics to compute the  UCL of the
mean of a lognormal distribution (EPA, 1992). A
detailed discussion of H-statistics  is given in
Gilbert (1987).  For data sets  with nondetects,
estimation methods developed for censored data
from a lognormal distribution are discussed by
Lecher  (1991).    The use  of the lognormal
distribution has been controversial because it can
lead to incorrect decisions.  For example, recent
work of Gilbert (1993) indicates that statistical
tests of hypotheses based on H-statistics can yield
unusually high false positives, which would result
in an unnecessary cleanup.  The situation may be
reversed  when dealing with estimation of the
mean background level. If the H-statistic based
method is used to compute a UCL of the mean for
the observed background concentrations, then the
mean of the background  level may be  over-
estimated, which may result in not remediating a
contaminated area of the site. Stewart (1994) also
showed that the incorrect usage of a lognormal
distribution may lead to erroneous results.

  Most of the "classical" statistical methods based
on  the normal  distribution  were developed
between  1920  and 1950 and  have been well
investigated in the statistical literature.  On the
other hand, lognormal-based methods have not
received the same level of scrutiny.  Furthermore,
the classical methods became popular due to their
computational convenience.  The  1980s have
produced a new breed of statistical methods based
on the power and availability of computers (see,
for example, Efron and Gong,  1983). Both the
jackknife and bootstrap methods require a great
deal of computer power, and, therefore, have not
been   widely  adopted  by   environmental
statisticians. However, with the recent advances in
computer   equipment   and   software,
computationally intensive statistical procedures
have become more practical and accessible.
  The  authors  of this  article  have critically
reviewed several estimation procedures which can
be used to compute UCL values via monte carlo
simulation. These include the simple arithmetic
mean, the Minimum Variance Unbiased Estimate
(MVUE), and nonparametric procedures such as
the jackknife  and  the  bootstrap  procedures.
Computer simulation experiments (not included in
this paper) have  been  performed  for  various
values of the population standard deviation,  or
equivalently the Coefficient of Variation (CV),
and sample sizes ranging from 10 to  101. It has
been demonstrated that for samples of size 30 or
less,  the H-statistic  based  UCL  results  in
unacceptably high estimates of the threshold
levels such as the background level contamina-
tion.  This is especially true for data sets from
populations with CV values exceeding 1.  For
samples of larger sizes, the use of H-statistics can
be replaced  by UCLs based on nonparametric
methods such as the jackknife or the bootstrap.
Other well-known results such as the central limit
(CLT) and Chebychev theorems may also be used
to obtain UCLs. To illustrate problems associated
with methods based on lognormal theory, results
for some simulated examples  and some from
Superfund work done by the authors have been
included in this paper.

2. THE LOGNORMAL DISTRIBUTION

  The  authors briefly describe the lognormal
distribution.  By definition, contaminant concen-
tration  is  lognormally  distributed   if  the
log-transformed  concentrations  are  normally
distributed. This can be mathematically described
as follows:

  If Y = \n(X) is normally distributed with mean,
fj,, and variance,  o-2, then  X  is  said  to be
lognormally distributed with parameters fj. and a2.
It should be noted that fj.  and tr2 are not the mean
and variance of the lognormal random variable, X,
but they are the mean and variance  of the log-
transformed random variable, Y.  However, it is
common practice to use  the same parameters to
specify either, and it is convenient to refer to the

-------
normal distribution with the abbreviated notation Y
~ N(u, G2) and the log-normal distribution with the
abbreviationX~ LN(/^, a2). Figure 2, which shows
plots of a normal and a lognormal density function
with ju = 0 and a1 = 0.5, illustrates the difference
between normal and lognormal distributions.
        Normal and lognormal density functions
    0.7-

    0.6-

    0.5-

    0.4-

    0.3

    0.2

    0.1 -

    o.o-
N{0,.5)
       -5
                    0
Figure 1  Graphs of normal N(// = 0, a1 = 0.5) and
         lognormal LN(// = 0, e2 = 0.5) density
         functions.

Figure 3, which shows plots of several lognormal
distributions, each with ju = 0, illustrates how
varying the parameter a2 can change the amount of
skewness.
            Lognormal density functions
    14  H

    12

    10
   0.8

   0.6
   0.4  -

   0.2

   0.0
                        2
                        X
Figure 2  Graphs of A: LN(// = 0, a1 = 0.25), B:
         LN(// = 0, ff2 = 1.0) and C: LN(// = 0, a2
         = 25.0) density functions.


The  parameters  of  interest  of a  lognormal
distribution, LN(u, o2), are given as follows:
Mean = //j = exp(// +0.5
-------
            TRUE MEAN  95% UCL  '    95%
                                       1000
Figure 3   Graphs showing the relative positions of
          the TRUE MEAN, the 95% UCL, and
          the 95-th percentile..
this assumption being violated are discussed as
follows.  A data set can be put into  a statistical
procedure   (e.g.,  the  Shapiro-Wilks  test  of
normality)  or a computer program whether or not
the required assumptions are met. It  is the user's
responsibility to ensure the underlying assumptions
required to conduct the statistical procedure are
met. The decisions and conclusions derived from
incorrectly used statistics  can be expensive.  For
example, incorrect use of a statistic may lead to
wrong conclusions such as:  1) remediation of a
clean part of the site, or 2) no  remediation at a
contaminated  part of the site.  The  first wrong
conclusion will result in an unnecessary cleanup
whereas the second incorrect conclusion may cause
a threat to human health and the environment. It is
likely that  the  availability of new and improved
statistical software has also increased the misuse of
statistical techniques.  This  is illustrated in the
following discussion of the  application  to some
simulated and real data sets. It should be reiterated
that  it is the  analyst's  (user's)  responsibility to
verify that  none of the required assumptions are
violated before using a statistical test and deriving
inferences  based upon the resulting analysis. In
many cases, this may warrant expert advice from a
statistician.

  Often, the central portion of a data  set  will
behave as if it came from a  normal distribution.
However, in practice, a normally distributed data
set with a few extreme (high) observations can be
incorrectly modeled by the lognormal distribution
with the lognormal assumption hiding the outliers.
Also, the mixture  of  two  or  more  normally
distributed data sets with significantly different
mean concentrations such as one coming from the
clean background part and the other taken from a
contaminated part of the site can also be modeled
(incorrectly) by a lognormal distribution.   The
following example illustrates this point.

Example 2.1.  Simulated data set from two pop-
              ulations
A simulated data set of size  fifteen (15) has been
obtained from a mixture of two normal populations. Ten
observations  (representing   background)  were
generated from a normal distribution with mean, 100,
and  standard  deviation, 50, and  five  observations
(representing contamination) were  generated from a
normal distribution with mean, 1000, and standard
deviation, 100. The mean of this mixture distribution is
400.   The generated data are as follows: 180.5071,
2.3345,  48.6651,  187.0732,   120.2125,  87.9587,
136.7528,  24.4667,  82.2324,  128.3839,  850.9105,
1041.7277, 901.9182, 1027.1841, and 1229.9384.

Discussion of Example 2.1

   The  data set  in Example  2.1 failed the
normality test based  on  several goodness-of-fit
tests  such  as  the  Shapiro-Wilks,  W-test
(W=0.7572), and Kolmogorov-Smirnov (K-S =
0.35) tests (see Figures 5 and 6).  However, when
these tests were carried out on the log-transformed
6 -
5 -
4 -
2 -
0 -





l l l
0 250 500







750 1000 1250
                      X__mix
Figure 4  Histogram of the 15 observations from the
         mixture population of Example 2.1.

-------
    .999 -
    .99 -
    .95 -
    .80 -
    .50 -
    .20 •
    .05 •
    .01
    .001
                    500
                       X mix
                                1000
       Average: 403.351
       Std.Dev.: 453.94
Kolmogorov-Smimov Normality Test
  D+: 0.350 D-: 0.189 D: 0.350
  Approximate p value < 0.01
Figures  K-S test of normality for the data of
          Example 2.1.

data, the test statistics are insignificant at the a =
0.05 level of significance with W=0.8957, and K-S
= 0.168, suggesting that a lognormal distribution
(see Figures 7 and 8) provides a reasonable fit to
the data.   Based upon this test,  one  might
incorrectly   conclude  that   the   observed
concentrations come from  a  single  background
lognormal population. This incorrect conclusion is
made quite frequently. This data set is used later to
illustrate how modeling the mixture data set by a
lognormal distribution will result  in incorrect
estimates of mean contamination levels at various
parts of the site.
8-
7-
6-

c 5 ~
0)
ri- 4-
P
t 3-
2-
0-



1
0.0






















1 1 1
2.5 5.0 7.5
                       ln(X)
Figure 6  Histogram of the log-transformed 15
          observations from the mixture population
          of Example 2.1.
                                 .999-
                                  .99-
                                  .95-
                                  .80-
                                  .50-
                                  .20-
                                  .05-
                                  .01-
                                 .001-
                           L _
                      I     I
•i	i	1	i	i	r
. i	i	i	i	j	4..
 i     i     i     i      ii
                                                                                                  .3
                                                       4
                                                     ln(X)
                                                               ige: 5.09021
                                                            Std. Dev. 1.70569
                                                            N of data: 15
            Kolmogorov-Smimov Normality Test
             D+: 0.134 D-: 0.168 D: 0.168
              Approximate p value > 0.15
                              Figure 7  K-S test of lognormality for the data of
                                       Example 2.1.

                                 METHODS OF COMPUTING A UCL OF
                                 THE MEAN

                                 The main objective of this study is to assess the
                              performances   of  the  various  methods  of
                              estimating the UCL for the mean, //1; of positively
                              skewed  populations.   The assumption  of a
                              lognormal distribution to model such populations
                              has become quite popular among environmental
                              scientists (Ott, 1990). As noted in Section 2, for
                              positively skewed data sets, there are potential
                              problems in using standard methods based on the
                              lognormal theory. Therefore, we will compare the
                              lognormal-based methods often used with cleanup
                              standards with some other available methods. The
                              alternate  methods  considered  here  have the
                              advantage that they do not require assumptions
                              about the   specific  form  of  the  population
                              distribution. In other words, they do not assume
                              normality or lognormality of the data set under
                              consideration.  In Section 4, the UCL of the mean
                              has been computed for several examples using the
                              following methods:

                                 •   The H-statistic
                                 •   The Jackknife procedure
                                 •   The Bootstrap procedure
                                 •   The Central Limit Theorem
                                 •   The Chebychev Theorem

-------
A brief description of the computation  of the
various estimates and the  associated confidence
limits  obtained  using  the   above-mentioned
procedures follows:

Parametric Lognormal Procedures

   Let xb x2,..., xn be a random sample from a log-
normal distribution with mean, fa,, and variance,
a2, and denote by /j, and a the population mean and
population standard deviation (sd), and y, and s
the sample mean and sample sd, respectively, of
the log-transformed data j, = Info); /' = 1, 2, ... , n.
.  Specifically,
                                          (6)
                                          (7)
and
   In a more general setting, consider a population
with an unknown parameter, 9.  The minimum
variance unbiased estimate (MVUE) of 0 is the
one that is not only an unbiased estimate of 0  (i.e.,
the expected value of the estimate is equal to the
true value  of the parameter), but it  also has a
smaller variance than any other unbiased estimate
of 0. When the parameter of interest is the mean,
//!, of a lognormally distributed population, Bradu
and Mundlak (1970) derive its MVUE, which is
given by
//,  = ex;
                                      (8)
where gn(u) is a function whose form is rather
complicated, but an infinite series solution is given
by Aitchison and Brown (1976).  Tabulations of
this function are provided by Gilbert (1987, Table
A9). Note that Gilbert uses v|/n in place of gn. This
function is also used in computing the MVUE of
the variance, a2,  of a lognormal population, as
given by Finney (1941),
= exp(2.y
                     n((n -2)s2/(n - 1))].     (9)
Bradu and Mundlak (1970) give the MVUE of the
variance of the estimate   ,
                                                                -gn((n-2)S;/(n-l)
                                                                                      (10)
                                                  Another estimate which is also sometimes used
                                               is known as the Maximum Likelihood Estimate
                                               (MLE). When the data set is a random sample
                                               from a lognormal distribution, the MLE of the
                                               parameter, ju, is simply the  sample mean of the
                                               log-transformed data, ju=y~, and the MLE of o2 is
                                               a multiple of the sample variance of the log-
                                               transformed data,namely,a2 = [(n-\)ln\s2. The
                                               MLE of any function of the parameters // and a1 is
                                               obtained by  simply substituting these MLEs in
                                               place of the parameters. For example, the MLE of
                                               the mean of a lognormal population is exp(u +
                                               0.5
-------
Consequently, other methods for computing all CL
of the mean, ju\, of a distribution of unspecified
form will be considered and the results compared
with UCLs obtained by the H-statistic approach.

   The methods considered in this paper can be
viewed  as variations  of a  basic  approach to
constructing  confidence  intervals known as the
pivotal quantity method.   In general, a pivotal
quantity is a function of both the parameter 9 and
an estimate a such that probability distribution of
the pivotal quantity does not depend on 0. Perhaps
the best-known example of a pivotal quantity is the
well-known t statistic,
                                         (12)
where x and sx are, respectively, the sample mean
and sample standard deviation.  If the data is a
random sample  from a normal population with
mean,  //1;  and standard  deviation, <71; then the
distribution of this pivotal quantity is the familiar
Student's  t distribution  with n-\  degrees of
freedom. Because the Student's ^distribution does
not depend on either unknown parameter, quantiles
are available.   Denote by  t^n,\  the upper oth
quantile of the Student's t distribution with  n- 1
degrees of freedom.  Based on equation (12), it is
possible  to derive  a  (1-2«)100% confidence
interval of the form
                                         (13)
The  confidence  interval is given in the familiar
form of a two-sided confidence interval for the
mean.   If the  lower  limit of this interval  is
disregarded, the upper limit provides a (1 - a) 100%
UCL for the mean,,«;.

   For a population which is normally distributed,
equation (13) provides the best way of constructing
confidence  intervals for  the  population  mean.
However, as noted previously,  the distribution  of
contaminant  concentration data  is  typically
positively skewed and frequently involves outliers.
It is well known  that the sample mean and sample
standard deviation get severely distorted  in the
presence of outliers, (Singh andNocerino 1995),
and  consequently any function,  such as the
Student's t, given by equation (12) above of these
statistics also gets  severely influenced by the
presence  of  outliers.   Robust  methods  for
estimating  the population mean  and sd are
available in the software package, SCOUT,  as
identified in Singh and Nocerino  (1995).   In
practice, statistical procedures  based  on the
pivotal quantity (equation 12) are usually thought
to be   "robust" relative to  violation  of the
normality assumption.  As noted  by Staudte and
Sheather (1990), tests based on the Student's t are
nonrobust  in  the   presence  of   outliers.
Consequently, other procedures which do not rely
on  a specific  parametric  assumption for the
population distribution are also considered in the
following discussion.

   The  approach  of  constructing confidence
intervals from pivotal quantities (or approximate
pivotal quantities) permits a unified treatment of
these alternate procedures.  In particular, each
procedure  involves  an approximate  pivotal
quantity with the difference between the unknown
population mean, ju\, and a point estimate of the
mean in the numerator, and an  estimate of the
standard error of  the point  estimate in the
denominator. Thus, each procedure involves two
parts: 1) finding some reasonably robust estimate
of the mean, (Singh and Nocerino 1995), and  2)
providing a convenient way to obtain quantiles of
the pivotal quantity. A general discussion of the
pivotal   quantity  approach   to  constructing
confidence  intervals   is given  by  Bain  and
Engelhardt(1992).

   As noted above, in order to apply the pivotal
quantity method, it is necessary to have  quantiles
of the  distribution of the pivotal quantity.  For
example, in order to compute equation (13), it is
necessary to  have quantiles of  the Student's t
distribution.  These quantiles  can be  found  in
tables or computed with the appropriate software.
However, for nonnormal populations the required
quantiles are not, in general, readily available.  In
some cases, even though the exact distribution of
a pivotal quantity is not known,  an approximate
distribution can be used. Thus, except for the H-

-------
statistic approach, which is exact if the population
is truly lognormal,  all  of the other methods
discussed below give only  approximate UCL
values  for  the  population  mean.   The true
confidence level of UCLs will vary from one
method to the next, and without some additional
study, it will not be clear whether the comparisons
are fair. In other words, it is possible to have a
smaller UCL at the expense of a true confidence
level which is below the nominal level, and below
the true confidence level of another competing
method.

   In  environmental applications, the objectives
typically are:  1) the identification of hot spots,
which  are typically  represented  by the  high
extreme concentrations, or 2) the separation of
clean part(s) of a site from the dirty contaminated
part(s) of the site.  However, from the examples
discussed in the following, it can be seen that the
practical use of the lognormal distribution in those
environmental applications is questionable as a
lognormal   distribution  often   accommodates
extreme outlying  observations   and  mixture
populations as part of one lognormal distribution.

Jackknife and Bootstrap Procedures

   General methods for deriving estimates, such as
the method of maximum likelihood, often result in
estimates which  are  biased.   Bootstrap and
jackknife procedures as discussed by Efron (1982)
and Miller (1974) are nonparametric statistical
techniques which can be used to reduce the bias of
point  estimates   and  construct  approximate
confidence intervals for parameters such as the
population mean. These two procedures require no
assumptions regarding the statistical distribution
(e.g.,  normal  or lognormal)  for the underlying
population, and  can be applied to a variety of
situations no matter how complicated. However, it
should be pointed out that a use of a parametric
statistical method (depending upon distributional
assumptions), when appropriate, is more efficient
than its nonparametric counterpart.  In practice,
parametric  assumptions  are  often  difficult  to
justify,  especially in environmental applications.
In these cases, nonparametric methods are valuable
tools  for obtaining  reliable estimates  of the
parameters of interest. Although bootstrap and
jackknife  procedures  are  conceptually  simple,
they are based on resampling techniques requiring
considerable computing power and time.
        j, x2, ... , xn be a random sample of size «
from a population with an unknown parameter 9
(e.g. ,6 = i*j), and let a be an estimate of 6 which
is a function of all « observations. For example,
the parameter 9 could  be the  mean,  and  a
reasonable choice for the estimate 0 might be the
sample mean, x.  Another choice is the MVUE of
a lognormal mean. Of course, if the population is
not lognormal then this estimate may not perform
well:  but, because  it  is frequently used with
skewed data sets, it is of interest to see how it
performs relative to the other methods.

Jackknife Estimation

   In the jackknife approach, « estimates of 9 are
computed by deleting one observation at a time.
Specifically, for each index, /', denote by 0 (j) the
estimate of 9 (computed similarly as 9 given
above) when the rth observation is omitted from
the original sample  of size n, and denote the
arithmetic mean of these estimates by

*-$,'*•                            04)

A quantity known as the rth "pseudo-value" is
defined by

Jl = nO- (n-l)0(j).                      (15)

The jackknife estimator of 9 is given by

  * ) = -E Jt  = n0-(n-\)0.             (1 6)
If the original estimate 9 is biased, then, under
certain conditions, part of the bias is removed by
the jackknife procedure, and an estimate of the
standard error of the jackknife estimate, J(9}, is
given by

-------
   Another  application  of the  pseudo-values,
suggested by J. Tukey (see Miller, 1974), is to use
the pseudo-values to obtain confidence intervals
for the parameter, 0, based on the following pivotal
quantity:
                                         (18)
The  statistic, t,  given by equation (18) has an
approximate  Student's t distribution with  n-1
degrees of freedom, which can be used to derive
the   following  approximately  two-sided
(1-2«)100% confidence interval for 0 :
                                         (19)
The upper limit of equation (19) is an approximate
(1-«)100% UCL for 0. If the sample size, «, is
large, then the upper oth Hpantile can be replaced
with the corresponding upper oth standard normal
quantile, za.  Observe also that when u  is the
sample mean, then the jackknife estimate is the
sample mean, that is J(x) = x; the estimate of the
standard error in equation (17) simplifies to sjnm,
and  the  confidence  interval in equation (19)
reduces to the familiar /-statistic based confidence
interval given by equation (13).

Bootstrap Estimation

   In the bootstrap procedure,  repeated samples of
size n are drawn with replacement from the given
set of observations. The process is repeated a large
number of times, and each time an estimate of 0 is
computed. The estimates thus  obtained are used to
compute an estimate of the standard error of a.
There  exists in the  literature  of  statistics  an
extensive array of different bootstrap methods for
constructing confidence intervals.  In this  article
two  of these methods are  considered:   1)  the
standard  bootstrap,   and  2) the  pivotal   (or
Studentized) bootstrap method as discussed by Hall
(1988).    A  general   description  of  bootstrap
methods, illustrated by application to the  sample
mean, follows:
Step 1.
Let (jtn,  xl2,  ...  , xin)  represent the ith
sample of size n with replacement from
                                                   the original data set (x\, x2,...,Xn)- Then
                                                   compute the sample mean and denote it
                                                   by Xj.

                                          Step 2.   Perform Step 1 independently TV times
                                                   (e.g., 500-1000), each time calculating
                                                   a new estimate.  Denote those estimates
                                                   by jTi, x~2, x~3, ...  , XN. The bootstrap
                                                   estimate of the population mean is the
                                                   arithmetic mean, XB, of the TV estimates
                                                   x~,.  The  bootstrap  estimate  of the
                                                   standard error is given by
                                          ffr,  =
                                                                                  (20)
                                          If some parameter, 0 (say, a population median),
                                          other than the mean  is of concern,  with an
                                          associated estimate (e.g., the sample median), then
                                          the same steps previously described could be
                                          applied with the parameter and its estimate used in
                                          place of/^j and x.  Specifically, the estimate, 9h
                                          would be computed, instead of x~,, for each of the
                                          N bootstrap  samples.   The general bootstrap
                                          estimate, denoted by 9B, is the arithmetic mean of
                                          the TVestimates. The difference, 9B - 9, provides
                                          an estimate of the bias of the estimate, 9, and the
                                          bootstrap estimate of the standard error of 9  is
                                          given by
                                          ffr,  =
                                                                                  (21)
                                          The standard bootstrap  confidence  interval  is
                                          derived from the following pivotal quantity:
                                               0-0
                                                                                  (22)
                                          Finally,  the  (1-2«)100%  standard  bootstrap
                                          confidence interval for 9, which assumes that
                                          equation (22) is approximately normal, is
                                                    0
                                        (23)
    In this case, the bootstrap approach gives a
convenient way to estimate the standard error of
ff. Depending on the type of estimate ff, the
                                               10

-------
standard error may be quite difficult to derive, and
consequently difficult to estimate.  However, the
bootstrap approach always yields an estimate of the
standard error directly from the data, even when
the mathematical form of the standard error is not
known.

        Another variation of the bootstrap method,
called  the  "bootstrap t"  by Efron  (1982), is a
nonparametric procedure which uses the bootstrap
methodology to estimate quantiles of the pivotal
quantity in equation (12). As previously noted, for
nonnormal populations the required quantiles may
not be  easily obtained, or it may be impossible to
compute exactly. However, with a variation of the
bootstrap procedure, as proposed by Hall (1988),
the required quantiles can  be estimated directly
from the data.  Specifically, in  Steps  1  and 2
described above, if JT is the sample mean computed
from the original data, and xt and s^ t are the sample
mean and sample  standard deviation computed
from the rth resampling of the original data, the N
quantities, ti = (x- x)ls^ t, are computed and sorted,
yielding ordered quantities tm < t^ < '" < t^. The
estimate of the upper oth quantile of the pivotal
quantity in equation (12) is t^B  = t^,^.  For
example, if N =  1000  bootstrap  samples are
generated, then the 950th ordered value, 7(950),
would  be the bootstrap estimate of the upper .05th
quantile of the pivotal quantity in equation (12).
This estimated quantile can be used in place of the
upper oth Student's t quantile in an interval of the
form given in equation (13). In the next section,
this method of construction will be  called the
"pivotal bootstrap".    This approach  has the
advantage that it does not rely on the assumption of
a special parametric form for the distribution of the
population, and it does not require an assumption
of approximate normality for the pivotal quantity
as does the standard bootstrap interval of equation
(23).

        In the examples to  follow, the jackknife,
the standard bootstrap method,  and the pivotal
bootstrap methods are applied using the sample
mean, x, and also the  estimate given by equation
(8), which is the  MVUE of the  mean  when the
population is lognormal.
The Central Limit Theorem

       Given a random sample, x1; x2, ... , xm of
size n from a population with a finite variance,
o"]2, where 9 = jul is the unknown population
mean, the Central Limit Theorem (CLT) states
that the asymptotic distribution, as n approaches
infinity,  of the sample  mean, xn,is  normally
distributed with mean, //1; and variance, a^ln.
More precisely, the sequence of random variables
                                        (24)
      all\fn
has a standard normal limiting distribution.  In
practice, this means that for large sample sizes n,
the sample mean, x, has an approximate normal
distribution  irrespective  of  the  underlying
distribution function.   Consequently, equation
(24) is an approximate pivotal quantity for large n.
This  powerful  result  can be used to obtain
approximate (1-2«)100% confidence intervals
for the  mean for any distribution with a finite
variance, although, strictly speaking, it requires
one to know the population standard deviation, 0l.
However,  as noted by Hogg and Craig (1978), if
trl is replaced by the sample standard deviation, $„
the normal approximation for large n is still valid.
This leads to the following confidence interval:
 x -
(25)
       Note that the  confidence  interval in
equation (25) has the  same  general  form as
equation (13), but with  the t quantiles  replaced
with approximate standard normal quantiles.  As
noted previously, if the lower limit is disregarded,
the upper limit of the interval provides a one-sided
UCL for the population mean.

       An often cited rule of thumb for a sample
size with the CLT is n > 30. However,  this may
not be  adequate if the population is highly
skewed.  A refinement of the CLT approach,
which  makes  an adjustment  for skewness, is
discussed by  Chen  (1995).   Specifically, the
"adjusted CLT" UCL is obtained if the  standard
normal quantile, za, in the upper limit of equation
(25)  is replaced by
                                               11

-------
                                         (26)
where /c3 is the sample coefficient of skewness,

       1
                                         (27)
Notice that this adjustment results in a UCL which
is larger than that of equation (25) when the sample
skewness is positive.

The Chebychev Theorem

       This theorem is given  here to obtain a
reasonably conservative estimate of the UCL of the
mean.  The two-sided Chebychev theorem states
that given a random variable, X, with finite mean
and standard deviation, /^ and <71; one has
       
100), all of these methods give similar results. In
this  section, a few  simulated examples are
provided to compare the various methods of
computing values of the UCL.  A few examples
from Superfund sites have also been included.

Example 4.1. Simulated sample from a mixture of
two normal populations, N(100, 50) and N(1000,
100).

This example uses the sample of size n = 15 which was
discussed previously in Example 2.1.  Recall, that this
is a simulated sample from a mixture of two normal
populations. The mean of the mixed normal population
is  U., = 400.  The values of the  mean,  standard
deviation, and coefficient of variation computed for the
log-transformed data are:
y = 5.090, sy = 1.705, and CVy = 0.34.
The values of the mean, standard deviation, and CV
computed for the  raw data are:
x =403.35, sx = 453.94, and CVX = 1.125.
If it is assumed  (incorrectly) that  the population  is
lognormal,  point estimates based on MVUE theory of
the mean,  u^ standard deviation, a.,, and standard
error of the mean are 572.98, 1334.56 and 290.14,
respectively.  Estimates of the 80th, 90th, and 95th
percentiles of a  lognormal distribution  are  686.33,
1453.48, and 2685.56, respectively.
          Discussion of Example 4.1

                  The 95% UCL values obtained from the
          methods  discussed   above,   without  using
          lognormal theory, are:
          Jackknife
          Standard Bootstrap
          Pivotal Bootstrap
          CLT
          Adjusted CLT
          Chebychev
                                     609.75
                                     584.32
                                     651.52
                                     596.16
                                     618.51
                                     927.27
          The values of the 95% UCL obtained from the
          methods discussed above,  calculated using the
          lognormal theory, are:
                                               12

-------
Jackknife
Standard Bootstrap
Chebychev
H-UCL
1085.17
 994.40
1869.90
4150.96
       Notice that the 95% UCL computed from
the H-statistic (4150.96)  exceeds the estimated
95th percentile (2685.56) of an assumed lognormal
distribution.   The H-UCL is also an  order  of
magnitude larger than the true mean,  400, of the
mixture of two normal populations.
Adjusted CLT
Chebychev
271.57
378.80
The values of the  95% UCL obtained from the
methods  discussed  above,  calculated  from
lognormal theory, are:
                Jackknife
                Standard Bootstrap
                Chebychev
                H-UCL
                                      289.30
                                      281.22
                                      448.41
                                      427.62
        It is also of interest to see how the methods
compare when applied to simulated lognormal data
with   different  sample  sizes   and   various
combinations of parameter values.

Example 4.2. Simulated sample of size n = 15
from a lognormal distribution,  LN(5,  1).

In this example, n = 15 data were generated from the
lognormal distribution  LN(5,1),  with following (true)
values of population parameters:  U., =  244.69, a,  =
320.75, and  CV =1.31. The generated data are:
139.2056, 259.9746,   138.7997,  48.8109, 166.1733,
54.1241,  120.3665,  60.9887,  551.2073,   66.3336,
16.0695, 364.5569, 153.2404, 271.5436, 473.6461.
The values of the sample mean, standard deviation, and
CV of the log-transformed data are:
y = 4.887, sy = 0.966, CVy = 0.20.
The sample  mean,  standard deviation, and CV for the
raw data are:
x =192.34, sx= 161.56, CVX = 0.84.
For a lognormal distribution, the estimates of u^ a.,, and
the standard error of the mean, based on MVUE theory,
are 202.58, 219.21, and 54.00, respectively. The MLEs
of  MI, °i,  and CV are  211.33,  262.47,  and  1.24,
respectively.  Estimates of the  80th, 90th,  and 95th
percentiles of the lognormal  distribution are 299.79,
458.58, and  649.31, respectively.
Discussion of Example 4.2

The  values of the 95% UCL obtained from the
methods discussed above, without using lognormal
theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
 265.79
 258.21
 292.17
 260.96
        The differences in UCLs for the various
methods are not as extreme as they were in the
previous example, but a similar pattern with the
Chebychev (as expected) and H-UCL limits being
the largest is still present.  However, unlike the
previous example, the 95% UCL is below the
estimated  95th  percentile  of  a  lognormal
distribution, as one would intuitively expect. It is
also interesting to note that the CV estimated as
the ratio of the sample standard deviation to the
sample mean from raw data is less than 1 (0.84),
while the CV computed from the MLEs is slightly
greater than 1 (1.24).  According to the CV test,
which says that if CV <1.0, then the population is
normally  distributed,  the former CV  of  0.84
might lead one to  incorrectly assume that the
population is normally distributed.

        In the next example, the variance of the
log-transformed variable is increased slightly with
a corresponding increase in CV and skewness.

Example 4.3. Simulated sample  of size n  =  15
from a lognormal distribution, LN(5, 1.5).

In this example,  n = 15 observations were generated
from the lognormal distribution, LN(5,1.5),  with the
following true values of population parameters: U., =
457.14, 0, = 1331.83, CV = 2.91. The generated data
are:
440.8517,1013.4986,1857.7698, 500.9632, 397.9905,
110.7144,  196.2847, 128.2843, 1529.9753, 5.7978,
940.8903, 597.5925, 1519.5159, 181.6512, 52.8952.
The sample mean, standard deviation, and CV of the
log-transformed data are:
y = 5.761, sy = 1.536, and CVy = 0.27.
The sample mean, standard deviation, and CV for the
raw data are:
x =631.65, sx = 603.13, and CVX= 0.96.
                                                13

-------
For a lognormal distribution, the estimates of u.,, a,, and
standard error of the mean, based on MVUE theory, are
894.76,1784.95, and 405.79, respectively. TheMLEsof
u.,, a,, and  CV  are 1033.63,  3202.28  and  3.10,
respectively.  Estimates of the 80th, 90th,  and 95th
percentiles of the lognormal distribution are 1163.05,
2286.63, and 3975.71, respectively.
Discussion of Example 4.3

The  values of the 95% UCL obtained from the
methods discussed above, without using lognormal
theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
 905.88
 882.82
 977.18
 887.82
 919.81
1327.75
The  values of the 95% UCL obtained from the
methods  discussed  above,  calculated  from
lognormal theory, are:
Jackknife
Standard Bootstrap
Chebychev
H-UCL
1534.94
1363.26
2708.63
4570.27
       As in the case of Example 4.1, the 95% H-
UCL (4570.27) again exceeds the estimated 95th
percentile  of the  lognormal distribution.   The
situation with the CV is similar to that of Example
4.2.  That is, the  CV  computed from raw data
(0.96) is less than  1, which by application of the
CV-test could lead one to adopt (incorrectly) the
normal distribution.  Notice that the true  CV and
the estimate based on the MLEs are both  close to
three.   The next example  involves  the  same
population but with a larger sample size.

Example  4.4. Simulated sample of  size n = 31
from a lognormal distribution, LN(5, 1.5).

In  this simulated example, n = 31  observations were
generated from a lognormal distribution, LN(5,1.5). This
is the same distribution use in the previous example, and
thus  true mean, standard deviation, and CV are the
same. The generated data are:
                49.0524, 806.8449, 122.2339, 697.7315, 2888.1238,
                37.7998,  7.2799,  292.5909,  433.4413,  639.7468,
                3876.8206, 1376.8859, 197.8634, 93.0379, 180.9311,
                1817.9912, 284.3526, 344.6761, 44.8680, 297.3899,
                11.9195,  100.5519,   264.7574,  41.3961,  43.4202,
                1053.3770, 2067.0361, 132.2938, 75.9661, 53.2236,
                83.5585.
                The sample mean, standard  deviation, and CV  of
                log-transformed data are:
y =5.326, sy =1.577, and CVV = 0.30
The sample mean, standard deviation, and CV for raw
data are:
x = 594.10, sx = 919.05, and CVX = 1.55.
Fora lognormal distribution, the estimates of u^ a.,, and
the standard error of the mean are 657.45, 1632.25,
and 238.86, respectively. The MLEs of u^ a,, and CV
are 713.34, 2369.11, and 3.32.  Estimates of the 80th,
90th, and 95th percentiles of a lognormal distribution
are 779.73, 1560.71, and 2753.62, respectively.
                Discussion of Example 4.4

                The values of the 95% UCL obtained from the
                methods   discussed  above,  without  using
                lognormal theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
 874.22
 854.51
1003.00
 865.64
 932.36
1331.95
                The values of the 95% UCL obtained from the
                methods   discussed  above,  calculated  from
                lognormal theory, are:
                Jackknife
                Standard Bootstrap
                Chebychev
                H-UCL
                                     1062.35
                                     1088.94
                                     1725.15
                                     1792.54
                   As one might expect with a larger sample size
                (n = 31), the point estimates tend to be closer to
                the true parameter values they are intended to
                estimate. Also, there is  not as much variation
                among the  UCLs computed from  the different
                methods. Furthermore, the H-UCL is below the
                estimated  95th  percentile  of the  lognormal
                distribution.
                                               14

-------
    In the next example, a sample of size n = 15 is
considered again, but with the variance of the log-
transformed variable slightly larger than that of
Examples 4.2-4.4.

Example 4.5. Simulated sample of size n = 15
from a lognormal distribution, LN(5, 1.7).

This last simulated data set of size  n = 15 is obtained
from LN(5,  1.7),  with the following  true values of
population parameters: MI = 629.55,  a, = 2595.18, CV
= 4.12.
The generated data are:
16.5197,  235.4977,  1860.4443,   74.5825,  3.9684,
325.2712, 167.7949, 189.0130, 1307.6180, 878.8519,
35.4675, 96.2498, 229.2540, 182.0494, 1498.6146.
The sample mean, standard deviation, and CV of the
log-transformed data are:
y =5.178, sy= 1.710, CVy=0.33.
The sample mean, standard deviation, and CV for raw
data are:
x = 473.41, sx = 606.79, CVX = 1.28.
For a lognormal distribution, the estimates of u^ a.,,  and
the standard error of the mean, based on MVUE theory,
are 629.82, 1473.12, and 319.0, respectively. The MLEs
of MI,  OL and  CV are 765.52,  3213.52, and 4.20,
respectively.  Estimates of the 80th,  90th,  and 95th
percentiles for  a  lognormal distribution are 752.50,
1596.91, and 2955.58, respectively.
Discussion of Example 4.5

The values of the 95% UCL obtained from the four
methods discussed above, without using lognormal
theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Chebychev
 749.31
 721.07
 862.51
 731.14
1173.74
The values of the 95% UCL obtained from the four
methods  discussed  above,   calculated  from
lognormal theory, are:
Jackknife
Standard Bootstrap
Chebychev
H-UCL
1176.39
1141.95
2059.47
4613.32
    Notice that in this example (as with Examples
4.1 and 4.3), the 95% H-UCL (4613.32) exceeds
the estimated  95th percentile (2955.58)  of the
lognormal distribution.

    The sample size and the mean of the log-
transformed variable in examples 4.2,4.3, and 4.5
are held  constant  at  15 and  5,  respectively,
whereas the standard deviation (sd) of the log-
transformed variable  are 1.0,  1.5,   and 1.7,
respectively.  From these examples alone, it can
be  seen  that  as soon  as the  sd  of the log-
transformed variable becomes greater than 1.0, the
H-statistic-based UCL becomes orders of magni-
tude  higher  than  the  largest  concentrations
observed, even when the data were obtained from
a lognormal population. Thus, even though the H-
UCL is theoretically sound and possesses optimal
properties for truly lognormal populations such as
being MVUE, the practical merit of the use of H-
UCL   in  environmental  applications   is
questionable when the sd of the log-transformed
variable starts  exceeding 1.0. This is  especially
true for small sample sizes (e.g., n <30). As seen
in the examples discussed here, the use of the
lognormal distribution and the H-UCL in  some
circumstances tends to hide contamination rather
than find it, which is contrary to one of the main
objectives in many environmental applications.
Actually,  under the assumption of lognormal
distribution, one can get away with very little or
no cleanup, (Bowers, Neil, and Murphy 1994), at
a polluted site.

Example 4.6.  Data from the Naval Construction
Battalion Center (NCBC) Superfund Site in Rhode
Island.

Inorganic analyses were performed onthegroundwater
samples from seventeen (17) wells from the NCBC Site.
The main objective was to provide reliable estimates of
the mean  background threshold levels for  the various
inorganic contaminants at the site.  The UCLs have
been computed using the procedures described above.
The results fortwo of the contaminants, aluminum and
manganese, are summarized below.
Aluminum: 290, 113, 264,  2660, 586, 71, 527, 163,
107, 71, 5920, 979, 2640, 164, 3560, 13200, 125.
The sample mean, standard deviation, and  CV of
log-transformed data are:
y = 6.226, sy = 1.659, CVy = 0.27.
                                               15

-------
The sample mean, standard deviation, and CV for the
raw data are:
x =1849.41, sx=3351.27, CVX=1.81.

With the lognormal assumption, the estimates of U.,, a,,
and the standard error of the  mean,  based on MVUE
theory, for aluminum are 1704.84, 3959.87, and 807.64,
respectively. The MLEs of m a,, and CV are 2002.71,
7676.37, and 3.83, respectively. Estimates of the 80th,
90th and 95th percentiles for a lognormal distribution are
2054.44, 4263.44, and 7747.81, respectively.
Manganese: 15.8, 28.2, 90.6,  1490,  85.6, 281, 4300,
199, 838, 777, 824, 1010,  1350, 390, 150, 3250, 259.
The sample mean,  standard  deviation,  and  CV  of
log-transformed data are:
y =5.91, sy= 1.568, CVy = 0.27.
The sample mean, standard deviation, and CV for the
raw data are:
x =902.25, sx= 1189.49, CVX=1.32.
With the lognormal assumption, the estimates of U.,, a,,
and the standard error of the  mean,  based on MVUE
theory, for manganese are  1100.92, 2340.72,  and
490.16, respectively.  The MLEs of u^ a.,, and  CV are
1262.59,  4125.5, and 3.27, respectively.  Estimates of
the 80th, 90th, and 95th percentiles for a lognormal
distribution  are  1389.65,  2769.95,  and  4870.45,
respectively.
The calculated Shapiro Wilks statistics for the raw data
are 0.594 (aluminum) and 0.725 (manganese),  and for
the log-transformed data, the corresponding values are
0.913  and 0.969.  The tabulated critical value for 0.10
level of significance is 0.91.  Thus, for both aluminum
and manganese, the data failed the normality test and
passed the lognormality test at significance level 0.10
(Note: Shapiro-Wilks is a lower tail test).
Discussion of Example 4.6

The values of the 95%  UCL obtained from the
methods discussed above, without using lognormal
theory, are:
                     Aluminum    Manganese
                      3268.22       1405.83
                      3125.56       1354.15
                      5286.63       1968.03
                      3186.47       1376.82
                      3675.94       1503.84
                      5482.64       2191.81
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
                                                     bootstrap methods, the CLT, the adjusted CLT,
                                                     and the  Chebychev limit  are well  below their
                                                     respective estimates  of  the   95th percentile
                                                     (Aluminum: 7747.81 and Manganese: 4870.45)of
                                                     assumed (based on Shapiro-Wilks' test) lognormal
                                                     distributions.

                                                     The values of the 95% UCL obtained from the
                                                     methods  discussed  above,  calculated  from
                                                     lognormal theory, are:
                                                     Jackknife
                                                     Standard Bootstrap
                                                     Chebychev
                                                     H-UCL
                     Aluminum
                      3283.34
                      3663.20
                      5314.99
                      9102.73
Manganese
  1889.52
  1821.55
  3291.95
  5176.16
    Observe that for both of the contaminants, the
95% UCLs calculated  from the Jackknife, both
    Observe that the 95% UCLs calculated using
lognormal  theory  from  the  Jackknife,   the
bootstrap,  and  the  Chebychev inequality  are
similar to the respective values obtained without
using lognormal theory,  and that these are well
below their respective estimated 95th percentiles
for a lognormal distribution.  The  95% UCLs
calculated from the H-statistic, however, exceed
their respective estimated 95th percentiles for a
lognormal distribution.

Example 4.7.   Data from the  Elrama School
Superfund  site  in Washington County, PA.

The data were compiled from two waste piles for risk
evaluations  of the contaminants found at the Elrama
School  Superfund  Site,  Washington  County,   PA.
Twenty-six (26) contaminants (10 inorganics, 12 semi-
volatile compounds, and 4  volatile  compounds) were
detected in both of  the  waste  piles.   Using  the
nonparametricKolmogorov-Smirnovtwo-sampleteston
the two waste piles, it was  concluded that there is no
statistically significant difference between distributions
of the contaminants from the two waste piles. Thus, the
data from these two waste piles  were combined to
compute all of the relevant statistics such as the mean,
the standard deviation, and  the UCLs.  This resulted in
data sets consisting of 23 observations (15 from Waste
Pile 1 and 8 from Waste  Pile  2).  The results  are
provided for two of the  contaminants of concern:
aluminum and toluene.

Aluminum:  31900.0,  8030.0,  12200.0,  11300.0,
4770.0,  5730.0,  5410.0,  8420.0, 8200.0, 9010.0,
8600.0,  9490.0,   9530.0,  7460.0,  7700.0,  13700.0,
30100.0, 7030.0, 2730.0, 5820.0,  8780.0, 360.0,
7050.0.
                                                16

-------
The sample mean, standard deviation, and CV of the
log-transform data are:
y = 8.927, sy = 0.845, CVy = 0.095
The sample mean, standard deviation, and CV for the
raw data are:
x = 9709.57, sx = 7310.02, CVX = 0.75.
With the lognormal assumption, the estimates of U.,, a.,,
and the  standard  error of the mean, based on  MVUE
theory, for  aluminum are  10552.68, 10031.60, and
2044.90, respectively. The MLEs of u^ a.,, and CV are
10768.22, 10993.32, and 1.02, respectively. Estimates
of the 80th,  90th, and 95th percentiles for a lognormal
distribution  are 15323.48,  22224.45, and 30381.95,
respectively.
Toluene: 7300.0,  6.0,  6.0,  5.5,  29000.0,  46000.0,
12000.0, 2500.0, 1300.0, 3.0, 510.0, 230.0, 63.0, 6.0,
5.5, 6.0,  6.0, 5.5, 280000.0, 8.0, 28.0, 6.0, 7.0.
The sample mean, standard deviation and CV of log-
transform data are:
  = 4.652, sv = 3.660. CVY = 0.79
  e sample mean, standard deviation, and CV for the
raw data are:
Th=
x = 16478.33, sx = 58510.78, CVX = 3.55.
With the lognormal assumption, the estimates of U.,, a,,
and the  standard error of the mean, based  on  MVUE
theory, for  toluene are 21328.39,  362471.55, and
18788.05, respectively. The MLEs of u^ a,, and CVare
84702.17,  68530556.56,  and  809.08,  respectively.
Estimates of the 80th, 90th, and 95th percentiles for a
lognormal  distribution  are 2264.17,  11329.16, and
43876.88, respectively.
The Shapiro-Wilks statistics for the raw data are 0.707
(aluminum)  and 0.313 (toluene),  and for the log-
transformed data,  the corresponding values  are 0.781
and 0.818. The tabulated critical value for a 0.10 level of
significance with n = 23 is 0.928. Thus, neither a normal
nor a lognormal distribution gives a good fit.
                                                    Jackknife
                                                    Standard Bootstrap
                                                    Chebychev
                                                    H-UCL
                    Aluminum
                      13542.11
                      13579.18
                      19693.40
                      16503.51
   Toluene
   62263.37
  278888.51
  105757.50
18444955.15
    Observe  that the  95% UCL for toluene,
calculated  from  the  H-statistic, is  orders  of
magnitude higher than those calculated from the
other methods, and is also orders of magnitude
higher than  the maximum  observed  toluene
concentration at the site.  Also with the toluene
data,  the pivotal bootstrap method results in a
UCL  which is two to five times larger than the
others computed from the non-lognormal theory
methods.  It  is even larger than the Chebychev
limit. As noted earlier, this is possible when the
standard error of the  point  estimate  is  also
estimated from the data.  In most environmental
applications,   the  true   population   standard
deviation of the point estimate  is unknown, and
therefore,  it  needs to be estimated  from  the
available data. Note, however, it is two orders of
magnitude smaller than the H-UCL.

    Note, also, that the CV (0.75) computed from
the raw data for aluminum is less than 1. The use
of the CV-test for normality could lead  one to
assume normality, even though the Shapiro-Wilks
test strongly  rejects the normal distribution (p-
value = 0.00002).
Discussion of Example 4.7

The  values of the 95% UCL obtained from the
methods discussed above, without using lognormal
theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
The values of the 95% UCL obtained from the four
methods  discussed  above,   calculated  from
lognormal theory, are:
Aluminum
12327.40
12246.67
15161.90
12216.95
12895.10
16522.94
Toluene
37431.95
33494.25
152221.00
36547.89
47316.80
71013.85
                                                    5. SUMMARY AND RECOMMENDATIONS

                                                        It is seen from the simulated examples that,
                                                    even  when  the  underlying  distribution  is
                                                    lognormal, the performance (in terms of a lower
                                                    UCL)  of the Jackknife,  bootstrap, and CLT
                                                    procedures is more accurate than that of the H-
                                                    UCL. In each of the four simulation experiments,
                                                    the  95% UCLs computed from all of the above
                                                    methods exceeds the true respective population
                                                    means, but the 95% H-UCL is consistently larger,
                                                    except in some cases where it is comparable to the
                                                    conservatiave Chebychev result, than the 95%
                                                    UCLs obtained from  other methods.  It is also
                                                    seen  from the simulation  examples that the
                                                    estimate of the CV based on the MLEs is closer to
                                                    the true CV than the usual (moment) estimate of
                                                    CV.  Furthermore, the usual estimate of the CV
                                               17

-------
appears to underestimate the true CV.  In some of
the examples, the usual estimate of the CV is less
than 1, while the true population CV is somewhat
greater than 1.  That is, the rule of thumb (CV-test)
which declares the distribution to be normal when
the moment estimate of the CV is less than 1, can
frequently lead to an incorrect assumption about
the underlying distribution of the data.

    Moreover, from the examples discussed in this
paper, it is observed that the H-UCL becomes order
of magnitures  higher even when the data were
obtained from a lognormal population and can lead
to incorrect conclusions. This is especially true for
samples of smaller sizes (e.g., <30). It appears that
the lognormal distribution and the H-UCL tend to
hide contamination rather than revealing it. Under
the assumption of the lognormal distribution, one
can get away with very  little  or no cleanup at a
polluted  site.   Thus, although  the H-UCL  is
theoretically   sound  and  possesses  optimal
properties, the practical  merit of the H-UCL in
environmental applications is  questionable, as it
becomes order of magnitude higher than the largest
concentration observed high when  the sd of the
log-transformed data starts exceeding  1.0.  It is
therefore,  recommended that  in environmental
applications, the use of the H-UCL to obtain an
estimate of the upper confidence limit of the mean
should be avoided.
    lognormal  theory  based   formulas   for
    computing the MVUE of the population mean
    and the standard deviation, b) either use these
    MVUEs  with the jackknife  or bootstrap
    methods to calculate a UCL of the mean, or
    use the Chebychev approach for calculating a
    UCL. Do not use the UCL based on the H-
    statistic, especially if the number of samples
    is less than 30.
4)  If the data distribution turns out to be neither
    normal  nor  lognormal,  then  use   the
    nonparametric versions of the jackknife or
    bootstrap to  calculate a UCL.  Even if the
    lognormal distribution seems to  provide a
    reasonable fit to the data, and  if there is
    evidence of  a  mixture   of  two or more
    subpopulations, or if outliers  are  suspected,
    then using one of the nonparametric methods
    discussed above is recommended.

NOTICE

    The U.S. Environmental Protection  Agency
(EPA),  through   its  Office  of  Research  and
Development (ORD), funded  and prepared this
Issue Paper.  It has been peer reviewed by the
EPA and approved for publication. Mention of
trade names  or commercial products does  not
constitute  endorsement or recommendation by
EPA for use.
    Based on the monte carlo simulation results,
and the authors' experience with Superfund site
work, the following steps for computing a UCL of
the mean  of the contaminant(s)  of concern  are
recommended:

1)  Plot histograms of the observed contaminant
    concentrations and perform a statistical test of
    normal or lognormal  distribution (e.g.,  the
    Shapiro-Wilks test). Do not  use the rule of
    thumb that declares the data distribution to be
    normal ifCVis less than 1.
2)  If a normal distribution provides an adequate
    fit to the data, then use the Student's t approach
    (equivalent to the j ackknife) for calculating the
    UCL of the population mean.
3)  If a  lognormal   distribution  provides  an
    adequate  fit to the  data, then  a) use  the
                                              18

-------
                                         References
Aitchison, J., and Brown, J.  A.  C. (1976), The
   Lognormal   Distribution,   Cambridge:
   Cambridge University Press.
Bain,  L.  J.  and   Engelhardt,  M.   (1992),
   Introduction to Probability andMathematical
   Statistics, Boston: Duxbury Press.
Bowers,  T., Neil, S.,  and Murphy, B.  (1994),
   "Applying Hazardous Waste Site  Cleanup
   Levels: A Statistical Approach to  Meeting
   Site   Cleanup   Goals  on  Average."
   Environmental Science and Technology.
Bradu, D., and Mundlak, Y. (1970), "Estimation
   in Lognormal Linear Models," Journal of the
   American Statistical Association, 65, 198-
   211.
Chen, L. (1995), "Testing the Mean of Skewed
   Distributions," Journal  of the  American
   Statistical Association, 90, 767-772.
Efron,  B. (1981),  "Nonparametric Estimates of
   Standard Error: The Jackknife, the Bootstrap,
   and Other Resampling Plans," Biometrika.
Efron,  B. (1982),  The Jackknife, the Bootstrap,
   and Other Resampling Plans, Philadelphia:
   SIAM.
Efron,  B., and Gong, G. (1983), "A Leisurely
   Look at the Bootstrap, the  Jackknife, and
   Cross-Validation," The American Statistician,
   37, 36-48.
EPA (1992), "Supplemental Guidance to RAGS:
   Calculating  the  Concentration   Term,"
   Publication 9285.7-081, May 1992.
Finney, D.  J. (1941), "On the Distribution of a
   Variate  Whose  Logarithm  is  Normally
   Distributed," Journal of the Royal Statistical
   Society, 7, 155-161.
Gilbert,  RO.  (1987), Statistical Methods for
   Environmental Pollution Monitoring, New
   York: Van Nostrand Reinhold.
Gilbert, RO. (1993), "Comparing Statistical Tests
   for Detecting Soil Contamination Greater that
   Background," Pacific Northwest Laboratory,
   Technical Report No. DE 94-005498.
Hall,  P.  (1988).  "Theoretical  comparison of
    bootstrap confidence intervals," Ann. Statist.,
    16, 927-953.
Hogg, R.V., and Craig, A.T. (1978), Introduction
    to  Mathematical   Statistics,  New  York:
    Macmillan Publishing Company.
Land, C. E. (1971), "Confidence Intervals for
    Linear Functions of the Normal Mean and
    Variance," Annals of Mathematical Statistics,
    42, 1187-1205.
Land, C. E. (1975), "Tables of Confidence Limits
    for Linear Functions of the Normal Mean and
    Variance,"   in    Selected  Tables  in
    Mathematical Statistics, vol. Ill, American
    Mathematical Society, Providence, R.I., 385-
    419.
Lechner,  J.A.  (1991),  "Estimators  for Type-II
    Censored (Log) Normal  Samples,"  IEEE
    Transactions on Reliability, 40, 547-552.
Miller, R. (1974), "The Jackknife - A Review,"
    Biometrika, 61, 1-15.
Power, M. (1992), "Lognormality in the Observed
    Size Distribution of Oil and Gas Pools as a
    Consequence   of  Sampling  Bias,"
    Mathematical Geology, 24, 929-945.
Singh, A. and Nocerino J.M. (1995),  "Robust
    Procedures for the  Identification of Multiple
    Outliers", Chemometrics in Environmental
    Chemistry,  Statistical  Methods,  Vol  2.,
    part G, 229-277, Springer Verlag, Germany.
Staudte, R. G., and Sheather, S. J. (1990), Robust
    Estimation and  Testing, New York:  John
    Wiley & Sons.
Stewart,  S.   (1994),  "Use  of  Lognormal
    Transformations in Environmental Statistics,"
    M.S.  Thesis,  Department of Mathematics,
    University of Nevada, Las Vegas.
Ott, W. (1990), "A Physical Explanation  of the
    Lognormality of Pollutant Concentrations,"
    Journal of Air Waste Management Assoc., 40,
    1378-1383.
                                              19

-------