United States
               Environmental Protection
               Agency
Office of
Research and
Development
                                                           Office of Solid
                                                           Waste and
                                                           Emergency
                                                           Response
EPA/600/S-97/006
December 1997
>EPA    Technology Support  Center  Issue
               The  Lognormal Distribution  in
               Environmental  Applications
              Ashok K. Singh1, Anita Singh2, and Max Engelhardt3
                The   Technology   Support  Projects,
              Technology  Support  Center (TSC)  for
              Monitoring and Site Characterization was
              established  in  1987 as a  result of an
              agreement between the Office of Research
              and Development (ORD), the Office of Solid
              Waste and Emergency Response (OSWER)
              and all ten Regional Offices. The objectives
              of the Technology Support Project and the
              TSC were to make available and provide
              ORD's  state-of-the-science   contaminant
              characterization technologies and expertise
              to Regional staff, facilitate the evaluation
              and  application of site  characterization
              technologies at Superfund and RCRA sites,
              and to improve  communications between
              Regions and ORD Laboratories.  The TSC
              identified a need to provide  federal, state,
              and private environmental scientists working
              on hazardous waste sites with a technical
              issue paper that identifies data assessment
              applications that can be implemented to
              better define and identify the distribution of
              hazardous waste site contaminants.  The
              examples given in this Issue  paper and the
              recommendations provided were the result of
              numerous  data assessment approaches
              performed by the TSC at hazardous waste
              sites. Mr. John Nocerino provided guidance
              and suggestions that greatly enhanced the
              quality of this Issue Paper.
               This paper was prepared by A. K. Singh,
              A.  Singh, and M. Engelhardt.  Support for
              this project was provided  by the EPA
              National Exposure Research Laboratory's
              Environmental Sciences Division with the
              assistance of the Superfund Technical Sup-
              port Projects Technology Support Center for
              Monitoring  and  Site  Characterization,
              OSWER's Technology Innovation Office,
              the U.S. DOE Idaho National Engineering
              and Environmental Laboratory,  and the
              Associated  Western  Universities  Faculty
              Fellowship  Program.  For further informa-
              tion,  contact  Ken  Brown, Technology
              Support Center Director, at (702) 798-2270,
              A. K. Singh at (702) 895-0364, A. Singh at
              (702) 897-3234, or M. Engelhardt at (208)
              526-2100.

              Purpose and Scope

               The  purpose of this issue paper is to
              provide guidance to environmental scientists
              regarding the interpretation  and statistical
              assessment  of data  collected from sites
              contaminated with inorganic and organic
              contaminants.  Contaminant concentration
              data from sites quite often appear to follow a
              skewed probability distribution.  The  log-
              normal  distribution is frequently used to
              model  positively skewed   contaminant
              concentration distributions.  The H-statistic
    • ^-

echnology 'S.
upport    ?
reject   £
               'Department of Mathematics, University of Nevada, Las Vegas, NV 89154
               2Lockheed Martin Environmental Systems & Technologies, 980 Kelly Johnson Dr., Las Vegas, NV 89119
               'Lockheed Martin Idaho Technologies, P.O. Box 1625, Idaho Falls, ID 83415-3730

                Technology Support Center for
                Monitoring and Site Characterization
                National Exposure Research Laboratory
                Environmental Sciences Division
                Las Vegas, NV 89193-3478
            Technology Innovation Office
            Office of Solid Waste and Emergency Response,
            U.S. EPA, Washington, D.C.

            Walter W. Kovalick, Jr., Ph.D., Director
                  Printed on Recycled Paper
                                                                        256CMB04.RPT * 8/18/2004

-------
based Upper Confidence Limit  (UCL)  for the
mean of a lognormal population is recommended
by  U.S. EPA  guidance documents  (see, for
example, EPA (1992)) and is widely used to make
remediation  decisions   at   Superfund   sites.
However, recent work in environmental statistics
has cast some doubts on the performance of the
formula based on the H-statistic for computing an
upper confidence limit of the mean of a lognormal
population. This issue paper is mainly concerned
with  the  problem  of computing  an  upper
confidence  limit   when   the  contaminant
concentration distribution appears to be highly
skewed.

  Several  approaches  to  computing   upper
confidence limits for the mean of a lognormal
population  are  considered.   The  approaches
discussed include those based on the H-statistic,
the jackknife method, the bootstrap method, and a
method based  on  the  Chebychev inequality.
Simulated examples show that for values of the
coefficient  of variation larger than 1, the upper
confidence limits for the  mean  contaminant
concentration based on the H-statistic  are much
higher than the upper confidence limits obtained
by the other estimation methods. This may result
in an unnecessary cleanup. In other words, the use
of the jackknife method, the bootstrap method, or
the Chebychev inequality method provides better
input to the risk  assessors and may result in a
significant reduction in remediation costs. This is
especially true when the  number of samples  is
thirty or less.  When the value of the coefficient of
variation exceeds 1, upper confidence limits based
on any of the other estimation procedures appear
to be more stable and reliable than those based on
the H-statistic.    Values  of the coefficient of
variation computed from observed contaminant
concentrations are typically  used  by environ-
mental scientists to assess the normality of the
population  distribution.  In this issue paper, the
issue  of using  the coefficient of variation in
environmental data analysis is addressed and the
problem of estimating the coefficient of variation,
when sampling from lognormal  populations,  is
also discussed.
  This issue paper is divided into the following
major   sections:   (1)   Introduction,   (2)  the
Lognormal   Distribution,   (3)   Methods  of
Computing a UCL of the Mean, (4) Examples,
and (5) Summary and Recommendations.

1. Introduction

  Most of the procedures available in the literature
of environmental statistics for computing UCL of
the mean of a population assume that contaminant
concentration  data  is  approximately  normally
distributed.    However, the  distributions  of
contaminant concentration data from Superfund
sites typically  are positively  skewed  and are
usually  modeled by the lognormal distribution.
This apparent skewness, however, may be due to
biased sampling, multiple populations, or outliers,
and not necessarily due to lognormally distributed
data.

  Biased sampling is often used in sampling for
site  characterization (Power,  1992).   Another
common  situation   often   present   with
environmental data is  a mixed  distribution of
several subpopulations  (see Figure 1).  Also, the
presence  of one  or  more  outliers,   spurious
observations, or anomalies can result in a data set
which appears to come from  a highly skewed
distribution.   When  dealing with a skewed
distribution, statisticians sometimes  recommend
                        B
        Elevated levels of contaminant concentration   |

        Moderately high levels of contamination	|

        Extremely high levels of contamination	|
        Background
                                                    Figure 1  A site with several sources of
                                                             contamination.

-------
using the  population median (instead  of the
population mean) as a measure of central ten-
dency.   However,  remediation decisions at  a
polluted site typically are made on the basis of the
population mean, and therefore UCL of the mean
of the concentration distribution is needed.  For
positively skewed distributions, the  median is
smaller than the mean: therefore a UCL for the
median provides  an inappropriate basis for  a
decision about the mean.

  U.S. EPA guidance documents recommend the
use of H-statistics to  compute the UCL of the
mean of a lognormal distribution (EPA, 1992). A
detailed  discussion of H-statistics is given in
Gilbert (1987).  For data sets with  nondetects,
estimation methods developed for censored data
from a lognormal distribution are discussed by
Lecher  (1991).    The use  of the   lognormal
distribution has been controversial because it can
lead to incorrect decisions.  For example, recent
work  of Gilbert (1993) indicates that statistical
tests of hypotheses based on H-statistics can yield
unusually high false positives, which would result
in an unnecessary cleanup. The situation may be
reversed when dealing with estimation of the mean
background level. If the H-statistic based method
is used to compute a UCL of the mean for the
observed background concentrations, then the
mean of the background level may be over-
estimated, which may result in not remediating a
contaminated area of the site. Stewart (1994) also
showed that the incorrect usage of a lognormal
distribution may lead to erroneous results.

  Most of the "classical" statistical methods based
on  the  normal  distribution  were   developed
between  1920 and 1950 and have  been well
investigated in the statistical literature.   On the
other  hand, lognormal-based methods have not
received the same level of scrutiny. Furthermore,
the classical methods became popular due to their
computational convenience.   The  1980s  have
produced a new breed of statistical methods based
on the power and availability of computers (see,
for example,  Efron and Gong, 1983). Both the
jackknife and bootstrap methods require a great
deal of computer power, and, therefore, have not
been widely  adopted by environmental  statis-
ticians. However, with the recent advances in
computer   equipment   and  software,
computationally  intensive statistical procedures
have become more practical and accessible.

  The  authors  of this  article have  critically
reviewed several estimation procedures which can
be used to compute UCL values via monte carlo
simulation.  These include the simple arithmetic
mean, the Minimum Variance Unbiased Estimate
(MVUE), and nonparametric procedures such as
the jackknife  and  the  bootstrap procedures.
Computer simulation experiments (not included in
this paper) have been performed for various values
of  the  population   standard  deviation,   or
equivalently the  Coefficient of Variation  (CV),
and sample sizes ranging from  10 to 101.  It has
been demonstrated that for samples of size 30 or
less,  the H-statistic  based  UCL  results  in
unacceptably high estimates of the threshold levels
such as the background level contamination. This
is especially true  for data sets from populations
with CV values exceeding  1.  For samples  of
larger sizes, the use of H-statistics can be replaced
by UCLs based on nonparametric methods such as
the jackknife or the bootstrap. Other well-known
results such  as  the  central  limit (CLT) and
Chebychev theorems may also be used to obtain
UCLs.  To  illustrate problems associated with
methods  based on lognormal theory, results for
some  simulated  examples and  some  from
Superfund work  done by the authors have been
included  in this paper.

2.  The Lognormal Distribution

  The  authors briefly describe the lognormal
distribution. By definition, contaminant concen-
tration   is   lognormally  distributed  if the
log-transformed   concentrations  are   normally
distributed. This can be mathematically described
as  follows:

  If Y = \n(X) is normally distributed with mean,
//,  and  variance,  o2, then X is said  to  be
lognormally distributed with parameters // and  o2.
It should be noted that // and o2 are not the mean
and variance of the lognormal random variable, X,
but they  are the  mean and variance of the log-
transformed random variable, Y.  However, it is
common  practice to use  the same parameters to

-------
specify either, and it is convenient to refer to the
normal distribution with the abbreviated notation
Y ~ N(//, o2) and the log-normal distribution with
the abbreviation X ~ LN(//, a2). Figure 2, which
shows plots of a normal and a lognormal density
function with // = 0 and o2 = 0.5, illustrates the
difference   between  normal  and  lognormal
distributions.
        Normal and lognormal density functions
    0.7-

    0.6-

    0.5-

    0.4

    0.3-
    0.2-

    0.1-

    0.0-
Figure 2 Graphs of normal N(/y = 0, d = 0.5)
         and lognormal LN(//= 0, o2 = 0.5)
         density functions.

  Figure  3,  which   shows  plots   of  several
lognormal  distributions,  each  with  //  =  0,
illustrates how  varying  the  parameter o2 can
change the amount of skewness.
   1.4
   1.2

   1.0
 -0.8

 ^0.6
   0.4
   0.2
   0.0
             Lognormal density functions
                         2
                         x
Figure 3 Graphs of A: LN(//= 0, o2 = 0.25), B:
         LN(/y = 0, d = 1.0) and C:  LN(/y = 0, d
         = 25.0) density functions.
  The  parameters of interest  of a  lognormal
distribution, LN(//, o2), are given as follows:
Mean =
           =  exp(/u
                     0.5 a2)
Median = exp(/u)
(1)
                                          (2)
Variance =  a\ = pxp(2u, + o \ pxp(a2) - 1]    (3)

Coefficient
of Variation =  CV =  a^ft, =  \/exp(cr2) - 1    ^

Skewness = (CV)3 + 3(CT).                 (5)

  Throughout  this  paper,  irrespective  of the
underlying distribution, //;, and a;2 represent the
mean and variance of the random variable X  (in
original units), whereas // and a2 are the mean and
variance of its logarithm given by Y=ln(X). The
pth quantile (or 100/>th percentile),  xp, of the
distribution of a random variable, X, is defined by
the probability statementP(X 
-------
   0.004-
            TRUE MEAN  95% UCL  '    85%
                         oUU
                                       1000
Figure 4 Graphs showing the relative positions
         of the TRUE MEAN, the 95% UCL,
         and the 95th percentile.

  One of the inherent assumptions required to
compute the UCL of the mean is that the data set
under consideration comes from a single statistical
population  (e.g.,  background only).  Violation
ofthis assumption can lead to invalid applications
of a statistical technique.  The consequences of
this assumption being violated  are discussed as
follows.  A data set can be put into a statistical
procedure  (e.g.,   the  Shapiro-Wilks   test  of
normality) or a computer program whether or not
the  required assumptions are met. It is the user's
responsibility  to  ensure  the  underlying
assumptions required  to conduct the statistical
procedure are met. The decisions and conclusions
derived from incorrectly used  statistics can be
expensive.   For  example,  incorrect use of a
statistic may lead to wrong conclusions such as:
1) remediation of a clean part of the site,  or 2) no
remediation at a contaminated part of the site. The
first  wrong  conclusion   will   result   in  an
unnecessary cleanup whereas the second incorrect
conclusion may cause athreatto human health and
the  environment.  It is likely that the availability
of new and improved statistical software has also
increased the  misuse  of  statistical techniques.
This is illustrated in the following discussion of
the  application to  some simulated and real data
sets. It should be  reiterated that it is the analyst's
(user's) responsibility to verify  that none of the
required assumptions are violated before using a
statistical test and deriving inferences based upon
the  resulting analysis.  In many cases, this may
warrant expert advice from a statistician.
  Often,  the  central portion  of a  data set will
behave as if it came from a normal distribution.
However, in practice, a normally distributed data
set with a few extreme (high) observations can be
incorrectly modeled by the lognormal distribution
with the lognormal assumption hiding the outliers.
Also, the mixture  of two or more normally
distributed data sets with significantly different
mean concentrations such as one coming from the
clean background part and the other taken from a
contaminated part of the site can also be modeled
(incorrectly)  by a lognormal distribution.  The
following example illustrates this point.

Example 2.1.  Simulated data set from two pop-
              ulations

A simulated data set of size fifteen (15) has been
obtained from a mixture of two normal populations.
Ten observations (representing background) were
generated from a normal distribution with mean,
100,  and standard  deviation,  50,  and five
observations  (representing contamination) were
generated from a normal distribution with mean,
1000, and standard deviation, 100.  The mean of
this mixture distribution is 400.  The  generated
data are  as follows: 180.5071, 2.3345, 48.6651,
187.0732, 120.2125, 87.9587,  136.7528, 24.4667,
82.2324,   128.3839,   850.9105,   1041.7277,
901.9182, 1027.1841, and 1229.9384.
Discussion of Example 2.1

  The data set in Example 2.1 failed the normality
test based on several goodness-of-fit tests such as
the  Shapiro-Wilks,  W-test  (W=0.7572),  and
Kolmogorov-Smirnov (K-S =  0.35)  tests  (see
Figures 5 and 6). However, when these tests were
carried out on the log-transformed data, the test
statistics are insignificant at the  a= 0.05 level of
significance with W=0.8957, and K-S = 0.168,
suggesting that a lognormal distribution  (see
Figures 7  and 8) provides a reasonable fit to the
data.  Based upon this test, one might incorrectly
conclude that the observed concentrations come
from a single background lognormal population.
This incorrect conclusion is made quite frequently.
This  data set  is used  later to illustrate how
modeling  the  mixture data set by a lognormal
distribution will result in incorrect estimates of

-------
mean contamination levels at various parts of the
site.
  IT
  S.
  6-

  5 -

 . 4 -

  3-

  2 -
          0    250   500    750   1000   1250
                      X_mix
  Figure 5 Histogram of the 15 observations
           from the mixture population of
           Example 2.1.
e
CL
     .999 -
      .99 •
      -95 -
      .so -
      .50
      .20
      .05 -
      .01 •
     .001 -
                      500
                        X mix
                               1000
        Average: 403.351
        Std. Dev.: 453.94
                       Kolmogorov-Smimov Normality Test
                        D+: 0.350 D-: 0.189 0:0.350
                        Approximate p value < 0.01
  Figure 6  K-S test of normality for the data of
           Example 2.1.
     8-
     7-
     6-
     5-
     4-
     3-
     2-
     1-
     0-
           0.0
                    I
                   2.5
                          I
                         5.0
 I
7.5
                       ln(X)
  Figure 7  Histogram of the log-transformed
           15 observations from the mixture
           population of Example 2.1.
                     .999-
                      .99-
                      .95-
                    '  .80-
                      .50-
                      .20-
                      .05-
                      .01-
                     .001-
                                                             Average: 509021
                                                             Std. Dev. 1.70569
                                                             N of data: 15
                                                                          ln(X)
                                                                            Kolmogorov-Smimov Normality Test
                                                                              r>: 0.134 D-: 0.168 D: O.T68
                                                                              Approximate p value > 0.15
                                                    Figure 8   K-S test of lognormality for the
                                                               data of Example 2.1.
3. Methods  of Computing a  UCL  of the
   Mean

  The main objective of this study is to assess the
performances of the various methods of estimating
the UCL for the mean, //b of positively  skewed
populations.   The  assumption of a lognormal
distribution to model such populations has become
quite popular among environmental scientists (Ott,
1990).  As noted in  Section 2,  for positively
skewed data sets, there are potential  problems in
using standard methods based on the lognormal
theory.     Therefore,  we  will   compare the
lognormal-based methods often used with cleanup
standards with some other available methods. The
alternate methods  considered  here have the
advantage that they do not  require assumptions
about  the   specific  form  of  the   population
distribution. In other words, they do not assume
normality or lognormality of the data set under
consideration. In Section 4, the UCL of the mean
has been computed for several examples using the
following methods:

  • The H-statistic
  • The Jackknife procedure
  • The Bootstrap procedure
  • The Central Limit Theorem
  • The Chebychev Theorem

A brief description  of the  computation  of the
various estimates and the associated confidence
limits  obtained  using   the  above-mentioned
procedures follows:

-------
Parametric Lognormal Procedures

  Let xb x2,..., xn be a random sample from a log-
normal distribution with mean, //;, and variance,
a,2, and denote by // and o the population mean
and population standard deviation (sd), and y,
and  s  the  sample  mean  and  sample  sd,
respectively, of the log-transformed data yt =
ln(X;); / = 1, 2, ... , n. .  Specifically,
-    1 "
y =  -Y.yt
and
(6)
                                         (7)
  In a more general setting, consider a population
with an unknown parameter, 6.  The minimum
variance unbiased estimate (MVUE) of 6  is the
one that is not only  an unbiased estimate of 6
(i.e., the expected value of the estimate is equal to
the true value of the parameter), but it also has a
smaller variance than  any other unbiased estimate
of 6. When the parameter of interest is the  mean,
//!, of a lognormally distributed population,  Bradu
and Mundlak (1970)  derive its MVUE, which is
given by
                                         (8)
where gn(u) is a function whose form is rather
complicated, but an infinite series solution is given
by Aitchison and Brown (1976).  Tabulations of
this function are provided by Gilbert (1987, Table
A9). Note that Gilbert uses i|;n in place of gn.  This
function is also used in computing the MVUE of
the variance, a^, of a lognormal population, as
given by Finney (1941),
f2 =  exp(2^)[gn(2j
                      ((« - 2)Sy/(n - 1))].   (9)
          Bradu and Mundlak (1970) give the MVUE of the
          variance of the estimate yus
                                                  (10)
                                                    Another estimate which is also sometimes used
                                                  is known as the Maximum Likelihood Estimate
                                                  (MLE).  When the data set is a random sample
                                                  from a lognormal distribution, the MLE of the
                                                  parameter, //,  is simply the sample mean of the
                                                  log-transformed data, fi=y, and the MLE of a2 is
                                                  a multiple  of the sample variance of the log-
                                                  transformed data, namely, a2 = [(n- l)/n]sy2. The
                                                  MLE of any function of the parameters // and o2 is
                                                  obtained by simply  substituting these MLEs in
                                                  place of the  parameters. For example, the MLE of
                                                  the mean of a lognormal population is exp(/) +
                                                  0.5 a2), and the MLE  of the 95th percentile is
                                                  exp(// + 1.65 a).  One disadvantage of the MLEs
                                                  for the lognormal mean and percentiles is that they
                                                  are biased estimates. Another slight modification
                                                  uses sy in place of the MLE, a.  Although the
                                                  result is not identical to the MLE, there is only a
                                                  small difference numerically, and for convenience
                                                  the use of the term  MLE will also include this
                                                  modified version.

                                                    Finally, the  one-sided (1- ar)100% UCL for the
                                                  mean, //b of the lognormal distribution derived by
                                                  Land (1971, 1975) is given as follows:
                                                   UCL =
                                                               + 0.55,,
                                                  (11)
          Tables of H-statistic values can be found in Land
          (1975) and also in Gilbert (1987, Table A10).

           Use of the UCL for a population mean based on
          the  H-statistic   is  widely  recommended  in
          environmental  guidance   documents.
          Theoretically, the UCL based on  the H-statistic
          has optimal properties when the population is truly
          lognormal. However, in practice the results can be
          quite disappointing and misleading if the data set
          includes outliers, or is a mixture of data from two
          or more distributions.  Monte carlo investigations

-------
performed by the authors confirm that, for small
sample sizes, the use of the H-statistic approach
can result in unacceptably high values of UCL
when the CV is larger than  1.0.  Consequently,
other methods for computing a UCL of the mean,
//!, of a distribution of unspecified form will be
considered and the results compared with UCLs
obtained by the H-statistic approach.

  The methods considered in this paper can be
viewed  as variations  of a  basic approach to
constructing confidence intervals known as the
pivotal quantity method.  In general, a pivotal
quantity is a function of both the parameter 6 and
an estimate 6 such that probability distribution of
the  pivotal quantity  does  not depend  on 6.
Perhaps the best-known example of a pivotal
quantity is the well-known t statistic,
t =
     x -
                                         (12)
where x and sx are, respectively, the sample mean
and sample standard deviation.  If the data is a
random sample from a normal population with
mean,  //l5 and  standard deviation,  a^ then  the
distribution of this pivotal quantity is the familiar
Student's  t  distribution  with n-1  degrees  of
freedom. Because the Student's t distribution does
not  depend  on   either unknown  parameter,
quantiles are available. Denote by ta n_l the upper
ath quantile of the Student's t distribution with
n-1 degrees of freedom. Based on equation (12),
it is possible to derive a (l-2ar)100% confidence
interval of the form
 X ~
(13)
The confidence interval is given in the familiar
form of a two-sided confidence interval for the
mean.   If the lower  limit of this interval is
disregarded, the upper limit provides a (1 - ar) 100%
UCL for the mean, ^.

  For a population which is normally distributed,
equation  (13)  provides  the  best  way   of
constructing   confidence  intervals  for   the
population mean. However, as noted previously,
the distribution of contaminant concentration data
is  typically positively  skewed  and frequently
involves outliers. It is well known that the sample
mean and sample standard deviation get severely
distorted in the presence of outliers, (Singh and
Nocerino  1995), and consequently any function,
such as the  Student's t, given by equation (12)
above  of these  statistics  also  gets  severely
influenced by the presence of outliers.  Robust
methods for estimating the population mean and
sd are available in the software package, SCOUT,
as identified in Singh and Nocerino (1995).  In
practice, statistical procedures based on the pivotal
quantity equation (12) are usually thought to  be
"robust" relative to  violation of the  normality
assumption. As  noted by Staudte and Sheather
(1990), tests based on the Student's t are nonrobust
in the presence of outliers.  Consequently, other
procedures which do  not rely  on a  specific
parametric  assumption  for  the   population
distribution are also considered in the following
discussion.

  The   approach  of constructing   confidence
intervals from pivotal quantities (or  approximate
pivotal quantities) permits a unified  treatment of
these alternate procedures. In particular, each pro-
cedure involves an approximate pivotal quantity
with  the  difference  between   the  unknown
population mean, //b and a  point estimate of the
mean in the numerator, and  an  estimate of the
standard error  of the  point estimate  in the
denominator. Thus, each procedure involves two
parts: 1) finding some reasonably robust estimate
of the mean, (Singh and Nocerino 1995), and 2)
providing a convenient way to obtain quantiles of
the pivotal quantity.  A general discussion of the
pivotal   quantity  approach  to   constructing
confidence  intervals  is given   by Bain  and
Engelhardt(1992).

  As noted above, in order to apply the pivotal
quantity method, it is necessary to have quantiles
of the distribution of the pivotal quantity.  For
example, in order to compute equation (13), it is
necessary to have quantiles  of the Student's t
distribution.  These  quantiles can  be found  in
tables or computed with the appropriate software.
However, for nonnormal populations the required

-------
quantiles are not, in general, readily available. In
some cases, even though the exact distribution of
a pivotal quantity is not known, an approximate
distribution can be used. Thus, except for the H-
statistic approach, which is exact if the population
is truly  lognormal,  all of the  other methods
discussed below  give only approximate  UCL
values for the  population  mean.   The  true
confidence level  of  UCLs  will  vary from one
method to the next, and without some additional
study, it will not be clear whether the comparisons
are fair.  In other words, it is possible to have a
smaller UCL  at the expense of a true  confidence
level which is below the nominal level, and below
the true confidence level of another competing
method.

  In environmental applications, the objectives
typically are: 1) the  identification of hot spots,
which  are typically  represented by the  high
extreme concentrations, or 2)  the separation of
clean part(s) of a site from the dirty contaminated
part(s) of the  site.  However, from the examples
discussed in the following, it can be seen that the
practical use of the lognormal distribution in those
environmental applications  is questionable  as a
lognormal  distribution  often  accommodates
extreme  outlying observations  and  mixture
populations as part of one lognormal distribution.

Jackknife and Bootstrap Procedures

  General methods for deriving estimates, such as
the method of maximum likelihood, often result in
estimates  which  are biased.    Bootstrap  and
jackknife procedures as discussed by Efron (1982)
and  Miller (1974) are nonparametric statistical
techniques which can be used to reduce the bias of
point  estimates   and  construct  approximate
confidence intervals  for parameters such as the
population mean.  These two procedures require
no   assumptions  regarding  the  statistical
distribution (e.g., normal or lognormal) for the
underlying population, and can be applied to a
variety of situations no matter how complicated.
However, it should be pointed out that a use of a
parametric statistical  method  (depending  upon
distributional assumptions), when appropriate, is
more efficient than its nonparametric counterpart.
In practice,  parametric assumptions are  often
difficult to justify, especially in environmental
applications.    In  these  cases,  nonparametric
methods are valuable tools for obtaining reliable
estimates of the parameters of interest. Although
bootstrap  and  jackknife  procedures  are
conceptually simple, they are based on resampling
techniques requiring  considerable  computing
power and time.

  Let xb x2, ... , xn be a random sample of size «
from a population with an unknown parameter 6
(e.g., 6= ft), and let 6 be an estimate of 6which
is a function of all « observations. For example,
the parameter  6 could  be the mean,  and  a
reasonable choice for the estimate 6 might be the
sample mean, X. Another choice is the MVUE of
a lognormal mean.  Of course, if the population is
not lognormal then this estimate may not perform
well:  but,  because  it is frequently used with
skewed data  sets, it is of interest to see how it
performs relative to the other methods.

Jackknife Estimation

  In the jackknife approach, « estimates of 6 are
computed by  deleting  one observation at a time.
Specifically, for each index, /', denote by 6 ({) the
estimate of 6 (computed similarly as 6 given
above) when  the rth observation is omitted from
the original sample of size «,  and denote the
arithmetic mean of these estimates by
1"
«;=!
                                        (14)
A quantity known as the  rth "pseudo-value" is
defined by
J  = n6-  («
The jackknife estimator of 6 is given by
     =  -EJ.=  n6- (n- 1)0.
                                    (15)
                                    (16)
If the original estimate 6 is biased, then, under
certain conditions, part of the bias is removed by

-------
the jackknife procedure, and an estimate of the
standard error of the jackknife estimate, J(6), is
given by
       \
                                         (17)
  Another  application  of the  pseudo-values,
suggested by J. Tukey (see Miller, 1974), is to use
the pseudo-values to obtain confidence intervals
for the parameter,  6,  based on the  following
pivotal quantity:
t=  J(0]-0
                                         (18)
The  statistic, t, given by equation  (18) has an
approximate  Student's t distribution  with n-1
degrees of freedom, which can be used to derive
the   following   approximately  two-sided
(l-2ar)100% confidence interval for 6:
                                         (19)
The upper limit of equation (19) is an approximate
(l-flr)100% UCL for 0.  If the sample size, n, is
large, then the upper oth Hpantile can be replaced
with the corresponding upper ath standard normal
quantile, za.  Observe  also that  when 0 is  the
sample mean,  then the jackknife estimate is  the
sample mean, that is J(X) = X; the estimate of the
standard error in equation (17) simplifies to sjn112,
and the  confidence interval in equation (19)
reduces to the familiar ^-statistic based confidence
interval given by equation  (13).

Bootstrap Estimation

  In the bootstrap procedure, repeated samples of
size n are drawn with replacement from the given
set of observations.  The  process  is repeated a
large number of times, and each time an estimate
of 6 is computed. The estimates thus obtained are
used to compute an estimate of the standard error
of 0.  There exists in the literature of statistics an
extensive array of different bootstrap methods for
constructing confidence intervals.  In this article
two  of these  methods  are  considered:  1)  the
standard  bootstrap,  and 2)  the  pivotal  (or
Studentized) bootstrap method as discussed by
Hall (1988).  A general description of bootstrap
methods, illustrated by application to the sample
mean, follows:

Step 1.  Let (i^, xl2,..., xin) repre sent the i* sample
        of size n  with  replacement from  the
        original data set (xb x2,  ..., xn)-   Then
        compute the sample  mean and denote it
        by X:.

Step 2.  Perform  Step 1  independently TV times
        (e.g., 500-1000), each time calculating a
        new estimate.  Denote those estimates by
        JTb x~2, x"3,..., XN. The bootstrap estimate
        of the population mean is the arithmetic
        mean,  XB, of the N estimates it}.  The
        bootstrap estimate of the standard error is
        given by
                                                                 N
                                                           N-
                V - V 2
               (xi  XB) •
                                                                                             (20)
If some parameter, 6 (say, a population median),
other than the mean  is  of concern, with an
associated estimate (e.g., the sample median), then
the same steps previously described could be
applied with the parameter and its estimate used in
place of//] and x.  Specifically, the estimate, 6h
would be computed, instead of it}, for each of the
N bootstrap  samples.   The general bootstrap
estimate, denoted by UB, is the arithmetic mean of
the TV estimates. The difference, 6B -  6, provides
an estimate of the bias of the estimate, 0, and the
bootstrap estimate  of the standard error of 0 is
given by
          1
             N
        N- 1 A
                                         (21)
The  standard  bootstrap confidence  interval
derived from the following pivotal quantity:
     6-  0
                                           is
z =

                                               10

-------
Finally,  the  (1-2«)100%  standard  bootstrap
confidence interval for 6, which assumes that
equation (22) is approximately normal, is

 /J 	    *   /i .    ^                       /OO\

  In this case,  the  bootstrap approach gives a
convenient way to estimate the standard error of
6'.  Depending on the type of estimate  6,  the
standard error may be quite difficult to derive, and
consequently difficult to estimate. However, the
bootstrap approach always yields an estimate of
the standard  error directly from the data, even
when the mathematical form of the standard error
is not known.

  Another  variation of the bootstrap method,
called  the  "bootstrap t" by Efron (1982), is a
nonparametric procedure which uses the bootstrap
methodology to estimate quantiles of the pivotal
quantity in equation (12).  As previously noted,
for nonnormal populations the required quantiles
may not be easily obtained,   or  it  may  be
impossible to compute exactly. However, with a
variation of the bootstrap procedure, as proposed
by Hall (1988), the required quantiles can be
estimated directly from the data.  Specifically, in
Steps 1 and 2 described above, if x is the sample
mean computed from the original data, and Ti and
s^j  are the sample  mean  and sample standard
deviation computed from the rth resampling of the
original data, the TV quantities, tt = (Xr X)lsx t, are
computed and sorted, yielding ordered  quantities
^(i) ^ fa < ••• < t(N). The estimate of the  upper ath
quantile of the pivotal quantity in equation (12) is
tx,B = ^((i-aw  F°r example, if N= 1000 bootstrap
samples are generated, then the 950th ordered
value,  t(950), would be the bootstrap estimate of
the upper .05th quantile of the pivotal quantity in
equation (12). This estimated quantile can be used
in place of the upper ath Student's t quantile in an
interval of the form given in equation (13).  In the
next section,  this method of construction will be
called the "pivotal bootstrap". This approach has
the advantage  that  it  does not  rely  on  the
assumption of a special parametric form for the
distribution of the population,  and it does  not
require an assumption of approximate  normality
for the pivotal quantity as does the  standard
bootstrap interval of equation (23).
      In the examples to follow, the jackknife, the
     standard  bootstrap  method,  and  the pivotal
     bootstrap methods are applied using the sample
     mean, x, and also the estimate given by equation
     (8), which is the MVUE of the mean  when the
     population is lognormal.

     The Central Limit Theorem

      Given a random sample, jc1; x2, ... , xn, of size «
     from a population with a finite variance,  a\2,
     where 6=n\ is the unknown population  mean, the
     Central Limit Theorem (CLT) states that the
     asymptotic distribution, as « approaches infinity,
     of the sample mean, xn, is normally distributed
     with  mean,  //l5 and variance,  o\ln.   More
     precisely, the sequence of random variables
          *  "
     z_  =
                                             (24)
     has a standard normal limiting distribution.  In
     practice, this means that for large sample sizes n,
     the sample mean, x, has an approximate normal
     distribution  irrespective  of  the  underlying
     distribution function. Consequently, equation (24)
     is an approximate pivotal quantity for large n.
     This  powerful  result can be used  to  obtain
     approximate (1 - 2 a) 100% confidence intervals for
     the  mean for any  distribution  with a  finite
     variance, although, strictly speaking,  it requires
     one to know the population standard deviation, ol.
     However, as noted by Hogg and Craig (1978), if
     ol is replaced by the sample standard deviation, sx,
     the normal approximation for large n is still valid.
     This leads to the following confidence interval:
                 ' X
(25)
      Note that the confidence  interval in equation
     (25) has the same general form as equation (13),
     but with the t quantiles replaced with approximate
     standard normal quantiles. As noted previously,
     if the lower limit is disregarded, the upper limit of
     the interval provides a one-sided UCL for the
     population mean.

      An often cited rule of thumb for a sample size
     with the CLT is n > 30. However, this may not be
11

-------
adequate if the population is highly skewed.  A
refinement of the CLT approach, which makes an
adjustment for skewness, is discussed by Chen
(1995).  Specifically, the "adjusted CLT" UCL is
obtained if the standard normal quantile, z^ in the
upper limit of equation (25) is replaced by
                                        (26)
where K3 is the sample coefficient of skewness,


                *)3-                     (27)
       1  "
      —T E
      ns'
Notice that this adjustment results in a UCL which
is  larger than that of equation (25)  when the
sample skewness is positive.

The Chebychev Theorem

  This  theorem  is  given  here  to  obtain a
reasonably conservative estimate of the UCL of
the mean.  The two-sided Chebychev theorem
states that given a random variable, X, with finite
mean and standard deviation,
                             and a1; one has
             A*i
                        - I/*2.
                                        (28)
  This result can be applied with the sample mean,
x, to obtain a conservative UCL for the population
mean.  Specifically, if the right side of equation
(28) is equated to 0.95, then k = 4.47, and UCL =
x  + 4.47cTj/«1/2 is  a  conservative 95% upper
confidence limit for the population mean.  Of
course, this would require the user to know the
value of ol. The obvious modification would be to
replace ol with the sample standard deviation, sx,
but, since this is estimated from the data, the result
is no  longer guaranteed to be conservative. In
general, if n\  is  an unknown mean,  j\  is an
estimate and tr(//j) is an estimate of the standard
error ofj\, then the quantity UCL = f\+ 4.47a(/)1)
will give 95% UCLs for //b which should tend to
be  conservative, but this is not assured.   This
could  be used, for example, with the mean of a
lognormal population, using equation (8), as the
estimate of the population mean and the square
root of equation (10) as the estimate  of the
standard error.   This has  been used  in  the
following examples.

4.  Examples

  Monte carlo  simulation  experiments were
performed  to  compare  various methods  of
computing  the UCL  of the lognormal mean.
Based  on these experiments, the  methods  of
jackknife, bootstrap, or even  the conservative
method based on the Chebychev inequality appear
to be superior to the H-statistic-based UCL for
small sample sizes. When the number of samples
is large (« >  100), all  of these methods  give
similar results. In this section, a few simulated
examples are  provided to compare the various
methods of computing values of the UCL. A few
examples from Superfund sites have also been
included.

Example 4.1.  Simulated sample from a mixture
              of two normal  populations, N(100,
              50)andN(1000, 100).

This example  uses the  sample of  size n =  15
which was discussed previously in Example 2.1.
Recall,  that this is a  simulated sample  from a
mixture of two normal populations. The mean of
the mixed normal population is //,  = 400.  The
values  of the  mean, standard  deviation,  and
coefficient of  variation  computed  for the  log-
transformed data are:

y = 5.090, sy = 1.705, and  CVy = 0.34.

The values of the mean,  standard deviation, and
CV computed for the raw data are:

x = 403.35, sx = 453.94, and  CVX = 1.125.

If it is assumed (incorrectly) that the population is
lognormal, point estimates based on MVUE theory
of  the mean, //,,  standard  deviation,  o1t  and
standard error of the mean are 572.98, 1334.56
and 290.14, respectively. Estimates of the 80th,
90th, and  95th  percentiles  of a  lognormal
distribution  are 686.33,  1453.48, and 2685.56,
respectively.
                                              12

-------
Discussion of Example 4.1
               y = 4.887, sv = 0.966, CVV = 0.20.
  The  95% UCL values obtained from the
methods   discussed   above,   without  using
lognormal theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
 609.75
 584.32
 651.52
 596.16
 618.51
 927.27
The values of the 95% UCL obtained from the
methods discussed above,  calculated using the
lognormal theory, are:
Jackknife
Standard Bootstrap
Chebychev
H-UCL
1085.17
 994.40
1869.90
4150.96
  Notice that the 95 % UCL computed from the H-
statistic (4150.96)  exceeds  the estimated 95th
percentile (2685.56) of an  assumed lognormal
distribution.   The H-UCL is  also an order of
magnitude larger than the true  mean, 400, of the
mixture of two normal populations.

  It is  also of interest to see how the methods
compare when applied to simulated lognormal
data with different sample sizes and various
combinations of parameter values.

Example 4.2.   Simulated sample  of size n = 15
              from  a  lognormal  distribution,
              LN(5, 1).

In this example, n = 15 data were generated from
the lognormal distribution LN(5,1), with following
(true) values  of population  parameters:   //,  =
244.69, o1 =  320.75, and  CV =  1.31.    The
generated data are:

139.2056,  259.9746,  138.7997,   48.8109,
166.1733, 54.1241, 120.3665, 60.9887, 551.2073,
66.3336, 16.0695,364.5569,  153.2404,271.5436,
473.6461.

The  values of  the  sample  mean,  standard
deviation, and CVofthe log-transformed data are:
The sample mean, standard deviation, and CVfor
the raw data are:

x = 192.34, sx = 161.56, CVX = 0.84.

For a lognormal distribution, the estimates of //,,
ap and the standard error of the mean, based on
MVUE theory,  are 202.58,  219.21, and  54.00,
respectively.  The MLEs of //,, o1t and CV are
211.33, 262.47, and 1.24, respectively. Estimates
of the  80th, 90th, and 95th percentiles  of the
lognormal distribution  are 299.79, 458.58, and
649.31, respectively.
Discussion of Example 4.2

The values of the 95% UCL obtained from the
methods   discussed   above,  without  using
lognormal theory, are:
               Jackknife
               Standard Bootstrap
               Pivotal Bootstrap
               CLT
               Adjusted CLT
               Chebychev
                                     265.79
                                     258.21
                                     292.17
                                     260.96
                                     271.57
                                     378.80
               The values of the 95% UCL obtained from the
               methods  discussed   above,  calculated  from
               lognormal theory, are:
               Jackknife
               Standard Bootstrap
               Chebychev
               H-UCL
                                     289.30
                                     281.22
                                     448.41
                                     427.62
                The differences  in  UCLs  for  the  various
               methods are not as  extreme as they were in the
               previous example, but a similar pattern with the
               Chebychev (as expected) and H-UCL limits being
               the largest is still present. However, unlike the
               previous example, the  95% UCL is  below the
               estimated  95th  percentile   of  a  lognormal
               distribution, as one would intuitively expect. It is
               also interesting to note  that the CV estimated as
               the ratio of the sample  standard deviation to the
               sample mean from raw  data is less than 1 (0.84),
               while the CV computed from the MLEs is slightly
               greater than 1 (1.24). According to the CV test,
                                              13

-------
which says that if CV <1.0, then the population is
normally distributed, the former CV of 0.84 might
lead one to incorrectly assume that the population
is normally distributed.

  In the next example, the variance of the log-
transformed variable is increased slightly with a
corresponding increase in CV and skewness.

Example 4.3.  Simulated sample of size n = 15
              from  a  lognormal  distribution,
              LN(5, 1.5).

In  this example,   n =  15 observations  were
generated  from  the  lognormal  distribution,
LN(5,1.5), with the  following true  values  of
population parameters: u1 = 457.14, o1 = 1331.83,
CV = 2.91. The generated data are:

440.8517,  1013.4986,   1857.7698,  500.9632,
397.9905,   110.7144,   196.2847,  128.2843,
1529.9753,   5.7978,   940.8903,   597.5925,
1519.5159, 181.6512, 52.8952.

The sample mean, standard deviation, and CVof
the log-transformed data are:

y = 5.761, sy = 1.536, and CVy = 0.27.

The sample mean, standard deviation, and CVfor
the raw data  are:

x = 631.65, sx = 603.13, and CVX =  0.96.

For a lognormal distribution,  the estimates of //,,
o1t  and standard error of the mean, based  on
MVUE theory, are 894.76, 1784.95, and 405.79,
respectively.   The MLEs of u1t o1t and  CV are
1033.63,  3202.28  and  3.10,   respectively.
Estimates of the 80th, 90th, and 95th percentiles of
the lognormal distribution are 1163.05, 2286.63,
and 3975.71, respectively.
Discussion of Example 4.3

The values of the 95% UCL obtained from the
methods  discussed   above,  without  using
lognormal theory, are:
             Adjusted CLT
             Chebychev
                                     919.81
                                    1327.75
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
905.88
882.82
977.18
887.82
             The values of the 95% UCL obtained from the
             methods  discussed  above,  calculated  from
             lognormal theory, are:
             Jackknife
             Standard Bootstrap
             Chebychev
             H-UCL
                                    1534.94
                                    1363.26
                                    2708.63
                                    4570.27
  As in the case of Example 4.1, the 95% H-UCL
(4570.27)  again exceeds  the  estimated  95th
percentile  of the lognormal  distribution.  The
situation with the CV is similar to that of Example
4.2.  That is, the CV computed from raw data
(0.96) is less than 1, which by application of the
CV-test could lead one to adopt (incorrectly) the
normal distribution.  Notice that the true CV and
the estimate based on the MLEs are both close to
three.   The next example  involves the same
population but with a larger sample size.

Example 4.4.  Simulated sample of size n = 31
              from  a  lognormal  distribution,
              LN(5, 1.5).

In this simulated example,  n = 31 observations
were generated from a lognormal distribution,
LN(5,1.5).  This is the same distribution use in the
previous example, and thus true mean, standard
deviation, and CV are the same. The generated
data are:

49.0524,   806.8449,   122.2339,   697.7315,
2888.1238, 37.7998, 7.2799, 292.5909, 433.4413,
639.7468,   3876.8206,   1376.8859,  197.8634,
93.0379,  180.9311,   1817.9912,   284.3526,
344.6761, 44.8680, 297.3899, 11.9195, 100.5519,
264.7574,   41.3961,  43.4202,   1053.3770,
2067.0361, 132.2938, 75.9661, 53.2236, 83.5585.

The sample mean, standard deviation, and CVof
log-transformed data are:

y = 5.326, sy =1.577, and CVV = 0.30

The sample mean, standard deviation, and CVfor
raw data are:

x = 594.10, sx = 919.05, and CVX = 1.55.
                                              14

-------
For a lognormal distribution, the estimates of //„
o1t and the standard error of the mean are 657.45,
1632.25, and 238.86, respectively. The MLEs of
fj1r  o1t and CV are 713.34, 2369.11,  and 3.32.
Estimates of the 80th, 90th, and95thpercentilesof
a lognormal distribution are 779.73, 1560.71, and
2753.62, respectively.
Discussion of Example 4.4

The values of the 95% UCL obtained from the
methods  discussed  above,  without  using
lognormal theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
 874.22
 854.51
1003.00
 865.64
 932.36
1331.95
The values of the 95% UCL obtained from the
methods   discussed  above,  calculated  from
lognormal theory, are:
Jackknife
Standard Bootstrap
Chebychev
H-UCL
1062.35
1088.94
1725.15
1792.54
  As one might expect with a larger sample size (n
= 31), the point estimates tend to be closer to the
true  parameter values they  are intended to
estimate.  Also, there is not as  much variation
among the UCLs computed  from the different
methods. Furthermore, the H-UCL is below the
estimated  95th  percentile  of  the  lognormal
distribution.

  In the next example, a sample of size n = 15 is
considered again, but with the variance of the log-
transformed variable slightly larger than that of
Examples 4.2-4.4.

Example 4.5.  Simulated sample of size n = 15
              from  a lognormal  distribution,
              LN(5, 1.7).

This  last simulated data set of size n =  15 is
obtained from LN(5, 1.7),  with the following true
values of population parameters: //, = 629.55, o1
= 2595.18, CV = 4.12.

The generated data are:

16.5197, 235.4977, 1860.4443, 74.5825, 3.9684,
325.2712,   167.7949,  189.0130,  1307.6180,
878.8519, 35.4675, 96.2498, 229.2540, 182.0494,
1498.6146.

The sample mean, standard deviation,  and CVof
the log-transformed data are:

y = 5.178, sy = 1.710, CVy = 0.33.

The sample mean, standard deviation, and CVfor
raw data are:

x = 473.41, sx = 606.79, CVX = 1.28.

For a lognormal distribution, the estimates of //,,
o1t and the standard error of the mean, based on
MVUE theory, are 629.82, 1473.12, and 319.0,
respectively. The  MLEs  of //,,  o1t and CV are
765.52,  3213.52,  and  4.20,   respectively.
Estimates  of the 80th, 90th, and 95th percentiles
for a lognormal distribution are 752.50, 1596.91,
and 2955.58, respectively.
Discussion of Example 4.5

The values of the 95% UCL obtained from the
four methods discussed above, without using
lognormal theory, are:
              Jackknife
              Standard Bootstrap
              Pivotal Bootstrap
              CLT
              Chebychev
                                     749.31
                                     721.07
                                     862.51
                                     731.14
                                    1173.74
              The values of the 95% UCL obtained from the
              four methods discussed above, calculated from
              lognormal theory, are:
              Jackknife
              Standard Bootstrap
              Chebychev
              H-UCL
                                    1176.39
                                    1141.95
                                    2059.47
                                    4613.32
                                              15

-------
  Notice that in this example (as with Examples
4.1 and 4.3), the 95% H-UCL (4613.32) exceeds
the estimated 95th percentile (2955.58)  of the
lognormal distribution.

  The  sample  size  and the  mean of the  log-
transformed variable in examples 4.2,4.3, and 4.5
are held constant at 15  and 5, respectively,
whereas the standard deviation (sd) of the log-
transformed variable  are 1.0,   1.5,   and  1.7,
respectively. From these examples alone, it can be
seen that as soon as the sd of the log-transformed
variable becomes greater than 1.0, the H-statistic-
based UCL becomes orders of magnitude higher
than the largest  concentrations observed, even
when the data were obtained from a lognormal
population.  Thus, even though the H-UCL is
theoretically  sound   and  possesses  optimal
properties for truly lognormal populations such as
being MVUE, the practical merit of the  use of H-
UCL in environmental applications is questionable
when the sd of the log-transformed variable starts
exceeding 1.0. This is especially true  for small
sample  sizes (e.g., n  <30).  As  seen  in  the
examples discussed here, the use of the lognormal
distribution   and  the   H-UCL  in  some
circumstances tends to hide contamination rather
than find it, which is contrary to one of the main
objectives in many environmental applications.
Actually, under  the  assumption of lognormal
distribution, one can get away with very little or
no cleanup, (Bowers, Neil, and Murphy 1994), at
a polluted site.

Example 4.6.  Data from the Naval Construction
              Battalion   Center   (NCBC)
              Superfund Site in Rhode Island.

Inorganic  analyses  were performed on  the
groundwater samples from seventeen (17) wells
from the NCBC Site.  The main objective  was to
provide reliable estimates of the mean background
threshold   levels   for  the  various   inorganic
contaminants at the site.  The UCLs have been
computed using the procedures described above.
The results for two of the contaminants, aluminum
and manganese, are summarized below.

Aluminum: 290,  113, 264, 2660,  586, 71,  527,
163, 107, 71, 5920, 979, 2640,  164, 3560,  13200,
125.
The sample mean, standard deviation, and CVof
log-transformed data are:

y = 6.226, sy = 1.659, CVy = 0.27.

The sample mean, standard deviation, and CVfor
the raw data are:

x = 1849.41, sx = 3351.27,  CVX = 1.81.

With the lognormal assumption, the estimates of
//,, o1t and the standard error of the mean, based
on MVUE theory,  for  aluminum  are  1704.84,
3959.87, and 807.64, respectively.  The MLEs of
//„ o1t  and CV are 2002.71, 7676.37, and 3.83,
respectively. Estimates of the 80th,  90th and 95th
percentiles for  a  lognormal distribution   are
2054.44, 4263.44, and 7747.81, respectively.

Manganese:  15.8, 28.2, 90.6, 1490,  85.6,  281,
4300, 199, 838, 777,  824, 1010,  1350, 390,  150,
3250, 259.

The sample mean, standard deviation, and CVof
log-transformed data are:

y = 5.91, sy = 1.568, CVy = 0.27.

The sample mean, standard deviation, and CVfor
the raw data are:

x = 902.25, sx = 1189.49, CVX = 1.32.

With the lognormal assumption, the estimates of
//,, o1t and the standard error of the mean, based
on MVUE theory, for manganese  are 1100.92,
2340.72, and 490.16,  respectively.  The MLEs of
//„ o1t  and CV are 1262.59, 4125.5,  and 3.27,
respectively. Estimates of the 80th, 90th, and 95th
percentiles for  a  lognormal distribution   are
1389.65, 2769.95, and 4870.45, respectively.

The calculated Shapiro Wilks statistics for the raw
data   are   0.594   (aluminum)  and  0.725
(manganese), and for the  log-transformed data,
the corresponding values are  0.913 and 0.969.
The  tabulated critical value  for 0.10 level of
significance is 0.91. Thus, for both aluminum and
manganese, the data  failed the normality test and
passed the lognormality test at significance level
0.10 (Note: Shapiro-Wilks is a lower tail test).
                                              16

-------
Discussion of Example 4.6

The values of the 95% UCL obtained from the
methods  discussed  above,  without  using
lognormal theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
      Aluminum
       3268.22
       3125.56
       5286.63
       3186.47
       3675.94
       5482.64
   Manganese
     1405.83
     1354.15
     1968.03
     1376.82
     1503.84
     2191.81
  Observe that for both of the contaminants, the
95% UCLs calculated from the jackknife, both
bootstrap methods, the  CLT, the adjusted CLT,
and the Chebychev limit are well below their
respective  estimates  of the  95th  percentile
(Aluminum: 7747.81 and Manganese: 4870.45) of
assumed (based on Shapiro-Wilks' test) lognormal
distributions.

The values of the 95% UCL obtained from the
methods  discussed  above,  calculated  from
lognormal theory, are:
Jackknife
Standard Bootstrap
Chebychev
H-UCL
      Aluminum
        3283.34
        3663.20
        5314.99
        9102.73
   Manganese
      1889.52
      1821.55
      3291.95
      5176.16
  Observe that the 95% UCLs calculated using
lognormal  theory  from  the  jackknife,  the
bootstrap, and  the Chebychev inequality are
similar to the respective values obtained without
using lognormal theory, and that these are well
below their respective estimated 95th percentiles
for a lognormal distribution.  The 95% UCLs
calculated from the H-statistic, however, exceed
their respective estimated 95th percentiles for a
lognormal distribution.
Example 4.7.
Data  from  the
Superfund  site
County, PA.
Elrama School
in  Washington
The data were compiled from two waste piles for
risk evaluations of the contaminants found at the
Elrama  School  Superfund  Site,  Washington
County,  PA.  Twenty-six (26) contaminants (10
inorganics,  12 semi-volatile compounds, and 4
volatile compounds) were detected in both of the
waste   piles.      Using   the  nonparametric
Kolmogorov-Smirnov two-sample test on the two
waste piles, it was concluded that there is no
statistically  significant  difference   between
distributions of the contaminants from  the two
waste piles.  Thus, the data from these two waste
piles were combined to compute all of the relevant
statistics such as the  mean,  the  standard
deviation, and the  UCLs.   This resulted in data
sets consisting of 23 observations (15 from Waste
Pile 1 and 8 from Waste Pile 2).  The results are
provided for two of the contaminants of concern:
aluminum and toluene.

Aluminum:  31900.0, 8030.0,  12200.0,  11300.0,
4770.0,  5730.0, 5410.0, 8420.0, 8200.0, 9010.0,
8600.0,  9490.0, 9530.0,  7460.0, 7700.0,  13700.0,
30100.0, 7030.0, 2730.0, 5820.0,  8780.0, 360.0,
7050.0.

The sample mean, standard deviation, and CV of
the log-transform data are:

y = 8.927, sy = 0.845, CVy = 0.095

The sample mean, standard deviation, and CVfor
the raw  data are:

x = 9709.57, sx = 7310.02, CVX = 0.75.

With the lognormal assumption, the estimates of
u1t o1t and the standard error of the mean, based
on MVUE theory,  for  aluminum  are 10552.68,
10031.60, and 2044.90, respectively. The MLEs
of jL/1r Oj, and CV are 10768.22, 10993.32,  and
1.02, respectively.  Estimates of the 80th, 90th,
and 95th percentiles for a lognormal distribution
are  15323.48,   22224.45,  and  30381.95,
respectively.

Toluene: 7300.0, 6.0, 6.0, 5.5, 29000.0,  46000.0,
12000.0, 2500.0, 1300.0, 3.0, 510.0, 230.0, 63.0,
6.0, 5.5, 6.0, 6.0, 5.5, 280000.0, 8.0, 28.0, 6.0, 7.0.

The sample  mean,  standard deviation and CV of
log-transform data are:

y = 4.652, sy = 3.660, CVy = 0.79

The sample mean, standard deviation, and CVfor
the raw  data are:
                                              17

-------
x = 16478.33, sx = 58510.78, CVX = 3.55.

With the lognormal assumption, the estimates of
u1t o1t and the standard error of the mean, based
on  MVUE  theory, for toluene  are 21328.39,
362471.55, and 18788.05,  respectively.   The
MLEs  of  /j1t  o1t  and  CV are  84702.17,
68530556.56, and809.08, respectively. Estimates
of the  80th, 90th, and  95th  percentiles for a
lognormal distribution are 2264.17, 11329.16, and
43876.88, respectively.

The Shapiro-Wilks statistics for the raw data are
0.707 (aluminum) and 0.313 (toluene), and for the
log-transformed data, the corresponding  values
are 0.781 and 0.818. The tabulated critical value
fora 0.10 level of significance with n = 23 is 0.928.
Thus, neither a normal nor a lognormal distribution
gives a good fit.
Discussion of Example 4.7

The values of the 95% UCL obtained from the
methods   discussed   above,  without  using
lognormal theory, are:
Jackknife
Standard Bootstrap
Pivotal Bootstrap
CLT
Adjusted CLT
Chebychev
Aluminum
 12327.40
 12246.67
 15161.90
 12216.95
 12895.10
 16522.94
   Toluene
   37431.95
   33494.25
  152221.00
   36547.89
   47316.80
   71013.85
The values of the 95% UCL obtained from the
four methods discussed above, calculated from
lognormal theory, are:
Jackknife
Standard Bootstrap
Chebychev
H-UCL
Aluminum
 13542.11
 13579.18
 19693.40
 16503.51
   Toluene
   62263.37
  278888.51
  105757.50
18444955.15
  Observe  that  the  95%  UCL  for  toluene,
calculated  from  the  H-statistic, is orders of
magnitude higher than those calculated from the
other methods, and is also orders of magnitude
higher  than the  maximum observed toluene
concentration at the site. Also, with the toluene
data, the pivotal bootstrap method  results  in a
UCL which is two to five times larger than the
others computed from the non-lognormal theory
methods.  It is even larger than the Chebychev
limit. As noted earlier, this is possible when the
standard error  of the  point estimate  is  also
estimated from the data.  In most environmental
applications,  the  true   population   standard
deviation of the point estimate is unknown, and
therefore,  it  needs to  be estimated  from the
available data. Note, however, it is two orders of
magnitude smaller than the H-UCL.

  Note, also, that the CV (0.75) computed from
the raw data for aluminum is less than 1. The use
of the  CV-test for normality could lead one to
assume normality, even though the Shapiro-Wilks
test strongly rejects the  normal distribution (p-
value = 0.00002).

5. Summary and Recommendations

  It is seen from the simulated examples that, even
when the underlying distribution is lognormal, the
performance (in  terms of a lower UCL) of the
Jackknife, bootstrap, and CLT procedures is more
accurate than that of the H-UCL. In each of the
four  simulation  experiments, the  95%  UCLs
computed from all of the above methods exceeds
the true respective population means, but the 95%
H-UCL is  consistently larger,  except  in some
cases where it is comparable to the conservative
Chebychev result, than the 95% UCLs obtained
from other methods.  It is also  seen from the
simulation examples that the estimate of the CV
based on the MLEs is closer to the true CV than
the usual (moment) estimate of CV. Furthermore,
the  usual  estimate  of  the  CV  appears to
underestimate  the true  CV.   In some of the
examples, the usual estimate of the CV is less than
1, while the true population CV is somewhat
greater than 1.  That is, the rule of thumb (CV-
test)  which declares the distribution to be normal
when the moment estimate of the CV is less than
1, can frequently lead to an incorrect assumption
about the underlying distribution of the data.

  Moreover, from the examples discussed in this
paper,  it is observed that the H-UCL becomes
order of magnitudes higher even when the  data
were obtained from a lognormal population and
can  lead  to incorrect   conclusions.   This  is
                                              18

-------
especially true for samples of smaller sizes (e.g.,
<30). It appears that the lognormal distribution
and the H-UCL tend to hide contamination rather
than revealing it.  Under the assumption of the
lognormal distribution, one can get away  with
very little or no cleanup at a polluted site. Thus,
although the H-UCL is theoretically sound and
possesses optimal properties, the practical merit of
the H-UCL in environmental  applications  is
questionable, as it becomes an order of magnitude
higher than  the largest  observed  concentration
when the sd of the log-transformed data starts
exceeding 1.0. It is therefore, recommended that
in environmental applications, the  use of the H-
UCL to obtain an estimate of the upper confidence
limit of the mean should be avoided.

  Based on the monte carlo simulation results, and
the authors' experience with Superfund site work,
the following steps for computing  a UCL of the
mean of the  contaminant(s)  of  concern  are
recommended:

1)  Plot histograms of the observed contaminant
    concentrations and perform a statistical test of
    normal  or lognormal distribution (e.g., the
    Shapiro-Wilks test).  Do not use the  rule of
    thumb that declares the data distribution to be
    normal ifCVis less than 1.

2)  If a normal distribution provides an adequate
    fit to the data,  then use the  Student's  t
    approach (equivalent to the jackknife) for
    calculating the UCL of the population mean.
3)  If  a lognormal  distribution  provides  an
    adequate  fit to the data,  then  a)  use the
    lognormal  theory  based formulas  for
    computing the MVUE of the population mean
    and the standard deviation, b) either use these
    MVUEs  with the  jackknife  or bootstrap
    methods to calculate a UCL of the mean, or
    use the  Chebychev approach for calculating a
    UCL. Do not use the UCL based on the H-
    statistic, especially if the number of samples is
    less than 30.

4)  If the data distribution turns out to be neither
    normal  nor   lognormal,   then use  the
    nonparametric versions of the jackknife or
    bootstrap to calculate  a UCL.  Even if the
    lognormal distribution  seems  to provide a
    reasonable fit to the  data, and  if  there is
    evidence  of  a  mixture  of two or more
    subpopulations, or if outliers are suspected,
    then using one of the nonparametric methods
    discussed above is recommended.

Notice

  The  U.S. Environmental Protection  Agency
(EPA),  through  its  Office of Research and
Development (ORD), funded and  prepared this
Issue Paper.  It has been peer reviewed by the
EPA and approved for publication.  Mention of
trade names  or commercial products does not
constitute  endorsement or recommendation  by
EPA for use.
                                              19

-------
                                         References
Aitchison, J., and Brown, J. A. C.  (1976), The
   Lognormal   Distribution,   Cambridge:
   Cambridge University Press.
Bain,  L.  J.  and  Engelhardt,  M.   (1992),
   Introduction to Probability andMathematical
   Statistics, Boston: Duxbury Press.
Bowers,  T., Neil,  S.,  and Murphy, B.  (1994),
   "Applying Hazardous Waste Site  Cleanup
   Levels: A Statistical Approach to  Meeting
   Site   Cleanup  Goals   on  Average."
   Environmental Science and Technology.
Bradu, D., and Mundlak, Y. (1970), "Estimation
   in Lognormal Linear Models," Journal of the
   American Statistical Association, 65,198-211.
Chen, L. (1995), "Testing the Mean of Skewed
   Distributions," Journal  of the American
   Statistical Association, 90, 767-772.
Efron,  B. (1981),  "Nonparametric Estimates of
   Standard Error: The Jackknife, the Bootstrap,
   and Other Resampling Plans," Biometrika.
Efron,  B. (1982),  The Jackknife, the Bootstrap,
   and Other Resampling Plans, Philadelphia:
   SIAM.
Efron,  B., and Gong,  G. (1983), "A  Leisurely
   Look at the  Bootstrap, the Jackknife, and
   Cross-Validation," The American Statistician,
   37, 36-48.
EPA (1992), "Supplemental Guidance to RAGS:
   Calculating   the   Concentration   Term,"
   Publication 9285.7-081, May 1992.
Finney, D. J. (1941), "On the Distribution of a
   Variate  Whose  Logarithm  is Normally
   Distributed," Journal of the Royal Statistical
   Society,!, 155-161.
Gilbert,  R.O.  (1987),  Statistical Methods for
   Environmental Pollution Monitoring, New
   York: Van Nostrand Reinhold.
Gilbert, R.O. (1993), "Comparing Statistical Tests
   for Detecting Soil Contamination Greater that
   Background,"  Pacific Northwest Laboratory,
   Technical Report No. DE 94-005498.
Hall,  P.  (1988).   "Theoretical comparison of
    bootstrap confidence intervals," Ann. Statist.,
    16, 927-953.
Hogg, R.V., and Craig, A.T. (1978), Introduction
    to  Mathematical   Statistics,  New  York:
    Macmillan Publishing Company.
Land, C. E. (1971), "Confidence Intervals  for
    Linear Functions of the Normal Mean and
    Variance," Annals of Mathematical Statistics,
    42, 1187-1205.
Land, C. E. (1975), "Tables of Confidence Limits
    for Linear Functions of the Normal Mean and
    Variance," in Selected Tables in Mathematical
    Statistics, vol. Ill,  American Mathematical
    Society, Providence, R.I., 385-419.
Lechner,  J.A.  (1991),  "Estimators for  Type-II
    Censored (Log) Normal  Samples," IEEE
    Transactions on Reliability, 40, 547-552.
Miller, R (1974), "The Jackknife - A Review,"
    Biometrika, 61, 1-15.
Power, M. (1992), "Lognormality in the Observed
    Size Distribution of Oil and Gas Pools as a
    Consequence   of   Sampling   Bias,"
    Mathematical Geology, 24, 929-945.
Singh,  A. and Nocerino  J.M.  (1995),  "Robust
    Procedures for the  Identification of Multiple
    Outliers", Chemometrics in Environmental
    Chemistry, Statistical Methods, Vol 2., part G,
    229-277, Springer Verlag, Germany.
Staudte, R. G., and Sheather, S. J. (1990), Robust
    Estimation  and  Testing, New York:  John
    Wiley & Sons.
Stewart,   S.   (1994),  "Use  of  Lognormal
    Transformations in Environmental Statistics,"
    M.S.  Thesis,  Department of Mathematics,
    University of Nevada, Las Vegas.
Ott, W. (1990), "A Physical Explanation of the
    Lognormality of Pollutant Concentrations,"
    Journal of Air Waste Management Assoc., 40,
    1378-1383.
                                              20

-------