United States
               Environmental Protection
               Agency
Office of
Research and
Development
Office of Solid
Waste and
Emergency
Response
EPA/600/R-02/084
October 2002
<>EPA   Technology  Support Center  Issue
               Estimation  of the Exposure Point Concentration Term
               Using  a Gamma Distribution
               Anita Singh1, Ashok K. Singh2, and Ross J. laci3

                  The   Technology  Support  Projects,
               Technology  Support  Center  (TSC)  for
               Monitoring  and  Site Characterization was
               established in 1987 as a result of an agreement
               between  the  Office  of  Research  and
               Development (ORD), the  Office  of Solid
               Waste and Emergency Response (OSWER)
               and all ten Regional Offices. The objectives of
               the Technology Support Project and the TSC
               were to make available and provide ORD's
               state-of-the-science   contaminant
               characterization technologies and expertise to
               Regional  staff, facilitate the evaluation and
               application   of  site   characterization
               technologies at Superfund and RCRA sites,
               and to improve  communications between
               Regions and ORD Laboratories.   The TSC
               identified  a need to provide federal, state, and
               private environmental scientists working on
               hazardous waste  sites with a technical issue
               paper  that  identifies   data  assessment
               applications that can be implemented to better
               define  and  identify  the distribution   of
               hazardous  waste  site  contaminants.   The
               examples  given in this Issue  paper and  the
               recommendations provided were the result of
               numerous   data  assessment  approaches
               performed by the TSC at hazardous waste site s.

                  This paper was prepared by Anita Singh,
               Ashok K.  Singh, and Ross J. laci. Support for
               this project was provided by the EPA National
               Exposure  Research   Laboratory's
                 Environmental  Sciences Division with the
                 assistance of the Superfund Technical Support
                 Projects Technology  Support Center for
                 Monitoring  and   Site  Characterization,
                 OSWER's Technology Innovation Office, the
                 U.S. DOE Idaho National Engineering and
                 Environmental Laboratory, and the Associated
                 Western  Universities  Faculty  Fellowship
                 Program. For further information,  contact
                 Christopher Sibert, Technology Support Center
                 Director, at (702) 798-2270, Anita Singh at
                 (702) 897-3234, or A. K. Singh at (702) 895-
                 0364.

                     In Superfund and RCRA projects of the
                 U.S. EPA,  cleanup,  exposure,  and  risk
                 assessment decisions are often made based
                 upon the  mean   concentrations  of  the
                 contaminants of potential concern. A 95%
                 upper  confidence  limit  (UCL)  of  the
                 population mean is used to estimate the
                 exposure point  concentration  (EPC)  term
                 (EPA, 1992), to  determine the attainment of
                 cleanup  standards (EPA, 1989), to estimate
                 background level contaminant concentrations,
                 or to compare the soil concentrations with the
                 site specific soil screening levels (EPA, 1996).
                 It is, therefore, important to  compute  an
                 accurate and stable 95% UCL of the population
                 mean from the available data.
                     The formula for computing a UCL depends
                 upon the  data  distribution.  Typically,
                 environmental data are positively skewed, and
                Lockheed Martin Environmental Services, 1050 East Flamingo, Ste. E120, Las Vegas, NV 89119
               ' Department of Mathematics, University of Nevada, Las Vegas, NV 89154
               ; Department of Statistics, University of Georgia, Athens, GA 30602-1952
                Technology Support Center for
                Monitoring and Site Characterization,
                National Exposure Research Laboratory
                Environmental Sciences Division
                Las Vegas, NV 89193-3478
             Technology Innovation Office
             Office of Solid Waste and Emergency Response,
             U.S. EPA, Washington, D.C.

             Walter W. Kovalick, Jr., Ph.D., Director
                                                         Printed on Recycled Paper
                                                                                289CMB02.RPT

-------
a lognormal distribution (EPA, 1992) is often used
to model such data distributions. The H-statistic
(Land, 1971) based upper confidence limit of the
mean (denoted henceforth as H-UCL) is used in
these applications.  However, recent research in
this area (Hardin and Gilbert, 1992; Singh, et al.,
1997,  1999;  and  Schultz  and Griffin,  1999)
suggest that this may not be an appropriate choice.
It is observed that for large values of standard
deviation (e.g., exceeding 1.5 - 2.0) of the log-
transformed data, the use of H-statistic leads to
unreasonably large, unstable, and impractical UCL
values. This is especially true for sample sets of
smaller sizes (e.g., n < 20-25). The H-UCL is also
very sensitive to a few low  or high values. For
example, the addition of a sample below detection
limit can cause the H-UCL to increase by a large
amount. Realizing that the use of H-statistic can
result in an unreasonably large UCL, it has been
recommended (EPA, 1992) to use the maximum
observed value as an estimate of the UCL (EPC
term) in cases where the H-UCL exceeds  the
maximum observed value. Also, when the sample
size   is 5  or  less, the maximum observed
concentration is often used as an estimate of the
EPC term. However, it is observed that for highly
skewed data sets, use of the maximum observed
concentration may not provide the specified 95%
coverage to the  population  mean  (as shown in
Section 5).  This  is especially true for samples of
small size (e.g., 5-10). For larger sample sets/data
sets  (e.g.,  n« 20),  the   use of the  maximum
observed value results in an overestimate of the
95% UCL of population mean.  For such highly
skewed data sets, use of a  gamma distribution
based UCL of the mean provides a viable option.

    A positively  skewed data set can quite often
be modeled by  lognormal  as  well as gamma
distributions.  Due to the relative  computational
ease, however, the lognormal distribution is used
to model positively skewed  data sets.  However,
use of a lognormal model for an  environmental
data  set unjustifiably  elevates  the  minimum
variance unbiased estimate  of the mean and  its
UCL to levels that may not be  applicable  in
practice. In this  paper,  we propose the use of a
gamma distribution to model positively skewed
data sets. The objective of the present work is to

above, the test is based on a one-sided UCL of the
mean. A one-sided UCL is a statistic such that the
study procedures which can be used to compute a
stable and accurate UCL of the mean based upon
a gamma distribution.   Several parametric and
non-parametric   (e.g.,  standard  bootstrap,
bootstrap-t,  Hall's   bootstrap,  Chebyshev
inequality) methods of computing a UCL of the
unknown  population mean,  • ,• have also been
considered. Monte Carlo simulation experiments
have been performed to compare the performances
of these methods. The comparison of the various
methods  has been  evaluated  in  terms of the
coverage  (confidence coefficient) probabilities
achieved by the various UCLs. Based upon this
study, in Section 6, recommendations have been
made about the computation of a UCL of the mean
for  skewed data distributions originating from
various environmental applications.


1. Introduction

    Suppose the Regional Project Manager (RPM)
of a Superfund  site believes  that the mean
concentration of the  contaminant of potential
concern (COPC) exceeds a  specified cleanup
standard, Cs, but the potentially responsible party
(PRP) claims that the mean concentration is below
Cs.  In statistical terminology this can be stated in
terms of testing of hypotheses.  The hypotheses of
interest  are the null hypothesis  that the mean
concentration exceeds the cleanup standard, H0: • •
• Cs, versus the alternative hypothesis, Ha:  • < Cs.
This formulation of the problem is protective of
the environment because it assumes that the area
in question is contaminated, and the burden of
testing is to show otherwise. In order to perform
a test of these hypotheses, a  random sample is
collected from the site and concentrations of the
COPC in  these  samples are determined.   A
suitable statistical test is then used to make  a
decision.

    A  convenient  way to perform a  test  of
hypotheses  about   an  unknown  population
parameter is first to compute a confidence interval
for  the  parameter,  and  then  reject H0  if the
hypothesized value,  in this  case the cleanup
standard,  Cs, lies outside of the  confidence
interval. For the one-sided hypotheses mentioned

true population mean, • -is less than the UCL with
aprescribed probability or level of confidence, say

-------
(1 • ••). For example, if the UCL is a 95% one-
sided upper confidence limit, then • < UCL with
95%confidence (orwith 0.95 probability), and the
set of all real numbers less than UCL forms a 95%
upper  one-sided  confidence  interval.    The
corresponding statistical test will reject H0 (i.e.,
declare the site clean) if UCL < Cs, and the
significance level of this test, or false positive
error rate, is •  r This follows because if the  site is
contaminated  (i.e., •  • Cs), then the probability of
declaring it clean is the probability that UCL < Cs,
which is at most • r

    Testing of these  hypotheses and computation
of  a UCL  of the  mean  depends  upon  the
population  distribution   of  the  COPC
concentrations.  Several procedures are available
to compute a  UCL of the mean of a normal or a
lognormal  distribution  in  the  literature  of
environmental statistics (i.e., Singh,  Singh, and
Engelhardt, 1997,   1999; Schultz and Griffin,
1999). In this paper, we focus our effort on the
inference procedures for an unknown population
mean based upon a gamma distribution.   The
objective here is to study procedures that can be
used to compute an accurate and stable UCL of the
mean.  Several parametric (Johnson, 1978;  Chen,
1995; and, Grice  and Bain,  1980)  and  non-
parametric (e.g., standard bootstrap,  bootstrap-t
(Efron, 1982,  Hall, 1988), Hall's bootstrap  (Hall,
1992),  Chebyshev  inequality)  methods  of
computing a  UCL of population mean, • ,• of a
skewed distribution  have also been considered.
The comparison of the various methods has been
performed in  terms  of the coverage (confidence
coefficient) probabilities provided by the various
95% UCLs. Monte Carlo simulation experiments
have been performed to compare the performances
of  these methods.    Based  upon this  study,
recommendations  have  been made  about the
computation of a UCL of the mean for skewed
data  distributions  originating  from  various
environmental applications.

    Section 2  has a brief description of the gamma
distribution and a discussion of goodness-of-fit
tests for the gamma distribution.  Section 2 also
describes estimation of gamma parameters and the
computation of the UCL of mean based upon a
gamma distribution.  Section 3 describes the
various  other methods which can be used  to
compute a UCL of population mean.  Section 4
has some examples illustrating the procedures
used.   Section 5  discusses the  Monte  Carlo
experiments used to illustrate these methods and
results.     Section   6  consists   of  our
recommendations for dealing with heavily skewed
data sets.


2. The Gamma Distribution

    A continuous random variable, X (e.g., COPC
concentration), is said to follow a two-parameter
gamma distribution, G(k,«) with parameters k>0
and • K), if its probability density function is
given by the following equation:
                                         (1)
and zero otherwise. The parameter k is the shape
parameter,  and • • is the scale parameter (the
location parameter is set to zero).  Plots of the
gamma distribution, G(k,») for varying choices of
the shape parameter, k, and the scale parameter, •;
are shown in Figures 1-4. These figures have been
generated using the statistical software package,
MINITAB. The mean, variance, and skewness of
a gamma distribution, G(k,«) are given as follows:

                      Mean = ••=£•.•     (2)

                  Variance = •2i = k-2:     (3)

                     Skewness = 2/-k.     (4)

From equation (4), it  is noted  that skewness
increases as the shape  parameter k  decreases.
Figures 1 and 2 have the graphs of highly skewed
distributions. As £ increases, skewness decreases,
and  consequently  a gamma distribution starts
approaching a normal distribution for larger values
of k (e.g., k • 40), as can be seen in Figures 3 and
4.  Thus for larger values of k, the UCL based
upon a gamma distribution and a UCL based upon
a normal  distribution are in close  agreement.
From Figures 1 -4, it can also be seen that the scale
parameter,  •;  simply  affects the scale  of the
distribution and has no effect on the shape of the
gamma distribution. In practice, a highly skewed
data set can be fitted  by both lognormal and
gamma distributions. However,  the  difference
between the   UCLs  obtained using  the two

-------
distributions can be enormous. This is especially
true when the shape parameter is small (e.g., k <
          1).  This is illustrated in examples 2-5 given in
          Section 4.
                     0  -
                         0,0
0,5
1.0
1,5
                 Figure 1.  Graphs of the gamma distributions G(0.1, 1), G(0.2, 1), and
                           G(0.5, 1).
                     0,18
                     0,16
                     0,14 —
                     0.12 —
                     0,10
                     0.08 -
                     0,06 —
                     0,04 -
                     0.02 —
                     o.oo -
                            I
                            0
                                       I
                                       8
                Figure 2.  Graphs of the gamma distributions G(0.1, 50), G(0.2, 50), and
                          G(0.5, 50).

-------
                     0,4
                     0.3 -I
                     0.2
                     0.1 —
                     0.0 -1
                                               10
                                 15
               20
25
                                                     X
                  Figure 3.  Graphs of the gamma distributions G(2, 1), G(4, 1), and
                             G(10, 1).
                     0,008

                     0,007

                     0.006  -

                     0,005

                     0,004

                     0.003

                           -

                     0.001

                           -1

                                                      500
                                                       X
                                                      1000
                 Figure 4.  Graphs of the gamma distributions G(2, 50), G(4, 50), and
                           G(10, 50).
2.1  Goodness-of-Fit
     Distribution
Tests   for  Gamma
    Since the goodness-of-fit tests for gamma
distributions  are  not readily available, a  brief
description of those tests is given here.  Several
tests based upon empirical distribution functions
(EDF) exist in the statistical literature, and can be
used to test for a gamma distribution. These tests
include  Kolmogorov-Smirnov, D-test statistic,
Anderson-Darling, A2-test statistic,  and Cramer-
von  Mises test statistics,  W2  and U2 (e.g., see
D'Agostino and Stephens (1986), page 101). The
exact critical values of these statistics are not
available; this is especially true when the shape

-------
parameter, k, of the gamma distribution is less
than 1. Some asymptotic upper-tail critical values
of the test statistics, W2, A2, and U2 are given in
D'Agostino and Stephens (1986) for values of the
shape parameter, k* 4 (pages 152-155). Schneider
(1978) also studied the goodness-of-fit tests for
gamma distribution. He derived the critical values
of  Kolmogorov-Smirnov,  D-test  statistic for
selected values of the shape parameter, k, and the
sample  size for the gamma distribution  with
unknown parameters. All of these tests are right-
tailed. This means that if a computed test-statistic
exceeds its respective • 400% critical value, the
null hypothesis of gamma distribution will be
rejected at • 'level of significance.

    Most of the commercially available software
packages such as SAS and S-PLUS do not provide
the goodness-of-fit tests for a gamma distribution.
The software ExpertFit  (developed  by Law &
Associates, Inc., 2001) performs a goodness-of-fit
test for gamma distribution using the Anderson-
Darling test statistic, A2 and Kolmogorov-Smirnov
test statistic. Due to the unavailability of the exact
critical values  of the general test statistics, the
software ExpertFit (Law and Kelton (2000)) uses
approximate critical values of the test statistic
under the assumption that all parameters  (e.g.,
shape and scale) of the distribution  are known,
that is the distribution is completely specified as
given in Stephens (1970).  Those critical values
are  the generic critical values for all completely
specified  distributions.   ExpertFit  uses these
generic critical  values  to test for  a gamma
distribution. These critical values are also given
on page 105 of D'Agostino and Stephens (1986).
The authors of this article  also developed  a
program, GamGood (2002), to test for a gamma
distribution. This program computes the various
goodness-of-fittest statistics using the formulae as
given on page 101 of D'Agostino and Stephens
(1986). In this paper, we also use the smoothed
percentage points of the Kolmogorov-Smirnov (K-
S), D-test statistic as computed by Schneider and
Clickner (1976), Schneider (1978) to test  for a
gamma distribution.   An illustration  of the
goodness-of-fit test for a gamma distribution has
been discussed in Example 1.
Example 1
The following data set of size 20 is given by Grice
and Bain (1980): 152, 152, 115, 109, 137, 88, 94,
77, 160, 165, 125,40, 128, 123, 136, 101,62, 153,
83, and  69.   None of the parameters of the
underlying distribution are known. The various
goodness-of-fit test statistics are given by A2 =
0.41496, W2 = 0.06142, U2 =  0.05111, and D =
0.13867.  The estimated shape parameter, k, for
this data  set is 7.513 (see  Example 1,  to  be
continued). For a shape parameter of 7.513, the
asymptotic 5% critical values (Table  4-21, page
155, D'Agostino and Stephens, 1986) of these
statistics are:  A2 = 0.755, W2 = 0.127, and U2 =
0.117, and the critical value of the K-S statistic, is
D = 0.196 (Table 7 of Schneider, 1978). Since all
of the test-statistics are less than their respective
critical  values, it  is concluded  that there  is
insufficient evidence to conclude at the 0.05 level
of significance that the data do  not follow a
gamma distribution.

2.2  Estimation of Parameters of the Gamma
     Distribution
   Next, we consider the estimation  of the
parameters of a  gamma  distribution.   The
population mean  and variance  of  a gamma
distribution,  G(k,«), are  functions of  both
parameters, k and  •.• In order to estimate the
mean,  one has to obtain estimates of k and  • .•
Computation of the maximum likelihood estimate
(MLE) of k is quite complex and requires the
computation of Digamma and Trigamma functions
(Choi and Wette, 1969).  Several  authors (Choi
and Wette, 1969, Bowman and Shenton, 1988,
Johnson, Kotz, and Balakrishnan, 1994) have
studied the  estimation  of  shape  and  scale
parameters of a  gamma  distribution.   The
maximum  likelihood estimation  procedure  to
estimate shape and scale parameters of a gamma
distribution is described below.

   Let xlyr2,...,xn be a random sample (of COPC
concentrations)  of  size  n  from  a gamma
distribution, G(k,»), with  unknown  shape and
scale parameters k and • • respectively.  The log
likelihood function is given as follows:
                                                       log !(*„ x2,.,,A;A,0) = -ok log(0) -
                                                            (k) + (k - l)£log jr, -   J>,.
                                                                                             (5)

-------
To find the MLEs of k and • • which are k and 0 ,
respectively, we differentiate the log likelihood
function as given in (5) with respect to k and • ,•
and set the derivatives to zero. This results in the
following two equations:
                      i
                      -
                            n
                               ), and
                              E x
 (6)


 (7)
Solving equation (7) for 9 and substituting the
result in  equation  (6), we get the  following
equation:
            ,    1                f\
      - log(*)= I £  tog^)  - logx. (8)
There does not exist a closed form solution of
equation (8).   This equation needs to  be solved
numerically   for   k ,   which   requires   the
computation  of    digamma  and    trigamma
functions.  This  is quite easy  to  do  using  a
personal computer.  An estimate of k can be
computed  iteratively  by using  the   Newton-
                                         Raphson  (Faires  and  Burden,  1993) method
                                         leading to the following iterative equation:
                                                  Ir = t
                                                  K.J  ft |-1

                                                     (9)
             ₯ (i) = |, (log T (*)), an
                                         The  iterative process  stops  when  k starts to
                                         converge.  In practice,  convergence is typically
                                         achieved in  fewer  than  10  iterations.    In
                                         equation (9):
                                                                               (K)).
                                         where • •(£) is the digamma function and • "(k) is
                                         the trigamma function.  In order to obtain the
                                         MLEs of k and • • one needs to compute the
                                         digamma  and  trigamma  functions.    Good
                                         approximate values for these two functions (Choi
                                         and Wette,  1969) can  be  obtained using the
                                         following approximations.

                                         For k* 8, these functions are approximated by:
                       (k) « log(A) - {1 + [ 1 - (1/10 - l/(21 A2)) / A2] / (6A) } / (2k),     (10)

                                               and

                       4"(A) » {l+ (l+ [l- (1/5 - l/(7ff))/(?]/(3k)}/(2k)}/k.      (11)
For k < 8, one can use the following recurrence
relation to compute these functions:
                  !)-!/*;
and 4"(A) =₯'(* +!)+!
                                         (12)
                                         (13)
The iterative process requires an initial estimate of
k.  A good starting value for k in this iterative
process is given by k0 = \I(2M).  Thorn  (1968)
suggests the  following  approximation  as an
estimate of k:
    k~
    * ~
                   4A/
                        l-
(14)
                                         w
Bowman and Shenton (1988) suggested using k as
given by equation (14) to be a starting value of k
for an iterative procedure, calculating k, at the /*
                                         iteration from the following formula:
                     *-,=
                                                                                             (15)
Both equations (9)  and (15) have  been used to
compute the MLE of k.  It is  observed that the
estimate, k based upon Newton-Raphson method
as given by equation (9) is in close agreement with
that obtained using equation (15)  with Thorn's
approximation as an initial estimate.  Choi and
Wette (1969) further concluded that the MLE ofk,
h , is biased high.  A bias corrected (Johnson, et
al., 1994) estimate of k is given by the following
equation:
                           + 2/(3«),16   (16)

In (16), k is the MLE of k obtained using either
(9) or  (15).   Substitution  of equation (16)  in

-------
equation (7)  yields  an estimate  of the scale
parameter, • 'given as follows:
                                         (17)
Next  we  provide  an example  illustrating  the
computations of the MLEs of k and • r

Consider the data set of Example 1.  The sample
mean,  x , is  113.5.   The MLEs  of the two
parameters, k and • • are obtained iteratively using
the Newton-Raphson method  (equation 9), and
Bowman and  Shenton's  proposal as given by
equation (15). The two sets of estimates are in
agreement and are given by k = 8.799, and 9 =
12.893.    The  corresponding  bias-corrected
estimates of k and • • as given by equations (16)
and(17)are k " = 7.51267and  e" = 15.101. Note
that the  bias-corrected  MLE   of  the   shape
parameter,  k  = 7.51267, which is quite  high;
consequently, the skewness of this data set is mild
and  its  MLE  =  0.73  (from   equation  (4)).
Goodness-of-fit tests performed on this data set
suggest that the data cannot reject the hypothesis
that the data are normal or that they are lognormal.

2.3  Computation of UCL of the Mean of a
     Gamma,  G(k, •> Distribution
    In  the statistical literature, even though
procedures exist to compute a UCL of the mean of
a gamma distribution (Grice and  Bain,  1980,
Wong, 1993), those procedures have not become
popular due to their  computational complexity.
Those  approximate  and  adjusted  procedures
depend upon the Chi-square distribution and an
estimate  of the shape parameter, k.   As  seen
above, computation  of a MLE of k is  quite
involved, and this works as a deterrent to the use
of a gamma distribution-based UCL of the mean.
However, the  computation of a gamma UCL
currently should not be a problem  due to easy
availability of personal computers.

    Given a random sample, x1,x2,...,xn of size n
from a gamma, G(k,f)  distribution,  it can be
shown  that  2nX 16  follows  a  Chi-square
distribution, XM , with 2nk degrees of freedom
(df). It is noted that  (2nX) 19 =2(Xl+X2+ ...

For • •= 0.05 (confidence coefficient of 0.95), • •=
0.1, and • «= 0.01, these adjusted probability levels
+ Xn)  I • • Using  a simple transformation  of
variables,  it is seen that  each  of the  random
variables, 2X/fy:=].,2,...,n  follows a chi-square,
XM, distribution.  Also those chi-square random
variables are independently distributed. Since the
sum of the independently distributed chi-square
random variables  also  follows  a chi-square
distribution, it is  concluded that  (2nX) I 9
follows a chi-square, XM  distribution with 2nk
degrees-of-freedom. When the shape parameter,
k, is known, a uniformly most powerful test of size
• «of the null hypothesis, H0: • • Cs, against the
alternative hypothesis, H^ • 
-------
obtain an adjusted • -for values of n not covered in
the table.  The adjusted (I--) 100% UCL of
gamma mean, • •= k« «is given by:
              - UCL
where • «is given in Table 1 for • -=0.05, 0.1, and
0.01.  Note that as the sample size, n, becomes
large, the adjusted probability level, • • approaches
• r Except for the computation of the MLE of k,
equations (19)  and  (20) provide  simple  Chi-
square-distribution-based UCLs of the mean of a
gamma distribution. It should also be noted that
the UCLs as given by  (19) and (20) only depend
upon the estimate of the shape parameter, k, and
are independent of the scale parameter, • • and its
estimate.   Consequently,  as expected,   it   is
observed that coverage probabilities for the mean
associated with these UCLs do not depend  upon
the values  of the scale parameter,  • r  This  is
further discussed in Section 4.
Table 1 . Adjusted Critical Level, • -for Various Values of • -and n

n
5
10
20
40
••
• •= 0.05
probability level, • •
0.0086
0.0267
0.0380
0.0440
0.0500
. = o.1
probability level, • •
0.0432
0.0724
0.0866
0.0934
0.1000
• •= 0.01
probability level, • •
0.0000
0.0015
0.0046
0.0070
0.0100
    It is observed (Figures 5-7) that except for
highly skewed (£<0.15) data and samples of small
size (e.g.,  <10), the adjusted gamma UCL given
by (20) provides the  specified 95% coverage of
the population mean. It is also noted that for
highly skewed (k<0.15) data sets of small sizes,
except for the H-UCL, the  coverage probability
provided by  the adjusted gamma UCL is the
highest and is close to the specified level, 0.95.
However, for these highly skewed data sets, the H-
statistic results in unacceptably large values of the
UCL.  This is further illustrated in examples 2-4.
For values of k • 0.2, the specified coverage of
0.95 is always approximately achieved by the
adjusted gamma UCL given by equation (20), as
shown in Figures 7-14.
                   95
                3
                o
                   80
I
£
<»
o»
e
o
                   65"
                   50
          - - * - - Max-test
          —x— Adj-CLT
          —+— Modified-!
          —•*— Chebyshev
          —»— Bootstrap-!
          —o— Apx-Gamma
          —o— Adj-Gamma
                                20
                80
                           40          60
                                   Size
Figure 5.  Graphs of coverage probabilities by 95% UCLs of mean of
          G(k = 0.10, "=50).
100

-------
   95-
   75-
                                               • - * - - Max-test
                                               — — Adj-CU
                                               —+— Modified-!
                                               —>v— Chebyshev
                                               	*	 Boolstrap-T
                                               —o— Apx-Gamma
                                               —a— Adj-Gamma
     0
                20
80
                                             100
                           40-         60
                                   Size
Figure 6.  Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 0.15, "=50).
   95
3
o
-
s
£  85
S  75
e
   65
                20
                               • • * - - Max-test
                               — *— Ad)-CLT
                               — +— Modfcd-T
                               — A— Chebyshev
                               — <»— Bootstrap-T
                                                    Adj-Gamma
                                  80
           100
                           40          60
                              Sample Sfea
Figure 7.  Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 0.20, "=50).
s
   95
   85-
   751
   65
                               • - * • - Max-test
                               —-— Adj-CLT
                               — t— Modified-!
                               —tr—• Chebyshev
                               —"—- Bootstrap-?
                               —o— Apx-Gamma
                               —o— Adj-Gamma
     0
20
80
                                                             100
                           40          60
                              Sample Size
Figure 8.  Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 0.25, "=50).
                               10

-------
   95
I  85
   75
                 20
                                                —•+— Modified-T
                                                • A'  • Chebyshev
                                                —*—• Bootstrap-T
                                                •—o— Apx-Gamrna
                                                —o— Adj-Garnma
40          60
  Sample
           80
                                                             	T j

                                                             100
Figure 9.  Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 0.50, "=50).
**^
O
o
                                                  - * - • Max-test
                                                   x— Adj-CLT
                                                   + — Modified-T
                                                   a — Chebyshev
                                                   , — Bootstrap-!
                                                   ° — Apx-Gamma
                                                   ° — AtJJ-Gamma
                                   .

                                   -

                                   -
                                   I
                 20
                       80
                      100
                            40         80
                              Sample Size
Figure 10. Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 1.0, "=50).
                 20
                                                 — s— Adj-CLT
                                                 _+_ Modifled-T
                                                 — tr — Chetayshev
                                                 — • — Bootstrap-T
                                                 — o^ Apx-Gamma
                                                 — J— Adj-Gamtns
40
60
                                                  80
100
                               Sample Siz*
Figure 11. Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 2.0, "=50).
                               11

-------
0.
»  so
   85
                               - - * - - Max-test
                               — ••—Adj-CtT
                               —f— Modified-T
                               —A— Chebyshsv
                               —•— Bootstrap-T
                               —o— Apx-Gamma
                               —o— Adj-Qamma
                20
           40
           60
           80
           100
                              Sample Size
 Figure 12. Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 4.0, "=50).
                 20
           40
           80
           80
           100
                              Sample Sto
 Figure 13. Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 6.0, "=50).
   99'
o
3
o
O3
S
                                                • - «- - Max-test
                                                — «- Adj-CLT
                                                — i-— Modtfied-T
                                                —a— Chebyshev
                                                -—*— Bootstrap-T
                                                •—c— Apx-Gamma
                                                —o— Adj-Gamma
   87-
      0
20
40
60
80
100
                              Sample Stw
 Figure 14. Graphs of coverage probabilities by 95% UCLs of mean
           ofG(k = 10.0, "=50).
                               12

-------
Example 1 (Continued)
The data set of size 20 and the associated MLEs of
parameters k and • «are given in Example 1.  For
n=20and oc =0.05, the adjusted probability level,
• •= 0.038 (Table 1),  and the adjusted df, Ink * =
 u * = 300.507. The approximate 95% UCL of the
mean obtained using equation (19) is given by
UCL = 130.447, and the adjusted 95% UCL of
mean obtained using equation (20) is given by
UCL = 131.901. As noted above, this data set
passes both normality as well as  lognormality
tests. The associated Student's t-statistic based
and the H-statistic based UCLs are 127.288 and
134.73, respectively. Forthis mildly skewed data
set, one can use any of these four UCLs.


3. Other UCL Computation Methods

    Several authors  (Johnson,  1978, Kleijnen,
Kloppenburg, and Meeuwsen, 1986, Chen, 1995,
Sutton,   1993)  have   developed   inference
procedures  for  estimating  the   means  of
asymmetrical distributions. Also, several bootstrap
procedures (Efron, 1982, Hall, 1988 and 1992,
Manly, 1997) have  been recommended for the
computation of confidence intervals for means of
skewed distributions. These are summarized below
and are also included in the simulation experiments
described in Section 4. Some examples have been
included to illustrate these procedures.

3.1  UCL Based Upon Student's t-Statistic

    A (1 • •) 100%  one-sided upper  confidence
limit for the mean based upon Student's t-statistic
is given by the following equation:
                                        (21)
where t.. „. l  is the upper  • -th percentile of the
Student's  t distribution with n  • 4 degrees of
freedom,  and the sample variance is given by:
                   1
This UCL should be used when either the data
follow a normal distribution, or when the data
distribution is  only  mildly skewed and sample
size n is large.  For highly skewed data sets, the
UCL based upon this method fails to provide the
specified (!-•) 100% coverage for the population
mean, • •

3.2  UCL Based Upon Modified Student's t-
     Statistic for Asymmetric Distributions
    Johnson (1978) and Sutton (1993) proposed
the use  of a modified t-statistic for testing the
mean of a positively skewed distribution.  An
adjusted  (I--) 100% UCL  (Singh, Singh,  and
Engelhardt,  1999) of the mean, • • based upon
modified t-statistic is given as follows:
        = x
(22)
Where,   /}3  an  unbiased  moment  estimate
(Kleijnen, Kloppenburg, and Meeuwsen, 1986) of
the third central moment, //3 , is given as follows:
                                 2)).     (23)
The  simulation study conducted in Section 4
suggests that the UCL based upon the modified-t
statistic also  fails  to provide the  specified
coverage (95% here) for skewed data sets from
gamma distributions.

3.3   UCL  of the Mean Based  Upon the
      Adjusted  Central  Limit Theorem  for
      Skewed Distributions
   Given a random sample, jcls x2, ... , xn of size n
from a population with finite  variance, *\ and
mean, • -the Central Limit Theorem (CLT) states
that the asymptotic distribution (as n approaches
infinity) of the sample mean,  xn ,  is normally
distributed with mean  • -and variance *V«.  An
often cited rule of thumb for a minimum sample
size satisfying the CLT is n • *30. However, this
is not adequate if the population is highly skewed
(Singh,  Singh,  and  Engelhardt,  1999).   A
refinement of the CLT approach which makes an
adjustment for skewness is discussed by Chen
(1995).  Specifically, the "adjusted CLT" UCL is
given by:

UCL = x + [z. +!3(1+ 2z») / (6{n)]sx I in, (24)

where k3 , the coefficient of skewness, is given by
k3 = fi3 / s* . The simulation study conducted in
Section 5 suggests that even for larger samples,
the adjustment made in the CLT-UCL method is
not effective  enough  to provide the specified
                                              13

-------
(95%)  coverage  for  skewed  data  sets.   As
skewness decreases, the coverage provided by the
adjusted CLT-UCL approaches 95% for larger
sample sizes, as can be seen in Figures 12-14.

3.4   UCL of the Mean  of a  Log normal
      Distribution  Based   Upon   Land's
      Method
    In practice, a skewed data set can be modeled
by  both  lognormal  and  gamma distribution.
However,   due  to  computational  ease,   the
lognormal distribution is typically used to model
such skewed data sets. A (1 • •) 100% UCL for the
mean, • ,• of a lognormal  distribution based upon
Land's H-statistic  (1971) is given as follows:
  UCL = exp(y + 0.5s$ + s,//, a//Br!) .
                                 (25)
where y and S  are the sample mean and variance
of the  log-transformed data.  Tables of values
denoted by Hl....can be found in Gilbert (1987).
From the  simulation experiments discussed  in
Section 4,  it is observed that H-statistic based
UCL grossly  overestimates the  95% UCL and
consequently, coverage provided by a H-UCL is
always larger than the specified coverage of 95%.
In  Section 4,  examples to  illustrate  this
unreasonable behavior of the H-statistic based
UCL are included.   The practical merit of a H-
UCL is doubtful as it results in unacceptably high
UCL values. This is especially true for samples of
small size (e.g., <25) with values of sy exceeding
1.5-2.0. This is illustrated in examples 2-4.

3.5  UCL of the  Mean Based  Upon the
     Chebyshev Inequality
    Chebyshev inequality can be used to obtain a
reasonably conservative but stable estimate of the
UCL of the mean.  The two-sided  Chebyshev
Theorem states that given a random variable X
with finite mean and standard deviation, • and • ,•
we have:
                              - I//.
                                 (26)
Here, j is a positive real number. This result can
be applied with the sample mean, J, to obtain a
conservative  UCL  for  the  population  mean.
Specifically, a (!-•) 100% UCL of the mean, • • is
given by:
UCL = x
                         "1 ..... 1)0- / fn.
(27)
Of course, this would require the user to know the
value of • .• The obvious modification would be to
replace • -with the sample standard deviation, sx,
but this is estimated from data, and therefore, the
result is no longer guaranteed to be conservative.
In general,  if • -is  an  unknown  mean, fi is  an
estimate, and  a (//) is an estimate of the standard
error of fi, then the quantity UCL = fi +4.359

-------
1999). Some of those methods are described as
follows.

    Let Jt1; x2,..., xn be a random sample of size «
from a population with an unknown parameter • •
(e.g.,  ••= •) and let 9 be an estimate of • -which
is a function of all « observations.  For example,
the parameter  • • could be  the  mean,  and  a
reasonable choice for the estimate 9 might be the
sample mean x .  In the bootstrap procedures,
repeated  samples  of size  « are  drawn  with
replacement from the given set of observations.
The process is repeated a large number of times
(e.g.,  1000), and each time an estimate,(9 , of • •
(the mean, here) is computed. The estimates thus
obtained are used Jo compute an estimate of the
standard error of 6. There exists in the literature
of  statistics  an extensive   array  of different
bootstrap methods for constructing confidence
intervals. In this article three of those methods are
considered: 1) the standard bootstrap method, and
2) bootstrap -tmethod (Efron, 1982, Hall, 1988),
and 3) Hall's bootstrap method (Hall,  1992,
Manly, 1997).

3.6   UCL of the  Mean  Based Upon the
      Standard Bootstrap Method
Step 1. Let (jtn, xi2,..., xm) represent the ith sample
       of  size «  with replacement  from the
       original data set (rb x2, ..., xn).  Compute
       the sample mean *;  of the  i* sample.
Step 2. Repeat  Step 1  independently TV times
       (e.g., 1000-2000), each time calculating a
       new estimate.  Denote these estimates by
       Xi,X2,x~3,.. . ,x N.   The   bootstrap
       estimate  of the population mean is the
       arithmetic   mean,   x B,   of  the  N
       estimates xt .  The bootstrap estimate of
       the standard error is given by:
                                                                                 d-e
                                                                                           (29)
                     1
                                         (28)
The general bootstrap estimate, denoted by 6B, is
the arithmetic mean of the  N estimates.  The
difference,  9B - 9 , provides an estimate of the
bias of the estimate, 9 .

The  standard bootstrap confidence interval is
derived from the following pivotal quantity, t:
                                                   A  (!••> 100% standard bootstrap UCL for ••
                                                   which assumes that equation (29) is approximately
                                                   normal, is given as follows:
                                                                UCL = & + z,
                                         (30)
                                                   It is observed that the standard bootstrap method
                                                   does not adequately adjust for skewness, and the
                                                   UCL given by equation (30) fails to provide the
                                                   specified (!-•) 100% coverage of the population
                                                   mean of skewed data distributions.

                                                   3.7   UCL of the  Mean  Based Upon  the
                                                         Bootstrap -1  Method

                                                      Another variation of the bootstrap method,
                                                   called the "bootstrap - t" by Efron (1982) is a
                                                   nonparametric procedure which uses the bootstrap
                                                   methodology to  estimate quantiles  of the t-
                                                   statistic, given by (29), directly from data (Hall,
                                                   1988).  In practice, for non-normal populations,
                                                   the required  t-quantiles may  not  be   easily
                                                   obtained, or may be impossible to derive exactly.
                                                   In this  method,  as  before  in  Steps  1  and  2
                                                   described above, x is the sample mean computed
                                                   from the original  data, and  J,  and sXii are the
                                                   sample  mean and sample  standard deviation
                                                   computed from the rth resampling of the original
                                                   data. The TV quantities tt = -(n) (J, - x ) / sxi are
                                                   computed and sorted, yielding ordered quantities
                                                   ^(i) * *^(2) * *****~*V)- Th£ estimate of the lower •*
                                                   quantile of the pivotal quantity (29) is t..B = t(.V).
                                                   For example, if N = 1000 bootstrap samples are
                                                   generated, then the 50th ordered value, t(50), would
                                                   be the  bootstrap  estimate of the lower 0.05th
                                                   quantile of the t-statistic as given by (29). Then a
                                                   (!-•) 100% UCL of mean based upon bootstrap t-
                                                   method is given as follows:
                UCL = x -
                                                                                   / 
-------
the original data. Let x be the sample^mean, sx
be the sample standard deviation, and k3 be the
sample skewness computed from the original data.
                  Wt = (x, - jf) /a*, and
      The quantities W; and Qj given as follows  are
      computed for each of the N bootstrap samples,
      where:

W, + kyW? 13 +      / 27 + 4 /(6»).
The quantities Q,{Wt) given above are arranged in
ascending order. For a specified (!-•) confidence
coefficient, compute the (• N)* ordered value, q..
of quantities Qi(W^). Finally, compute W(q.) using
the inverse function, which is given as follows:
                                        (32)
Finally, the (!-•) 100% UCL of the population
mean based upon Hall ' s bootstrap method (Manly,
1997) is given as follows:
                                        (33)
    It is observed (Section 4) that the coverage
probabilities provided by bootstrap -1 and Hall's
bootstrap methods are in close agreement. For
larger samples these two methods approximately
provide the  specified 95%  coverage to the
population mean, k« r  For smaller sample sizes,
the coverage provided by these methods is only
slightly lower than the specified level of 0.95.  It
is also noted that, for highly skewed (Figures 5-8)
data sets (with k* 0.25) of small size (e.g., n<10),
coverage probability  provided  by  these two
methods is higher than the Chebyshev UCL.


4.  Examples

    Several examples illustrating the computation
of the various 95% UCLs  of the population mean
are included in this  section.  Software, ProUCL
(EPA 2002) has been used to compute some of the
UCLs values. Gamma UCLs are computed using
the program  Chi_test  (2002).   Examples are
generated from the  gamma distribution and the
lognormal distribution, and UCLs are computed
      using all of the methods discussed in this paper.
      It is observed that for small data sets, it is not easy
      to distinguish between a gamma model and a
      lognormal distribution. It is further noted that use
      of a gamma distribution results in practical and
      reliable UCLs of the population mean.  Simulation
      results  discussed in Section  5 suggest that the
      adjusted gamma UCL approximately provides the
      specified  95% coverage to the population mean
      for data sets with shape parameter, k, exceeding
      0.1.
      4.1   Simulated  Examples
            Distribution
                                                                                from  Gamma
      Example 2
      A data set of size 15 is generated from a gamma,
      G(0.2, 100), distribution with the true population
      mean = 20, and skewness =4.472.  The data are:
      0.7269,   0.00025,   0.0000002548,  0.9510,
      0.000457,  32.5884, 0.02950,  1.6843, 3.3981,
      170.4109,  59.8188,  0.00042,  0.8227,  0.00726,
      2.1037. The data set consists of very small values
      as well as some large values. These types of data
      sets often occur in environmental applications.
      The sample mean  is 18.17.  Using the Shapiro-
      Wilk's test,  it is  concluded that the data also
      follow a lognormal model. The standard deviation
      (sd) of log-transformed data is quite large, 5.618;
      therefore, the H-statistic based UCL  of mean
      becomes unpractically large. The bias-corrected
      MLEs of k  and "are  0.16527 and  109.939,
      respectively. The  adjusted  (using bias-corrected
      estimate of k) df, u*=4.958.  For • '=0.05, and
      n=15, the critical probability level, • «to be used is
      0.0324 (from Table 1). The UCLs obtained using
      the  various  methods  are  summarized in the
      following table.
                                              16

-------
             UCL Computation Method
            95% UCL of Mean
             Approximate gamma UCL (equation (19))                  79.968
             Adjusted gamma UCL (equation (20))                     98.139
             UCL based upon t-statistic (equation (21))                  38.778
             UCL based upon modified t-statistic (equation (22))          40.356
             UCL based upon adjusted CLT (equation (24))             47.537
             UCL based upon H-statistic (equation (25))                 5.4E+13
             UCL based upon Chebyshev (equation (27))               69.171
             UCL based upon standard bootstrap (equation (30))         36.889
             UCL based upon bootstrap -1 (equation (31))               102.392
             Hall's bootstrap UCL (equation (33))	114.252
   Note that the H-UCL becomes unacceptably
large. Since the  H-UCL exceeds the maximum
observed   value  of   170.41,   using  the
recommendation made in the EPA (1992) RAGS
document, one would use that maximum value as
an estimate of the EPC term.  Simulation results
summarized in the next section (Figures  6-7)
suggest that for n=15 and an estimate of k = 0.165,
the adjusted UCL based upon a gamma model
provides the specified 95% coverage  to the
population mean. Therefore, for this data set, the
use of  the  adjusted  gamma UCL of  98.139
(equation 20) is an appropriate choice  for an
estimate  of the  EPC term.   The maximum
observed value represents an overestimate of the
EPC term.
Example 3
A data set of size 15 is generated from a gamma
distribution with: &=0.5; and • «=100 with mean,
• «= k« «= 50, and skewness = 2.828. The data are:
343.31, 102.44,0.33,1.42, 13.17,439.59, 130.66,
158.0, 70.65, 25.05, 144.84, 63.65, 62.50, 11.58,
1.097. Using Shapiro-Wilk's test, it is concluded
that these data cannot rej ect the hypothesis that the
data  also follow a lognormal distribution with
sample mean =  104.553.   The bias-corrected
estimates of k and • «are 0.46166, and 226.473,
respectively. The adjusteddf, u* = 2w&*,forthe
Chi-square distribution =13.85. As before, for • •
=0.05, and n=15, the critical probability level, • «=
0.0324.  The 95% UCLs of mean obtained using
the various methods described above are given
below.
            UCL Computation Method
             95% UCL of Mean
            Approximate gamma UCL (equation (19))
            Adjusted gamma UCL (equation (20))
            UCL based upon t-statistic (equation (21))
            UCL based upon modified t-statistic (equation (22))
            UCL based upon adjusted CLT (equation (24))
            UCL based upon H-statistic (equation (25))
            UCL based upon Chebyshev (equation (27))
            UCL based upon standard bootstrap (equation (30))
            UCL based upon bootstrap -1 (equation (31))
            Hall's bootstrap UCL (equation (33))	
                 223.879
                 247.257
                 163.413
                 165.92
                 175.596
                5687.383
                 250.22
                 158.798
                 223.665
                 461.795
    Again note that the H-UCL is 5687.38, which
is much higher than the UCLs obtained using any
of the other methods. Simulation results suggest
that, for n=15 and an MLE of k to be 0.46166,
both approximate as well as the adjusted UCLs
based upon a gamma model provide the specified
95% coverage to the population mean (Figure 9).
Also, note that the Chebyshev UCL is very close
to the adjusted gamma UCL. Any of these three
methods may be used to compute the UCL of the
population mean.
                                             17

-------
Example 4
A random sample of size n=10 is generated from
a gamma (1,100) distribution with mean 100 and
skewness=2.  The  data are:  3.0018, 31.0899,
9.0257, 271.3804, 155.8221, 157.8577, 73.3756,
95.0452,  1.4292, 65.7240. Also, at 0.05 level of
significance,  these  data   cannot  reject  the
hypothesis that the  data follow  a lognormal
distribution. They  also pass the Shapiro-Wilk's
test for normality.  The sample mean is 86.375.
The bias corrected MLEs of k and • «are 0.55121
and 156.7006, respectively, and the associated df
= 11.0242. For n=10, and • «=0.05, the critical
probability level,  • • to be used  (to achieve  a
confidence coefficient of 0.95) is =0.0267. The
UCLs obtained using the various methods are
given as follows.
             UCL Computation Method
            95% UCL of Mean
             Approximate gamma UCL (equation (19))                 207.435
             Adjusted gamma UCL (equation (20))                     244.531
             UCL based upon t-statistic (equation (21))                 136.776
             UCL based upon modified t-statistic (equation (22))         138.368
             UCL based upon adjusted CLT (equation (24))             141.804
             UCL based upon H-statistic (equation (25))               3260.882
             UCL based upon Chebyshev (equation (27))               206.222
             UCL based upon standard bootstrap (equation (30))        130.526
             UCL based upon bootstrap -1 (equation (31))              164.356
             Hall's bootstrap UCL (equation  (33))	148.938
    Once again, note that the H-UCL is 3260.882,
which is much higher than the UCLs obtained
using any of the other methods. Simulation results
summarized in the next section suggest that for,
for n=10  and an estimate of  k  to  be  0.5512
(Figures 9-10), both the approximate and adjusted
UCLs based  upon the gamma model at least
provide  the  specified  95% coverage  to the
population mean.  95% Chebyshev UCL also
provides the  specified  coverage  to  population
mean.   For this  combination of  skewness and
sample size, any  of these three methods may be
used to compute a 95% UCL of population mean.

Example 5
A mildly skewed data set of size 10 was generated
from a gamma distribution G(4,100) with mean
400 and skewness =1.  The data  are 734.9055,
352.2732,  402.2431,  410.0733,  507.1526,
1010.3391,  199.9971,  296.4427,  1241.1702,
392.7091. The sample mean = 554.730.  Based
upon the  Shapiro-Wilk's  test, at 0.05 level of
significance, the data do not reject the hypotheses
of normality as well as of lognormality. The sd of
the  log-transformed data is  0.561.   The bias
corrected MLEs of k and • • are  2.55276 and
217.307, respectively. The associated df=51.055.
For n=10, and ••=0.05,  the  critical probability
level, • • (to achieve a confidence coefficient of
0.095), to be used is =0.0267. The UCLs obtained
using the various methods are given below.

    For this data set, the difference between the H-
UCL and other UCLs is small. Simulation results
suggest  that as the sample size increases, these
differences in the UCLs will decrease. From these
results (Figures  11-12),  it is noted that  for a
sample of size 10 and an estimate of k=2.55, both
the  approximate Gamma UCL and  adjusted
gamma  UCL at least provide the specified 95%
coverage to the population mean. Any of the two
methods can be used to compute a 95% UCL of
the mean. The Chebyshev inequality results in an
overestimate of the UCL.
                                              18

-------
             UCL Computation Method
            95% UCL of Mean
             Approximate gamma UCL (equation (19))                 794.531
             Adjusted gamma UCL (equation (20))                     847.442
             UCL based upon t-statistic (equation (21))                 749.638
             UCL based upon modified t-statistic (equation (22))         756.657
             UCL based upon adjusted CLT (equation (24))             775.617
             UCL based upon H-statistic (equation (25))                862.649
             UCL based upon Chebyshev (equation (27))              1018.95
             UCL based upon standard bootstrap (equation (30))        715.892
             UCL based upon bootstrap -1 (equation (31))              902.716
             Hall's bootstrap UCL (equation  (33))	889.773
4.2   Simulated Examples from  Log normal
      Distributions
Next  we consider a couple of small data  sets
generated  from lognormal distributions.   It is
observed that those data sets also follow gamma
models.

Example 6
A sample  of size n =  15 is generated from the
lognormal distribution with parameters • «= 5, • •=
2; the true  mean of this distribution is 1096.6, the
coefficient of variation (CV) is 7.32, and skewness
is 414.4. The generated data are: 47.42,2761.51,
2904.26, 6928.33, 14.73, 7.67, 73.36, 2843.79,
151.71, 103.52, 14.8, 37.32, 24.74,  658.04,
110.42. A goodness-of-fit test showed the data
distribution to be non-normal (P < 0.01) and also
that the data passes the test of lognormality (P >
0.15).  The software packages ExpertFit (2001)
and  GamGood (2002) were  used  to  test the
goodness-of-fit of the gamma distribution. The
observed  value of the Anderson-Darling  test
statistic is 1.094,  and the  approximate critical
value for test size 0.05 is 2.492, and hence an
approximate gamma distribution can also be used
to model the probability distribution of this data
set. The Chi-square goodness-of-fit test with four
equal intervals led to the same conclusion. The
bias-adjusted estimates of shape, k, and scale, •;
of the gamma distribution are 0.32 land 3466.301,
respectively. The 95% UCLs computed from the
various methods are given below.

Notice that the H-UCL is more than 5 times higher
than the maximum concentration in the sample,
and more than 10 times higher than all the other
UCLs.  All UCLs are larger than the true
population mean (1096.6) for this data set. From
Figures 8 and 9, it is observed that for an estimate
of k=0.321 and n=15, the adjusted gamma UCL =
3276.40 provides the specified 95% coverage to
population mean.

  Student's t                   2005.973
  Adjusted CLT                2251.311
  Modified t                    2053.46
  CLT                        1946.872
  Standard Bootstrap           1917.433
  Bootstrap t                   2541.425
  Hall's Bootstrap              2305.170
  Chebyshev (Mean, Std)        3324.25
  95% H-UCL                 37726.46
  Adjusted Gamma UCL	3276.40

    Continuing with this example,  suppose that
another sample is collected and it turns out to be
below the detection limit (DL) of the instrument.
Suppose further that DL =10, and following EPA
guidance documents, this value is replaced  by
DL/2 = 5. One would expect that this additional
non-detect observation would result in a reduction
of the UCL.   The UCLs calculated from this
sample of n = 16 observations are given  below:
  Student's t
  Adjusted CLT
  Modified t
  CLT
  Standard Bootstrap
  Bootstrap t
  Chebyshev
  H-UCL
  Gamma UCL
 1884
 2122.80
 1929.28
 1832.02
 1795.20
 2369.23
 3134.05
40313.2
 3013.70
                                              19

-------
The UCLs computed from all but the H-statistic
based formula decreased with the addition of one
below-detect observation; the  H-statistic based
formula, however, resulted in a much higher UCL.
This is unarguably an unacceptable  value.

Example 7
Finally, we consider a data set of size 20 from a
highly skewed lognormal model with parameters
••=5 and  ««=3.   For this high value of •; the
population  mean assuming  a  lognormal model
becomes quite high= 13359.73.  In practice, use of
such  a  model  will unjustifiably inflate  the
population mean; therefore, its use to estimate the
EPC  term  is  not  desirable.    Note that  the
population median is only 148.413. The generated
data are:   4453.2441,  337.7879, 2972.0916,
10.4690,   827.7806,  63.2507,   13969.2646,
11.1967,  5.2651, 65.7771,  921.7736,   7.6539,
756.6956,223.3185,140.8639,466.1513,3.1751,
418.6896, 1.1281,  22.4442.   The  observations
range from 1.1281 to  13969.2646  with sample
mean = 1283.9.  Note that the population mean is
orders of magnitude higher than the sample mean.
It  is  also observed  that  at  0.05  level  of
significance, this  data  set cannot reject  the
hypothesis of a gamma model. The bias corrected
MLEs of shape and scale for a  gamma model are
0.28 and 4564.46, respectively.  The Kolmogorov-
Smirnov (K-S) test statistic for gamma distribution
is D= 0.176 which is less  than the 5% critical
value  of about  0.21 (with  an estimated shape
parameter of 0.28,  Table 7,  Schneider, 1978)
leading to the  conclusion that the data cannot
reject the hypothesis of a gamma distribution of
the  data  set.   The estimated  population mean
assuming a gamma model is kQ = 0.28*4564.46
= 1278.049 which is close to the sample arithmetic
mean of 1283.9.

    The adjusted 95% Gamma UCL = 3278.41;
the  95% UCLs based  upon  Student's- t  and
modified-  t  are   2517.889   and  2616.198,
respectively. The95%Bootstrap-tUCL=5823.31,
95% Chebyshev UCL=4394.61, and 95% H-
UCL=87052.   As  mentioned  earlier, use  of a
lognormal  model  unjustifiably  accommodates
large  and  unpractical  values  of the mean
concentration and its UCLs.  From Figures 8 and
9, it is noted that the adjusted  gamma UCL will
provide  the specified  95%  coverage  to  the
population mean. Thus, for this data set, a gamma
UCL = 3278.41 provides a reasonable estimate of
the EPC term.
5. Comparison  of   the  Various   UCL
   Computation Methods

   Using Monte Carlo simulation experiments for
data sets generated from gamma distributions, the
performances of the various UCL computation
methods  have been compared in terms of the
coverage probabilities achieved by the respective
UCLs.   Similar  comparisons (Singh, Singh,
Engelhardt,  and Nocerino,  2001)  have  been
performed  for  the various UCL  computation
methods using data sets generated from lognormal
distributions.   The methods considered in the
present simulation experiments include: Student's
t-statistic, modified Student's t-statistic, adjusted
CLT,  Chebyshev method, H-UCL, approximate-
gamma UCL, adjusted gamma UCL, and the three
bootstrap methods: standard bootstrap, bootstrap-t
method  (Efron, 1982; Hall, 1988), and Hall's
(1992) bootstrap method. For each of the  three
bootstrap methods, 1000 resamples have  been
used.    The  EPA  (1992)  RAGS document
recommends the use of the maximum observed
concentration as an estimate  of the EPC  term
when  the H-UCL exceeds the maximum observed
value.  Therefore, the maximum observed value
(called Max-test in this paper)  has also  been
included in the simulation experiments.  Thus, 11
EPC computation methods have been considered
in these simulation experiments.

   The simulation experiments are carried out for
various values of the sample size, n = 5, 10, 15,
20, 30, 50, 70, and 100.  Random deviates of
sample size n were generated from a gamma,
G(k,') population. Various values of k and •  -have
been considered. The considered values of k are
0.1, 0.15, 0.2, 0.25, 0.5, 1.0,  2.0, 4.0, 6.0, and
10.0.  These values of k cover a wide  range of
values of  skewness,  2/Vk . The  simulation
experiments were conducted for three values, 1.0,
50.0,  and 100.0 of the scale parameter, •: As
noted  earlier, gamma distribution based UCLs as
given  by (19) and (20)  only  depend upon the
estimate  of the shape  parameter, k  and are
independent of the scale parameter, • • and its
estimate.   Consequently  as  expected,   it is
observed that coverage probabilities for the mean
                                             20

-------
associated with the gamma UCLs do not depend
upon the values of the scale parameter, • • and the
differences in the coverage probabilities for these
three values of* -are negligible.  Therefore, in this
article, the coverage probabilities are graphed for
• -=50.0 only. A typical simulation experiment can
be described in the  following four steps:

Step 1.  Generate   a   random   sample   of the
        specified size, n, from a gamma, G(£,50)
        distribution. The algorithm as outlined in
        Whittaker  (1974)  has been  used  to
        generate the gamma deviates.

Step 2.  For  each generated  sample, compute a
        95% UCL of the mean  using the various
        methods described in Sections 2.0, 3.0,
        and  in ProUCL software package (EPA
        2002).

Step 3.  Repeat steps 1 and 2, 15,000 times.

Step 4.   For each UCL computation method, count
        the number of times the population mean,
        k« • falls below a respective UCL. The
        percentages of these numbers provide the
        coverage probabilities  achieved by the
        various UCL  computation methods.

    Simulation results suggest that the  UCLs
based upon  Student's-t  and standard bootstrap
methods fail  to  provide  the specified  95%
coverage  of  the  population mean  of  the
distributions  considered here.   Also,  as  noted
earlier, the H-statistic overestimates the 95% UCL
as it provides almost 100%  coverage of the
population mean.   Use of the  H-statistic  yields
impractically large UCL values. This is especially
true for highly skewed data sets (k*  4). It is also
noted that the coverage probabilities provided by
bootstrap-1 method and Hall's  bootstrap method
are quite similar.    Therefore,  the  coverage
percentages for these four methods:  Student's -t,
standard bootstrap, Hall's bootstrap, and H-UCL
are not  included in the graphs  presented in this
section. The coverage percentages as obtained in
Step 4  for the remaining seven (7)  methods are
given in Figures 5-14.  Figures  5-14 have the
coverage percentages whenk=0.1,0.15,0.2,0.25,
0.5, 1.0, 2.0, 4.0, 6.0, and 10.0, respectively. The
following observations have been made from these
graphs.
1.   From Figures 5-14, it is observed that UCLs
    based upon the adjusted CLT and modified-t
    methods  fail to provide the specified  95%
    coverage of the population mean. For mildly
    skewed data sets (e.g.,  k* 6),  the  coverage
    provided  by   the  adjusted  CLT-UCL
    approaches 95% as the sample size becomes
    larger than 50.

2.   It is observed that for highly skewed data with
    an  estimate  of k<0.25,  even a  Max-test
    (maximum observation) fails to provide the
    specified  95% coverage of the population
    mean.  This is especially true when the sample
    size is  less than 10 (Figures 5-7).  For smaller
    samples (e.g.,  • *>), the Max-test  fails  to
    provide the specified 95% coverage of the
    mean for values of k as large as 6.  Thus, for
    samples of size 5 or less, the default option of
    using the maximum observation as an estimate
    of the EPC term may not be appropriate. It is
    more appropriate to use an adjusted gamma-
    distribution-based UCL of the mean. It is also
    observed that for samples of size 15 or larger,
    the Max-test always provides  at least  95%
    coverage to population mean. This means for
    samples of size • 4 5, the Max-test would re suit
    in an overestimate of the 95% UCL of the
    mean.

3.   From Figures 5-9,  it is also observed that for
    skewed  data sets with  k* 0.5,  the  95%
    Chebyshev UCL fails to provide the specified
    95% coverage of the population mean. This is
    especially true  when the sample size is less
    20. When k>0.5, the 95% Chebyshev UCL
    provides the specified 95% coverage to the
    population mean  even for small  samples
    (Figures 10-14). Furthermore,  it is noted that
    as  the sample size  increases,  the  95%
    Chebyshev  UCL  provides at  least  95%
    coverage to population mean  resulting in a
    conservative but stable estimate of the  95%
    UCL of the mean.

4.   It is observed that for highly skewed data sets
    (Figures 5-6) with k<0.2, and  samples  of
    small size  (<10), the  adjusted  gamma UCL
    provides the maximum coverage (and close to
    0.95)  of the  population  mean,  and  this
    coverage approaches the specified coverage of
    0.95 as k  approaches 0.2  (Figure 7).  For
    k* 0.2, the adjusted gamma UCL provides at
                                              21

-------
    least 95% coverage of the population mean
    (Figures 8-14) for samples of all sizes.

5.  From Figures 5-8, it  is observed that  for
    values of k<0.5, and samples of small  size
    (n<30), the approximate gamma UCL fails to
    provide  the  specified 95%  coverage  of
    population mean.  Also, from Figures 9-14, it
    is observed  that for values  of k« 0.5,  an
    approximate   gamma  UCL  provides   the
    specified 95% coverage of population mean
    for samples of all sizes.

6.  From Figures 5-14, it is observed that the 95%
    UCL  based  upon  the  bootstrap-t  method
    consistently provides about 90% coverage of
    population mean. This coverage approaches
    95% as the sample size increases.


6. Recommendations

    Skewed data sets can be modeled by more
than one distribution.  Due to the computational
ease of working  with a lognormal model, users
often choose the  lognormal distribution for such
data sets. It is observed that for small data sets, it
is not easy to distinguish between a gamma model
and a lognormal distribution. However, there are
some fundamental problems associated with  the
use of a lognormal distribution.  The use of a
lognormal model unjustifiably elevates the mean
and  the  associated UCL,  therefore, its use in
environmental  applications should  be avoided.
Since the H-UCL becomes unrealistically large,
the Max-test is sometimes used as an estimate of
the EPC term. It is shown that for highly skewed
(k<0.25) data sets of small size (n<10), the Max-
test does not provide the specified 95% coverage
of the means of gamma populations and for larger
samples, Max-test results in overestimates of the
EPC term. Furthermore, the EPC term represents
an average concentration in an area, therefore, it
should be estimated by a UCL of the mean. In this
paper, we have introduced the gamma distribution
which is well suited to model highly  skewed data
sets  originating  from various   environmental
applications.  It is further noted that use of the
gamma distribution results in practical and reliable
UCLs of the population mean. Simulation results
discussed in Section 5 suggest that  the adjusted
gamma UCL approximately provides  the specified
95% coverage of the population mean for data sets
with shape parameter, k, exceeding 0.1.  It is,
therefore, recommended that for a given data set,
the user should use a goodness-of-fit test to see if
the data follow a gamma distribution. If the data
do follow a gamma distribution, then the user
should compute a UCL of the mean based upon a
gamma model. It is shown that both  approximate
and adjusted gamma UCLs behave in a stable
manner.   For estimated values of the shape
parameter, k« 0.5, one can use the  approximate
gamma UCL as an estimate of the EPC term, and
for values of  k<0.5, one can use the adjusted
gamma UCL.  Graphs presented  in Figures 5-14
cover a wide range  of the skewness of gamma
distributions.   These  graphs  can  be  used to
determine which method should be used for a
given combination of skewness and  sample size.
For data sets which cannot be modeled by an
approximate gamma distribution, one can use a
UCL based upon the Chebyshev inequality or the
bootstrap t-procedure.   These  two procedures
generally result in conservative,  but reasonable,
estimates of the EPC term.
                                              22

-------
                                       References
1.   Bowman, K. O., and Shenton, L.R. (1988),
    Properties  of Estimators  for  the  Gamma
    Distribution, Volume 89. Marcel Dekker, Inc.
    New York.

2.   Chen, L. (1995), Testing the Mean of Skewed
    Distributions. Journal of American Statistical
    Association, 90, 767-772.

3.   Chijest Software (2002),  A Software that
    Computes MLEs of Gamma Parameters and
    an Upper Confidence Limit of the Mean of a
    Gamma  Distribution.   Lockheed  Martin
    Environmental Sciences, Las Vegas, Nevada.

4.   Choi, S. C., and Wette, R. (1969), Maximum
    Likelihood Estimation of the Parameters of
    the  Gamma Distribution  and Their  Bias.
    Technometrics, Vol II, pp 383-690.
5.   D'Agostino, R.B. and Stephens, M.A. (1986).
    Goodness-of-Fit-Techniques. Marcel Dekker,
    Inc., NY.
6.   Efron,  B.  (1982),  The  Jackknife,  the
    Bootstrap,  and Other  Resampling  Plans,
    Philadelphia:   SIAM   Monograph   #38.
    (Philadelphia:   Society  for  Industrial and
    Applied Mathematics).
7.   EPA (1989),  Methods for Evaluating the
    Attainment of Cleanup Standards, Vol 1, Soils
    and Solid Media, EPA 230/2-89/042.

8.   EPA (1992), Supplemental Guidance to Rags:
    Calculating the Concentration Term,  EPA
    9285.7-081, May 1992.
9.   EPA (1996),  Soil  Screening Guidance:
    Technical  Background    Document,
    EPA/540/R-95/128,  9355.4-17A, May 1996.

10. EPA (2002),  ProUCL  Version   2.1,  A
    Statistical   Software,   National  Exposure
    Research Lab, EPA, Las Vegas, Nevada, April
    2002.

11. ExpertFit Software (2001), Averill M. Law &
    Associates Inc, Tucson, Arizona.
12. Faires,  J. D.,  and  Burden,  R. L. (1993).
    Numerical Methods. PWS-Kent Publishing
    Company, Boston, USA.
13. GamGood Software (2002), A Goodness-of-
    fit  Testing Software  to  Test for a Gamma
    Model.  Lockheed  Martin  Environmental
    Sciences, Las Vegas, Nevada.

14. Gilbert, R.O. (1987),  Statistical Methods for
    Environmental  Pollution Monitoring, New
    York, Van Nostrand Reinhold.
15. Grice,J.V., and Bain, L.J.( 1980), Inferences
    Concerning  the  Mean   of  the  Gamma
    Distribution.   Journal  of the  American
    Statistical Association. Vol 75, Number 372,
    pp  929-933.

16. Hall, P.  (1988), Theoretical comparison of
    bootstrap confidence  intervals; Annals  of
    Statistics, 16, pp 927-953.

17. Hall, P. (1992), On the Removal of Skewness
    by   Transformation.  Journal  of  Royal
    Statistical Society, B 54, pp 221-228.

18. Hardin,  J.W.,  and Gilbert,  R.O. (1993),
    Comparing Statistical Tests for Detecting Soil
    Contamination  Greater  Than Background.
    Pacific  Northwest  Laboratory,  Battelle,
    Technical Report # DE 94-005498.
19. Johnson, N.J. (1978), Modified t-tests and
    Confidence  Intervals   for  Asymmetrical
    Populations. Journal of American Statistical
    Association, 73, pp 536-544.

20. Johnson, N.L., Kotz, S., and Balakrishnan, N.
    (1994), Continuous Univariate Distributions,
    Volume  1.  Second Edition. John Wiley, NY.
21. Kleijnen, J.P.C.,  Kloppenburg,  G.L.J., and
    Meeuwsen, F.L. (1986), Testing Mean of an
    Asymmetrical   Population:   Johnson's
    Modified  t-test  Revisited.  Commun.  in
    Statist.-Simula., 15(3), pp 715-731.

22. Land, C. E. (1971), Confidence Intervals for
    Linear Functions  of the Normal Mean and
    Variance, Annals  of Mathematical Statistics
    42:  pp 1187-1205.
23. Law,  A.M.,  and Kelton,  W.D.  (2000),
    Simulation Modeling  and Analysis.  Third
    Edition. McGraw Hill, USA.
                                             23

-------
24. Manly,  B.F.J.  (1997),    Randomization,
    Bootstrap,  and  Monte  Carlo  Methods in
    Biology. Second Edition.  Chapman  Hall,
    London.
25. Schneider, B.E. andClickner, R.P. (1976). On
    the distribution of the Kolmogorov-Smirnov
    Statistic for the Gamma Distribution with
    Unknown Parameters. Mimeo Series Number
    36, Department   of Statistics,  School of
    Business Administration, Temple University,
    Philadelphia, PA.
26. Schneider,  Bruce E. (1978),  Kolmogorov-
    Smirnov Test Statistics for  the  Gamma
    Distribution  with  Unknown  Parameters.
    Dissertation, Department of Statistics, Temple
    University, Philadelphia, PA.
27. Schulz,  T.  W.,  and  Griffin,  S.  (1999),
    Estimating  Risk Assessment Exposure Point
    Concentrations when Data are Not Normal or
    Lognormal. Risk Analysis, Vol. 19, No. 4, pp
    1999.

28. Singh, A.K., Singh, A., and Engelhardt, M.
    (1997),  The  lognormal    Distribution in
    Environmental Applications.   Technology
    Support Center  Issue  Paper,  182CMB97.
    EPA/600/R-97/006.
29. Singh, A.K., Singh, A., and Engelhardt, M.
    (1999),  Some Practical Aspects  of Sample
    Size and Power Computations for Estimating
   the Mean of Positively Skewed Distributions
   in Environmental Applications. Technology
   Support  Center Issue  Paper, EPA/600/S-
   99/006.
30. Singh, A., Singh, A. K., Engelhardt, M. E.,
   andNocerino, J. (2001), On the Computation
   of the Upper Confidence Limit of the Mean of
   Contaminant Data Distributions. Under EPA
   Review.

31. Stephens, M. A. (1970), Use of Kolmogorov-
   Smirnov, Cramer-von  Mises and Related
   Statistics Without Extensive Tables. Journal
   of Royal  Statistical Society, B 32, pp 115-122.
32. Sutton,  C.D. (1993),  Computer-Intensive
   Methods for Tests  About the Mean of an
   Asymmetrical  Distribution.   Journal  of
   American Statistical Association, 88, No. 423,
   pp 802-810.

33. Thorn, H.C.S. (1968), Direct and  Inverse
   Tables of the Gamma Distribution,  Silver
   Spring, MD; Environmental Data Service.
34. Whittaker, J. (1974), Generating Gamma and
   Beta Random Variables  with Non-integral
   Shape Parameters. Applied Statistics, 23, No.
   2, pp 210-214.

35. Wong, A. (1993), ANote on Inference forthe
   Mean Parameter of the Gamma Distribution.
   Statistics Probability   Letters,   Vol  17,
   pp 61-66.
                                             24

-------
                                         Notice
The U.S. Environmental Protection Agency (EPA)       approved for publication as an EPA document.
through its Office of Research and Development       Mention of trade names or commercial products
(ORD) funded and prepared this Issue Paper.  It       does   not  constitute  endorsement  or
has been subjected to the Agency's peer and       recommendation by EPA for use.
administrative review by the  EPA and has been
                                            25

-------