^eosr^
# o
iSBfZ!
\«/
Robust Estimation of Mean and
Variance Using Environmental
Data Sets with Below Detection
Limit Observations
-------
^eosr^
# o
iSBfZ!
V**^
Journal Article Published May 2001
Clearance Number 01-062
Robust Estimation of Mean and
Variance Using Environmental
Data Sets with Below Detection
Limit Observations
by
Anita Singh
Lockheed-Martin Environmental Systems & Technologies Company
980 Kelly Johnson Drive
Las Vegas, NV 89119
John Nocerino
United States Environmental Protection Agency
National Exposure Research Laboratory
Post Office Box 93478
Las Vegas, NV 89193-3478
085CMB04.PDF ~ 1/8/04
-------
Notice
The U.S. Environmental Protection Agency (EPA), through its Office of Research and
Development (ORD), funded and collaborated in the research described here. It has been subjected to the
Agency's peer review and has been approved as an EPA publication. The U.S. Government has a non-
exclusive, royalty-free license in and to any copyright covering this article.
-------
Summary
Scientists, especially environmental scientists often encounter trace level concentrations that are
typically reported as less than a certain limit of detection, L. Type I, left-censored data arise when certain
low values lying below L are ignored or unknown as they cannot be measured accurately. In many
environmental quality assurance and quality control (QA/QC), and groundwater monitoring applications
of the United States Environmental Protection Agency (U.S. EPA), values smaller than L are not required
to be reported. However, practitioners still need to obtain reliable estimates of the population mean, |_i,
and the standard deviation (sd), a. The problem gets complex when a small number of high
concentrations are observed with a substantial number of concentrations below the detection limit. The
high outlying values contaminate the underlying censored sample, leading to distorted estimates of |i and
a. The U.S. EPA, through the National Exposure Research Laboratory-Las Vegas (NERL-LV), under
the Office of Research and Development (ORD), has research interests in developing statistically rigorous
robust estimation procedures for contaminated left-censored data sets. Robust estimation procedures
based upon a proposed (PROP) influence function are shown to result in reliable estimates of population
parameters of mean and sd using contaminated left-censored samples. It is also observed that the robust
estimates thus obtained with or without the outliers are in close agreement with the corresponding
classical estimates after the removal of outliers. Several classical and robust methods for the estimation
of |i and a using left-censored (truncated) data sets with potential outliers have been reviewed and
evaluated.
Key Words: Type I censoring, Type II censoring, left-censored (truncated) data, detection limit, robust
statistics, Monte Carlo simulation, mean square error (MSE), PROP influence function,
unbiased maximum likelihood estimation (UMLE), Cohen's maximum likelihood
estimation, Persson and Rootzen's restricted maximum likelihood estimation (RMLE),
expectation-maximization (EM) algorithm, regression methods.
-------
Table of Contents
Notice ii
Summary iii
Section 1 Introduction 1
Section 2 Mathematical Formulation 4
Robust Procedures 5
Cohen's Method 6
Expectation Maximization (EM) Algorithm 7
Restricted Maximum Likelihood (RMLE) Method 8
Regression Method 9
Section 3 More Examples 11
Section 4 Simulation Experiments and Results 17
Fixed Detection Limit 29
Computed Detection Limit 30
Section 5 Summary and Conclusions 31
References 33
iv
-------
Section 1
Introduction
The processing of the analytical results of environmental samples containing potentially hazardous
chemicals is often complicated by the fact that some of these pollutants are present at trace levels, which
cannot be measured reliably and therefore are reported as results lying numerically below a certain limit
of detection, L. This results in left-censored data sets. In many environmental monitoring applications,
values smaller than L are not even required to be reported. However, since the presence of some of these
toxic pollutants (e.g., dioxin) in the various environmental media can pose a threat to human health and
the environment even at trace level concentrations, these non-detects cannot be ignored or deleted (often
done in practice) from subsequent analyses. For site characterization purposes such as to establish mean
contamination levels at various parts of a polluted site, it is desirable to obtain reliable estimates of |i and
a using the left-censored data sets. The problem gets complicated when some outliers are also present in
conjunction with the non-detects. Also, sometimes in environmental applications, non-detects (e.g., due
to matrix effects) exceed the observed values adding to the complexity of the estimation procedures.
Improperly obtained estimates of these parameters can result in inaccurate estimates of cleanup standards,
which in turn can lead to incorrect remediation decisions at a polluted site. In this article, emphasis is
given to obtain robust estimates of population mean and sd using left-censored data sets with potential
outliers in the right tail of a data set. In this study, it is assumed that all non-detects are smaller than the
observed values.
In general, censoring means that observations at one or both extremes (tails) are not available. In
Type I censoring, the point of censoring (e.g., the detection limit, L) is "fixed" a priori for all observations
and the number, k(>0), of the censored observations varies. In Type II censoring, the number of censored
observations, k, is fixed a priori, and the point(s) of censoring vary. For example, Type II right-censoring
(large values are not available) typically occurs in life testing and reliability applications. In a life testing
application, n items (e.g., electronic items) are subjected to a life testing experiment which terminates as
soon as (n-k) of the n data values have been observed (failed). The lifetimes of the remaining k living
objects are unavailable or being censored.
The estimation of the parameters of normal and lognormal populations from censored samples has
been studied by several researchers, including Cohen [1950, 1959], Persson and Rootzen [1977], Gleit
[1985], Schneider [1986], Gilliom andHelsel [1986], A myriad of estimation procedures for Type I left-
censored data exist in the literature, including simple substitution methods and several rigorous
procedures such as Cohen's maximum likelihood estimation (MLE) procedure, Persson and Rootzen's
restricted MLE (RMLE) method, and regression methods (Gilliom andHelsel [1986], Newman, Dixon,
and Finder [1989]). The commonly used substitution methods are: replacement of below detection limit
data by zero, or by half of the detection limit, L/2, or by the detection limit, L itself.
1
-------
Using Monte Carlo simulation experiments, several researchers, including Gleit [1985], Gilliom and
Helsel [1986], and Haas and Scheff[ 1990], concluded that the data substitution methods resulted in a
biased estimate of the population mean. In practice, probably due to computational ease, these data
substitution methods are commonly used in many environmental applications. Depending upon the
sample size, n, and the censoring intensity, k, substitution of the censored values by L/2 is one of the
recommended methods in some U.S. EPA guidance documents, such as the Guidance for Data Quality
Assessment, 96. None of the simulation studies conducted so far included the unbiased maximum
likelihood estimation (UMLE) method. Also, the results and conclusions of the above-mentioned studies
are not directly comparable due to reasons discussed in the following paragraphs.
Gleit [1985] performed simulation experiments for a "fixed" detection limit, L, for various censoring
intensities. Based upon Dempster, Laird, and Rubin's [1977] expectation-maximization (EM) algorithm,
Gleit used the conditional expected values of order statistics of the Gaussian distribution for censored
observations. Based on the low mean square error (MSE) criterion, he recommended the use of the EM
method which replaces all of the non-detects by the conditional expected value of the order statistics as
given by equation (11) below. Gleit's simulation experiments did not include Persson andRootzen's
RMLE method or any of the regression methods.
Gilliom and Helsel [1986] performed simulation experiments for various distributions and several
levels of censoring intensities. Their simulation experiments used the "computed" detection limit based
on the distribution used. For example, for a normal distribution with mean, 5, and sd, 2, ~ N(5,2), and for
censoring intensities of 30% L and 60% L, will be 5+2*zQ30~ 5 -2*0.525=3.95, and
5+2*zQ g0~5 + 2*0.255=5.51, respectively, where za represents a value of the standard
normal deviate such that area to the left of za is oc. Thus, the limit, L, changes with the censoring
intensity. They concluded that the extrapolation regression approach on the log-transformed data results
in an estimate of the population mean with the smallest root mean square error (RMSE). Their simulation
study did not include the RMLE method and the EM algorithm.
Haas and Scheff[ 1990] and Lechner [1991] compared the performance of classical estimation
methods for left-censored samples in terms of bias and MSE. They also used the "computed" detection
limit based on the distribution used in their simulation experiments. They concluded that the bias-
corrected RMLE procedure results in as good estimates as the Cohen's MLE method, and also the RMLE
method possesses lower bias and MSE than the regression and substitution methods. They also suggested
that the RMLE is less sensitive to the deviations from normality. Their study did not include the EM
method.
The objective of the present article is to develop robust procedures which yield reliable estimates of
population parameters from left-censored data sets in the presence of outliers, and also to compare the
performances of the various estimation procedures. The authors of this article performed Monte Carlo
simulation experiments for both the "fixed" and the "computed" detection limit cases to assess the
performances of the various classical and robust procedures in terms of bias and MSE. Several methods,
including the EM algorithm, MLE, UMLE, RMLE, and the regression method, have been considered.
The results of a couple of simulation runs are presented here to demonstrate the differences in the
performances of these methods for the two cases: 1) L stays fixed for all censoring levels, and 2) L is
computed based on the distribution used and, therefore, varies with the censoring intensity. In
environmental applications, the first case (1) of a fixed detection limit occurs quite frequently.
2
-------
The occurrence of non-detects in combination with potential outliers is inevitable in data sets
originating from environmental applications. The data set resulting from such a combination of non-
detects in the left tail of the distribution, and high concentrations in the right tail of the distribution,
typically do not follow a well-known statistical distribution. The problem gets complex when multiple
detection limits (reporting limits) are present. In practice, such a data set may have been obtained from
two or more populations with significantly different mean concentrations such as the one coming from the
clean background part of the site and the other obtained from a contaminated part of the site.
Unfortunately, many times such a data set can be modeled incorrectly by a lognormal distribution (which
could pass the lognormality test). Also, a normally distributed data set with a few extreme (high)
observations can be incorrectly modeled by a lognormal distribution with the lognormal assumption
hiding the outliers [Singh, Singh, and Engelhardt (1997, 2000)]. An example is discussed next to
elaborate on this point.
Example 1. A simulated data set with 15 observations has been obtained from a mixture of two normal
populations. Ten observations (representing background) were generated from a normal
distribution with a mean of 100 and a sd of 50, and five observations (representing
contamination) were generated from a normal distribution with a mean of 1000 and a sd of
100. The mean of this mixture distribution is 400. The generated data are: 180.51,2.33,
48.67, 187.07, 120.21, 87.96, 136.75, 24.47, 82.23, 128.38, 850.91, 1041.73, 901.92,
1027.18, and 1229.94. The data set failed the normality test based on several goodness-of-
fit tests, such as the Shapiro-Wilk test, the W-test (W=0.7572), and the
Kolmogorov-Smirnov (K-S = 0.35) test. However, when these tests were carried out on the
log-transformed data, the test statistics are insignificant at the a= 0.05 level of significance
with W=0.8957 and K-S = 0.168, suggesting that a lognormal distribution provides a
reasonable fit to the data. Based upon those tests, one might conclude that the observed
data come from a single background lognormal population, a situation which occurs
frequently in practice. It is, therefore, warranted to make sure that the data come from a
single population before one would try to use a lognormal distribution on a data set.
Like full uncensored samples, the classical procedures used on left-censored data sets with potential
high outliers result in distorted estimates of location and scale. A brief description of some of the above-
mentioned procedures is given in Section 2 and some real and simulated data sets are discussed in
Section 3. Results of a few simulation runs for the various classical methods are given in Section 4, and
the conclusions are summarized in Section 5. The simulation results based on robust procedures are in
agreement with the classical procedures without outliers. However, due to the length of the present article,
results of the complete Monte Carlo study (using data sets with outliers) to assess the performances of the
various classical and robust procedures are not included in this article, and will be submitted for
publication at a later date.
3
-------
Section 2
Mathematical Formulation
In the following, it has been implicitly assumed that the data set under consideration has been
obtained from a "single" normal population, perhaps after a suitable Box-Cox [1964] type transformation
(including the log-transformation) with unknown mean, |i, and sd, a. Cohen [1950, 1959] derived the
maximum likelihood (ML) equations for censored samples and prepared tables of the constants needed to
obtain the MLEs of |i and a. The ML equations are solved iteratively using a suitable numerical method
such as the Newton-Raphson method. Some computer programs (e.g., UNCENSOR by Newman et al.
[1989]) are available to compute the MLEs and the RMLEs from left-censored samples obtained from
normal and lognormal populations. Some of these classical and robust methods have been incorporated
into a computer program, CENSOR (see Scout: A Data Analysis Program), for estimation of |i and a
from left-censored data sets with potential outliers.
Let x1, x2, . . . , xn be a random sample from a normal, N(|i.a). population with k of the non-
detects, x1, x2, . . . , xk , lying numerically below the detection limit, L. Let (J) and ® be the
probability density function (pdf) and cumulative distribution function (cdf) of the standard normal
distribution (snd). The logarithm of the likelihood function is given as follows:
n
InL (x, u, a) =kln® (Z) -n0lna-5^ (—p.) 2 / 2 o2 + constant, (1)
k+l 1
where n = (n-k) and Z = / o , with ® (Z) representing the probability that an
U 2
observation is less than L. The mean ,xo, and variance ,sof using the (n-k) observed data values
are:
xo= 52 x±/ (n-k) , and sfl2 = (x±-xo)2/(n-k). (2)
i=k+l i=k+l
A brief description of some of the robust procedures to estimate population parameters from
contaminated left-censored samples is given as follows.
4
-------
Robust Procedures
When dealing with data sets originated from environmental applications, one is faced with the dual
problem of the occurrence of below detection limit concentrations (non-detects) in the left tail and
possibly some extreme concentrations in the right tail of the distribution of the contaminant (e.g., lead)
under consideration. The presence of outliers leads to distorted estimates of the population mean, |_i, and
the sd, a. It is, therefore, important that these unusual observations in both tails of the distribution be
treated adequately. For full uncensored data sets, simple robust estimates such as the trimmed mean or
Winsorized mean (Hoaglin, Mosteller, and Tukey [1983]) are sometimes used to estimate the population
mean in the presence of outliers. For example, a 100p% trimmed mean is obtained by using only the
middle n(l-2p) data values and the np values are omitted from each of the two (left and right) tails of the
data set. Gilbert [1987] suggested the use of the Winsorized and trimmed means for the estimation of |i
and a for left-censored data sets. Depending upon the censoring intensity, the use of the trimmed and
Winsorized mean has also been recommended in some guidance documents, such as Guidance for Data
Quality Assessment, 1996. Helsel [1990] discussed the use of non-parametric, distribution-free
procedures, and of a simple robust pair, (median, MAD/0.6745), to estimate |i and a, where MAD
represents the median absolute deviation. Gilliom and Helsel [1986] suggested the use of the least-
squares regression on the log-transformed data to obtain robust estimates of |i and a from left-censored
data sets.
In this article, we use robust M-estimation procedures based on the notion of the influence function
(Hampel [1974]), which assigns reduced weights to the outlying observations. For full uncensored data
sets, several robust procedures exist in the literature for the estimation of the population mean and the
variance (Huber [1981], Rousseeuw andLeroy [1984], Staudte and Sheather [1990], Singh andNocerino
[1995]). For left-censored data sets, in order to identify and subsequently assign reduced weights to the
outliers that may be present in the right tail of a data set, the robust sample mean, x *, and the robust
sd, s0, using the (n-k) observed values, need to be obtained first. These values are then used in the
various estimation methods, such as MLE, UMLE, RMLE, and the EM method to obtain robust estimates
of the population mean and sd. Singh and Nocerino [1995] showed that for full data sets, the PROP
influence function works very well for 1) the identification of multiple multivariate outliers, and 2) the
robust estimation of the population mean vector and the dispersion matrix. In this article, those
techniques are extended to obtain the robust estimates of the population mean and the variance using left-
censored data sets with outliers.
The PROP influence function (Singh [1993]) and the corresponding iteratively obtained sample mean
and sd based on the (n-k) detected observations are given as follows.
t|r(d_.) = dL ; d. < do
= do exp[-(d.-do)| ; d. > do
(3)
x0 = w1(di)x1 /^^(dj ; s*2 = Y, (xi - x*)2/v. (4)
(*+D (*+D
2 s(! O . S|c2 2
Here, d± = (x±-x0) /sQ ; i=k+l, k+2, . . . , n, and dD is the a* 100% critical value from
the scaled beta distribution, (n-k-1)2 3(1/2, (n-k-2)/2) / (n-k) , ofthe
distances, df.
5
-------
2 2
The weights are given by w1 (d±) = w± = tJt (di) /d±, and w2(d±) = wi=w1 (di) , with the
degrees-of-freedom, v = wsum2-1, and wsuml =53 wx (d±) , wsum2 =53 w2 (d±) .
Since the number of outliers present in a data set is usually unknown, more than one value of a
should be tried on the same data set. The commonly used values of the level a are 0.01, 0.05. In the
presence of multiple outliers, higher values of a (e.g., 0.1) may be needed (Singh and Nocerino [1995]) to
unmask outliers. This is especially true when the sample size is small (e.g., 20 or less). It needs to be
pointed out that the outlier identification procedures based on influence functions typically identify
extreme observations in both tails of the underlying distribution. When dealing with left-censored data
sets, one is concerned with the identification of outlying observations that might be present in the right
tail of the distribution; therefore, reduced weights are to be assigned to those extreme observations found
in the right tail only. Each of the observed detected values in the left-tail is assigned a unit weight.
Cohen's Method
Cohen's MLEs for the mean and the variance are obtained by solving the following equations:
Vlmle = *o~ h) ' and d2MLE = s*+(xo~L) 2\(g,h) , (5)
where g = s*/ (xq- L) 2 and h = k/n. The estimates of |i and a given by equation (5) are biased.
For a Type II censored data set from normal population, Saw [1961] tabulated the first-order bias
correction terms, which were simplified by Schneider [1986] and are given as follows.
Bias~ = -exp [ 2 . 692 -5 . 439 (n-k) / (n+1) ] , and (6)
Bias =- [0.312+0.859 (n-k) / (n + 1) ] "2. (7)
In practice, the bias corrections given by equations (6) and (7) are also used for Type I censored data.
The bias-corrected MLE, denoted by UMLE, is given as follows.
o^Tr,Bias~ o^^Bias*
a a MLE ll i a MLE o / o \
UrT1/rr „ = ii„r „ , and oTn.T „ = a„r „ . (8)
^ UMLE ^MLE (Z7 + 1 ) UMLE MLE 12+1
The corresponding robust ML and UML estimates of |i and a are obtained by using the robust
estimates, x0 and s0 , in place of xo, and sq in equations (5) and (8), respectively.
6
-------
Expectation Maximization (EM) Algorithm
Dempster, Laird, and Rubin [1977] developed the EM algorithm to maximize the likelihood function
based upon censored and missing data. The iterative EM algorithm works on the observed values
assuming that no observations were censored. At the initial iteration, using the observed (n-k) data
values, one could start with some convenient estimates for |_i and a, such as the sample mean and sd, or a
simple one-step robust pair represented by the median and MAD/0.6745. The iterations are defined as
successively maximizing the expectation of the conditional likelihood function of the complete data,
given the type of censoring. Gleit [1985] used this procedure for left-censored samples and found it to
possess a lower MSE than the various other substitution and likelihood procedures. For the single
detection limit case, the estimates of |i and a at the (j+1 )lh iteration are given as follows (Shumway, Azari,
and Johnson [1989]).
= [ E Xi + E Xi*L) ] In , (9)
i=k+1 i = 1
n k
oj+1=[ E (x1-Hj)2 + ,£Ej((X1-vj)z\X1*L)]/(n-l) , where (10)
i=k+l i=1
E.[X.\X.*L] =uj.-aj.[cMZ) /®(Z) ] , with Z={L-£.)/Q.f (11)
EA(Xi-VLj)2\Xi^L] =d) (1-Z[<\>(Z) /<£(£) ] ) • (12)
Thus the EM method is an iterative substitution method in which at each iteration all of the non-detects
are replaced by the same conditional expected value as given by equation (11). In the presence of
outliers, the conditional expected value given by equation (11) gets distorted (e.g., becomes negative),
and results in inadequate estimates given by equations (9) and (10). Typically, contaminant
concentrations are non-negative and substituting a negative value for non-detects will be inappropriate.
In these cases, the non-detects have to be replaced by zero, or half of the detection limit, L/2 (see
Example 4). In this article, whenever the conditional expected value became negative, it was replaced by
L/2. As shown in the examples to follow, the robust EM estimation procedure takes care of this problem
by assigning reduced weights to the outlying observations. The robust EM estimates at the (j+1 )lh
iteration are given as follows:
n k
Uj+1=[ E w±xi + E Ej (xi I X±
-------
Restricted Maximum Likelihood (RMLE) Method
Persson and Rootzen [1977] obtained the restricted likelihood estimates by simplifying the ML
equations. The likelihood function can be written as follows:
n
L(x,p,a) = [4>(Z) ]Jt(2na2)"","")/2exp-[ £ (yi+Zo)2/2o2] , (15)
i= (k+i)
where y^Xj-L; i:=k+l, k+2,..., n. The random variable, (n-k), representing the number of observed values
above L, can be expressed as a binomial random variable with the pdf given below.
P(No. of observations lying above L = r) = [n\ / r\ (n-r) ! ] (l-<£ (Z) ) r (Z) , the probability that an observation lies below L, is k/n.
Thus, for 0_1 (k/n) .
Substituting ^k/n for Z in equation (15) and then maximizing the resulting restricted likelihood
function yields the following closed form estimates of |i and a.
n
and (17)
2 (n-k) jt+i
where c = ^k/n^ yi / (n-k) . The estimates given by equation (17) are biased, which can be
k +1 2 O O
corrected as follows. For left-censored samples, E [x ] =u +oa, and E [sD ] =o [1+ (cx.Z-cx. ) ] ,
where oc = 4> (Z) / (1 -<& (Z) ) , and the bias-corrected RMLEs are given as follows:
Vbrml=xo-&Qbrml' and Qbrml = t so " ~&2) °rml\ 1/2, (18)
where ot =
-------
Regression Method
The ordinary least squares (OLS) regression line is obtained by fitting a model to the observed data
(perhaps after a suitable transformation) and the hypothetical normal quantiles. In other words, it is
assumed that the k censored observations, x^, x2, . . . , xk, follow the zero-to-detection limit
portion of a normal (transformed) distribution. A least squares regression line is obtained using the (n-k)
pairs, (q ± , x(i)) ; i=k+l, k+2,...,n, where x(i) are the observed values arranged in ascending
order. The n quantiles, q(i), are obtained using an appropriate normal probability statement, such
as P[z
-------
It is well known that the OLS estimates of intercept and slope (Rousseeuw and Leroy [1984]), and,
hence, the mean and sd and the extrapolated non-detects, get distorted even by the presence of a single
outlier. In the presence of outliers, the use of the log-transformation alone will not result in robust
estimates of intercept and slope. Robust regression methods as given by Rousseeuw and Leroy [1984],
and by Singh and Nocerino [1995], may be used to obtain robust estimates of slope and intercept. Some
examples are considered next to illustrate the procedures described here. The discussion and use of the
robust regression for censored data sets are beyond the scope of the present article. All computations are
performed using the CENSOR program. In the following, all replacement values (when applicable) for
non-detects are listed in parentheses. Since the substitution methods do result in biased estimates of |i
and a, their computations are omitted from most of the examples and simulation results discussed in the
following sections.
10
-------
Section 3
More Examples
Example 2. A simulated data set of size 15 was obtained from a normal population with mean, |_i = 1.33,
and sd, a = 0.2, N(1.33, 0.2), with L=1.0, and k=2. The left-censored data are: <1.0, <1.0,
1.2883, 1.1612, 1.1560, 1.3251, 1.1568, 1.5638, 1.2914, 1.3253, 1.2884, 1.4688, 1.4581,
1.3641, 1.1342. The sample mean and sd obtained using the 13 observed data values were
1.306 and 0.134, respectively. Classical and robust procedures produced similar results and
are given as follows in Table 1. For this simulated data set, observe that all of the methods,
except the first two substitution methods, resulted in similar results.
Table 1. Classical/Robust results for N(1.33, 0.2) with n=15, k=2, and L=1.0.
Method
Zero
L/2
L
MLE
UMLE
RMLE
Regress*
EM**
Mean
1.13
1.20
1.27
1.25
1.26
1.26
1.27
1.25
sd
0.48
0.31
0.16
0.18
0.19
0.17
0.16
0.19
*(0.979, 1.061)
**(0.91)
Note: Substitution values for non-detects are given in parentheses, identified by one (1)
asterisk (*) for the Regress method and by two (2) asterisks (**) for the EM method.
Example 3. Next, the data set of Example 2 was contaminated with two outliers, 3.8561 and 6.2513,
from a normal, N(5,2) population. Outliers distorted the classical estimates of the mean and
the sd for all of the methods, and also distorted the intercept and slope of the OLS
regression. The sample mean and sd for the 15 observed data points were 1.806 and 1.4,
respectively. The results are given in Table 2. Notice that for the EM algorithm, the outliers
distorted the conditional replacement value of 0.91 to 0.025. The robust MLE, RMLE, and
the EM methods, on the other hand, resulted in fairly accurate estimates, and the robust
results of Table 2 with the outliers, and the classical results of Table 1, without the outliers,
are in close agreement.
11
-------
Table 2. Results for N(1.33, 0.2), with outliers from
N(5,2), n=17, k=2, and L=1.0.
Method
Classical
Robust
Mean
sd
Mean
sd
L
1.71
1.34
1.27
0.16
MLE
1.60
1.41
1.26
0.17
UMLE
1.61
1.49
1.26
0.18
RMLE
1.55
1.50
1.26
0.17
EM*
1.60
1.46 *(0.025)
1.25
0.18 *(0.91)
Note: Substitution values for non-detects are given in
parentheses, identified by one asterisk (*) for the
EM method.
Example 4. This left-censored data set is taken from the U.S. EPA RCRA guidance document [1992],
The detection limit is 1450. The data with 3 non-detects and 21 observed values are: <1450,
<1450, <1450, 1850, 1760, 1710, 1575, 1475, 1780, 1790, 1780, 1790, 1800, 1800, 1840,
1820, 1860, 1780, 1760, 1800, 1900, 1770, 1790, 1780. The sample mean and sd obtained
using 21 observed data are 1771.91 and 92.702, respectively. The classical and robust
estimates are summarized in Table 3. Note that the substitution by L/2 method resulted in a
biased estimate of mean with the highest variability, which is one of the most frequently
used methods in environmental applications. All of the likelihood methods and the EM
method resulted in fairly similar estimates. The regression method resulted in estimated
non-detects larger than L. Figure 1 displays the classical fit obtained using the observed
data, (q±, x(i)) , i= 4, 5, ..., 21. This model is then used to estimate the non-detects by
extrapolation. The graph with extrapolated non-detects is given in Figure 2. As can be seen
in this figure, the three estimated non-detects (circled) are larger than the detection limit,
L=1450, and even larger than the smallest observed value of 1475. The use of those
extrapolated non-detects (higher than the reporting value), of course, will reduce the spread
in data, but will also result in a highly biased estimate of the mean. This resulted in
sd=103.21 and ahigh-biased estimate of the mean=1751.36. Using the slope and the
intercept of the regression line, we have an estimate of sd=92.15, and an estimate of the
mean=1751.36. This example alone suggests that the OLS regression approaches are not
suitable for the estimation of the mean and the sd from censored samples.
Table 3. Classical/Robust results (without outliers), n=24, k=3, and L=1450.
Method
L/2
L
MLE
UMLE
RMLE
Regress*
EM**
Mean
1641.04
1731.67
1724.0
1724.94
1725.55
1751.36
1723.66
sd
364.09
138.92
153.65
159.39
144.37
103.21
157.80
*(1571.91, 1613.25, 1637.46)
**(1385.97)
Note: Substitution values for non-detects are given in parentheses, identified by one (1) asterisk (*) for
the Regress method and by two (2) asterisks (**) for the EM method.
12
-------
1946.75
1895.33
1843.90
¦» •
q 1792.47
• •
• • • •
• •
1741.05
1689.62
2 1638.20
Least Squares
Slope = 92.150
Y Intercept = 1751.359
Correlation = 0.800
1586.78
1535.35
1483.92
1432.50
-0.98
-0.25
0.47
1.91
-1.34
-0.61
0.83
1.55
2.28
Theoretical Quantiles (Normal Distribution)
Figure 1. Classical Fit for Observed Data.
1946.75
1895.33
1843.90
1792.47
1741.05
1689.62
1638.20
1586.78
1535.35
1483.92
1432.50 -
-2
©
0
,®
Least Squares
Slope = 92.150
Y Intercept = 1751.359
Correlation = 0.921
34 "1-87 -1.39 "°-92 -0.45 0 02 0.49 0 96 1.43 190 2.38
Theoretical Quantiles (Normal Distribution)
Figure 2. Non-Detects Obtained Using the Classical Fit.
13
-------
In order to further illustrate how the presence of outliers distorts these estimates, three arbitrarily
chosen outliers, 7000, 8000, and 11000 are added to the data set of this example. The relevant classical
and robust statistics for the contaminated data set are summarized below in Table 4. The classical
observed sample mean and sd for the data with outliers (24 values) are 2633.75 and 2410.35. Using
equation (21), the intercept and slope of for the left-censored contaminated data set are 2216.51 and
2061.25. Use of this OLS fit resulted in distorted negative values for the extrapolated non-detects which
are given in Table 4.
Table 4. Results with 3 discordant values, n=27, k=3, and L=1450.
Method
Classical
Robust
(a=0.01, 0.05)
Mean
sd
Mean
sd
MLE
2317.64
2437.60
1729.83
147.41
UMLE
2329.79
2516.74
1730.56
152.19
RMLE
2204.61
2609.06
1731.14
139.17
Regress*
2216.51
2574.10
*(-1899.83, -995.304, -469.177)
EM**
2312.80
2491.23 **(-254.79)
1723.66
157.80**(1385.97)
Note: Substitution values for non-detects are given in parentheses, identified by one
asterisk (*) for the Regress method and by two asterisks (**) for the EM
method.
For the classical EM procedure, the estimated non-detects became negative. Notice that the robust
results for the MLE, the RMLE, and the EM-based algorithm are in close agreement with or without the
outliers as can be seen by comparing Tables 3 and 4. Also note that the robust replacement value of
1385.97 for the EM algorithm is in agreement with the corresponding classical value without the outliers.
Next, the log-transformation of the data set with the outliers is considered. The corresponding
estimates are given below in Table 5. The classical mean and sd for the observed log-transformed data
are 7.675 and 0.537. In the following, all back-transformation results are obtained using equation (22).
The outliers distorted the estimates of the mean and the sd for all of the methods, including the regression
method.
The OLS fit on the log-transformed data is given in Figure 3. The intercept and slope are 7.577 and
0.481, respectively. Using this fit, the estimated non-detects are 6.62, 6.83, and 6.95, which, when
converted back to the original units, are 749.95, 925.19, and 1043.15, respectively. The resulting
estimates (using all 27 data points) of the mean and the sd in the original units are 2441.79 and 2333.93,
which are obviously influenced by the three outliers. Thus, as mentioned above, the log-transformation
alone cannot produce robust regression estimates. The only advantage of using the log-transformed data
is that the replacement values for the non-detects did not become negative for the regression and the EM
method.
14
-------
Table 5. Classical Estimates for Log-Transformed Data with Outliers,
n=27, and k=3.
Method
Log Transform
Back Transform
Mean
sd
Mean
sd
MLE
7.59
0.56
2314.35
1393.69
UMLE
7.59
0.57
2344.59
1456.60
RMLE
7.58
0.58
2315.67
1476.96
Regress*
7.58
0.58
2311.29
1461.96 *(6.62,6.83,6.95)
EM**
7.59
0.57
2327.96
1438.29 **(6.92)
Note: Substitution values for non-detects are given in parentheses,
identified by one asterisk (*) for the Regress method and by two
asterisks (**) for the EM method.
TO
1
<0
O
"O
<0
n
O
¦o
¦O
Least Squares
Slope = 0.481
Y Intercept = 7.577
Correlation = 0.805
-0 Q5
-1.43 -0.46 0.50 1.47
Theoretical Quantiles (Normal Distribution)
2.44
Figure 3. OLS Fit with Outliers: Log-Transformed Data.
15
-------
The PROP estimates with a=0.05 on the log-transformed data are given in Table 6. The robust mean
and sd for the observed data are 7.48 and 0.0086, respectively. The results summarized in Table 6 are in
close agreement with the robust results (Table 4) obtained using the data in the original units and also
with the estimates (Table 3) obtained using the classical procedure without the outliers.
Table 6. Robust Estimates for Log-Transformed Data with
Outliers, n=27, and k=3.
Method
Log Transform
Back Transform
Mean
sd
Mean
sd
MLE
7.45
0.09
1731.10
155.89
UMLE
7.45
0.09
1732.34
161.09
RMLE
7.45
0.08
1731.72
146.78
EM*
7.45
0.10
1725.57
166.57 (7.24)
Note: Substitution values for non-detects are given in
parentheses, identified by one asterisk (*) for the EM
method.
16
-------
Section 4
Simulation Experiments and Results
The performances of the various procedures are measured in terms of bias and MSE. Bias of an
A /V
estimate, 0, of a parameter, 9 , is defined as its departure, (0-0) , from the parameter. For a
N
simulation experiment with N iterations, these are given by bias =52 (§ • ~Q) /N, and MSE
N ~ 1 1
=T (0.-0) 2/N, where 0. is an estimate of 0, obtained from the sample generated at the ith
l 1 1
iteration, i:=l,2,..., N. Some results of the simulation experiments for left-censored samples without
outliers obtained from the normal population, N(5,2), with 5000 iterations each, are discussed as follows.
Several classical estimation methods for various values of L, sample sizes, and censoring intensities are
considered. For the "fixed" detection limit case, L was set at 1.0, 2.0, 4.0, and 5.0 for each of the
censoring levels. For the "computed" detection limit case, for a 100% censoring intensity, the detection
limit, L, is given by the equation, L =\i + o * za, where za , the critical value of the standard
normal distribution, is given by P (Z < za) = a. Selected graphs of bias and MSE for some values of
L (e.g., 2 and 4), sample sizes (e.g., 5 - 25), and censoring intensities (e.g., 10% - 60%) are presented
here. Since the various substitution methods do not perform well, therefore, most of the results and
graphs discussed are for the MLE, UMLE, RMLE, EM, and the regression methods. From these graphs
(Figures 4-27), as
expected, it is observed
that the MSE and bias for
all of the methods
decrease with the sample
size. Also, it is noticed
that the differences in
MSEs of the various
likelihood procedures
decrease as the sample
size increases (e.g.,
Figures 4-6 and 9-11).
HI
W
0.78 -r-
0.624
0.468 --
0.312
0.156 --
0
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
2.00
2.00
—I—
2.00
2.00
—\—
2.00
2.00 Det. Limit
—I
10% 20% 30% 40% 50% 60% Censoring
Figure 4. MSE: 5000 Simulation Runs For N(5,2), L = 2.0 (n = 5).
17
-------
LU
tt>
0.292 -T-
0.234 --
0.175
0.117 --
0.0584
2.00
2.00
2.00
2.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
2.00
2.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 5. MSE: 5000 Simulation Runs For N(5,2), L = 2.0 (n = 15).
0.166 —r~
0.133 --
LU
CO
0.0996
0.0664 --
0.0332 --
2.00
2.00
2.00
2.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
2.00
2.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 6. MSE: 5000 Simulation Runs For N(5,2), L = 2.0 (n = 25).
18
-------
CO
<
CD
0.13 -i-
0.0975 --
0.065
0.0325
0
-0.0325
-0.065
-0.0975
-0.13
2.00
2.00
2.00
2.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
2.00
2.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 7. Bias: 5000 Simulation Runs For N(5,2), L = 2.0 (n = 5).
0.042
0.0315
0.021
0.0105
0.0105
0.021 --
2.00 Det. Limit
-0.0315
~ Cohen's MLE
o RMLE
+ Unbias MLE
* Regress
o Expected
10% 20% 30% 40% 50% 60% Censoring
Figure 8. Bias: 5000 Simulation Runs For N(5,2), L = 2.0 (n = 15).
19
-------
LU
CO
0.92 -r
0.736 --
0.552
0.368 --
0.184 --
4.00
4.00
400
4.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
4.00
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 9. MSE: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 5).
LU
CO
0.384 -r-
0.307 --
0.23
0.154
0.0768 --
4.00
4.00
400
4.00
° Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
4.00
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 10. MSE: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 15).
20
-------
LU
CO
0.216 -r
0.173 --
0.13 --
0.0864 --
0.0432 --
4.00
4.00
400
4.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
4.00
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 11. MSE: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 25).
CO
<
CD
0.72 -i-
0.54 —
0.36
0.18
0
-0.18
-0.36
-0.54
-0.72
r
=8=
4.00
4.00
4.00
4.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
4.00
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 12. Bias: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 5).
21
-------
CO
<
CO
0.098 -r-
0.0735-
0.049
0.0245 4-
0
-0.0245 --
-0.049 —
-0.0735-
-0.098
4.00
4.00
4.00
4.00
4.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
* Regress
o Expected
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 13. Bias: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 15).
CO
<
CD
0.056
0.042 -|-
0.028
0.014
0
-0.014-1-
-0.028
-0.042 4-
-0.056
4.00
-e-
-B-
0—
4.00
4.00
4.00
4.00
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 14. Bias: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 25).
22
-------
HI
CO
0.88 -i-
0.686 —
0.493.
0.299--
0.106
4.00
4.00
4.00
4.00
~ L/2
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
4.00
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 15. MSE: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 5).
LU
CO
0.384 —r~
0.307 --
0.23 --
0.154
0.0768 --
4.00
4.00
4.00
4.00
~ L/2
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
4.00
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 16. MSE: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 15).
23
-------
LU
CO
0.248 -r-
0.198
0.149 --
0.0992 --
0.0496 --
4.00
4.00
4.00
4.00
~ L/2
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
4.00
4.00 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 17. MSE: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 25).
CO
<
CD
0.252
0.189
0.126
0.063
0
-0.063
-0.126
-0.189
-0.252
4.00
B
4.00
4.00
B—
4.00
~ L/2
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
_—* ^
K— "
K
ih
—=#=—
o
A—
&—
—&
A
=—A
—<£>
o
—~—
4.00
4.80 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 18. Bias: 5000 Simulation Runs For N(5,2), L = 4.0 (n = 15).
24
-------
HI
0)
2.91 —r~
2.33 --
1.75 --
1.16 --
0.582 --
2.44
3.32
3.95
4.49
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
5.00
5.51 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 19. MSE: 5000 Simulation Runs For N(5,2), L Computed (n = 5).
LU
CO
0.532 t
0.426 --
0.319 --
0.213 --
0.106
~ L/2
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
2.44
3.32
3.95
4.49
5.00
5.51 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 20. MSE: 5000 Simulation Runs For N(5,2), L Computed (n = 15).
25
-------
0.51 -r-
0.408 --
0.306
0.204 --
0.102 --
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
2.44
3.32
3.95
4.49
5.00
5.51 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 21. MSE: 5000 Simulation Runs For N(5,2), L Computed (n = 25).
0.54 —r~
0.432 --
0.324
0.216 --
0.108 --
~ L/2
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
2.44
3.32
3.95
4.49
5.00
5.51 Det. Limit
10% 20% 30% 40% 50% 60% Censoring
Figure 22. MSE: 5000 Simulation Runs For N(5,2), L Computed (n = 25).
26
-------
1.69-r
1.27 —
0.847--
0.424 —
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
§
CD
-0.424 --
-0.847 —
.27 —
5.00
2.44
3.32
3.95
5.51 Det. Limit
4.49
.69
30%
50%
60% Censoring
10%
20%
40%
Figure 23. Bias: 5000 Simulation Runs For N(5,2), L Computed (n = 5).
0.588 -T-
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
0.441 --
0.294 --
0.147 —
CO
<
CD
-0.147 —
-0.294 —
-0.441 --
2.44
3.32
3.95
4.49
5.00
5.51 Det. Limit
-0.588
50%
30%
60% Censoring
10%
20%
40%
Figure 24. Bias: 5000 Simulation Runs For N(5,2), L Computed (n = 15).
27
-------
0.156-1-
~ Cohen's MLE
o RMLE
+ Unbias MLE
x Regress
o Expected
0.117 —
0.078 --
0.039 —
in
<
CO
-0.039--
-0.078 —
-0.117 —
5.00
3.95
5.51 Det. Limit
2.44
3.32
4.49
-0.156
30%
50%
10%
20%
40%
60% Censoring
Figure 25. Bias: 5000 Simulation Runs For N(5,2), L Computed (n = 25).
1.64-r
1.23 —
0.82 —
0.41 --
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
-0.41 --
-0.82 —
.23 —
2.44
3.32
3.95
4.49
5.00
5.51 Det. Limit
.64
30%
50%
60% Censoring
10%
20%
40%
Figure 26. Bias: 5000 Simulation Runs For N(5,2), L Computed (n = 5).
28
-------
0.024 -r
0.012 —
0.018 —
~ L/2
o Cohen's MLE
+ RMLE
x Unbias MLE
o Expected
CO
<
00
-0.018 —
-0.006 —
-0.012 —
0.006 —
0
2.00 2.00 2.00 2.00 2.00 2.00 Det. Limit
-0.024
10% 20% 30% 40% 50% 60% Censoring
Figure 27. Bias: 5000 Simulation Runs For N(5,2), L = 2.0 (n = 25).
Fixed Detection Limit
Figures 4 through 6 have the MSEs and Figures 7 and 8 have the bias for various procedures when
L=2.0. Figures 9 through 11 show the MSEs and Figures 12-14 display the bias for various estimation
methods when L=4.0. It is observed that, for samples of smaller (e.g., less than 10) sizes, the UMLE
method yields a higher bias than the MLE, RMLE, and the EM methods (Figures 7 and 12). However,
when L (e.g., 1, 2) is much smaller than the mean, |i (e.g., 5), the UMLE method has the smallest MSE
and the EM method has the largest MSE for samples of all sizes. When L is closer (e.g., 4 or 5) to mean,
|i, for samples of smaller sizes (Figures 9 and 12), the MSE and bias for the UMLE become larger than
those of the EM, MLE, and the RMLE methods, with the EM method having the smallest MSE and bias.
This observation concurs with Gleit's (1985) findings for the EM method. But as the sample size
increases (e.g., becomes 15 or larger), as expected, the situation reverses, and the UMLE method results
in the smallest MSE, while the EM and the regression methods yield larger MSEs (Figures 10 and 11).
Note that, to some extent, this behavior of the MSE and bias of the EM method is similar to the
substitution by L/2 or L methods, except that, as the sample size increases, the MSE and bias for the later
two substitution methods become much greater than those of the EM method, as can be seen in Figures
15-18 and 24. The EM method, after all, is just a substitution method in which all of the non-detects are
replaced by an optimally obtained conditional expected value. From Figures 13 and 14, it is noticed that
the bias for the EM and the regression methods becomes fairly large as the sample size increases.
Also, it is observed that as the sample size increases, the bias of the MLE and RMLE methods starts
becoming smaller (in magnitude) than that of the UMLE method for all censoring intensities (Figures 13-
14). From all of these graphs, it is observed that the bias and MSEs obtained using the RMLE and
Cohen's MLE methods are stable, always stay very close to each other, and lie in the middle of the
respective bias and MSE of the other methods for all sample sizes and censoring levels. Actually, in most
29
-------
cases, the RMLE method even results in a smaller bias and MSE than the MLE method as can be seen
from Figures 8, 10, 11, 13, 14, and 16-18. Also, note that the differences in the MSE of the three MLE
methods (UMLE, Cohen's, and RMLE) decrease as the sample size increases.
Computed Detection Limit
Figures 19-22 are the graphs of the MSE and Figures 23-27 have the bias for various sample sizes and
censoring levels. From these graphs, as expected, it is observed that the MSE for all of the methods
increase with the detection limit, L, or the censoring intensity, for all sample sizes. It is observed that for
small sample sizes, the EM method (the optimal replacement by the conditional expected value method)
results in smaller MSE and bias (Figures 19 and 23), especially when the detection limit starts coming
closer to the mean value. The bias for the EM and L/2 methods becomes unacceptably high with
increased sample size, as can be seen in Figures 24 and 25. As observed earlier, note that, as the sample
size increases, the UMLE method results in the smallest MSE, and the substitution by L/2 method, the
EM method, and the regression method yield MSEs much larger than the three MLE methods. However,
the UMLE method results in a bias which is larger than those of the MLE and RMLE methods. This
increase in the bias of the UMLE becomes quite noticeable with increases in the sample size, the
detection limit, and the censoring intensity. This is especially true when the censoring level starts
exceeding 30%. Moreover, from all of these graphs, it is observed that both the bias and the MSEs
obtained using the RMLE and Cohen's MLE methods are stable and always stay close together for all of
the sample sizes and censoring levels. Also, as noticed earlier, the RMLE method does result in a smaller
bias and MSE than the MLE method (Figures 20-25) most of the time. These observations concur with
the conclusions derived by Haas and Scheff (1990).
30
-------
Section 5
Summary and Conclusions
In this article, two questions which arise when dealing with left-censored data sets have been
addressed. Those two questions are: 1) "Which method should be used for the estimation of the
population mean and sd from left-censored data sets?" and 2) "What is an appropriate robust estimation
procedure in the presence of potential outliers in the right tail of the distribution?"
The various substitution methods are simple, but do not perform well in most cases as they yield
estimates with a larger bias and MSE than those obtained using the MLE methods. Also, it is observed
that for larger sample sizes (e.g., >=15), the EM method results in a bias and MSE larger than those of the
MLE methods. The examples presented here lead to the conclusion that the OLS regression-based
approaches cannot be recommended for routine use. The estimated non-detects obtained by extrapolating
the fitted model many times result in infeasible estimates, which become negative or even greater than the
detection limit, L. The examples and the simulation results presented in this article clearly establish that,
in most cases, the three MLE methods (Cohen's MLE, UMLE, and RMLE) perform better than the
various substitution and regression methods.
All of the classical estimation procedures, including the maximum likelihood and substitution
methods, result in distorted estimates in the presence of outliers. In the presence of outliers, the EM
method sometimes produces negative estimates of the non-detects, which in turn result in a biased
estimate of the population mean. The OLS regression models get distorted by the outlying observations;
therefore, regression estimates obtained using raw or log-transformed data are no longer reliable. Thus,
the OLS regression method based on the log-transformed data is not a "true robust" method. It is
observed that the robust estimation procedure based on the PROP influence function results in stable and
reliable estimates of the population parameters. Moreover, the resulting robust estimates, with or without
the outliers, and the classical estimates, without the outliers, stay in close agreement.
The performance of the various estimation methods described here depend upon several things, such
as the sample size, the censoring intensity, and the value of the detection limit, L. The conclusions
derived from the simulation results and graphs presented in this article are summarized as follows.
• When the detection limit, L, is closer to the population mean, it is observed that for samples of smaller
sizes (e.g., 5-10), the EM method and the other substitution methods such as the L/2 method result in a
smaller bias and MSE than the three MLE (UMLE, Cohen's, and RMLE) procedures. However, as the
sample size increases (e.g., 15 or larger), the EM method, along with the other substitution methods,
results in a higher bias and a larger MSE than the three likelihood procedures.
31
-------
• For values of L much smaller than the mean, the UMLE method results in the smallest MSE for
samples of all sizes.
• The differences in the MSE of the three MLE methods decrease as the sample size increases.
• The simulation results suggest that, although the UMLE method does result in the smallest MSE for
samples of size 15 or larger, the bias of the UMLE becomes larger (in magnitude) than the MLE and
the RMLE methods. This increase in the bias of the UMLE method becomes quite noticeable as the
detection limit increases and the censoring intensity starts exceeding 30%. Thus, for higher censoring
intensities, the MLE or the RMLE method may be used to obtain estimates of the population mean and
sd from a left-censored data set.
• The RMLE method is simple and results in estimates which are in close agreement with the Cohen's
ML estimates. It is observed that the bias and MSEs obtained using the RMLE and Cohen's MLE
methods are stable and always stay close together for all sample sizes and censoring intensities.
Actually, in most cases, the RMLE method results in a smaller bias and MSE than those obtained using
the Cohen's MLE method. This is especially true as the sample size increases.
Using the examples and results described here, the following recommendations can be made:
• For data sets with potential outliers, the robust estimation procedures based on influence functions,
such as the PROP influence function, should be used for the estimation of population parameters.
• For samples of small sizes (e.g., 10 observations or less), the EM method or the substitution by L/2 (or
L) method may be used, especially when L is closer to the mean.
• For samples of larger sizes (e.g., 15 observations or larger), the UMLE method may be used for
censoring levels of 30% or less.
• However, since the differences in the MSE of the three MLE methods (UMLE, Cohen's, and RMLE)
decrease as the sample size increases, and in order to make things easier for a typical user, it is
recommended that for larger sample sizes, or for samples with censoring levels exceeding 30%, the
much simplified RMLE method may be used for the estimation of the population parameters.
• The results of our study clearly establish that one should stay away from the substitution methods,
especially when the sample size is larger than 10 observations.
32
-------
References
Box, G.E.P., and Cox, D.R., "An analysis of transformation," Journal of Royal statistical Society, Ser. B,
39, pp. 211-252, 1964.
Barnett, V., "Convenient probability plotting positions for the normal distribution," Appl. Statist., 25, No.
1, pp. 47-50, 1976.
Cohen, A.C., Jr., "Estimating the mean and variance of normal populations from singly truncated and
double truncated samples," Ann. Math. Statist., Vol. 21, pp. 557-569, 1950.
Cohen, A.C., Jr., "Simplified estimators for the normal distribution when samples are singly censored or
truncated," Technometrics, Vol. 1, No. 3, pp. 217-237, 1959.
Cohen, A.C., Jr., "Truncated and censored samples," 119, Marcel Dekker Inc., New York, NY 1991.
Dempster, A.P., Laird, N.M., and Rubin, D.B., "Maximum likelihood from incomplete data via the EM
algorithm," Journal of the Royal Statistical Society, Ser. B, 39, pp. 1-38, 1977.
El-Shaarawi, A.H., "Inferences about the mean from censored water quality data," Water Resources
Research, 25, pp. 685-690, 1989.
Gilbert, R.O., "Statistical Methods for Environmental Pollution Monitoring," Van Nostrand Reinhold,
New York, 1987.
Gilliom, R.J., and Helsel, D.R., "Estimation of distributional parameters for censored trace level water
quality data. 1. Estimation Techniques," Water Resources Research, 22, pp. 135-146, 1986.
Gleit, A., "Estimation for small normal data sets with detection limits," Environmental Science and
Technology, 19, pp. 1206-1213, 1985.
Haas, C.H., and Scheff, P.A., "Estimation of averages in truncated samples," Environmental Science and
Technology, 24, pp. 912-919, 1990.
Hampel, F.R., "The Influence Curve and Its Role in Robust Estimation," Journal of American Statistical
Association, 69, pp. 383-393, 1974.
Hashimoto, L.K., and Trussell, R.R., "Evaluating water quality data near the detection limit." Presented at
the Advanced Technology Conference, Water Works Assoc., Las Vegas, NV, June 5-9, 1983.
Helsel, D.R., "Less than obvious, Statistical treatment of data below the detection limit," ES&T Features
Environmental Sci. Technol., Vol. 24, No. 12, pp. 1767-1774, 1990.
Hoaglin, D.C., Mosteller, F., and Tukey, J.W., "Understanding Robust and Exploratory Data Analysis,"
John Wiley, New York, 1983.
33
-------
Huber, P.J., "Robust Statistics," John Wiley, New York, 1981.
Johnson, R.A., and Wichern, D.W., "Applied multivariate statistical analysis," Prentice Hall, 1988.
Lechner, J.A., "Estimators for Type-II Censored (Log) Normal Samples," IEEE Transactions on
Reliability, Vol. 40, No. 5, pp. 547-552, 1991.
Newman, M.C., Dixon, P.M., and Pinder, J.E., "Estimating mean and variance for environmental samples
with below detection limit observations," Water Resources Bulletin, Vol. 25, No. 4, pp. 905-916,
1990.
Persson, T., and Rootzen, H., "Simple and highly efficient estimators for a Type I censored normal
sample," Biometrika, 64, pp. 123-128, 1977.
Practical Methods for Data Analysis, Guidance for Data Quality Assessment, EPA QA/G-9, QA96
Version, 1996.
Saw, J.G., "The bias for the maximum likelihood estimates of location and scale parameters given a Type
II censored normal sample," Biometrika, 48, pp. 448-451, 1961b.
Schneider, H., "Truncated and censored samples from normal populations," Vol. 70, Marcel Dekker Inc.,
New York, 1986.
"Scout: A Data Analysis Program" (revised March 1999), Technology Support Bulletin, U.S. EPA, ORD,
NERL, ESD-LV, Las Vegas, NV 89193-3478. CENSOR was developed to be a module that will be
incorporated into the next version of Scout.
Shumway, A.H., Azari, A.S., Johnson, P., "Estimating mean concentrations under transformation for
environmental data with detection limits," Technometrics, Vol. 31, No. 3, pp. 347-356, 1989.
Singh, A., "Omnibus robust procedures for assessment of multivariate normality and detection of
multivariate outliers," Multivariate Environmental Statistics, Patil, G.P. and Rao, C.R., Editors, pp.
445-488, Elsevier Science Publishers, 1993.
Singh, A. and Nocerino, J.M., "Robust procedures for the Identification of Multiple Outliers," Handbook
of Environmental Chemistry, Statistical Methods, Vol. 2, Part G, 229-277, Springer, Germany, 1995.
Singh, A.K., Singh, A., and Engelhardt, M. (1997). The Lognormal Distribution in Environmental
Applications. Technology Support Center Issue, 182CMB97. EPA/600/R-97/006.
Singh, A.K., Singh, A., and Engelhardt, M. (2000). Some Practical Aspects of Sample Size and Power
Computations for Estimating the Mean of Positively Skewed Distributions in Environmental
Applications. EPA/600/S-99/006.
Staudte, R.G. and Sheather, S.J., "Robust Estimation and Testing," John Wiley, 1990.
Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Addendum to Interim Final
Guidance, Office of Solid Waste, Waste Management Division. U.S. EPA, 1992.
34
------- |