v>EPA U.S. Environmental Protection Agency ProUCL Version 3 Statistical Software Upper Confidence of the Unknown Population Mean Introduction The EPA Technical Support Center (TSC) in Las Vegas has developed ProUCL Version 3.0 software to support risk assessment and cleanup decisions at contaminated sites. Risk assessments, exposure evaluations, and cleanup decisions are often made based upon the mean concentrations of the contaminants of potential concern (COPCs). The true population mean concentrations of the COPCs at a contaminated site are often unknown, and are frequently estimated by the respective sample means based upon the data collected from the site under investigation. In order to address the uncertainties associated with the estimates of the true unknown mean concentrations of the COPCs, appropriate 95% upper confidence limits (UCLs) of the respective unknown means are frequently used in many environmental applications. The computation of an appropriate 95% UCL of practical merit depends upon the data distribution and the skewness associated with the data set under study. ProUCL can be used to compute an appropriate UCL of the unknown population mean using a discernible probability distribution (e.g., normal, lognormal, gamma) and/or a suitable non-parametric distribution-free method. The Need for UCL Computational Software A 95% UCL of the unknown population arithmetic mean, fiv of a COPC is often used in environmental applications to: • Estimate the exposure point concentration (EPC) term, • Support risk assessment applications, • Determine the attainment of cleanup standards, • Estimate background level mean contaminant concentrations, or • Compare the soil mean concentrations with site-specific soil screening levels. In December 2002, the EPA revised the Guidance Document to Calculate the Upper Confidence Limits for Exposure Point Concentrations at Hazardous Waste Sites (OSWER 9285.670). ProUCL, Version 3.0 consists of all parametric and non-parametric UCL computation methods as described in this revised EPA UCL Guidance Document. ProUCL also computes the UCLs of the unknown population mean based upon the positively skewed gamma distribution, which is often better suited to model environ men Lai data sets than the lognormal distribution. For positively skewed data sets, the default use of a lognormal distribution often results in unpractically large UCLs, especially when the data sets are small. This is also illustrated in the example included in this fact sheet. In order to obtain accurate and stable UCLs of practical merit, other distributions such as a gamma distribution should be used to model positively skewed data sets. ProUCL, Version 3.0 has procedures to perform the gamma goodness-of-fit test and to compute UCLs of the population mean based upon gam ma distributed data sets. ProUCL Version 3.0 Capabilities Performs Goodness-of-Fit Tests to Assess Normality/Lognormality of a Data Set Using: • Informal graphical quantile-quantile (Q-Q) plot (normal probability plot) and histogram. • Shapiro - Wilk test: to be used when the sample size is less than or equal to 50. 179CMB04.FS 4/20/2004 ------- ProUCL Version 3.0 Statistical Software to Compute Upper Confidence Limits of the Unknown Population Mean • LUliefors test: to be used when the sample size is large (e.g., greater than 50). Performs Goodness-of-Fit Test for Gamma Distribution Using: • Informal graphical quantile-quantile (Q-Q) plot (gamma probability plot) and histogram. • Kolmogorov-Smirnov test: available for sample sizes in the range 4-2500 (critical values computed using Monte Carlo simulations) and values of the estimated shape parameter, k in the interval [0.01,100.0], • Anderson-Darling test: available for sample sizes in the range 4-2500 (critical values computed using Monte Carlo simulations) and values of the estimated shape parameter, k in the interval [0.01,100.0], Computes the Estimates of Relevant Population Parameters: • Computes all relevant descriptive summary statistics for raw and log-transformed data. • Computes the maximum likelihood (ML) and minimum variance unbiased (MVU) estimates of the various population parameters such as the mean, standard deviation, quantiles, coefficient of variation, skewness, and also the MLEs of the shape parameter k and scale parameter 0 of a gam ma distribution. Computes Five Parametric UCLs: A (l-a) 100% (for all values of the confidence coefficient, (l-a) in the interval [0.5,1.0] including 0.95 except for a couple of methods) UCL of the unknown population mean, /i,, using five (5) parametric methods for normal, lognormal, and gamma distributions. The five parametric UCL computation methods incorporated in ProUCL are: 1. Student's-t UCL: to be used for normally (or at least approximately normally) distributed data sets. Student's-t UCL is available for all confidence coefficients, (l-a) in the interval [0.5,1.0], 2. Approximate Gamma UCL: to be used for gamma distributed data and is typically used when k hat (ML estimate of the shape parameter, k) is greater than or equal to 0.5. Approximate gamma UCL is available for all confidence coefficients (l-a) in the interval [0.5, 1.0], 3. Adjusted Gamma UCL: to be used for gamma distributed data sets and should be used when k hat is greater than 0.1 and less than 0.5. Adjusted gamma UCL is available only for three confidence coefficients: 0.90, 0.95, and 0.99. 4. H-UCL based upon Land's H-statistic: to be used for lognormally distributed data sets. In ProUCL, H-UCL is available only for two confidence coefficients: 0.90 and 0.95. ProUCL can compute H-UCL for samples of size up to 2001. Caution: For highly skewed data sets, H-UCL should be avoided as the H-statistic often results in unrealistically large, impractical and unusable H-UCL values. ProUCL provides warning messages and recommends the use of alternative UCLs for such highly skewed lognormally distributed data sets. 5. Chebyshev (MVUE) UCL: to be used for lognormally distributed data sets. This UCL computation method uses the MVU estimates of the standard deviation of the mean and of other parameters of a lognormal distribution. Chebyshev (MVUE) UCL is available for all confidence coefficients, (l-a) in the interval [0.5,1.0], Computes Ten Non-parametric UCLs Based Upon Bootstrap Procedures and Chebyshev Inequality: These UCLs can be computed for all confidence coefficients, (l-a) in the interval [0.5,1.0]. 1. Central Limit Theorem (CLT) based UCL: to be used when the sample size is large. 2. Adjusted-CLT (Adjusted for skewness) UCL: to be used for skewed data sets of large sizes. 2 ------- ProUCL Version 3.0 Statistical Software to Compute Upper Confidence Limits of the Unknown Population Mean 3. Modified-t statistic (Adjusted for skewness) based UCL: may be used for mildly skewed data. 4. Chebyshev (Mean, Sd) UCL: based upon the sample mean and standard deviation, Sd. 5. Jackknife UCL for mean (same as Student's-t UCL). 6. Standard Bootstrap UCL. 7. Bootstrap-t UCL. 8. Hall's Bootstrap UCL. 9. Percentile Bootstrap UCL. 10. Bias-corrected accelerated (BCA) Bootstrap UCL. As mentioned before, for most of the UCL computation methods, ProUCL can compute the UCLs for all confidence coefficients, (l-a) in the interval [0.5,1.0]. However, since in most environmental applications (e.g., estimation of EPC), a 95% UCL of mean is used, therefore, ProUCL makes recommendations for the most appropriate 95% UCL(s) which may be used to estimate the unknown population mean concentration. The basis and theoretical justification for these recommendations can be found in the references listed in this fact sheet. Example: An example illustrating the importance of the use of gamma distribution based UCL computation methods is discussed next. Consider a simulated positively skewed data set of size 20: 0.0086284103,16.18078972, 7.334853523, 6.12856E-005,1.756500498,1.394359005, 23.41857632, 7.516539628, 0.8594623274, 39.06134332,17.97357103,114.885481, 9.251610362, 39.44123801, 71.64271025, 6.271065467, 0.9742964478, 0.1558884758, 0.4817911951, 0.0065875373. This data set was generated from a gam ma distribution with the shape parameter, k = 0.25 and the scale parameter, 0 = 50. This can be seen from the two enclosed figures, a histogram and the associated gamma probability plot generated by ProUCL. This data set does not follow a normal or a lognormal distribution. However, using the default lognormal (as some Limes done in practice for positively skewed environ men Lai data seLs) distribution, one will get an unrealistically large 95% H-UCL (=498390.58) as given in the enclosed ouLpuL generaLed by ProUCL. This in turn will lead to Lhe use of Lhe maximum value (=114.885) as an estimate of Lhe EPC Lerm - a pracLice often used in environmental applications. However, since Lhe daLa seL does follow a gamma disLribution, Lherefore, an appropriaLe and more accuraLe estimate of Lhe EPC term (representing Lhe average exposure over a long period of Lime over an exposure area) will be obLained by using Lhe 95% UCL (=45.067) based upon Lhe gamma disLribution as recommended in Lhe enclosed ouLpuL Lable generaLed by ProUCL, Version 3.0. Summary ProUCL compuLes parameLric UCLs based upon a normal, lognormal, and a gamma disLribution. ProUCL also compuLes UCLs using several non- parameLric methods. The compuLaLion of an appropriaLe UCL of Lhe unknown population mean depends upon Lhe daLa disLribuLion, Lherefore goodness-of-fiL LesLs need Lo be performed Lo assess Lhe daLa disLribution before using one of Lhe UCL compuLaLion meLhods available in ProUCL. Based upon an appropriaLe daLa disLribuLion and Lhe associaLed skewness, ProUCL provides recommendations abouL one or more 95% UCL compuLaLion meLhods LhaL may be used Lo estimate Lhe unknown mean concenLraLion of a COPC. 3 ------- ProUCL Version 3.0 Statistical Software to Compute Upper Confidence Limits of the Unknown Population Mean ~l File Edit View Options Summary Statistics Histogram Goodness-of-Fit Tests UCLs Window Help ~ & I x uHlf Histogram of G(.25,50) Close window For Helpj press F1 Gamma Q-Q Plot for G[. 25,50] 40 GO 80 Theoretical Quantiles of Gamma Distribution N = 20, Mean = 17.931, k hat = 0.301 Slope = 1.029, Intercept = 0.414, Correlation, R = 0.996 K-S Test Statistic = 0.103, Critical Value(0.05) = 0.210, Data are Gamma Distributed 4 ------- ProUCL Version 3.0 Statistical Software to Compute Upper Confidence Limits of the Unknown Population Mean ProUCL Version 3.0 llii File Edit View Options Summary Statistics Histogram Goodness-of-Fit Tests UCLs Window Help Da? jt ^ H fl f V? P UCL Statistics for G(.25,50) BE JBJ A B | C D E F G H 1 1 Data File D:\PData\Gamma-Test-File2.txt Variable: G(.25,50) 2 3 Raw Statistics Normal Distribution Test 4 Number of Valid Samples 20 Shapiro-Wilk Test Statisitic 0.6627653 5 Number of Unique Samples 20 Shapiro-Wilk 5% Critical Value 0.905 6 Minimum 6.13E-005 Data not normal at 5% significance level 7 Maximum 114.88548 8 Mean 17.930768 95% UCL (As suming Normal Distribution) 9 Median 6.8029595 Student's-t UCL 29.275688 10 Standard Deviation 29.341896 11 Variance 860.94688 Gamma Distribution Test 12 Coefficient of Variation 1.6363993 A-D Test Statistic 0.1492209 13 Skewness 2.3948728 A-D 5% Critical Value 0.8445354 14 K-S Test Statistic 0.1026697 15 Gamma Statistics K-S 5% Critical Value 0.2101034 16 k hat 0.3008281 Data follow gamma distribution 17 k star (bias corrected) 0.2890372 at 5% significance level 18 Theta hat 59.604702 19 Theta star 62.036194 95% UCLs (Assuming Gamma Distribution) 20 nu hat 12.033123 Approximate Gamma UCL 41.976248 21 nu star 11.561488 Adjusted Gamma UCL 45.066538 22 Approx.Chi Square Value (.05) 4.9386585 23 Adjusted Level of Significance 0.038 Lognormal Distribution Test 24 Adjusted Chi Square Value 4.6000062 Shapiro-Wilk Test Statisitic 0.8641123 25 Shapiro-Wilk 5% Critical Value 0.905 26 Log-transformed Statistics Data not lognormal at 5% significance level 27 Minimum of log data -9.699966 28 Maximum of log data 4.7439358 0.5953246 95% UCLs (Assuming Lognormal Distribution) 29 Mean of log data 95% H-UCL 498390.58 30 Standard Deviation of log data 3.6226212 95% Chebyshev (MVUE) UCL 1438.355 31 Variance o flog data 13.123384 97.5% Chebyshev (MVUE) UCL 1932.73 32 99% Chebyshev (MVUE) UCL 2903.834 33 34 95% Non-parametric UCLs 35 CLT UCL 28.72273 36 Adj-CLT UCL (Adjusted for skewness) 32.476962 37 Mod-t UCL (Adjusted for skewness) 29.861273 38 Jackknife UCL 29.275688 39 Standard Bootstrap UCL 28.161388 40 Bootstrap-t UCL 39.768558 41 RECOMMENDATION Hall's Bootstrap UCL 71.935053 42 Data follow gamma distribution (0.05) Percentile Bootstrap UCL 29.194241 43 BCA Bootstrap UCL 34.604005 44 Use Adjusted Gamma UCL 95% Chebyshev (Mean, Sd) UCL 46.529711 45 97.5% Chebyshev (Mean, Sd) UCL 58.904496 46 99% Chebyshev (Mean, Sd) UCL 83.212366 17 |< 1 ~ l\ General Statistics / For Help, press F1 ^ 5 ------- ProUCL Version 3.0 Statistical Software to Compute Upper Confidence Limits of the Unknown Population Mean Computer Requirements to Operate ProUCL Installation of ProUCL Version 3.0 requires a microprocessor speed of at least 200MHz, 12 MB of hard drive space, 48 MB of memory (RAM), and Windows 98 (or newer) operating system. ProUCL is compatible with Windows NT-4, Windows 2000, Windows XP, and Windows ME. Installation ProUCL can be downloaded from the TSC website at www.epa.gov/nerlesdl /tsc/tsc.htm. The website contains download and usage instructions. Find More Information About ProUCL The TSC website at www.epa.gov/nerlesdl / tsc/tsc.htm provides additional information. EPA technical issue papers used in the development of ProUCL are also available at the TSC website. For additional information, contact: Gareth Pearson, TSC Director E-mail: pearson.gareth@epa.gov Phone: 702-798-2270 The website containing information on the 2002 EPA guidance for calculating the 95% UCLs is: www.epa.gov/superfund/programs/risk/ ragsa/ucl.pdf. References EPA (2002), Calculating Upper Confidence Limits for Exposure Point Concentrations at Hazardous Waste Sites, OSWER 9285.6-10, December 2002. Schulz, T. W., and Griffin, S. (1999), Estimating Risk Assessment Exposure Point Concentrations When Data are Not Normal or Lognormal. Risk Analysis, Vol. 19, No. 4, 1999. Singh, A. K., Singh, A., and Engelhardt, M. (1997), "The Lognormal Distribution in Environmental Applications," EPA/600/R- 97/ 006, December 1997. Singh, A., Singh, A. K., and lad, R. J. (2002). "Estimation of the Exposure Point Concentration Term Using a Gam ma Distribution." EPA/600/R-02/084. Singh, A., Singh, A. K., Engelhardt, M., and Nocerino, J.M. (2003), "On the Computation of the Upper Confidence Limit of the Mean of Contaminant Data Distributions." Under EPA Review. Singh, A. and Singh, A.K. (2003). Estimation of the Exposure Point Concentration Term (95% UCL) Using Bias-Corrected Accelerated (BCA) Bootstrap Method and Several Other Methods for Normal, Lognormal, and Gamma Distributions. Draft EPA Internal Report. Notice The U. S. Environmental Protection Agency through its Office of Research and Development funded and managed the research described here under Interagency Agreement DW47939416 to the Government Services Administration (GSA). It has been subject to the Agency's peer and administrative review and has been approved for publication as an EPA document. 6 ------- |