25th ANNUAL NATIONAL CONFERENCE ON
MANAGING ENVIRONMENTAL QUALITY
SYSTEMS
APRIL 24-27, 2006
Marriott Renaissance, Austin, Texas
Technical Papers
Statistical Issues for Health and the Environment
•	R. Sastry, Statistical Methods to Analyze Occupational Safety Data of DOE
Facilities - 10:30 AM
•	H. Kahn, Statistical Issues in the Analysis of the Carcinogenic Risk of Ethylene
Oxide - 11:00 AM
•	M.Nash, Partial Least Squares Regression for Small Sample - 11:30 AM

-------
TECHNICAL SESSION:
Statistical Issues for Health and the Environment
Statistical Methods to analyze the Occupational Safety and Health Data
of the
Department of Energy facilities
M. Rama Sastry, PhD
Office of Quality Assurance Programs
Office of Environment, Safety and Health
U.S. Department of Energy
Washington, D.C.20585
(Paper prepared for presentation at the EPA Quality Management Conference, Austin, Texas,
April 24-28, 2006. The views expressed in this paper are personal and do not represent the
Department of Energy's position.)
1. INTRODUCTION
The Department of Energy (DOE) operates many nuclear and non-nuclear facilities, and
National Research laboratories located throughout the United States. Approximately
130,000 employees of various contractors work at these facilities. The DOE is
responsible to protect health and safety of the employees and conduct work in an
environmentally safe manner. The DOE complies with the Environmental Protection
Agency (EPA) regulations for environmental management and the Occupational Safety
and Health Administration (OSHA) regulations for worker protection. The DOE
contractors record and report incidents and accidents related to occupational injuries and
illnesses in accordance with 29 CFR regulations. Such data are collected and maintained
by a centralized data base called "Computerized Accident/Incident Reporting System
(CAIRS)", which is the main source of information for the statistical methods shown by
this paper.
The following statistical methods were considered to analyze the occupational safety and
health data:
1.	Exploratory Data Analysis (Box Plots)
2.	Data Visualization, Data Images or Color Histograms
3.	Clustering Analysis (Hierarchical clustering with Complete linkage)
4.	Trend Analysis (Exponential Smoothing, Kalman Filtering)
5.	Advanced Methods (Discriminanat Analysis)
Statistical Issues for Health and the Environment
1

-------
The above methods are some of the possible techniques useful for the analysis and do not
represent a comprehensive or unique list. The Office of Environment, Safety and Health
uses a wide variety of statistical method to conduct analysis of environmental data. For
example, see the publications by Richard Gilbert and others at the Pacific Northwest
National Laboratory (PNNL). The CAIRS data are published on a quarterly basis for all
DOE sites and facilities by contractors, by Field offices/Operations Offices, and by the
DOE Program Offices. The CAIRS data are validated periodically for data quality and
accuracy by reviewing the OSHA 200/300 logs maintained by the contractors. Historical
data are available beginning 1980's, however more recent data were considered by the
analysis to avoid the many changes occurred in organizations and the DOE mission. For
example, after the end of Cold War, the mission of the agency shifted from production to
environmental remediation and waste management, and recently additional emphasis
being placed in conducting basic research in science and technology at the National
Laboratories.
The Department of Labor, Bureau of Labor Statistics (BLS) compiles occupational safety
data for private industry within the United States and that data was used by DOE to
compare safety performance of DOE contractors. Since the inception of OSHA in 1971,
the safety performance of private industries has improved and the same pattern occurred
in DOE. However, in general the recordable injury rates at DOE sites are usually lower
than private industry. The DOE has also adopted the OSHA's Voluntary protection
Program (VPP) to promote safety and health excellence through cooperative efforts
between labor and management. During 1994-2005, approximately 25 sites were
recognized by the DOE VPP as STAR sites, and several other sites are in the process of
obtaining such status. The impact of the VPP and the Value Added by the program are
described in the DOE reports cited in the References (Section 4.) of this paper.
2. STATISTICAL METHODS
Two measures of occupational safety performance considered for the analysis area are as
follows:
(a)	Total Recordable Case Rate (OSHA Recordable injury/illness Case rates), and
(b)	Days Away form work, Restricted, and Transfer Case Rate (DART Case rate)
formerly known as the Lost Work Day Case Rate., as defined by the Bureau of Labor
Statistics
For the sake of illustration, annual data for the years 1996-2005 related to TRC Rate and
DART Rates at major DOE Program Offices were retrieved from CAIRS. The Program
Offices selected for this analysis are:
•	Energy Efficiency and Renewable Energy (EE)
•	Environmental Management (EM)
•	Fossil Energy (FE)
•	Nuclear Energy (NE)
Statistical Issues for Health and the Environment
2

-------
•	National Nuclear Security Administration (NNSA)
•	Fossil Energy (FE)
•	Science (SC)
Figure 1. below shows the TRC rates during the past ten years (1996-2005) for the seven
DOE Program Offices. Three Program Offices, NNSA, EM, and SC employ almost 75%
of the contractor work force in DOE, and the occupational hazards of the operations at
the sites such as Pantex, Los Alamos, Hanford, Rocky Flats, and the Science Laboratories
may be higher than the FE or EE facilities. However, Figure 1. indicates that their injury
illness rates are not necessarily higher. For example, the TRC rates at EM facilities are
lower than the FE rates in most of the years during 1996-2005.
TRC Rates at the DOE Program Offices (1996-2005)

EE
EM
FE
NE
NNSA
RW
SC
1996
1.7
3.3
4
3.6
4.5
2.5
3.7
1997
1.2
2.9
3
4.3
4.5
2.1
3.5
1998
2.5
2.5
2.7
4.7
3.8
1.9
3.8
1999
1.3
2.3
2.3
4
3
1.8
3.1
2000
1.2
2
2.5
2.5
3
1.7
3
2001
2.2
1.9
2
2.1
2.8
1.7
2.9
2002
1.4
1.7
1.5
1.8
2.6
1
2.5
2003
1.6
1.3
1.4
1.1
2.4
0.9
1.8
2004
1.1
1.4
1.6
1.1
2.1
0.6
1.5
2005
0.5
1.4
1.1
1.3
2.1
0.2
1.4
Statistical Issues for Health and the Environment
3

-------
TRC Rates of DOE Program Offices
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
2
3 4
5 6
7 8 9 10
1996-2005
—EE
-¦—EM
FE
NE
-3*— NNSA
—RW
-i—SC
Further analysis of the data was conducted by using Box-Whisker plots (See John
Tukey). Figure 2. Indicates that the variability of the TRC data at EM facilities to be
lower than the variability of the data at FE facilities during the same period. The same
chart suggests that the variability at EE to be the smallest and NE to be the highest. Also
from Figure 2. we observe that the median of the TRC rates at SC to be the largest and
the median of the EE sites to be the smallest among the seven Program offices
considered by this analysis.
In addition to the mandatory safety programs such as Integrated Safety Management
(ISM), many EM sites have adopted the Voluntary Protection Program (VPP) to improve
safety performance. For example, Fernald, Hanford, West Valley, WIPP, etc, are VPP
STAR sites. The safety performance at any DOE site or facility should not be judged on
the basis of one indicator such as TRC or DART. Further analysis is necessary to
understand the differences in the operational risks and the safety performance.
The next statistical method used in this paper is related to VPP data, in particular to
conduct cluster analysis using TRC rates of VPPsites and Non-VPP sites in DOE. For
more details of this method, see Sastry and Schwender (2005), and for a theoretical
description of the methods see Trevor Hastie et al (2001), and W.N. Venables and B.D.
Ripley (2002). The computer software used was S-Plus and R originally developed by
Bell laboratories. The methodology in particular that was used is called "hierarchical
clustering with complete linkage ". The primary objective of this method is to generate
clusters of data and identify similar patterns. Figure 3. (Dendogram) indicates that most
of the VPP sites (labeled in green) clustered into one group and most of the Non-VPP
sites (labeled in red) into another group. Only one or two sites or facilities were clustered
into a wrong group or miss specified.
Statistical Issues for Health and the Environment
4

-------
3. ADVANCED METHODS
In addition to the Clustering methods, other sorting procedures such as the Principle
Component Analysis and Single value decomposition, or the classification /
Classification and Regression Trees (CART) may be applied to perform the necessary
analysis. Also Discriminant Analysis (Linear or Quadratic) is useful to classify the safety
performance of VPP sites and Non-VPP sites. In addition, the distance between the VPP
sites and Non-VPP sites can be estimated on the basis of Mahalonobis D-Square statistic.
Quadratic Discriminant Analysis:
¦	Suppose the distribution for class C is multivariate normal with
mean |ic and covariance, then the Bayes Rule minimizes a
quadratic function:
Qc=D2+ log|^c |-21og;rc
where nc is the prior probability of class C
¦	D = Mahalanobis Distance
¦	D2 = (x; - x)'s(xj - x), where x is the sample
mean, and 5 is the variance
4. CONCLUSIONS
Classical statistical methods supplemented by Data Mining, Visualization and Graphics
can enhance the analysis capability. The results of the analysis should be useful for
management decision making and for continuous improvement.
5. REFERENCES
1.	National Research Council, " Beyond Productivity , Information Technology,
Innovation and Creativity", National Academy Press, Washington, DC
2.	William S. Cleveland, Visualizing Data, A T&T Bell Laboratories, Murray Hill,
N.J. , 1993
3.	Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements of
Statistical Learning: Data Mining, Inference, and Prediction , Springer
Publications, 2001
Statistical Issues for Health and the Environment
5

-------
4.	W.N. Venables and B.D. Ripley Modern Applied Statistics with S, 4th Edition,
Springer Publications, 2002
5.	Rama Sastry and Holger Schwender, Statistical Analysis of Occupational data of
VPP and Non-VPP sites, DOE-EH/0696, April, 2005
6.	Rama Sastry , Rex Bowser and David Smith, The Value Added of the DOE VPP ,
2004 Update, DOE/EH-0690, December 2004
7.	Rama Sastry and Carlos Coffman, Safety Performance of Security Forces at the
DOE facilities, DOE/EH-0705, December 2005
8.	John Tukey, Exploratory Data Analysis, Addison-Wesley Publishing Co, 1977
9.	Leo Breiman, et al, Classification and Regression Trees, Wadsworth Publishing
Co, 1984
10.	R.Gilbert, J.E. Wilson and B.A. Pulsipher and others , "Visual Sample Plan "
(various guides and research reports), DOE's Pacific Northwest National
Laboratory, Richland, WA
Statistical Issues for Health and the Environment
6

-------
Statistical Issues in the Analysis of the Carcinogenic Risk of Ethylene Oxide
Henry D. Kahn and Jennifer Jinot
National Center for Environmental Assessment
Office of Research and Development
Ethylene oxide (EtO) is a gas at room temperature that is manufactured from ethylene
and used primarily as an intermediate in the manufacture of ethylene glycol. It is also
used as a sterilizing agent for medical equipment and as a fumigating agent for spices.
Human exposure to EtO occurs in manufacturing plants and in hospitals and other
facilities where medical equipment is sterilized. EtO can also be inhaled by residents
living near production or sterilizing/fumigating facilities. In humans employed in EtO-
manufacturing facilities and in sterilizing facilities, the greatest evidence of a cancer risk
from exposure is for cancer of the lymphohematopoietic system. Increases in the risk of
lymphohematopoietic cancer have been seen in several studies, manifested as an increase
either in leukemia or in cancer of the lymphoid tissue. In one large epidemiologic study
of sterilizer workers that had a well-defined exposure assessment for individuals, positive
exposure-response trends for lymphohematopoietic cancer mortality in males and for
breast cancer mortality in females were reported (Steenland et al., 2004). The positive
exposure-response trend for female breast cancer was confirmed in an incidence study
based on the same worker cohort (Steenland et al., 2003). This presentation will focus
on the statistical analysis of human epidemiological data that may be used to estimate the
cancer inhalation risk due to exposure to ethylene oxide. Statistical modeling of the data
and the methodology for derivation of inhalation unit risk estimates for cancer mortality
and incidence will be discussed.
Statistical Issues for Health and the Environment
7

-------
Partial Least Squares (PLS) Regression for Small Sample with Collinear Predictors
in Landscape Ecology.
Maliha S. Nash * and Ricardo Lopez
US EPA, PO Box 93478, Las Vegas NV 89193-3478.
E-mail: nash. maliha(a),epa. gov
(Notice: Although this work was reviewed by EPA and approved for publication, it may not
necessarily reflect official Agency policy.)
1.	Introduction
Investigation of associations among constituents of surface water and landscapes
involves statistical analyses of fundamentally different data sets. Data on surface water
conditions are generally obtained through field sampling programs and field/analysis
programs are expensive and labor intensive; consequently, the total number of sample
sites is usually small. The data set may contain missing values due to the realities of
sampling or cost. Landscape data, however, is derived from remote sensing platforms,
thereby permitting wall-to-wall coverage. The landscape data sets may contain a very
large number of variables, although many of these are not wholly independent (i.e., they
may be collinear). Single- and multiple-regression analysis has frequently been used to
relate water nutrient concentrations to selected landscape variables are sensitive to
missing values and dependence of predictors (landscape variables). Reliable statistically
significant results generally cannot be obtained unless the total number of samples greatly
exceeds the number of variables. Partial least squares (PLS) analysis offers a number of
advantages over the more traditionally used regression analyses. It has been found to be
useful both for providing accurate predictions and for interpreting relationships between
data sets containing a high degree of collinearity (see references in Nash et al., 2005).
Additionally, the prediction error in PLS is smaller than in other multivariate methods
(for references and more see Nash et al., 2005).
2.	Data Description
The study area is the Upper White River study area is in the Ozarks of Missouri and
Arkansas, where 244 water-quality sampling locations were sampled and used as 'pour-
points. For each of the 244 sites, the watershed support area was delineated and a suite of
landscape variables was calculated. It is important to understand that some of the 244
subwatersheds are nested completely within other larger subwatersheds. Total of 46
landscape metrics were generated per each watershed. Measured total phosphorous (TP),
total ammonia (TAM), and E. coli were only existed in 18, 6, and 15 sites, respectively.
For the purpose of this paper, we used non nested (0 level) sub-watersheds representing
first order streams, hence eliminating nested watershed. Sample size, therefore, was 5, 6,
and 5 for TP, TAM and E. coli, respectively that used for building the PLS models.
Landscape metrics were for year 2000 and surface water constituents were averaged over
a period of 1997-2002. Prediction of the surface water constituents for the remaining
from the 244 sites were made using the PLS models above.
Statistical Issues for Health and the Environment
8

-------
3.	STATISTICAL METHODOLOGY
PLS is a multivariate analysis technique permits analysis and prediction for data sets with
missing values, with collinearity and with a relatively small number of observations (see
references in Nash et al., 2005). In the PLS analyses, both response and predictor data
sets (e.g. water and landscape variables) are first centered and scaled. A linear
combination is composed on the independent variables (T = L0 W; T is the score and W is
weight) forming a number of orthogonal latent variables [T] that are less in number
(dimensions) than that of the original landscape variables. The linear combination in [T]
is formed so that the covariance between [T] and the linear composition of the dependent
variables are maximized (T& U; U = B0 V; U is the score and V is weight). Prediction of
both water and landscape data will be via regression on the common latent variables (T).
Modeling and prediction in PLS, therefore, is not solely based on the conditional
distribution of the predictors (water variables) in the presence of independent variables
(landscape variables), instead it accounts for both landscape and water together through
[T] (see references in Nash et al., 2005).
PLS produces n-1 factors, with each factor containing a pair of scores (T,, U,).
Linear combinations on each data set are called factors. PLS extracts the second factor
using the residuals from the first and finds the linear combinations of both data sets such
that their covariance is maximized. This process is repeated by taking residuals from the
previous factor, producing n-1 factors, where n is the number of observations. Not all of
these factors are significant using the Cross Validation (CV) method; only the significant
factors are used in the final model. When applying CV, one data point is held out and the
fitted models are tested using the rest of data set and the predicted values are compared
with that of observed using PRESS (Predictive Residual Sum of Square) to assess the
predictive ability of the model. SAS gives the root means PRESS and its significant level
(the lower the value, the better the model is).
After defining the significant PLS factors; scores, weights and VIP (Variable
Influence on Projection) are used to examine the strength of the relationship,
irregularities and the contribution of the independent variable (landscape) in the model.
If VIP for an independent variable is small in value, it implies that variable has a
relatively small contribution to the model and may be deleted from the model. It was
indicated VIP values of less than 0.8 are considered to be small. The quality of the model
was determined by examining the residuals for both the response and the landscape
variables. An examination of any possible outliers using residuals was carried out to
finalize the fitted PLS model. SAS was used for statistical analyses.
4.	RESULTS
TP PLS model resulted in one significant factor explaining 83% of the variability in the
TP (see Table). Barren soil had the most significant effect based on the whole
watershed and in the riparian zone immediate to the stream. While the stream
density relates inversely with TP, percent barren enhances TP in surface water
especially in areas adjacent to the stream. The forest- and urban- related variables
contribute equally with opposite effect on the TP. Urban enhanced TP whereas
forest, especially within the proximity (RforO) of the sampling site, depressed the
level of TP.
Statistical Issues for Health and the Environment
9

-------
TAM PLS model resulted in one significant factor explaining 93% of the variability in
the TAM (see Table). Riparian and natural within all distances have negative
effect on TAM, whereas urban has a positive effect. Urban within the riparian
zone enhanced the level of TAM but beyond the proximity of stream (i.e. distance
of 30 m and more).
E. Coli PLS model resulted in two significant factors explaining 99.7% of the variability
in the E.coli population (see Table). Elevation has the highest effect on the E.
coli, the flatter the soil surface the higher abundance of the E. coli in surface
water. Urban- related variables enhanced the level of the E. coli especially in
riparian within area of the sampling site.
The prediction of the constituents in the 244 watersheds (from a small filed-based data
sample) was used to visualize the joint behavior of the predicted TP, TAM, and E. coli in
surface water of the Upper White River (Figure 1. Using PLS we determined four
distinct surface water conditions among subwatersheds in the Ozarks:
(1) subwatersheds with high concentrations of TAM, high concentrations of TP, and high
cell counts of E. coli; (2) subwatersheds with high concentrations of TAM, low
concentrations of TP, and high cell counts of E. coli; (3) subwatersheds with low
concentrations of TAM, low concentrations of TP, and high cell counts of E. coli; and (4)
subwatersheds with moderate concentrations of TAM, TP, and cell counts of E. coli.
5. Discussion and Conclusion
The results indicate PLS may prove to be a valuable statistical analysis tool for ecological
studies. The PLS methodology is less sensitive to the limitations than other statistical
methods. The joint behavior of TP and TAM as related with E. coli (Figure 1) was not
possible using the measurements from the study area sites (5, 6, and 5 sites for TP, TAM,
and E. coli, respectively) but it was overcome by prediction form the PLS model for the
244 sites. Hence, further analyses and comparisons within and between the above 4
groups may reveal the spatial characteristics setting for watersheds and their effect on
surface water quality.
The model results may help landscape ecologists produce indicators of surface
water condition, such that unique combinations of these indicators can be used to infer
the potential cause(s) and origin(s) of non-point pollution, which may lead to
eutrophication in aquatic ecosystems, the loss of aquatic ecosystem function, and the
injury of humans that consume from (or recreate in) the aquatic resources of the Ozarks.
Sensitivity analyses for the above model and the PLS results discussed in this
presentation are actively being used to prioritized subwatersheds in the Ozarks for
watershed management activities.
Reference
Note: The authors would like to thank Ms. Deborah Chaloud, EPA/LEB for valuable
input. The U.S. Environmental Protection Agency (EPA), through its Office of Research
and Development (ORD), funded and performed the research described here. Although
this work was reviewed by EPA and approved for publication, it may not necessarily
reflect official Agency policy. Mention of trade names or commercial products does not
constitute endorsement or recommendation for use.
Statistical Issues for Health and the Environment
10

-------
Table 1. Coefficients of the non-centered value of landscape metrics to predict the ln(TP), TAM,
and In (A", coli). Number of significant PLS factors and percent variation explained by
PLS for the responses are in the last two rows.
TP	TAM
E. coli
Landscape Metrics Coefficient

VIP
Coefficient
VIP
Coefficient
VIP
Intercept
-2.2892


0.07867

-2.20795

Fdensity
-0.00194
0
.12864




Fedge210
-0.00195
1
.03754




F mdcp
0.00079
0
.97022




F plgp





-0.00794
1.015
Pfor
-0.00118
1
.01968


0.00736
0.985
RforO
-0.00123
1
.12558


-0.00023
0.976
Rfor30
-0.00118
1
.10126


-0.00065
0.979
Rforl20
-0.00117
1
.10279


0.00172
0.977
RnatO
-0.00123
1,
.12558
-0.00027
0.91965
-0.000232
0.976
Rnat30
-0.00118
1,
.10126
-0.00025
0.92898
-0.00065
0.979
Rnatl20
-0.00117
1,
.10279
-0.00023
0.94123
0.00172
0.977
Purb
0.00102
1,
.06728
0.00031
1.06739
0.00557
1.008
RurbO
0.00161
1
.13694


0.00929
1.014
Rurb30
0.00153
1
.1.133
0.00042
1.07047
0.00868
1.014
Rurbl20
0.00141
1,
.16164
0.00036
1.06935
0.00743
1.006
RhumO
0.00123
1
.12558


0.00023
0.976
Rhum30
0.00118
1
.10126


0.00065
0.979
Rhuml20
0.00117
1
.10279


-0.00172
0.977
Pctia rd
0.00197
1
.04816


0.01373
1.021
Rddens
0.01074
1
.03470


0.09992
1.051
Pmbar
0.3312
1
.21778




RmbarO
0.08693
0
.50283




Rmbar30
0.15629
0
.50283


0.11702
0.757
Rmbarl20
0.38269
1
.24011




Strmdens
-0.04314
1
.13699


-0.91794
1.018
Elevmin





0.00024
1.273
Number of Factors	1	12
% Variation
83
93
99.7

-------
Figure 1 Three-dimensional plot of predicted TAM (x-axis), TP (y-axis), and E. coli cell counts
(z-axis) among 244 subwatersheds in the Upper White River region of the Ozarks.

-------