EPA/600/A-96/090
COMBINING ENVIRONMENTAL INFORMATION
Lawrence H. Cox1, Senior Mathematical Statistician
U.S. Environmental Protection Agency
1. Introduction
The problem of combining environmental information arises in
all aspects of environmental science. Data combination can be
performed to improve accuracy or precision of estimates, to
investigate relationships between data sets collected in different
places, times, ways, etc., or to validate findings. The scope of
environmental issues and the consequences of attendant remediation
decisions involve the combination and cross-comparison of
information from multiple sources and types. Consequently, proper
data combination can be a challenging problem statistically and
practically.
Data combination is an important problem in environmental
science, comprising the following features. _Environmental data
collection is a difficult and expensive enterprise, necessitating
reuse of available data, often for purposes unintended at the time
the original data were collected. Data necessary to address even
a single environmental issue often need to be compiled from several
sources involving different variables, measurements, time and
spatial scales, accuracy, precision and completeness. Design
criteria for data collection may be incomplete or unavailable, and
selection bias is often present but difficult to quantify. Data
validation can be difficult and complex, involving piecemeal cross-
comparisons among several data sources. Environmental data sets
are often large and consequently difficult and expensive to
manipulate and analyze. As demonstrated in this chapter, these
problems pose important challenges to statistical science, and
statistical methods are central to their solution
An objective of the NISS-USEPA cooperative agreement was to
explore statistical methods for combining environmental
information, leading to improved methods and widespread use of
proven methods. Aspects of this general topic were examined at a
1993 NISS-USEPA workshop using applications drawn from
environmental assessment, monitoring, and epidemiology. Selected
applications and findings of the workshop and subsequent
cooperative research are presented below. Complete details of the
workshop are reported in Cox and Piegorsch (1994).
1 The information in this article has been funded wholly or
in part by the United States Environmental Protection Agency. It
has been subjected to Agency review and approved for publication.
Mention of trade names or commercial products does not constitute
endorsement or recommendation for use.
-------
1
2. Combining Environmental Assessment Information
2.1 A Benthic Index for the Chesapeake Bay
USEPA and the State of Maryland are conducting an assessment
of the effectiveness of environmental restoration in the Chesapeake
Bay. Of concern is how nutrient abundance affects the plant and
animal communities (benthos) in Bay sediment. An ecological
nutrient index (benthic index) developed for the assessment is a
weighted sum of species quantifiers:
z = wxxx + w2x2 + w3x3 + w4x, + wsx5
where xx is the salinity-adjusted number of species, x2 is the
percent of total benthic bivalve abundance, x3 is the number of
amphipods (crustaceans with multi-purpose feet), x4 is the average
weight per polychaete (a type of segmented marine worm), and x5 is
the number of capitellids (a special form of polychaete) observed
at each location. Numerous samples were collected over four years
from 31 sample sites in the Bay. Site selection was based on
ecological, not statistical, criteria, requiring a model-based
approach. Expert-determined index values were assigned to each
site at multiple time points, and used to estimate the index
weights wa via regression.
A goal was to represent the benthic condition of the Bay as a
map of the z-index. From this, other assessments could be made,
e.g., estimating the percentage of Bay acreage exhibiting degraded
biotic conditions or identifying acreage in need of restoration.
For this, the index must be predicted at unobserved locations.
This was accomplished by modeling the index in two parts: using
regression based on location and depth to account for large-scale
variation,* and then kriging the regression residuals to account for
small-scale variation.
Another approach to problems involving multiple spatial scales
is to combine spatial models using change of support methods (Myers
1993). Carroll et al, (1995) do so for the case of snow water
equivalent (SWE) data collected from both terrestrial (point
support) and airborne (areal support) monitoring by direct
computation of the combined terrestrial-airborne data covariance
matrix. This approach can be generalized to other situations,
viz., if the two types of support are nonoverlapping.
The problem of combining large and small scale spatial data is
one of many outstanding problems in environmental data combination,
and arises frequently. Two approaches to this problem are
illustrated above. A worthwhile study would be to compare the
combined regression-kriging approach used in the Chesapeake Bay
study with the change of support methods developed to estimate SWE
using appropriate data such as the SWE data.
-------
2.2 Hazardous Waste Site Characterization
Soil or water samples are collected at locations within a
hazardous waste site to determine whether environmental clean-up is
necessary, has been successful, or needs to be continued. , This
determination is made based on estimates of remaining contamination
obtained by combining site measurements.
To accomplish this, Fisher's Method for combining p-values
C
from c independent studies (viz., compare Xj - -2j^ln(pi) to a
2 = 1
X2 reference distribution with 2c df; see Hedges and Olkin 1985)
can be used to combine environmental data for site
characterization. Assume that soil samples containing
concentrations of c toxic chemicals are taken at each of k
locations within a hazardous waste site. For simplicity, assume
that the k location-specific sampling distributions are i.i.d. with
known mean and variance vectors, and that the c chemical
concentrations, Cijf at each location j are independent. Then,
independent F-statistics F1 with corresponding p-values pi(
i=l,..,c, can be used to test hypotheses of combined effective
clean-up across the sites for each chemical. Fisher's Method then
can be used as an omnibus test of combined effective clean-up
across all chemicals and sites by combining the p4 into X2F.
If samples are not i.i.d. or have unknown variances, Fisher's
Method may be suboptimal or correct reference distributions may be
unavailable. An extended method is available, due to Mathew et al.
(1993). We illustrate this method for the case k=2.
Assume that the location-specific sampling distributions are
not i.i.d with known mean and variance. Compute the two-sample
between-chemical correlation coefficient R and its p-value pR
relative to an omnibus null hypothesis, H0: C < C0, of effective
clean-up across both sites (viz., the actual concentration of each
chemical at each site is below the contamination threshold for that
chemical) . Under H0, F1( F2 and R are independent, and therefore
the test statistic X?2 = -2(ln(p1) + In(p2) + ln(pfl) ) can be compared
to a reference x2 distribution with 6 df. Based on Monte Carlo
simulations, this method appears to perform more powerfully than
the corresponding Fisher Method with 4 df (see Mathew et al. 1993
for details).
Often, there is correlation among chemical concentrations at
a sampling location, and it is inappropriate to combine p-values
directly. Standard solutions are not available to this problem,
but some approaches, though untested, appear worthwhile. For
example, if data from the same or similar sites are available on
-------
background level^ of contamination for these chemicals, these data
can be combined with monitoring data to estimate site-specific
effect sizes (see Sec. 4) for each chemical. By means of an
estimated between-chemical covariance matrix, a correlation-
adjusted estimate of combined effect across sites and chemicals can
be constructed and tested against an omnibus no-effects hypothesis,
H0: C <. C„. Combining these techniques with those of the preceding
paragraph enables hypothesis testing across multiple sites, e.g.,
for combined assessment across a geo-political region.
Meta-analysis is the rubric for statistical methods for
combining the results of independent experiments or trials.
Traditionally, the experiments shared a common endpoint, and the
results to be combined were p-values from independent tests of
hypothesis or effects estimates from independent controlled
experiments. The extension of new meta-analysis methods and the
development of new methods for environmental data combination is an
important area of open environmetric research.
The predominant use of meta-analysis outside the environmental
arena has been directed towards combining information from
independent controlled experiments or trials. Environmental data
are typically not independent nor collected under case controls.
Needed is research on meta-analytic methods for correlated data
such as ecological and human exposure data. An important problem
is how to combine different analyses performed on the same data
set. Meta-analysis also deals with representativeness and data
quality issues such as publication bias, viz., selection bias
favoring publication of studies exhibiting statistically
significant findings. Similar issues arise in environmental
science in terms of study or site selection bias (e.g., studying
the 10 worst lakes in a region), reluctance to share data, and
publishing only those findings that are significant on statistical
or other grounds.
3. Combining Environmental Monitoring Data
Environmental monitoring data are available from many sources,
ranging in scope from single ponds or hazardous waste sites to
national ambient air monitoring networks, over different
environmental media (air, water, soil), and for different, often
overlapping, time and spatial scales. Some environmental
monitoring programs, such as the U.S. Environmental Monitoring and
Assessment Program (EMAP) and the U.S. National Resources Inventory
(NRI), are based on probability sample (P-sample) designs.
However, most are not (NP-samples) . Environmental monitoring data
are difficult and expensive to collect, and statistical methods are
needed to combine monitoring data to maximize environmental
understanding and effective environmental management.
3.1 Combining P-Samples
Three standard approaches are available for combining data or
estimates from two P-samples A and B. The first combines weighted
-------
estimates from the separate P-samples. The most general method is
based on inverse variance weighting: compute separate estimates of
the parameter of interest and its variance, weigh each estimate
inversely proportional to its estimated variance, and add the
weighted estimates. The method of inverse variance weighting
produces an unbiased minimum variance combined estimate
9a + 9b
a.2 a,?
y = —I £_ (Hartley 1974) , and is also applicable to NP-
samples.
A more trenchant approach for P-samples is dual frame
estimation which produces an unbiased minimum variance combined
estimate of a population total Y (Hartley 1974). Let A* and B*
denote the population frames corresponding to the samples A and B,
and consider the nonover 1 apping subsamples of AU.B : AflB , A\B
(viz., units in A not contained in B) , and B\A. A family of
unbiased combined estimates of Y is obtained from:
iq = Y,
i€A\B i£B\A
+ q £ (i/p/n»)y/n» * (i-g) E
ieAflA -Ob- jesTHTte*
for sample observations and probabilities of selection y and p and
for G
-------
problems are emerging in related areas, however. Specialized
statistical sampling methods, e.g., adaptive sampling (Thompson
1992, Part VI j and ranked set sampling (Patil et al. 1994) use
information obtained during sample collection to improve the
efficiency of a single environmental sample. Adaptive sampling
uses information collected on the first k sample units to decide
where to sample the (k+l)st unit (viz., when using random grid
sampling to estimate population size for a plant or animal species,
whenever the species is detected in one sample grid, add the
surrounding grids to the sample). Ranked set sampling employs
(subjective) assessments based on a covariate to prestratify units
(e.g., if the variable of interest, say vegetation, is believed to
be related to forest density, stratify potential sample sites into
high, medium and low categories based on a visual assessment of
density; then draw the sample based on this stratification) . Both
methods combine auxiliary data with a default sampling method (such
as simple random sampling) with the aim of improving sample
efficiency. A related promising area of research is to use
covariate adjustment based on an existing P-sample at the time of
selecting a second P-sample to improve the efficient of the second
sample (Patil 1996).
3.2 Combining P- and NP-Samples
Overton et al. (1993) address the problem of combining a
P-sample with an NP-sample. The NP-sample is identified with a
subset of the population represented by the P-sample, such as by
clustering based on NP-sample attributes. The population is
partitioned so that the NP-sample is a representative sample of one
partition, reducing the problem to that of combining two P-samples.
This method is imperfect, however: representativeness is liable to
be difficult to verify, and, if false, is liable to introduce
systematic bias into combined sample estimates. An alternative is
to include the NP-sample in the combined sample as a separate
stratum of self-representing units, i.e., assign probability of
selection one to each NP-unit. Unfortunately, we observe that this
will not improve precision unless the original P-sample is small.
Needed are further methods that use structure and data from a P-
sample to enhance representativeness of models based on NP-samples.
3.3 Combining NP-Samples
A regression method for augmenting P-samples (Overton et al.
1993) can be used to augment NP-samples. y denotes a variable on
the first but not the second sample; x denotes variables common to
both samples. Regress y on x in the first sample, y = px + e , and
apply this regression equation to x-variables on the second sample
to predict y for each second sample unit, y0 = 0xo . Problems with
this method include failure of the regression to account for true
variation in y, and potentially undetected bias in the regression
due to first-sample selection bias.
-------
Ail alternative method is based on statistical record linkage.
Select x-variables x> measured on both samples. Define a distance
measure d(u1( u2) between first and second sample units ux and u2
based on the xit e.g., normalize each xA to mean = 0 and variance
= l and define d(ulf u2 ) = (xn-xi2)2)in , where the vii reflect
i
the reliability of each xt for linkage. Associate each ux with a
u2 that is at minimum distance from it, and append the remaining
first sample variables of ux to u2. Record linkage can be viewed
as a sequence of mini-cluster analyses, viz., a set of second
sample records centered on each first sample record (however, these
sets are not necessarily disjoint). It is important to note that
these methods require, conditional on the x-variables, that pairs
of first and second sample variables be independent (Rodgers 1984;
Cox and Boruch 1988), and that this conditional independence
assumption must be verified in practice. Statistical record
linkage, familiar to socio-economic statistics, needs to be refined
and evaluated for environmental applications.
3.4 Combining NP-Samples Exhibiting More Than Purposive Structure
Plant, wildlife and other ecological sampling is often based
on a period of observation at preselected sampling locations. The
resulting data are likely to be nonrandom, due to censoring
mechanisms such as bias towards observing larger over smaller
specimens or biases based on direction, terrain, etc. Observed
distributions need to be weighted to account for observer-observed
bias. For example, if each observation v has probability 1 - w(v)
of not being observed, then the observed pdf is the true pdf
weighted by w(v) , and observed data are weighted by 1/w prior to
estimation. Presence of such bias is often signalled by
overdispersion, viz., excessive variance due, e.g., to undetected
bias or nonconstant mean.
Patil (1991) models overdispersion using double exponential
distributions
fu,e(y) = c(p,0)01/2(fp(y) )e(fy{y))l B
where fiJ(y)=exp{yA(fj)-B(}j)+D(y)) is the linear exponential family
with mean and variance 1 /A'(pi) with yA'(p)-B'{}j)=0 .
The double exponential family enjoys exponential family properties
for both mean and dispersion parameters, enabling application of
common regression methods. Overdispersion is common in
environmental data, and may be due to clumped sampling,
heterogeneity, or selection bias. Patil connects double
exponential families and weighted distribution functions via
-------
f»,e(y) = Vv,eiy)K'iy)/Eiw».9(Y)) with vP,«(y) = (fy{y)/fu{y) )1_e .
enabling bias reduction through modeling of overdispersion. If the
observed data are P-sample data, then they can be treated as a
(weighted) two-stage P-sample and standard P-sample methods can be
applied. If not, by accounting for some bias, weighting has
improved the representativeness of observed data, and standard
methods can be applied to weighted data. Weighted distribution
functions add the ability to combine empirical pdf's with
probability distributions of observer bias prior to analysis, and
to eliminate or reduce bias in overdispersed data. Extensions of
this approach to multivariate environmental data are needed.
3.5 Combining Monitoring Data with Spatial Structure
Many environmental monitoring data are inherently spatial.
Spatial estimation methods such as kriging and conditional
simulation are not design based, and do not enjoy the predictive
properties of P-sample based methods. Conversely, P-sample methods
do not take into account spatial structure. The problem of
combining P-sample estimates and spatial estimates for spatial P-
sample data is important in environmental applications. Related
important problems are how to use spatial data to improve a P-
sample design, and vice-versa. The first instance would arise, for
example, if the spatial data are short-term and the P-sample is to
be in place long-term, and could take the form of assigning strata
or joint probabilities of selection in the P-sample based on
spatial properties.
We offer a possible, but untested, approach towards combining
P-sample and spatial methods. Let I denote the set of observed
sites, and z the spatial process of interest. Define a fine grid
over the region and a corresponding set of grid centers J. Use
observed data at the sites I based to estimate a semi-variogram
Y . Use kriging based on y to predict {z} = wizi'.j€J} .
iei
Compute an optimal {i.e., minimum weight) clustering of the grids
relative to the weight matrix W = (w^)6 (one way to do so is to
apply principal components analysis to W and to cluster the grid
centers based on the component loadings). Each cluster is
considered to be a homogeneous set or stratum of sampling units.
Apply (stratified) probability sampling to the set of clusters to
obtain a representative spatial sample.
Spatial sampling is more appropriate for spatial data than
probability sampling, and it is possible that spatial designs will
supplant probability designs in spatial applications. The problem
then is to combine spatially designed data and estimates with those
from P-samples. One approach is to assign probabilities of
selection to the spatial units and use the methods of Sec. 3.1.
The open research question is how to integrate i.i.d. probability
sampling with models for spatially correlated data.
-------
3.6 Further Research
There are several areas where statistical theory can make
needed contributions to ecological monitoring and assessment. The
first is combining data and estimates from environmental monitoring
programs. P-sample based monitoring programs provide a rigorous
framework to study and assess ecological resources. Their value
can be magnified through combination with data from other
monitoring programs, and vice-versa. An important research
question is how to use NP-data to improve the efficiency of P-
sample designs and estimation strategies; e.g., for spatial data,
the NP-sample data might be kriged and the estimated surface used
to stratify the P-sample. Such methods would extend to other
environmental arenas, e.g., multi-stage sampling designs for site
characterization and drawing combined inferences from preliminary
(P-sample) site measurements and final (spatial NP-sample) site
monitoring data.
4. Combining Environmental Epidemiology Information
Approaches that combine data, such as Fisher's Method and its
weighted variations, synthesize disparate information into a single
inference about the environmental phenomenon under study. Unless
the data combination problem is fairly simple, however, simple
p-value combinations may overlook relevant scientific differences
among the various information sources. An extension of p-value
combination is effect size estimation, which combines a set of
summary statistics to produce, e.g., a summary correlation
coefficient or x2 statistic (Hedges and Olkin 1985, Ch. 5) . The
typical situation is to combine effect size estimates (ye-yc) /oe c
across independent controlled experiments (ye and yc are estimates
of the parameter of interest from experimental and control groups,
and oec is a pooled estimate of variance). However,
environmental applications typically involve observational (a.k.a.
"found") data not collected under experimental controls, and
require new and more general techniques. Recent approaches to
combine environmental information from multiple sources include
hierarchical regression, and are illustrated in the following sub-
sections .
4.1. Passive Smoking
An important example of characterizing risk from multiple data
sources is illustrated by combination of epidemiologic data in the
EPA study on health effects of environmental tobacco (or "passive")
smoke (US EPA 1992) . Thirty epidemiologic studies were considered
as part of the EPA analysis. In all cases, the measured effect was
the relative increase in risk for lung cancer mortality over that
risk for non-exposed controls, i.e., t;he relative risk of exposure
death to unexposed death: RR = Pr {D j E,.}/Pr {D J E.} . When the disease
-------
prevalence in the. population is small, the relative risk may be
estimated via the corresponding ratio of odds of exposure for
cancer deaths ("cases") to odds of exposure for controls (Breslow
and Day 1980, Sec. 2.8). Statistical methods are then employed for
testing whether this odds ratio equals 1.
For the 30 environmental tobacco smoke (ETS) studies, the
"exposed" groups were female non-smokers whose spouses smoked. The
estimated relative risks for death due to lung cancer ranged from
0.68 to 2.55 (see Table 1). When analyzed separately, only one of
the eleven U.S.-based studies showed a significant increase
(p=0.03) in the odds of lung cancer mortality after ETS exposure.
At issue was whether proper combination of the individual odds
ratios would identify an overall increased risk of lung cancer
death due to ETS.
Table 1. Summary relative risk information for 30 individual
studies of lung cancer risk after ets exposure, grouped
by geographic region.
Region Study Estimated RRA 90% Confidence Limits Weight wi
G
kala
1. 92
(1.13
3.23)
1 . 98
G
trie
2 . 08
(1.31
3 .29)
12 .76
HK
chan
0 .74
(0.47
1 . 17)
13 . 01
HK
koo
1. 54
(0. 98
2.43)
13 . 12
HK
kamt
1 . 64
(1.21
2.21)
29 . 83
HK
lamw
2 . 51
(1.49
4.23)
9 . 94
J
akib
1.50
(1. 00
2 . 50)
12 . 89
J
hiraCoh
1.37
(1. 02
1.86)
29 . 98
J
inou
2.55
(0. 90
7.20)
2 .50
J
shim
1. 07
(0.70
1.67)
14 .32
J
sobu
1.57
(1.13
2 .15)
26 .16
USA
brow
1 .50
(0.48
4 .72)
2 .07
USA
buff
0 .68
(0.32
1.41)
4 . 92
USA
butlCoh
2 . 01
(0.61
6 . 73)
1 . 88
USA
corr
1 .89
(0.85
4 . 14)
4.32
USA
font
1.28
(1. 03
1.60)
55.79
USA
garf
1.27
(0. 91
1.79)
23 .65
USA
garfCoh
1 .16
(0. 89
1.52)
37 . 78
USA
humb
2 . 00
(0. 83
4 . 97)
3 .38
USA
jane
0 . 79
(0 . 52
1 .17)
16 .46
USA
kaba
0.73
(0 . 27
1.89)
2.86
USA
wu
1.32
(0.59
2.93)
4 .21
EU
holeCoh
1 . 97
(0.34
11.67)
0 . 87
EU
lee
1 .01
(0.47
2.15)
4 .68
EU
pers
1 .17
(0.75
1.87)
12 . 97
EU
sven
1.20
(0. 63
2.36)
6 .21
C
gao
1 .19
(0.87
1 .62)
28.00
C
geng
2 .16
(1.21
3 .84)
8 .12
C
liu
0 .77
(0 . 35
1.68)
4 .40
C
wuwi
0.78
(0 . 63
0.96)
60 . 99
-------
Region codes.: G=Greecef ,HK=Hong Kong, J=Japan, USA,
EU=Western Europe, C=China.
Study abbreviation code from EPA (1992, Table 5.9).
Coh=cohort study.
Estimated relative risk RRi is adjusted for smoker
misclassification.
Weight wi= 1/Var[In(RRJ ] .
As part of the data combination, the relative risk models were
adjusted for certain background exposures that may decrease the
observed risk. For example, women without spouses who smoke are
still exposed to background ETS, via work place or
public/outside-the-home exposures. Thus, the unexposed group may
still exhibit some background lung cancer risk due to ETS, above
and beyond an idealized, "baseline" group that received no ETS
exposure whatsoever. To perform this adjustment, the following
model was developed: the baseline risk to an idealized group with
no ETS exposure was taken as 1. Next, risk to the unexposed group
was modeled via 1 + pidi , where /3 is the increased risk per unit
dose, and d is the mean dose level in the unexposed group. Risk to
the "exposed" group was modeled as 1 + ziPidi , where z is the ratio
between the mean dose level in the exposed group and the mean dose
level in the unexposed group. This gives z^RR^l, where
RR1 = (1 + zi/3idi) /(1 + /3idi) is the "observed" relative risk. Under
this model, the adjusted risk for the exposed group relative to the
baseline group is the quantity of interest. It is calculated as
the ratio of the "exposed" group risk to the baseline risk. As the
baseline risk equals 1 under this model, then RR= 1 + .
Over the 30 studies, these adjusted estimates are pooled, so that,
on a logarithmic scale,
Vw-logfii?*)
Ewi
where the per-study weights are wi = l/var[loq(RR\)] .
Using only the observed relative risks, combination of all the
U.S. studies in the EPA analysis produced a statistically
significant (p=0.02) estimate for increased lung cancer risk of
1.19. Adjustment for background exposures, however, gives stronger
evidence: the relative risk estimate increases to a value of RR't
= 1.59, i.e., an estimated 59% increase in lung cancer mortality in
U.S. non-smokers when exposed to environmental tobacco smoke.
-------
These results illustrate the need for ongoing research and
widespread use of trenchant meta-analysis methods in environmental
science. Environmental and public health policy decisions often
involve serious economic and social outcomes. These decisions must
be based on the most reliable and complete combination of available
information.
4.2 Nitrogen Dioxide Exposure
4.2.1 Meta-Analysis of Nitrogen Dioxide Data
Quantifying risk of respiratory damage after exposure to
airborne toxins is an ongoing concern in modern environmental
epidemiology. An example involves an EPA meta-analysis of
respiratory damage after indoor exposure to nitrogen dioxide, N02.
Previous studies had given mixed results on the risk of N02
exposure. The EPA study combined information on the relationship
of N02 exposure to respiratory illness from these separate studies.
Using as an outcome variable the presence of adverse lower
respiratory symptoms in children aged 5 to 12 years, odds ratios
were employed to estimate the relative risk (RR) for increased
lower respiratory distress in exposed population(s). A set of nine
North American and western European studies reported estimated RR's
ranging from 0.75 to 1.49. Separately, only four of the nine odds
ratios suggested a significant increase in respiratory distress due
to N02 exposure (see Table 2) . Combined using inverse variance
weighting, these data led to a combined RR estimate of 1.18, with
95% confidence limits from 1.11 to 1.25. That is, the
meta-analysis suggested that increased N02 exposure can lead to an
increased risk of respiratory illness of about 11-25% over
unexposed controls (Hasselblad et al. 1992).
Table 2. Summary relative risk information for nine individual
studies of childhood respiratory disease risk after N02
J exposure.
Study Code
Estimated RR
90% Confidence Limits
Weight
M77
1.31
(1.18 (
, 1.45)
258.90
M79
1.24
(1.11 ,
, 1.39)
219 . 67
M80
1.53
(1.11 ,
r 2.11)
26 .10
M82
1.11
(0.87 ,
, 1.42)
44 . 88
W84
1. 08
(0.99 ,
, 1.15)
489.33
N91
1.47
(1.21 ,
, 1.79)
71. 50
E83
1.10
(0.83 ,
, 1.45)
35 .17
D90
0 . 94
(0.70 ,
, 1.26)
31 .30
K7 9
1 .10
(0.78 (
r 1.52)
25. 03
Study abbreviation codes adapted from DuMouchel (1994).
Data source: Hasselblad et al. (1992, Table II).
Weight wA = 1/Var[In(RR)].
-------
4.2.2 Hierarchical Bayes Meta-analysis of Nitrogen Dioxide Data
For the nine studies in the EPA N02 analysis, DuMouchel
(1994) employed a hierarchical model for the separate log odds
ratios by placing prior normal distributions on the underlying mean
log odds ratios, and assuming these mean log odds ratios were
themselves normally distributed. The model also extended the
hierarchy by placing additional hierarchical distributions on the
hyper-parameters of the normal prior distributions. Although
complex, this hierarchical model was easily manipulated to provide
posterior point estimates and posterior standard errors for the log
odds ratios. These estimates could then be employed to find an
overall, combined estimate of relative risk via inverse variance
weighting. Applied to the nine N02 studies, the resulting combined
(posterior) log(RR) is 0.1614 (odds ratio e01614 = 1.175),
buttressing the EPA estimate.
DuMouchel (1994) also described a hierarchical
regression model that adjusted the log(RR) estimate for important
covariates such as household smoking and gender. Applied to the
nine N02 studies, five of the nine studies exhibited significant
(p<0.05) posterior increases in relative risk, yielding a
covariate-adjusted combined log(RR) of 0.1567 (odds ratio e0 1567 =
1.170) . Data combination that incorporates important covariates in
hierarchical Bayes analyses is a straightforward extension of the
simple hierarchical model. Combining the similar information
sources with prior distribution(s) via a regression relationship
can, in effect, smooth out instabilities or other outlying features
in the data. This leads to posterior estimates that can outperform
those from other, non-hierarchical approaches (Greenland 1994) .
5. Conclusions
Research into statistical methods for combining
environmental information covers a broad spectrum of statistical
topics and environmental applications. The importance of these
problems stems from factors including: difficulty and cost of
collecting environmental data; differences in the design of
environmental data collection, and lack of complete design
information; presence of error and bias that are difficult to
identify and quantify; need to integrate different types of data
such as point and areal data or meteorological and pollution data;
and, need to verify scientific conclusions and to ensure
appropriateness and cost-effectiveness of future environmental
studies, regulations, and management strategies, and to establish
mechanisms for their validation and improvement.
In the examples presented above, statistical data
combination made it possible to draw out and quantify conclusions
that were hidden to the investigator. Meta-analytic methods were
presented to combine information on different,chemicals and across
different locations, enabling combined characterization of
hazardous waste sites. Spatial and regression methods were
presented to enable combining assessment information between
-------
monitoring sites, and between different monitoring programs.
Methods were presented to combine statistical samples of different
types, to combine estimates from such samples, and also to combine
epidemiological studies and draw statistically significant
conclusions from sets of predominantly non-significant results, as
in the passive smoking studies. Hierarchical regression methods
were presented to maximize effects of data combination and to
enable study-to-study comparisons.
6. Future Directions
Several open research problems and directions for
research were described with the preceding examples. Problems
which promise to contribute significantly both to environmental
understanding and to statistical methodology include the
development of methods for combining NP-sample data, development of
a theoretical framework for integrating spatial and P-sample
methods for environmental assessment, new methods and extensions of
existing methods for combining spatial data collected at different
aggregation scales, modeling approaches that eliminate or reduce
bias in environmental data, and extensions of meta-analytic methods
to the environmental arena, including methods for combining
correlated studies such as involving different contaminants at the
same sites, or multiple studies on the same data set, and
hierarchical methods that enable combination and intercomparison of
different environmental studies.
As illustrated herein, environmental science has
benefitted from statistical data combination method, but more work
is needed. The interested reader is referred to Cox and Piegorsch
(1996) and Piegorsch and Cox (1996) for critical examination of a
wider set of applications and recommendations for research.
References
Breslow, N. and N. Day (1980) . Statistical Methods in Cancer
Research. I. The Analysis of Case-Control Studies, Vol.
32. IARC Scientific Publications, Lyon, France.
Carroll, S., G. Day, N. Cressie and T. Carroll (1995). "Spatial
modeling of snow water equivalent using airborne and
ground-based snow data". Environmetries 6, 127-139.
Cox, L. and R. Boruch (1988) . "Record linkage, privacy and
statistical policy". Journal of Official Statistics 4,
3-16 .
Cox, L. and W. Piegorsch (1994) . "Combining environmental
information: environmetric research in ecological
monitoring, epidemiology, toxicology, and environmental
data reporting". Technical Report number 12. National
Institute of Statistical Sciences, Research Triangle
Park, NC.
Cox, L.H. and W.W. Piegorsch (1996) . "Combining environmental
information I: Environmental monitoring, measurement
and assessment". Environmetries 7, 299-308.
-------
DuMouchel, W. ("1994) . "Hierarchical Bayes linear models for
meta-analysis". Technical Report number 27. National
Institute of Statistical Sciences, Research Triangle
Park, NC.
Greenland, S. (1994). "Hierarchical regression for epidemiologic
analysis of multiple exposures". Environmental Health
Perspectives 102, Suppl. 8, 33-39.
Hartley, H. (1974). "Multiple frame methodologies and selected
applications". Sankhya, Series B 36, 99-118.
Hasselblad, V. , D.M. Eddy and D.J. Kotehmar (1992). "Synthesis of
environmental evidence: Nitrogen dioxide epidemiology
studies". Journal of the Air & Waste Management
Association 42, 662-671.
Hedges, L. and I. Olkin (1985). Statistical Methods for Meta-
Analysis, Academic Press, Orlando.
Mathew, T. , B. Sinha and L. Zhou (1993). "Some statistical
procedures for combining independent tests". Journal of
the American Statistical-Association 88, 912-919.
Myers, D. (1993). "Change of support and transformations."
Geostatistics for the Next Century, R. Dimitrakopoulous
(ed). Kluwer Academic Publishers, Amsterdam, 253-258.
Overton, J., T. Young and W. Overton (1993). "Using found data to
augment a probability sample: procedure and a case
study". Environmental Monitoring and Assessment 26,
65-83.
G.P. (1991). "Encountered data, statistical ecology,
environmental statistics, and weighted distribution
methods". Environmetries 2(4), 377-423.
G.P. (1996). "Using covariate-directed sampling of EMAP
hexagons to assess the statewide species richness of
breeding birds in Pennsylvania". Technical Report
number 95-1102, Center for Statistical Ecology and
Environmental Statistics, Pennsylvania State University,
University Park, PA.
G.P., A.K. Sinha and C. Taillie (1994). "Ranked set
sampling", in Handbook of Statistics, Volume 12:
Environmental Statistics, G.P. Patil and C.R. Rao
(eds.), North Holland, New York, 103-166.
Piegorsch, W. and L. Cox (1996). "Combining environmental
information II: environmental epidemiology and
toxicology". Environmetries 7, 309-324.
Rodgers, W. (1984). "An evaluation of statistical matching".
Journal of Business & Economic Statistics 2, 91-102.
Thompson, S.K. (1992). Sampling, John Wiley and Sons, Inc., New
York.
US EPA (1992). Respiratory Health Effects of Passive Smoking:
Lung Cancer and Other Disorders. Technical Report
number 600/6-90/006F. Environmental Protection Agency,
Washington, DC.
Patil,
Patil,
Patil,
-------
TECHNICAL REPORT DATA
1. REPORT NO.
EPA/600/A-96/090
2 .
3.RECII
4. TITLE AND SUBTITLE
Combining Environmental Information
5.REPORT DATE
6 PERFORMING ORGANIZATION CODE
7. AUTHOR(S)
Lawrence H. Cox
USEPA
NERL (MD-75)
RTP, NC 27711
8 PERFORMING ORGANIZATION REPORT NO
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Office of the Director
National Exposure Research Laboratory
Research Triangle Park, NC 27711
10.PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO
N/A
12. SPONSORING AGENCY NAME AND ADDRESS
NATIONAL EXPOSURE RESEARCH LABORATORY
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
RESEARCH TRIANGLE PARK, NC 27711
13.TYPE OF REPORT AND PERIOD COVERED
Chapter for a book
14. SPONSORING AGENCY CODE
USEPA
15. SUPPLEMENTARY NOTES
16. ABSTRACT
Combining information is a problem in all aspects of environmental science. This book chapter summarizes recent
developments and research into statistical methods for combining environmental information. It is one chapter in a book
summarizing recent environmental statistics research in a variety of areas.
17. KEY WORDS AND DOCUMENT ANALYSIS
a. DESCRIPTORS
b.IDENTIFIERS/ OPEN ENDED TERMS
c.COSATI
18. DISTRIBUTION STATEMENT
RELEASE TO PUBLIC
19. SECURITY CLASS (This Report)
UNCLASSIFIED
21 NO. OF PAGES
20. SECURITY CLASS (This Page)
UNCLASSIFIED
22 PRICE
-------
|