EPA/600/A-96/090 COMBINING ENVIRONMENTAL INFORMATION Lawrence H. Cox1, Senior Mathematical Statistician U.S. Environmental Protection Agency 1. Introduction The problem of combining environmental information arises in all aspects of environmental science. Data combination can be performed to improve accuracy or precision of estimates, to investigate relationships between data sets collected in different places, times, ways, etc., or to validate findings. The scope of environmental issues and the consequences of attendant remediation decisions involve the combination and cross-comparison of information from multiple sources and types. Consequently, proper data combination can be a challenging problem statistically and practically. Data combination is an important problem in environmental science, comprising the following features. _Environmental data collection is a difficult and expensive enterprise, necessitating reuse of available data, often for purposes unintended at the time the original data were collected. Data necessary to address even a single environmental issue often need to be compiled from several sources involving different variables, measurements, time and spatial scales, accuracy, precision and completeness. Design criteria for data collection may be incomplete or unavailable, and selection bias is often present but difficult to quantify. Data validation can be difficult and complex, involving piecemeal cross- comparisons among several data sources. Environmental data sets are often large and consequently difficult and expensive to manipulate and analyze. As demonstrated in this chapter, these problems pose important challenges to statistical science, and statistical methods are central to their solution An objective of the NISS-USEPA cooperative agreement was to explore statistical methods for combining environmental information, leading to improved methods and widespread use of proven methods. Aspects of this general topic were examined at a 1993 NISS-USEPA workshop using applications drawn from environmental assessment, monitoring, and epidemiology. Selected applications and findings of the workshop and subsequent cooperative research are presented below. Complete details of the workshop are reported in Cox and Piegorsch (1994). 1 The information in this article has been funded wholly or in part by the United States Environmental Protection Agency. It has been subjected to Agency review and approved for publication. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. ------- 1 2. Combining Environmental Assessment Information 2.1 A Benthic Index for the Chesapeake Bay USEPA and the State of Maryland are conducting an assessment of the effectiveness of environmental restoration in the Chesapeake Bay. Of concern is how nutrient abundance affects the plant and animal communities (benthos) in Bay sediment. An ecological nutrient index (benthic index) developed for the assessment is a weighted sum of species quantifiers: z = wxxx + w2x2 + w3x3 + w4x, + wsx5 where xx is the salinity-adjusted number of species, x2 is the percent of total benthic bivalve abundance, x3 is the number of amphipods (crustaceans with multi-purpose feet), x4 is the average weight per polychaete (a type of segmented marine worm), and x5 is the number of capitellids (a special form of polychaete) observed at each location. Numerous samples were collected over four years from 31 sample sites in the Bay. Site selection was based on ecological, not statistical, criteria, requiring a model-based approach. Expert-determined index values were assigned to each site at multiple time points, and used to estimate the index weights wa via regression. A goal was to represent the benthic condition of the Bay as a map of the z-index. From this, other assessments could be made, e.g., estimating the percentage of Bay acreage exhibiting degraded biotic conditions or identifying acreage in need of restoration. For this, the index must be predicted at unobserved locations. This was accomplished by modeling the index in two parts: using regression based on location and depth to account for large-scale variation,* and then kriging the regression residuals to account for small-scale variation. Another approach to problems involving multiple spatial scales is to combine spatial models using change of support methods (Myers 1993). Carroll et al, (1995) do so for the case of snow water equivalent (SWE) data collected from both terrestrial (point support) and airborne (areal support) monitoring by direct computation of the combined terrestrial-airborne data covariance matrix. This approach can be generalized to other situations, viz., if the two types of support are nonoverlapping. The problem of combining large and small scale spatial data is one of many outstanding problems in environmental data combination, and arises frequently. Two approaches to this problem are illustrated above. A worthwhile study would be to compare the combined regression-kriging approach used in the Chesapeake Bay study with the change of support methods developed to estimate SWE using appropriate data such as the SWE data. ------- 2.2 Hazardous Waste Site Characterization Soil or water samples are collected at locations within a hazardous waste site to determine whether environmental clean-up is necessary, has been successful, or needs to be continued. , This determination is made based on estimates of remaining contamination obtained by combining site measurements. To accomplish this, Fisher's Method for combining p-values C from c independent studies (viz., compare Xj - -2j^ln(pi) to a 2 = 1 X2 reference distribution with 2c df; see Hedges and Olkin 1985) can be used to combine environmental data for site characterization. Assume that soil samples containing concentrations of c toxic chemicals are taken at each of k locations within a hazardous waste site. For simplicity, assume that the k location-specific sampling distributions are i.i.d. with known mean and variance vectors, and that the c chemical concentrations, Cijf at each location j are independent. Then, independent F-statistics F1 with corresponding p-values pi( i=l,..,c, can be used to test hypotheses of combined effective clean-up across the sites for each chemical. Fisher's Method then can be used as an omnibus test of combined effective clean-up across all chemicals and sites by combining the p4 into X2F. If samples are not i.i.d. or have unknown variances, Fisher's Method may be suboptimal or correct reference distributions may be unavailable. An extended method is available, due to Mathew et al. (1993). We illustrate this method for the case k=2. Assume that the location-specific sampling distributions are not i.i.d with known mean and variance. Compute the two-sample between-chemical correlation coefficient R and its p-value pR relative to an omnibus null hypothesis, H0: C < C0, of effective clean-up across both sites (viz., the actual concentration of each chemical at each site is below the contamination threshold for that chemical) . Under H0, F1( F2 and R are independent, and therefore the test statistic X?2 = -2(ln(p1) + In(p2) + ln(pfl) ) can be compared to a reference x2 distribution with 6 df. Based on Monte Carlo simulations, this method appears to perform more powerfully than the corresponding Fisher Method with 4 df (see Mathew et al. 1993 for details). Often, there is correlation among chemical concentrations at a sampling location, and it is inappropriate to combine p-values directly. Standard solutions are not available to this problem, but some approaches, though untested, appear worthwhile. For example, if data from the same or similar sites are available on ------- background level^ of contamination for these chemicals, these data can be combined with monitoring data to estimate site-specific effect sizes (see Sec. 4) for each chemical. By means of an estimated between-chemical covariance matrix, a correlation- adjusted estimate of combined effect across sites and chemicals can be constructed and tested against an omnibus no-effects hypothesis, H0: C <. C„. Combining these techniques with those of the preceding paragraph enables hypothesis testing across multiple sites, e.g., for combined assessment across a geo-political region. Meta-analysis is the rubric for statistical methods for combining the results of independent experiments or trials. Traditionally, the experiments shared a common endpoint, and the results to be combined were p-values from independent tests of hypothesis or effects estimates from independent controlled experiments. The extension of new meta-analysis methods and the development of new methods for environmental data combination is an important area of open environmetric research. The predominant use of meta-analysis outside the environmental arena has been directed towards combining information from independent controlled experiments or trials. Environmental data are typically not independent nor collected under case controls. Needed is research on meta-analytic methods for correlated data such as ecological and human exposure data. An important problem is how to combine different analyses performed on the same data set. Meta-analysis also deals with representativeness and data quality issues such as publication bias, viz., selection bias favoring publication of studies exhibiting statistically significant findings. Similar issues arise in environmental science in terms of study or site selection bias (e.g., studying the 10 worst lakes in a region), reluctance to share data, and publishing only those findings that are significant on statistical or other grounds. 3. Combining Environmental Monitoring Data Environmental monitoring data are available from many sources, ranging in scope from single ponds or hazardous waste sites to national ambient air monitoring networks, over different environmental media (air, water, soil), and for different, often overlapping, time and spatial scales. Some environmental monitoring programs, such as the U.S. Environmental Monitoring and Assessment Program (EMAP) and the U.S. National Resources Inventory (NRI), are based on probability sample (P-sample) designs. However, most are not (NP-samples) . Environmental monitoring data are difficult and expensive to collect, and statistical methods are needed to combine monitoring data to maximize environmental understanding and effective environmental management. 3.1 Combining P-Samples Three standard approaches are available for combining data or estimates from two P-samples A and B. The first combines weighted ------- estimates from the separate P-samples. The most general method is based on inverse variance weighting: compute separate estimates of the parameter of interest and its variance, weigh each estimate inversely proportional to its estimated variance, and add the weighted estimates. The method of inverse variance weighting produces an unbiased minimum variance combined estimate 9a + 9b a.2 a,? y = —I £_ (Hartley 1974) , and is also applicable to NP- samples. A more trenchant approach for P-samples is dual frame estimation which produces an unbiased minimum variance combined estimate of a population total Y (Hartley 1974). Let A* and B* denote the population frames corresponding to the samples A and B, and consider the nonover 1 apping subsamples of AU.B : AflB , A\B (viz., units in A not contained in B) , and B\A. A family of unbiased combined estimates of Y is obtained from: iq = Y, i€A\B i£B\A + q £ (i/p/n»)y/n» * (i-g) E ieAflA -Ob- jesTHTte* for sample observations and probabilities of selection y and p and for G------- problems are emerging in related areas, however. Specialized statistical sampling methods, e.g., adaptive sampling (Thompson 1992, Part VI j and ranked set sampling (Patil et al. 1994) use information obtained during sample collection to improve the efficiency of a single environmental sample. Adaptive sampling uses information collected on the first k sample units to decide where to sample the (k+l)st unit (viz., when using random grid sampling to estimate population size for a plant or animal species, whenever the species is detected in one sample grid, add the surrounding grids to the sample). Ranked set sampling employs (subjective) assessments based on a covariate to prestratify units (e.g., if the variable of interest, say vegetation, is believed to be related to forest density, stratify potential sample sites into high, medium and low categories based on a visual assessment of density; then draw the sample based on this stratification) . Both methods combine auxiliary data with a default sampling method (such as simple random sampling) with the aim of improving sample efficiency. A related promising area of research is to use covariate adjustment based on an existing P-sample at the time of selecting a second P-sample to improve the efficient of the second sample (Patil 1996). 3.2 Combining P- and NP-Samples Overton et al. (1993) address the problem of combining a P-sample with an NP-sample. The NP-sample is identified with a subset of the population represented by the P-sample, such as by clustering based on NP-sample attributes. The population is partitioned so that the NP-sample is a representative sample of one partition, reducing the problem to that of combining two P-samples. This method is imperfect, however: representativeness is liable to be difficult to verify, and, if false, is liable to introduce systematic bias into combined sample estimates. An alternative is to include the NP-sample in the combined sample as a separate stratum of self-representing units, i.e., assign probability of selection one to each NP-unit. Unfortunately, we observe that this will not improve precision unless the original P-sample is small. Needed are further methods that use structure and data from a P- sample to enhance representativeness of models based on NP-samples. 3.3 Combining NP-Samples A regression method for augmenting P-samples (Overton et al. 1993) can be used to augment NP-samples. y denotes a variable on the first but not the second sample; x denotes variables common to both samples. Regress y on x in the first sample, y = px + e , and apply this regression equation to x-variables on the second sample to predict y for each second sample unit, y0 = 0xo . Problems with this method include failure of the regression to account for true variation in y, and potentially undetected bias in the regression due to first-sample selection bias.------- Ail alternative method is based on statistical record linkage. Select x-variables x> measured on both samples. Define a distance measure d(u1( u2) between first and second sample units ux and u2 based on the xit e.g., normalize each xA to mean = 0 and variance = l and define d(ulf u2 ) = (xn-xi2)2)in , where the vii reflect i the reliability of each xt for linkage. Associate each ux with a u2 that is at minimum distance from it, and append the remaining first sample variables of ux to u2. Record linkage can be viewed as a sequence of mini-cluster analyses, viz., a set of second sample records centered on each first sample record (however, these sets are not necessarily disjoint). It is important to note that these methods require, conditional on the x-variables, that pairs of first and second sample variables be independent (Rodgers 1984; Cox and Boruch 1988), and that this conditional independence assumption must be verified in practice. Statistical record linkage, familiar to socio-economic statistics, needs to be refined and evaluated for environmental applications. 3.4 Combining NP-Samples Exhibiting More Than Purposive Structure Plant, wildlife and other ecological sampling is often based on a period of observation at preselected sampling locations. The resulting data are likely to be nonrandom, due to censoring mechanisms such as bias towards observing larger over smaller specimens or biases based on direction, terrain, etc. Observed distributions need to be weighted to account for observer-observed bias. For example, if each observation v has probability 1 - w(v) of not being observed, then the observed pdf is the true pdf weighted by w(v) , and observed data are weighted by 1/w prior to estimation. Presence of such bias is often signalled by overdispersion, viz., excessive variance due, e.g., to undetected bias or nonconstant mean. Patil (1991) models overdispersion using double exponential distributions fu,e(y) = c(p,0)01/2(fp(y) )e(fy{y))l B where fiJ(y)=exp{yA(fj)-B(}j)+D(y)) is the linear exponential family with mean and variance 1 /A'(pi) with yA'(p)-B'{}j)=0 . The double exponential family enjoys exponential family properties for both mean and dispersion parameters, enabling application of common regression methods. Overdispersion is common in environmental data, and may be due to clumped sampling, heterogeneity, or selection bias. Patil connects double exponential families and weighted distribution functions via------- f»,e(y) = Vv,eiy)K'iy)/Eiw».9(Y)) with vP,«(y) = (fy{y)/fu{y) )1_e . enabling bias reduction through modeling of overdispersion. If the observed data are P-sample data, then they can be treated as a (weighted) two-stage P-sample and standard P-sample methods can be applied. If not, by accounting for some bias, weighting has improved the representativeness of observed data, and standard methods can be applied to weighted data. Weighted distribution functions add the ability to combine empirical pdf's with probability distributions of observer bias prior to analysis, and to eliminate or reduce bias in overdispersed data. Extensions of this approach to multivariate environmental data are needed. 3.5 Combining Monitoring Data with Spatial Structure Many environmental monitoring data are inherently spatial. Spatial estimation methods such as kriging and conditional simulation are not design based, and do not enjoy the predictive properties of P-sample based methods. Conversely, P-sample methods do not take into account spatial structure. The problem of combining P-sample estimates and spatial estimates for spatial P- sample data is important in environmental applications. Related important problems are how to use spatial data to improve a P- sample design, and vice-versa. The first instance would arise, for example, if the spatial data are short-term and the P-sample is to be in place long-term, and could take the form of assigning strata or joint probabilities of selection in the P-sample based on spatial properties. We offer a possible, but untested, approach towards combining P-sample and spatial methods. Let I denote the set of observed sites, and z the spatial process of interest. Define a fine grid over the region and a corresponding set of grid centers J. Use observed data at the sites I based to estimate a semi-variogram Y . Use kriging based on y to predict {z} = wizi'.j€J} . iei Compute an optimal {i.e., minimum weight) clustering of the grids relative to the weight matrix W = (w^)6 (one way to do so is to apply principal components analysis to W and to cluster the grid centers based on the component loadings). Each cluster is considered to be a homogeneous set or stratum of sampling units. Apply (stratified) probability sampling to the set of clusters to obtain a representative spatial sample. Spatial sampling is more appropriate for spatial data than probability sampling, and it is possible that spatial designs will supplant probability designs in spatial applications. The problem then is to combine spatially designed data and estimates with those from P-samples. One approach is to assign probabilities of selection to the spatial units and use the methods of Sec. 3.1. The open research question is how to integrate i.i.d. probability sampling with models for spatially correlated data.------- 3.6 Further Research There are several areas where statistical theory can make needed contributions to ecological monitoring and assessment. The first is combining data and estimates from environmental monitoring programs. P-sample based monitoring programs provide a rigorous framework to study and assess ecological resources. Their value can be magnified through combination with data from other monitoring programs, and vice-versa. An important research question is how to use NP-data to improve the efficiency of P- sample designs and estimation strategies; e.g., for spatial data, the NP-sample data might be kriged and the estimated surface used to stratify the P-sample. Such methods would extend to other environmental arenas, e.g., multi-stage sampling designs for site characterization and drawing combined inferences from preliminary (P-sample) site measurements and final (spatial NP-sample) site monitoring data. 4. Combining Environmental Epidemiology Information Approaches that combine data, such as Fisher's Method and its weighted variations, synthesize disparate information into a single inference about the environmental phenomenon under study. Unless the data combination problem is fairly simple, however, simple p-value combinations may overlook relevant scientific differences among the various information sources. An extension of p-value combination is effect size estimation, which combines a set of summary statistics to produce, e.g., a summary correlation coefficient or x2 statistic (Hedges and Olkin 1985, Ch. 5) . The typical situation is to combine effect size estimates (ye-yc) /oe c across independent controlled experiments (ye and yc are estimates of the parameter of interest from experimental and control groups, and oec is a pooled estimate of variance). However, environmental applications typically involve observational (a.k.a. "found") data not collected under experimental controls, and require new and more general techniques. Recent approaches to combine environmental information from multiple sources include hierarchical regression, and are illustrated in the following sub- sections . 4.1. Passive Smoking An important example of characterizing risk from multiple data sources is illustrated by combination of epidemiologic data in the EPA study on health effects of environmental tobacco (or "passive") smoke (US EPA 1992) . Thirty epidemiologic studies were considered as part of the EPA analysis. In all cases, the measured effect was the relative increase in risk for lung cancer mortality over that risk for non-exposed controls, i.e., t;he relative risk of exposure death to unexposed death: RR = Pr {D j E,.}/Pr {D J E.} . When the disease------- prevalence in the. population is small, the relative risk may be estimated via the corresponding ratio of odds of exposure for cancer deaths ("cases") to odds of exposure for controls (Breslow and Day 1980, Sec. 2.8). Statistical methods are then employed for testing whether this odds ratio equals 1. For the 30 environmental tobacco smoke (ETS) studies, the "exposed" groups were female non-smokers whose spouses smoked. The estimated relative risks for death due to lung cancer ranged from 0.68 to 2.55 (see Table 1). When analyzed separately, only one of the eleven U.S.-based studies showed a significant increase (p=0.03) in the odds of lung cancer mortality after ETS exposure. At issue was whether proper combination of the individual odds ratios would identify an overall increased risk of lung cancer death due to ETS. Table 1. Summary relative risk information for 30 individual studies of lung cancer risk after ets exposure, grouped by geographic region. Region Study Estimated RRA 90% Confidence Limits Weight wi G kala 1. 92 (1.13 3.23) 1 . 98 G trie 2 . 08 (1.31 3 .29) 12 .76 HK chan 0 .74 (0.47 1 . 17) 13 . 01 HK koo 1. 54 (0. 98 2.43) 13 . 12 HK kamt 1 . 64 (1.21 2.21) 29 . 83 HK lamw 2 . 51 (1.49 4.23) 9 . 94 J akib 1.50 (1. 00 2 . 50) 12 . 89 J hiraCoh 1.37 (1. 02 1.86) 29 . 98 J inou 2.55 (0. 90 7.20) 2 .50 J shim 1. 07 (0.70 1.67) 14 .32 J sobu 1.57 (1.13 2 .15) 26 .16 USA brow 1 .50 (0.48 4 .72) 2 .07 USA buff 0 .68 (0.32 1.41) 4 . 92 USA butlCoh 2 . 01 (0.61 6 . 73) 1 . 88 USA corr 1 .89 (0.85 4 . 14) 4.32 USA font 1.28 (1. 03 1.60) 55.79 USA garf 1.27 (0. 91 1.79) 23 .65 USA garfCoh 1 .16 (0. 89 1.52) 37 . 78 USA humb 2 . 00 (0. 83 4 . 97) 3 .38 USA jane 0 . 79 (0 . 52 1 .17) 16 .46 USA kaba 0.73 (0 . 27 1.89) 2.86 USA wu 1.32 (0.59 2.93) 4 .21 EU holeCoh 1 . 97 (0.34 11.67) 0 . 87 EU lee 1 .01 (0.47 2.15) 4 .68 EU pers 1 .17 (0.75 1.87) 12 . 97 EU sven 1.20 (0. 63 2.36) 6 .21 C gao 1 .19 (0.87 1 .62) 28.00 C geng 2 .16 (1.21 3 .84) 8 .12 C liu 0 .77 (0 . 35 1.68) 4 .40 C wuwi 0.78 (0 . 63 0.96) 60 . 99------- Region codes.: G=Greecef ,HK=Hong Kong, J=Japan, USA, EU=Western Europe, C=China. Study abbreviation code from EPA (1992, Table 5.9). Coh=cohort study. Estimated relative risk RRi is adjusted for smoker misclassification. Weight wi= 1/Var[In(RRJ ] . As part of the data combination, the relative risk models were adjusted for certain background exposures that may decrease the observed risk. For example, women without spouses who smoke are still exposed to background ETS, via work place or public/outside-the-home exposures. Thus, the unexposed group may still exhibit some background lung cancer risk due to ETS, above and beyond an idealized, "baseline" group that received no ETS exposure whatsoever. To perform this adjustment, the following model was developed: the baseline risk to an idealized group with no ETS exposure was taken as 1. Next, risk to the unexposed group was modeled via 1 + pidi , where /3 is the increased risk per unit dose, and d is the mean dose level in the unexposed group. Risk to the "exposed" group was modeled as 1 + ziPidi , where z is the ratio between the mean dose level in the exposed group and the mean dose level in the unexposed group. This gives z^RR^l, where RR1 = (1 + zi/3idi) /(1 + /3idi) is the "observed" relative risk. Under this model, the adjusted risk for the exposed group relative to the baseline group is the quantity of interest. It is calculated as the ratio of the "exposed" group risk to the baseline risk. As the baseline risk equals 1 under this model, then RR= 1 + . Over the 30 studies, these adjusted estimates are pooled, so that, on a logarithmic scale, Vw-logfii?*) Ewi where the per-study weights are wi = l/var[loq(RR\)] . Using only the observed relative risks, combination of all the U.S. studies in the EPA analysis produced a statistically significant (p=0.02) estimate for increased lung cancer risk of 1.19. Adjustment for background exposures, however, gives stronger evidence: the relative risk estimate increases to a value of RR't = 1.59, i.e., an estimated 59% increase in lung cancer mortality in U.S. non-smokers when exposed to environmental tobacco smoke.------- These results illustrate the need for ongoing research and widespread use of trenchant meta-analysis methods in environmental science. Environmental and public health policy decisions often involve serious economic and social outcomes. These decisions must be based on the most reliable and complete combination of available information. 4.2 Nitrogen Dioxide Exposure 4.2.1 Meta-Analysis of Nitrogen Dioxide Data Quantifying risk of respiratory damage after exposure to airborne toxins is an ongoing concern in modern environmental epidemiology. An example involves an EPA meta-analysis of respiratory damage after indoor exposure to nitrogen dioxide, N02. Previous studies had given mixed results on the risk of N02 exposure. The EPA study combined information on the relationship of N02 exposure to respiratory illness from these separate studies. Using as an outcome variable the presence of adverse lower respiratory symptoms in children aged 5 to 12 years, odds ratios were employed to estimate the relative risk (RR) for increased lower respiratory distress in exposed population(s). A set of nine North American and western European studies reported estimated RR's ranging from 0.75 to 1.49. Separately, only four of the nine odds ratios suggested a significant increase in respiratory distress due to N02 exposure (see Table 2) . Combined using inverse variance weighting, these data led to a combined RR estimate of 1.18, with 95% confidence limits from 1.11 to 1.25. That is, the meta-analysis suggested that increased N02 exposure can lead to an increased risk of respiratory illness of about 11-25% over unexposed controls (Hasselblad et al. 1992). Table 2. Summary relative risk information for nine individual studies of childhood respiratory disease risk after N02 J exposure. Study Code Estimated RR 90% Confidence Limits Weight M77 1.31 (1.18 ( , 1.45) 258.90 M79 1.24 (1.11 , , 1.39) 219 . 67 M80 1.53 (1.11 , r 2.11) 26 .10 M82 1.11 (0.87 , , 1.42) 44 . 88 W84 1. 08 (0.99 , , 1.15) 489.33 N91 1.47 (1.21 , , 1.79) 71. 50 E83 1.10 (0.83 , , 1.45) 35 .17 D90 0 . 94 (0.70 , , 1.26) 31 .30 K7 9 1 .10 (0.78 ( r 1.52) 25. 03 Study abbreviation codes adapted from DuMouchel (1994). Data source: Hasselblad et al. (1992, Table II). Weight wA = 1/Var[In(RR)].------- 4.2.2 Hierarchical Bayes Meta-analysis of Nitrogen Dioxide Data For the nine studies in the EPA N02 analysis, DuMouchel (1994) employed a hierarchical model for the separate log odds ratios by placing prior normal distributions on the underlying mean log odds ratios, and assuming these mean log odds ratios were themselves normally distributed. The model also extended the hierarchy by placing additional hierarchical distributions on the hyper-parameters of the normal prior distributions. Although complex, this hierarchical model was easily manipulated to provide posterior point estimates and posterior standard errors for the log odds ratios. These estimates could then be employed to find an overall, combined estimate of relative risk via inverse variance weighting. Applied to the nine N02 studies, the resulting combined (posterior) log(RR) is 0.1614 (odds ratio e01614 = 1.175), buttressing the EPA estimate. DuMouchel (1994) also described a hierarchical regression model that adjusted the log(RR) estimate for important covariates such as household smoking and gender. Applied to the nine N02 studies, five of the nine studies exhibited significant (p<0.05) posterior increases in relative risk, yielding a covariate-adjusted combined log(RR) of 0.1567 (odds ratio e0 1567 = 1.170) . Data combination that incorporates important covariates in hierarchical Bayes analyses is a straightforward extension of the simple hierarchical model. Combining the similar information sources with prior distribution(s) via a regression relationship can, in effect, smooth out instabilities or other outlying features in the data. This leads to posterior estimates that can outperform those from other, non-hierarchical approaches (Greenland 1994) . 5. Conclusions Research into statistical methods for combining environmental information covers a broad spectrum of statistical topics and environmental applications. The importance of these problems stems from factors including: difficulty and cost of collecting environmental data; differences in the design of environmental data collection, and lack of complete design information; presence of error and bias that are difficult to identify and quantify; need to integrate different types of data such as point and areal data or meteorological and pollution data; and, need to verify scientific conclusions and to ensure appropriateness and cost-effectiveness of future environmental studies, regulations, and management strategies, and to establish mechanisms for their validation and improvement. In the examples presented above, statistical data combination made it possible to draw out and quantify conclusions that were hidden to the investigator. Meta-analytic methods were presented to combine information on different,chemicals and across different locations, enabling combined characterization of hazardous waste sites. Spatial and regression methods were presented to enable combining assessment information between------- monitoring sites, and between different monitoring programs. Methods were presented to combine statistical samples of different types, to combine estimates from such samples, and also to combine epidemiological studies and draw statistically significant conclusions from sets of predominantly non-significant results, as in the passive smoking studies. Hierarchical regression methods were presented to maximize effects of data combination and to enable study-to-study comparisons. 6. Future Directions Several open research problems and directions for research were described with the preceding examples. Problems which promise to contribute significantly both to environmental understanding and to statistical methodology include the development of methods for combining NP-sample data, development of a theoretical framework for integrating spatial and P-sample methods for environmental assessment, new methods and extensions of existing methods for combining spatial data collected at different aggregation scales, modeling approaches that eliminate or reduce bias in environmental data, and extensions of meta-analytic methods to the environmental arena, including methods for combining correlated studies such as involving different contaminants at the same sites, or multiple studies on the same data set, and hierarchical methods that enable combination and intercomparison of different environmental studies. As illustrated herein, environmental science has benefitted from statistical data combination method, but more work is needed. The interested reader is referred to Cox and Piegorsch (1996) and Piegorsch and Cox (1996) for critical examination of a wider set of applications and recommendations for research. References Breslow, N. and N. Day (1980) . Statistical Methods in Cancer Research. I. The Analysis of Case-Control Studies, Vol. 32. IARC Scientific Publications, Lyon, France. Carroll, S., G. Day, N. Cressie and T. Carroll (1995). "Spatial modeling of snow water equivalent using airborne and ground-based snow data". Environmetries 6, 127-139. Cox, L. and R. Boruch (1988) . "Record linkage, privacy and statistical policy". Journal of Official Statistics 4, 3-16 . Cox, L. and W. Piegorsch (1994) . "Combining environmental information: environmetric research in ecological monitoring, epidemiology, toxicology, and environmental data reporting". Technical Report number 12. National Institute of Statistical Sciences, Research Triangle Park, NC. Cox, L.H. and W.W. Piegorsch (1996) . "Combining environmental information I: Environmental monitoring, measurement and assessment". Environmetries 7, 299-308.------- DuMouchel, W. ("1994) . "Hierarchical Bayes linear models for meta-analysis". Technical Report number 27. National Institute of Statistical Sciences, Research Triangle Park, NC. Greenland, S. (1994). "Hierarchical regression for epidemiologic analysis of multiple exposures". Environmental Health Perspectives 102, Suppl. 8, 33-39. Hartley, H. (1974). "Multiple frame methodologies and selected applications". Sankhya, Series B 36, 99-118. Hasselblad, V. , D.M. Eddy and D.J. Kotehmar (1992). "Synthesis of environmental evidence: Nitrogen dioxide epidemiology studies". Journal of the Air & Waste Management Association 42, 662-671. Hedges, L. and I. Olkin (1985). Statistical Methods for Meta- Analysis, Academic Press, Orlando. Mathew, T. , B. Sinha and L. Zhou (1993). "Some statistical procedures for combining independent tests". Journal of the American Statistical-Association 88, 912-919. Myers, D. (1993). "Change of support and transformations." Geostatistics for the Next Century, R. Dimitrakopoulous (ed). Kluwer Academic Publishers, Amsterdam, 253-258. Overton, J., T. Young and W. Overton (1993). "Using found data to augment a probability sample: procedure and a case study". Environmental Monitoring and Assessment 26, 65-83. G.P. (1991). "Encountered data, statistical ecology, environmental statistics, and weighted distribution methods". Environmetries 2(4), 377-423. G.P. (1996). "Using covariate-directed sampling of EMAP hexagons to assess the statewide species richness of breeding birds in Pennsylvania". Technical Report number 95-1102, Center for Statistical Ecology and Environmental Statistics, Pennsylvania State University, University Park, PA. G.P., A.K. Sinha and C. Taillie (1994). "Ranked set sampling", in Handbook of Statistics, Volume 12: Environmental Statistics, G.P. Patil and C.R. Rao (eds.), North Holland, New York, 103-166. Piegorsch, W. and L. Cox (1996). "Combining environmental information II: environmental epidemiology and toxicology". Environmetries 7, 309-324. Rodgers, W. (1984). "An evaluation of statistical matching". Journal of Business & Economic Statistics 2, 91-102. Thompson, S.K. (1992). Sampling, John Wiley and Sons, Inc., New York. US EPA (1992). Respiratory Health Effects of Passive Smoking: Lung Cancer and Other Disorders. Technical Report number 600/6-90/006F. Environmental Protection Agency, Washington, DC. Patil, Patil, Patil,------- TECHNICAL REPORT DATA 1. REPORT NO. EPA/600/A-96/090 2 . 3.RECII 4. TITLE AND SUBTITLE Combining Environmental Information 5.REPORT DATE 6 PERFORMING ORGANIZATION CODE 7. AUTHOR(S) Lawrence H. Cox USEPA NERL (MD-75) RTP, NC 27711 8 PERFORMING ORGANIZATION REPORT NO 9. PERFORMING ORGANIZATION NAME AND ADDRESS Office of the Director National Exposure Research Laboratory Research Triangle Park, NC 27711 10.PROGRAM ELEMENT NO. 11. CONTRACT/GRANT NO N/A 12. SPONSORING AGENCY NAME AND ADDRESS NATIONAL EXPOSURE RESEARCH LABORATORY OFFICE OF RESEARCH AND DEVELOPMENT U.S. ENVIRONMENTAL PROTECTION AGENCY RESEARCH TRIANGLE PARK, NC 27711 13.TYPE OF REPORT AND PERIOD COVERED Chapter for a book 14. SPONSORING AGENCY CODE USEPA 15. SUPPLEMENTARY NOTES 16. ABSTRACT Combining information is a problem in all aspects of environmental science. This book chapter summarizes recent developments and research into statistical methods for combining environmental information. It is one chapter in a book summarizing recent environmental statistics research in a variety of areas. 17. KEY WORDS AND DOCUMENT ANALYSIS a. DESCRIPTORS b.IDENTIFIERS/ OPEN ENDED TERMS c.COSATI 18. DISTRIBUTION STATEMENT RELEASE TO PUBLIC 19. SECURITY CLASS (This Report) UNCLASSIFIED 21 NO. OF PAGES 20. SECURITY CLASS (This Page) UNCLASSIFIED 22 PRICE------- |