Statistical Issues in Environmental Monitoring and Assessment of Anthropogenic Pollution

EPA/600/A-96/018
STATISTICAL ISSUES IN ENVIRONMENTAL MONITORING
AND ASSESSMENT OF ANTHROPOGENIC POLLUTION
Lawrence H. Cox
Senior Mathematical Statistician
Office of Research and Development
N. Phillip Ross
Director
Environmental Statistics and Information Division
U.S. Environmental Protection Agency'
Introduction
Environmental data are often collected to assist in assessment of the impact of
anthropogenic pollution on the natural environment, to determine the effects of pollution on
human and ecological health, for enforcing compliance with environmental regulations and
standards, and to assess the state of the environment. Since primary environmental data
collection is costly, data sets are often used for multiple purposes. In addition, environmental
information is collected by many diverse and independent organizations. This results in a
patchwork of spatially and temporally different data sets that profess to measure the same
phenomenon, but defy the use of classical statistical approaches for their integration and
analysis. This duality leads to a number of statistical issues relating to the monitoring,
measurement, use, and analysis of environmental data. Statisticians are being asked to
convert the proverbial "sow's ear" into a "silk purse". In this keynote paper to the Third
SPRUCE Conference, we address some statistical issues at the interface of environmental
monitoring, measurement and regulatory decision making. Some of the areas explored are:
environmental monitoring and measurement, environmental indicators, sampling approaches,
use of environmental models, use of "encountered" or "found" environmental data,
environmental decision making and public policy, and environmental reporting.
Statistical Issues in Environmental Monitoring and Measurement: The
collection of primary environmental data
Monitoring
The monitoring and collection of environmental data is a costly undertaking.
Environmental data collection involves complex measurement processes that often require the
development of new technology for the direct or indirect measurement of environmental
variables. In addition to instrumentation, additional cost and consideration must be given to
1 The information in this document has been funded wholly or in part by the U.S.
Environmental Protection Agency. It has been subjected to Agency review and approved for
publication. Mention of trade names or commercial products does not constitute endorsement
or recommendation for use.

-------
the placement of the measurement instruments and the timing of measurements so that the
data collected are representative. Statistical issues related to defining the universe to be
measured, designing sampling strategies that are cost effective and representative, and
integration of additional monitoring information with primary data for assessing the state of
the environment are among the problems associated with environmental monitoring.
In addition, new methods and strategies need to be developed to deal with the
collection of multi-media information at a single site. Initiatives like the U.S. Environmental
Monitoring and Assessment Program are attempts to establish monitoring systems based on
sound statistical design and capable of collecting multi-media information on a site-specific
basis. Although the collection of a number of variables from a single monitoring site
provides a cost effective approach to measuring the state of the environment across a number
of dependent variables, it brings with it new statistical problems of data interpretation.
Spatial and Temporal Variability
Spatial and temporal aspects of the environment must be considered when monitoring
and measuring anthropogenic pollutants: environmental measures have inherent spatial and
temporal characteristics and interrelationships. Many of the assumptions underlying classical
statistical time series and geographic data methods such as multivariate normality, i.i.d.
observations, replication, and simple random sampling of points in space and time are often
invalid or not recoverable (e.g., by logarithmic transformation, weighting or adjustment for
autocorrelation or trend) for environmental data sets. This lack of conformance to
assumptions underlying classical methods raises a number of interesting issues for statistical
application in the spatial and temporal analysis of environmental data.
Ecological monitoring and assessment illustrates these problems. The sampling design
for the Environmental Monitoring and Assessment Program is based on selecting monitoring
locations probabilistically using a randomly situated hexagonal tessellation of the area of
interest. The tessellation can be as fine as needed (i.e., involving smaller and smaller
hexagons), provided that the sample hexagons remained i.i.d. Within each sampled hexagon,
sampling methods based on less restrictive assumptions (e.g., adaptive or other sequential
methods) may be employed. To capture temporal trends, geographic sampling is
accompanied by a temporal sampling design analogous to "rotation group" designs familiar to
socio-demographic statistics. Sample collection for a single period can take several weeks or
months, but typically is confined to a single season. Thus, space is considered to be
homogeneous at sufficiently large scales, time is considered homogeneous at sufficiently
small scales, and space and time are considered separable.
Different problems are encountered in hazardous waste site characterization. Time is
regarded as fixed, and space as inhomogeneous. Simple random sampling schemes (e.g.,
along a regular grid), though in widespread use, may fail to capture spatial trends, and
"hotspot" sampling may overestimate the magnitude or extent of pollution. Methods based on
prior knowledge of spatial structure are preferred, but in their current form are often too
2

-------
complex for the average user. Sequential methods have been shown to be effective and are
useful when prior information is unavailable.
One area where simple spatial analysis is being applied is through the use of
Geographic Information Systems (GIS). Different types of environmental data can be
represented on the same plane. Traditional representation is relatively unsophisticated in that
different color schemes or shadings are superimposed on the plane under the visual or
intuitive judgement of the analyst. Such methods can provide confirmatory information (e.g.,
for identifying potential environmental justice) but arc not trenchant. The next step
(commonly not employed) is the use of statistical methods for defining and comparing
relationships and cluster groupings. Such approaches would provide less subjective
assessment of relational patterns and interactions of the environment with anthropogenic stress
factors. This would enable site assessments based on risk as well as provide a mechanism for
stratification of areas for focused environmental monitoring. Combination of the ability to
automatically manipulate and visualize spatial data provided by GIS with algorithms and
methods of spatial statistics that capture local spatial structure is emerging and should result
in statistical software for spatial analysis comparable in power and ease of use to that already
available for time series analysis, demographic studies, etc.
Environmental Indicators
In most countries, environmental monitoring systems are put in place to assist in the
enforcement of standards aimed at protecting the environment and the health and well being
of its inhabitants. Direct measurement of environmental quality is generally not feasible and
indirect measures, or indicators, are used to provide some assessment of the state of the
environment. Indicators are typically closely tied to environmental measurements, e.g.,
concentration (parts per billion) of ozone in ambient air or of dissolved oxygen (mg/1) in
estuaries. Such values, meaningful to scientists, may communicate little to policy makers or
the public, and do not effectively characterize total environmental condition. For these
purposes, environmental indices, i.e., (unitless) values based on several different
environmental measurements, are preferred. Examples of environmental indices are the
Pollution Standard Index (PSI), which reports daily air quality based on separate measured
concentrations of five pollutants in ambient air, and "benthic" indices, which combine
biological and chemical measurements taken in estuarian sediment. The types and
measurements used for indicators are of critical importance: resolution of these questions
gives rise to problems in both environmental and statistical science.
Indices present a special challenge to the statistician. The tendency to aggregate a
number of disparate measures into a single variable that will tell us how the environment is
doing at any point in time and over time is appealing. However, the process of integrating
multi-variate measures into an index for the environment is not simple. The types and
measurements used for indicators are of critical importance. Environmental structure and
function are filled with uncertainties and predicted linkages between observed measures and
actual state of environmental health is difficult if not impossible to show. Modern
3

-------
environmental indicator and index development does not account fully for these uncertainties.
Statistical Issues in Environmental Sampling
The most reliable answers to scientific and policy questions are those based on data of
known quality and data collection and estimation or inferential methods of known precision
and accuracy. In the environmental arena, lack of conformance of data to i.i.d. and other
assumptions raise the need for new statistical design and estimation approaches.
Environmental remediation is an expensive undertaking with significant public health
consequences. The development of reliable, statistically powerful spatial sampling designs for
site characterization is therefore important. Issues include inference and modelling of spatial
structure, number of samples, choosing or balancing among random grid-based designs and
sequential sampling methods, assessing and ensuring statistical power of tests, and accounting
for uncertainty due to physical properties (e.g., granularity) of samples. Each of these
questions is of current research interest. At the other end of the spectrum, current practice is
often wrong statistically (e.g., use of t-tests based on faulty assumptions of normality-or
failure to be cognizant of such assumptions), and it is incumbent upon statisticians to address
these deficiencies and offer useful (ideally, simple) alternatives to practitioners.
Ecological sampling offers opportunities for development and application of sequential,
adaptive sampling methods. Most wildlife and fish populations move in groups, often in
response to local conditions, and plant life develops and flourishes or decays within local
ecosystems. Ecological sampling must take these factors into account in defining units for
sampling and measurement and in developing and using statistical sampling strategies for the
environment. Adaptive sampling (e.g., oversampling areas where units of interest are found)
offers a suite of statistically reliable methods for doing so.
An emerging arena in statistical survey methodology is the design of human (and,
potentially, ecological) exposure surveys. Human exposure surveys may be population based
or conducted within cohorts (e.g., occupation groups) or regions of interest (e.g., in proximity
to hazardous waste sites). In addition to questionnaire data and data from administrative
sources, these surveys typically involve taking measurements from subjects' environments
and/or from subjects themselves. This forces both per-subject costs and respondent burden to
be high, raising serious statistical reliability and response bias issues that statisticians need to
solve through the development of new or refined survey methodology. As a paradigm for
sampling design, adaptive sampling methods offer the potential to improve both statistical
efficiency and cost effectiveness of these surveys.
Use of Environmental Models
Much environmental decision making scientific study is based on computer models of
environmental phenomena. Some of these models are statistical (e.g., regression models that
adjust ambient ozone concentration data for local meteorological effects, or that adjust daily
4

-------
nonaccidental death counts for ambient concentrations of particulate matter) but most are not
(e.g., regional models of ozone formation and transport based on atmospheric chemistry such
as the USEPA Regional Oxidant Model, and pharmacokinetic models to assess dose-response
based on biochemistry). Both types of models pose interesting and important statistical
questions and applications.
Regression based statistical models are being more frequently used to identify
relationships and account for uncertainty in environmental observations. A classical issue is
the overinterpretation of regression results and statistical significance, usually by
nonstatisticians: At what point are statistical methods being used to overinterpret data,
confusing quantitative artifacts for true relationships, or missing the larger, process-driven
picture in favor of isolated numerical findings lacking scientific plausibility?
An important issue in statistical modelling of environmental phenomena is the
transportability of model findings and of models themselves. If, as is often the case, several
different regression models are developed to assess the effects of particulate matter on
mortality in each of several cities, then how can these results be combined statistically to
draw more general (and, presumably, more powerful) conclusions? This calls for the
development of meta-analytic methods outside the realm of controlled experiments. In
addition, is it possible to develop "meta-models" that can be calibrated and transported from
one situation (viz., set of local conditions such as climate, presence of co-pollutants, etc.) to
another? This would enable the comparison of environmental problems such as mortality due
to particulate matter between different cities within a common, representative framework.
Non-statistical models of environmental processes typically have little or no ability to
account for uncertainty. Often reliant on numerical solutions to complex systems of
differential equations, these models are wholly deterministic and unable to account for
sensitivity of outputs to inputs. Sensitivity is often not examined at all, even empirically, nor
are models validated to identify and account for systematic biases. Often this is due to the
high cost of running models. Statistical designs for cost-effective model validation
experiments are needed. Also needed are statistical methods incorporated in the models
themselves to estimate uncertainty and the effects of propagation of uncertainty. Designs
such as for process optimization would be useful to optimize model performance.
Use of Encountered Data
One of the most challenging areas for environmental statistics is the development of
methods that facilitate reuse of existing data. How do we use information for purposes other
than what it was originally collected for? How can one data set be used to validate another?
How and when should missing or faulty environmental data be replaced by imputed values?
How do we take spatially and temporally disparate data sets that purport to measure the same
things and combine them into "synthetic" data sets for use in decision making and regulator)'
standard setting?
5

-------
For example, the USEPA and the University of Maryland have been working on the
evaluation of the attainment of restoration goals for dissolved oxygen (DO) in the Chesapeake
Bay using a statistical method to combine monitoring station and buoy data. Dissolved
oxygen is an essential element in maintaining viable conditions for living resources. The
Chesapeake Bay Executive Council, comprising representatives from EPA, The Chesapeake
Bay Commission, the District of Columbia, and the states of Maryland, Pennsylvania and
Virginia, established goals for restoration of the Bay. The standards for dissolved oxygen
were set on the basis of extensive laboratory and field research as:
Target DO Concentration
Time and Location
DO> 1.0 mg/1
1,0 mg/1 < DO < 3.0 mg/1
Monthly mean DO > 5.0 mg/1
DO > 5.0 mg/1
At all times, everywhere
For no longer than 12 hours; interval between
excursions at least 48 hours everywhere
At all times, throughout upper layer waters
At all times, throughout upper layer, in spawning
reaches, spawning rivers, and nursery areas
The restoration standards were time dependent; however, most of the monitoring data being
collected for DO on the Bay is done monthly or weekly at fixed sites. During the summer
months continuous monitoring of DO (every 15 minutes) was conducted at selected sites
during summer months. Continuous monitoring of DO is extremely expensive and cannot be
done year round. The challenge to the University of Maryland and EPA statisticians was to
develop an approach in which station data (intermittent monthly data) and limited buoy data
(continuous) could be combined into a single synthetic data base containing the long term
trend properties from the station data and the short term behavior similar to the buoy data.
The method of combining used spectral analysis techniques. Work is continuing on this
synthesis process to develop a data set that can be used to assess progress towards achieving
the restoration goals.
Environmental science and decision making need proper methods for combining
environmental data, for several reasons. Direct data collection is time consuming and
expensive. It can be prohibitively expensive, meaning that, absent the ability to combine
existing or partial data, important phenomena may go unstudied. Often, situations that occur
in one place in space and time cannot be reliably replicated elsewhere, requiring the
combination of actual data from one place or time with "surrogate" data from another. This
raises issues of the "transportation" of data, analogous to that of transportability of models
previously discussed. The benefits here will come both in terms of cost savings and
increased knowledge and reliability and weight of evidence of conclusions. Related issues
include combining administrative records data and monitoring data to assess socioeconomic
impacts of pollution and environmental restoration, and using quality assurance data to
validate monitoring data and adjust it for embedded systematic errors.
Most environmental data are not design-based, i.e., collected according to a
6

-------
probabilistic sampling design. For example, lake data might be collected from lakes within
selected areas, large lakes, lakes regarded as being in the most degraded environmental
condition, the most accessible lakes, or from lakes without specification as to how they were
chosen. Such data are known as encountered or "found" data. Environmental scientists use
encountered data effectively to study environmental processes. However, their use for
environmental assessment is limited due to lack of quantifiable knowledge of selection bias,
sampling variability, etc.
Prior to the advent of design-based approaches, environmental data were often
modelled statistically using regression, spatial, or time series methods. It was unclear how to
combine of such data (e.g., combining lake data between states within a geographic region),
and combination was often not attempted. However, we are now in a situation where
considerable resources have been expended by society to amass volumes of environmental
data, some of which now is design based. Statistical methods are needed to combine
encountered and design-based data among themselves and each other. Various statistical
approaches are available for these problems, each with its own strengths and weaknesses.
Under appropriate circumstances, probability samples can be combined directly. If probability
and encountered samples share frame variables, regression can be used to predict sample
values for variables observed in one sample but not in the other. Synthetic "pseudo-units"
can be formed by statistically matching sample units across two data sets. Perhaps most
reliable in general, methods such as dual frame estimation or minimum variance weighting
can be used to combine estimates (in lieu of direct combination of data) between two data
sets. And, weighted distribution functions can be used to adjust for bias and normalize data
for combination.
Issues for Environmental Statistics in Decision Making and Public Policy
Public policy requires decision making which incorporates a number of variables
ranging from objective measures of the state of the environment to social and political aspects
of the policy outcome. The statistical sciences provide the objective measure of the state of
the environment; the quantitative bases for public policy decision making. In the public
policy arena, decision making uses objective information, but is not necessarily driven by it.
An environmental manager is charged with making a decision about the construction of a dam
on a major river. The decision needs to be made within a few months. To determine the
impact of such a decision on the environment would require the collection and analysis of
large amounts of information. This process, if not underway, could require years. The
decision needs to be made in three months. The challenge to statisticians is to look for ways
to use what is available- good, bad or indifferent, to the best advantage possible. To provide
the decision makers with the best information within the needed time frame.
Environmental risk assessment
Public policy and environmental decision making requires that some form of risk
assessment be done to provide a quantitative basis for cost/benefit and decision making.
7

-------
Indeed, the limited funding for environmental protection leads environmental managers to rely
more and more on what are called "comparative risk assessments". Assessment of
environmental risk is a multi-disciplinary approach involving information from ecological
studies, chemistry, meteorology, statistics, biology, etc.
Current methods for estimating risks and defining safe levels of exposure do not take
full advantage of the data and information from the different disciplines. Indeed, at the local
community level comparative risk projects give little attention to the use of statistical methods
as a means to organize and analyze information provided from the different disciplines.
Statistical consideration of methods of sampling, predictive correlations using appropriate
stochastic models, and use of multivariate models for assigning risk measurement need to be
developed and incorporated into the comparative risk process. Uncertainty in risk analysis
must be addressed. The multiple stages in assessing risk give rise to a cascading of
uncertainty. However, in most studies on environmental risk the endpoint is presented as a
point estimate without any associated uncertainty analysis. Statistical approaches to
uncertainty analysis incorporating the cascading effect need to be developed and applied.
Reporting on the State of the Environment
Environmental managers and policy makers would like to have a crystal ball that
summarizes ecosystem status and predicts future states. What ever the immediate practicality
of diverse expectations, we need much better approximations of environmental knowledge-
environmental indicators. Statisticians must consider the community which will use these
indicators. Environmental indicators (like economic indicators ) are also useful to a variety of
individuals: political officials, their staffs, program planners and assessors, contractors,
researchers, environmentalists, educators, market analysts, students and the general public.
Most of the audience lacks formal training in statistics or the environmental sciences. This
has implication for statisticians in the design and presentation of indicators.
Public Health
The link between human health and the environment has become an important issue.
With the occurrences for Love Canal, Times Beach, secondary smoke and deterioration of the
stratospheric ozone level, the public has become keenly aware that continued degradation of
the environment will lead to serious health problems for their and future generations.
Statistical and epidemiological methods and research are needed to obtain a better
understanding of the complex relationships between human health, ecological health and
pollution. These relationships are not based on standardized sets of observations or easily
obtainable data. The present tools of biostatisties and epidemiology are inadequate to deal
with many problems of environmental health. These areas pose unusual sampling problems,
produce data that often are not normally distributed, or pose problems for which adequate
models have not been developed. Standard multivariate analysis does not provide a sampling
frame to account for fine mesh distribution.
8

-------
Public Access
With the advent of the "information highway," the public is being provided
unprecedented access to environmental data collected by Federal, State and Local
organizations. Unfortunately, the free economy philosophy of "caveat emptor" cannot hold.
Much of the raw data that is becoming available has a number of serious problems relating to
data quality and definition. If these data are to be made available to the public, then it is the
responsibility of environmental statisticians to provide the public with the capability to make
the data into information or to make appropriate judgements on the correct use of the data.
The release of environmental data/information under the "buyer beware" principle is
irresponsible and will lead to misinformation and costly mistakes in assessing the state and
health of the environment. Access needs to be given to the public; however, the public must
be educated on how to use and understand data which is uncertain and often biased.
Environmental data providers must ensure that appropriate "meta-data" are available to allow
this "educated" public to appropriately use and interpret the data/information being released.
This conundrum and the associated statistical issues are exemplified by the USEPA
Toxic Release Inventory (TRI). Through the Superfund Reauthorization Amendments
(SARA) Title 313, in 1987 the U.S. Congress passed legislation requiring companies who
employ more than ten employees and who produce more than 25,000 pounds of the TRI's list
of substances, or firms that use more than 10,000 pounds of these substances per year, are
required to report annual releases and transfers of TRI chemicals to the USEPA. In turn, the
USEPA is required to make this information available to the public on a site identifiable
basis.
TRI data are now available to the public, but only in their "raw" form with no meta
information. A number of information services have downloaded the TRI data bases and are
providing summary statistics, time series and interpretation of the changes as if the data were
of known quality. In fact, the quality of the data is unknown: TRI data are self-reported and
there are no standard for reporting. Some of the data is observational, some is model
generated, and some are "best guess". The public has no way of knowing which is which or
what comparisons are legitimate, if any. With all these problems, the release of TRI has been
an environmental information success. The public is using the information to effect change.
Companies are beginning to realize that the data they provide will be used and that they need
to be more careful in data measurement and generation. Statisticians can play an important
role in the development of appropriate methods to use, and in the display and visualization of
this type of data in a manner that allows the public to make more informed decisions.
Summary
We have discussed several areas where statistical methods are central to environmental
science and decision making. Solutions to these problems and proliferation of the use of
these methods will improve the quality and usefulness of environmental data and decision
making. The interface of statistics and regulatory policy requires the development and use of
9

-------
new and innovative methods which can provide environmental managers with the quantitative
component of their decision making process. Challenges are in the application of sound
statistical methods to combine existing environmental data and the development of cost
effective methods for the monitoring, analysis and display of primary data.
10

-------
TECHNICAL REPORT DATA
1. REPORT NO. 2.
EPA/600/A-96/018
3
4. TITLE AND SUBTITLE
Statistical Issues in Environmental Monitoring and Assessment of
Anthropogenic Pollution
5.REPORT DATE
6.PERFORMING ORGANIZATION CODE
7. AUTHOR(S)
Lawrence H. Cox
N. Phillip Ross
8.PERFORMING ORGANIZATION REPORT NO.
9. PERFORMING ORGANIZATION NAME AND ADDRESS
U.S. Environmental Protection Agency
National Environmental Research Laboratory
Research Triangle Park, NC 27711
U.S. Environmental Protection Agency
ESID/OPPE, Washington, DC 20460
10.PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
N/A
12. SPONSORING AGENCY NAME AND ADDRESS
U.S. Environmental Protection Agency
National Environmental Research Laboratory
Research Triangle Park, NC
13.TYPE OF REPORT AND PERIOD COVERED
14. SPONSORING AGENCY CODE
15. SUPPLEMENTARY NOTES
Spruce III Conference keynote paper to be published as a book chapter
16. ABSTRACT
Environmental data are often collected to assist in assessment of the impact of anthropogenic pollution on the natural
environment, to determine the effects of pollution on human and ecological health, for enforcing compliance with
environmental regulations and standards, and to assess the state of the environment. Since primary environmental data
collection is costly, data sets are often used for multiple purposes. In addition, environmental information is collected by
many diverse and independent organizations. This results in a patchwork of spatially and temporally different data sets
that profess to measure the same phenomenon, but defy the use of classical statistical approaches for their integration and
analysis. This duality leads to a number of statistical issues relating to the measurement and monitoring, use, and
analysis of environmental data. Statisticians are being asked to convert the proverbial "sow's ear" into a "silk purse".
Il this introductory paper to the Third Spruce Conference, we address some statistical issues at the interface of
environmental measurement, monitoring and regulatory decision making. Some of the areas explored are: environmental
measurement and monitoring, environmental indicators, sampling approaches, use of environmental models, use of
"encountered" or "found" environmental data, environmental decision making and public policy, and environmental
reporting.
17. KEY WORDS AND DOCUMENT ANALYSIS
a. DESCRIPTORS
b.IDENTIFIERS/ OPEN ENDED
TERMS
c.COSATI

18. DISTRIBUTION STATEMENT
Unclassified
19. SECURITY CLASS (This Report)
Release to Public
21.NO. OF PAGES

20. SECURITY CLASS (This Page)
Release to Public
22. PRICE

-------