EPA
Unitod Stales
Environ menul Protection
Agency
Office of Water
4904
CPAB22B97002
Biological Criteria:
Technical Guidance For
Survey Design and Statistical
Evaluation of Biosurvey
Data
-------
BIOLOGICAL CRITERIA
Technical Guidance for Survey Design and
Statistical Evaluation of Biosurvey Data
Prepared for EPA by TetraTech, Inc.
Principal authors: Kenneth H. Reckhow, Ph.D. and
William Warren-Hicks, Ph.D.
George Gibson, Jr., Ph.D.
Office of Science and Technology
Project Leader
Health and Ecological Criteria Division
Office of Water
U.S. Environmental Protection Agency
Washington, D.C. 20460
December 1997
-------
Acknowledgements
This document was developed by the United States Environmen-
tal Protection Agency, Office of Science and Technology, Health and
Ecological Criteria Division.
This text was written by Kenneth H. Reckow, PhD. And William
Warren-Hicks, PhD. Jerqen Gerristen, PhD. of Tetra Tech, Inc. pro-
vided editorial and technical support. George R. Gibson, Jr., PhD. of
USEPA was Project Leader ana co-editor.
-------
Disclaimer
This manual provides technical guidance to States, Indian
Tribes, and other users of biologicarcriteria to assist with survey de-
sign and statistical evaluation of biosurvey data. While this manual
constitutes EPA's scientific recommendations regarding survey de-
signs and statstical analyses, it does not substitute for the CWA or
EPA's regulations; nor is it a regulation itself. Thus, it cannot impose
legally binding requirements on the EPA, States, Indian Tribes, or
the regulated community, and might not apply to a particular situa-
tion or circumstance. EPA may change this guidance in the future.
11
-------
CONTENTS
Foreword vii
CHAPTER 1. The Biological Criteria Program and Guidance Documents 1
The Concept of Biological Integrity 1
Narrative and Numeric Biological Criteria 1
Biological Criteria and Water Resource Management 2
An Overview of this Document 2
CHAPTER 2. Classical Statistical Inference and Uncertainty 3
Formulating the Problem Statement 3
Basic Statistics and Statistical Concepts 3
Descriptive Statistics 3
Recommendations 4
Uncertainty 6
Statistical Inference 7
Interval Estimation 7
Hypothesis Testing 7
Common Assumptions 8
Parametric Methods the t test 9
Nonparametric Tests the W test 10
Example an IBI case study 11
Conclusions 12
CHAPTER 3. Designing the Sample Survey 15
Critical Aspects of Survey Design 15
Variability 15
Representativeness and Sampling Techniques 15
Cause and Effect 16
Controls 16
Key Elements 17
Pilot Studies 17
Location of Sampling Points 18
Location of Control Sites 19
Estimation of Sample Size 19
Important Rules 20
CHAPTER 4. Detecting Mean Differences 21
Cases Involving Two Means 21
Random sampling model, external value for a 21
Random sampling model, internal value for CT 22
Testing against a Numeric criterion 22
A Distribution-Free Test 23
Evaluating Two-Sample Means Testing 23
ill
-------
Multiple Sample Case 23
Parametric or Analysis of Variance Methods 23
Nonparametric or Distribution Free Procedures 25
Testing for Broad Alternatives 25
The Kolmogorov-Smirnov Two-Sample Test 26
Relationship of Survey Design to Analysis Techniques 27
CHAPTER 5. Discussion and Examples 29
Working with Small Sample Sizes 29
Assessments Involving Several Indicators 30
Regional Reference Data 31
Using Background Variability Measures 32
Final suggestions for Small Sample Sizes 32
Decision Analysis and Uncertainty 33
APPENDIX A. Basic Statistics and Statistical Concepts 35
Measures of Central Tendency 35
Mean 35
Median 35
Trimmed Mean 35
Mode 36
Geometric Mean 36
Measures of Dispersion 36
Standard Deviation . 36
Absolute Deviation 36
Interquartile Range 36
Range 37
Resistance and Robustness 37
Graphic Analyses 37
Histograms 37
Stem and Leaf Displays 39
Box and Whisker Plots 40
Bivariate Scatter Plots 41
References 43
IV
-------
LIST OF TABLES
TABLE PAGE
2.1. Measures of Central Tendency 4
2.2. Measures of Dispersion 5
2.3. Useful Graphical Techniques 5
2.4. Possible Outcomes from Hypothesis Testing 7
3.1. Number of samples needed to estimate the true mean (low extreme) 19
3.2. Number of samples needed to estimate the true mean (high extreme) 20
4.1. Descriptive Statistics: Upstream-Downstream Example 21
4.2. Assumptions, Advantages, and Disadvantages Associated with Various Two-Sample Means
Testing Procedures 24
4.3. Analysis of Variance Results for the Case Study Model 25
4.4. LSD Multiple Comparison Test 25
4.5. Duncan's Multiple Comparison Test 25
4.6. Tukey's Multiple Comparison Test 25
4.7. Survey Design and Analysis Techniques 27
5.1. Biological Indices and biocriteria 30
A.I. IBIData 38
LIST OF FIGURES
FIGURE PAGE
2.1. Sampling Distributions under Different Hypotheses 13
3.1. Random Sample Design having both Temporal and Spatial Dimensions 17
4.1. Cumulative Distribution functions of upstream and downstream sites 26
5.1. IBI Distributions for reference and inpacted sites 33
A.I. IBI Histogram 38
A.2. IBI Histogram with Ten-Unit Interval Size 39
A.3. IBI Histogram with Two-Unit Interval Size 39
A.4. Histogram for Log(IBI) 40
A.5. Histogram for Log(IBI): Alternative Scale 40
A.6. Stem and Leaf Display . . . 40
A.7. Box and Whisker Plots 41
A.8. Stream IBI Box Plot 41
A.9. IBI Bivariate Plot 42
-------
FOREWORD
Biological Criteria: Technical Guidance for Survey
Design and Statistical Evaluation of Biosurvey
Data, by Kenneth H. Reckhow and William War-
ren-Hicks, was prepared for the U.S. Environmental
Protection Agency to help states develop their biolog-
ical criteria for surface waters and specifically to help
water resource managers assess the reliability of their
data. A good biological criteria program will be practi-
cal and cost effective, but above all it will be predi-
cated on valid and scientifically sound information.
The application of the concepts and methods
of statistics to the biological criteria process en-
ables us ". . . to describe variability, to plan research
so as to take variability into account, and to analyze
data so as to extract the maximum information and
also to quantify the reliability of that information"
(Samuels, 1989).
This initial guidance document is intended to
reintroduce statistics to the natural resources man-
ager who may not be current in the application of
this tool (and our ranks are legion, we just don't like
to admit it). The emphasis is on the practical appli-
cation of basic statistical concepts to the develop-
ment of biological criteria for surface water resource
protection, restoration, and management. Subse-
quent guides will be developed to expand on and
refine the ideas presented here.
Address comments on this document and sug-
gestions for future editions to George Gibson, U.S.
Environmental Protection Agency, Office of Water,
Office of Science and Technology (4304), 401 M
Street, S.W., Washington, B.C. 20460.
VI
-------
CHAPTER 1 The Biological Criteria Program
and Guidance Documents
Efforts to measure and manage water quality in
the United States are an evolving process. Since
its simple beginning more than 200 years ago, water
monitoring has progressed from observations of the
physical impacts of sediments and flotsam to chemi-
cal analyses of the multiple constituents of surface
water to the relatively recent incorporation of biologi-
cal observations in systematic evaluations of the re-
source. Further, although biological measurements of
the aquatic system have been well-established proce-
dures since the Saprobic system was documented at
the turn of this century, such information has only re-
cently been incorporated into the nation's approach
to water resource evaluation, management, and pro-
tection.
The U.S. Environmental Protection Agency
(EPA) is charged in the Clean Water Act (Pub. L. \ 00-4,
§101) "to restore and maintain the chemical, physi-
cal, and biological integrity of the Nation's waters." To
incorporate biological integrity into its monitoring
program, the Agency established the Biological Crite-
ria Program in the Office of Water.
This program provides technical guidance to the
states for measuring biological integrity as an aspect
of water resource quality. Biological integrity comple-
ments the physical and chemical factors already used
to measure and protect the nation's surface water re-
sources. Eventually all surface water types will be in-
cluded in program technical guidance, including
streams, rivers, lakes and reservoirs, wetlands, estu-
aries, and near coastal marine waters.
States will use this information to establish bio-
logical criteria or benchmarks of resource quality
against which they may assess the status of their wa-
ters, the relative success of their management efforts,
and the extent of their attainment or noncompliance
with regulatory conditions or water use permits.
These criteria are intended to augment, not replace,
other physical and chemical methods, to help refine
and enhance our water protection efforts.
The Concept of Biological
Integrity
Biological integrity is the condition of the aquatic
community inhabiting unimpaired waterbodies of a
specified habitat as measured by community struc-
ture and function (U.S. Environ. Prot. Agency, 1990).
Essentially, the concept refers to the naturally
dynamic and diverse population of indigenous organ-
isms that would have evolved in a particular area if it
had not been affected by human activities. Such in-
tegrity or naturally occurring diversity becomes the
primary reference condition or source of biological
criteria used to measure and protect all waterbodies
in a particular region.
Only the careful and systematic measuring of
key attributes of the natural aquatic ecosystem and its
constituent biological communities can determine
the condition of biological integrity. These key attrib-
utes or biological endpoints indicate the quality of the
waters of concern. They are established by biosurveys
by analyses based on the sampling of fish, inverte-
brates, plants, and other flora and fauna. Such
biosurveys establish the endpoints or measures used
to summarize several community characteristics
such as taxa richness, numbers of individuals, sensi-
tive or insensitive species, observed pathologies, and
the presence or absence of essential habitat elements.
The careful selection and derivation of these
measures (hereafter, metrics), together with detailed
habitat characterization, is essential to translate the
concept of biological integrity into useful biological
criteria. That is, the quantitative distillation of the
survey data makes it possible to compare and contrast
several waterbodies in an objective, systematic, and
defensible manner.
Narrative and Numeric Biological
Criteria
Two forms of biological criteria are used in EPA's sys-
tem of water resources evaluation and management.
Narrative biological criteria are general statements
of attainable or attained conditions of biological in-
tegrity and water resource quality for a given use des-
ignation. They are qualitative statements of intent
promises formally adopted by the states to protect
and restore the most natural forms of the system. Nar-
rative criteria frequently include statements such as
"the waters are to be free from pollutants of human or-
igin in so far as achievable," or "to be restored and
maintained in the most natural state." The statements
must then be operationally defined and implemented
by a designated state agency.
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 1. The Biological Criteria Program Guidance Document
Numeric criteria are derived from and predicated
on the same objective status as narrative criteria,
which are then retained as preliminary statements of
intent. The difference between the two is that the
qualitative statement of integrity, the condition to be
protected or restored, is refined by the inclusion of
quantitative (numeric) endpoints as specific compo-
nents of the criteria. Compliance with numeric crite-
ria involves meeting stipulated thresholds or
quantitative measures of biological integrity.
The formal adoption of criteria of either type into
state law (with EPA concurrence) makes the criteria
"standards." They are then applicable and enforce-
able under the provisions of the Clean Water Act.
Biological Criteria and Water Resource
Management
Because these criteria will become the basis for re-
source management and possible regulatory actions,
the manner of their design is of utmost importance to
the states and EPA. The choice of metrics to represent
and measure biological integrity is the responsibility
of ecologists, biologists, and water resource manag-
ers. The Agency's role is to continue to develop tech-
nical guidance documents and manuals to assist in
this process.
The purpose of this document is to present meth-
ods that will help managers interpret and gage the
confidence with which the criteria can be used to
make resource management decisions. Using this
guidance, both the technician and the policymaker
can objectively convert data into management infor-
mation that will help protect water resources. How-
ever, the use and limits of the information must be
clearly understood to ensure coordination and mu-
tual cooperation between science and management.
An Overview of this Document
The focus of this document is on the basic statistical
concepts that apply within the biocriteria program.
From the program's inception, the problem statement,
survey design, and the statistical methods used in the
analysis must be correlated to provide functional re-
sults. Accordingly, chapter 2 begins with formula-
tions of the problem statement the focused objec-
tive that helps narrow the scope of observations in the
ecosystem to those necessary to predict the status and
impairment of the biotaand culminates in a discus-
sion of hypothesis testing, the approach advocated in
this guidance document. Chapter 2 also refers begin-
ners to Appendix A for a succinct review of the basic
statistics and statistical concepts used within the
chapter and throughout this document.
Chapter 3 presents key issues associated with
the design of the sample survey. Surveys are without
doubt the critical element in an environmental as-
sessment. Designs that minimize error, uncertainty,
and variability in both biological and statistical mea-
sures have a great effect on decision makers. This
chapter explores the difference between classical and
experimental design and the issues involved with
random, systematic, and stratified samples. Sample
sizes and how to proceed in confusing circumstances
round out the discussion.
Chapter 4 deals with problems that arise from
hypothesis testing methods based on detecting the
mean differences arising from two or more independ-
ent samples. The use and abuse of means testing pro-
cedures is an important topic. It should generally be
keyed to the survey design, but other information
should also be taken into consideration because er-
rors of interpretation often involve assumptions
about data.
Chapter 5 is a further discussion, with examples,
of the basic concepts introduced in earlier chapters.
Though hypothesis testing is generally preferred, this
chapter discusses circumstances in which other pro-
cedures may be useful. It also introduces the role of
cost-benefit assumptions in decision analysis and the
limits of data collection and interpretation in the de-
termination of causality. The reader should recall at
all times the basic nature of this document. Advanced
practitioners may look to the references used in pre-
paring this document for additional options and dis-
cussion.
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
BIOLOGICAL CRITERIA
Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data
CHAPTER 2. Classical Statistical Interference and Uncertainty 3
Formulating the Problem Statement 3
Basic Statistics and Statistical Concepts 3
Descriptive Statistics 3
Recommendations 4
Uncertainty 6
Statistical Interference 7
Interval Estimation 7
Hypothesis Testing 7
Common Assumptions 8
Parametric Methods the ^test 9
Nonparametric Tests the Wtest 10
Example an IBI case study 11
Conclusions 12
-------
CHAPTER 2 Classical Statistical Inference
and Uncertainty
Before the biological survey can be designed and
linked to statistical methods of interpretation, an
exact formulation of the problem is needed to narrow
the scope of the study and focus investigators on col-
lecting the data. The choice of biological and chemi-
cal variables should be made early in the process, and
the survey design built around that selection. Fancy
statistics and survey designs may be appropriate, but
biologically defined objectives should dominate and
use the statistics, not the reverse (Green, 1979).
Formulating the Problem
Statement
A clear statement of the objective or problem is the
necessary basis on which the biological survey is de-
signed. A general question such as "does the effluent
from the municipal treatment plant damage the envi-
ronment?" does little to help decision makers. Con-
sider, however, their response to a more specific
statement: "Is the mean abundance of
young-of-the-year green sunfish caught in seines
above the discharge point greater (with an error rate of
5 percent) than those similarly trapped downstream
of the discharge point?" The precise nature of this
question makes it a clear guide for the collection and
interpretation of data.
The problem statement should minimally in-
clude the biological variables that indicate environ-
mental damage, a reference to the comparisons used
to determine the impact, and a reference to the level of
precision (or uncertainty) that the investigator needs
to be confident that an impact has been determined.
In the preceding example, green sunfish are the bio-
logical indicator of impact, upstream and down-
stream seine data are the basis of comparison, and an
error rate of 5 percent provides an acceptable level of
uncertainty.
The problem statement, the survey design, and
the statistical methods used to interpret the data are
closely linked. Here, the survey design is an up-
stream/downstream set of samples with the upstream
data providing a reference for comparison. A t test or
rank sign test may be used to test for mean differences
between the sites.
From a statistical standpoint, the biological vari-
ables (measures) used to show damage should have
low natural variability and respond sharply to an im-
pact relative to any sampling variability. Natural vari-
ability contributes to the uncertainty associated with
their response to an impact. Lower natural variability
permits reliable inferences with smaller sample sizes.
Examining historical data is an excellent means
of selecting biological criteria that are sensitive to en-
vironmental impacts. Species that exhibit large natu-
ral spatial and temporal variations may be suitable
indicators of environmental change only in small
time scales or localized areas. If so, the use of such
variables will limit the investigator's ability to assess
environmental change in long-term monitoring pro-
grams. Historical data, combined with good scientific
judgment, can be used to select biological criteria that
exhibit minimal natural variability within the context
of the site under evaluation.
Basic Statistics and Statistical
Concepts
When a data set is quite small, the entire set can be re-
ported. However, for larger data sets, the most effec-
tive learning takes place, when investigators
summarize the data in a few well-chosen statistics.
The choice to trade some of the information available
in the entire set for the convenience of a few descrip-
tive statistics is usually a good one, provided that the
descriptive statistics are carefully selected and cor-
rectly represent the original data.
Some descriptive statistics are so commonly
used that we forget that they are but one option among
many candidate statistics. For example, the mean and
the standard deviation (or variance) are statistics used
to estimate the center of a data set and the spread on
those data. The scientist who uses these statistics has
already decided that they are the best choices to de-
scribe the data. They work very well, for example, as
representatives of symmetrically distributed data that
follow an approximately normal distribution. Thus,
their use in such circumstances is entirely justified.
However, in other situations involving biological
data, alternative descriptive statistics may be pre-
ferred.
Descriptive Statistics
Before selecting a descriptive statistic, the scientist
must understand the purpose of the statistic. Descrip-
tive statistics are often used in biological studies be-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
cause the convenience of a few summary numbers
outweighs the loss of information that results from
not using the entire data set. Nevertheless, as much
information as possible must be summarized in the
descriptive statistics because the alternative may in-
volve a misrepresentation of the original data.
The basic statistics and statistical techniques
used in this chapter are further defined, described,
and illustrated in the appendix to this document (Ap-
pendix A). Readers unfamiliar with descriptive statis-
tics and graphic techniques should read Appendix A
now and use it hereafter as a reference. Other readers
may proceed directly to the tables in this chapter,
which summarize the advantages and disadvantages
of the statistical estimators and techniques described
in the appendix.
The common measures of the center, or central
tendency, of a data set are the mean, median, mode,
geometric mean, and trimmed mean. None of these
options is the best choice in all situations (see Table
2.1), yet each conveys useful information. The points
raised in Table 2.1 are not comprehensive or absolute;
they do, however, reflect the author's experience with
these estimators.
Environmental contaminant concentration data
are strictly positive, and sample data sets exhibit
asymmetry (i.e., a few relatively high observations).
Therefore, a transformation, in particular, the loga-
rithmic transformation, should be applied to concen-
tration and other data that exhibit these characteris-
tics before analysis. When a transformation is used,
data analysis and estimation occur within the trans-
formed metric; if appropriate, the results may be con-
verted back to the original metric for presentation.
A measure of dispersion spread or variability
is another commonly reported descriptive statistic.
Common estimators for dispersion are standard devi-
ation, absolute deviation, interquartile range, and
range. These estimators are defined, described, and il-
lustrated with examples in the appendix; Table 2.2
summarizes when and how they may be used.
Table 2.3 summarizes four of the most useful
univariate and bivariate graphic techniques, includ-
ing histograms, stem and leaf displays, box and whis-
ker plots, and bivariate plots. These methods are also
illustrated in Appendix A.
Recommendations
There is no rigorous theoretical or empirical support
for using the normal distribution as a population
model for chemical and biological measures of water
quality or as a model for errors. Instead, the evidence
supports using the lognormal model. However, uncer-
tainty about the correctness of the lognormal model
suggests that prudent investigators will recommend
estimators that perform well even if an assumed
model is wrong.
Table 2.1 Measures of central tendency.
ESTIMATOR
Mean
Median
Mode
Geometric Mean
Trimmed Mean
ADVANTAGES
Most widely known
and used choice
Easy to explain
Easy to explain
Easy to determine
Resistant to others
Easy to explain
Easy to determine
Appropriate for
certain skewed
(lognormal)
distribution
Resistant to outliers
DISADVANTAGES
Not resistant to
outliers
Not as efficient1 as
some alternatives
under deviations from
normality
Not as efficient as the
mean under normality
Not as efficient as the
mean under normality
Not as easy to
explain as first three
Not as easy to
explain as first three
SHOULD CONSIDER
FOR USE WHEN
Sample mean is
required
Distribution is
known to be normal
Distribution is
symmetric
Sample median is
required
Outliers may occur
Most frequently
observed value is
required
Data are discrete or
can be discretized
Distribution appears
lognormal
Outliers may occur
and estimator
efficiency is desired
SHOULD NOT
USE WHEN
Outliers may occur
Distribution is not
symmetric
More efficient
options are appropriate
More widely known
estimators are
appropriate
1 Higher efficiency means lower standard error.
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
Table 2.2 Measures of dispersion.
ESTIMATOR
Standard
Deviation
Median Absolute
Deviation
Interquartile
Range
Range
ADVANTAGES
Most widely known
Routinely calculated
by statistics packages
Resistant to outliers
Resistant to outliers
Relatively easy to
determine
Easy to determine
DISADVANTAGES
Strongly influenced
by outliers
Not as efficient1 as
some alternatives
under even slight
deviations from
normality
Not as efficient as the
standard deviation
under normality
Not as efficient as the
standard deviation
under normality
Not as efficient
SHOULD CONSIDER
FOR USE WHEN
Sample standard
deviation is required
Distribution is
known to be normal
Outliers may occur
Outliers may occur
Range is required
SHOULD NOT
USE WHEN
Outliers may occur
Sample histogram is
even slightly more
dispersed than is a
normal distribution
Any of the above
options is appropriate
1 Higher efficiency means lower standard error.
Table 2.3 Useful graphic techniques.
TECHNIQUE
Histogram
Stem and Leaf Display
Box and Whisker Plot
Bivariate Plot
FEATURES
Bar chart for data on a single (univariate)
variable
Shows shape of empirical distribution
Same as histogram
Presents numeric values in display
Display of order statistics (extremes,
quartiles, and median)
May be used to graph the same
characteristic (e.g., variable) for several
samples (e.g., different sampling sites)
Scatter plot of data points (variable x
versus variable 7)
USEFUL FOR
Visual identification of distribution
shape, symmetry, center, dispersion, and
outliers
Same as histogram
Visual identification of distribution
shape, symmetry, center, dispersion, and
outliers (single sample)
Comparison of several samples for
symmetry, center, and dispersion
Visual assessement of the strength of a
linear relationship between two variables
Evidence of patterns, nonlinearity and
bivariate outliers
Many books and articles have been written re-
cently concerning the theoretical and empirical evi-
dence in favor of nonparametric methods and robust
and resistant estimators. Books that consider alterna-
tive estimators of center and dispersion (e.g., Huber,
1981; Hampel et al. 1986; Key, 1983; Barnett and
Lewis, 1984; Miller, 1986; Staudte and Sheather,
1990) build a strong case for more robust estimators
than the mean and variance. Indeed, there is good evi-
dence (Tukey, 1960; Andrews et al. 1972) that the
mean and variance may be the worst choices among
the common estimators for error-contaminated data.
Several articles that involve comparisons of estima-
tors on real data (e.g., Stigler, 1977; Rocke et al. 1982;
Hill and Dixon, 1982) also favor robust estimators
over conventional alternatives.
As a consequence, the median and the trimmed
mean are recommended for the routine calculation of
a data set's central tendency. The interquartile range
and the median absolute deviation are recommended
for calculation of the dispersion. These suggestions
represent a compromise between robustness, ease of
explanation, and calculation simplicity. For the
trimmed mean, recommended amounts of trimming
range from 10 percent (Stigler, 1977) to over 20 per-
cent (e.g., Rocke et al. 1982). A critical argument in
support of the trimmed mean is that interval estima-
tion and hypothesis testing are still possible using the
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
t statistic (Tukey and McLaughlin, 1963; Dixon and
Tukey, 1968; Gilbert, 1987).
Uncertainty
In statistics, uncertainty is a measure of confidence.
That is, uncertainty provides a measure of precision
it assigns the value of scientific information in eco-
logical studies. Scientific uncertainty is present in all
studies concerning biological criteria, but uncer-
tainty does not prevent management and decision
making. Rather, uncertainty provides a basis for se-
lecting among alternative actions and for deciding
whether additional information is needed (and if so,
what experimentation or observation should take
place).
In ecological studies, scientific uncertainty re-
sults from inadequate scientific knowledge, natural
variability, measurement error, and sampling error
; (e.g., the standard error of an estimator). In the actual
analysis, uncertainty arises from erroneous specifica-
tion of a model or from errors in statistics, parameters,
initial conditions, inputs for the model, or expert
judgment.
In some situations, uncertainty in an unknown
quantity (e.g., a model parameter or a biological end-
point) may be estimated using a measure of variabil-
ity. Likewise, in some situations, model error may be
estimated using a measure of goodness-of-fit (predic-
tions versus observations) of the model. In many situ-
ations, a judicious estimate of uncertainty is the only
option; in these cases, careful estimation is an accept-
able alternative and methods exist to elicit these judg-
ments from experts (Morgan and Henrion, 1990).
In many studies, uncertainty is present in more
than one component (e.g., parameters and models), so
the investigator must estimate the combined effects of
the uncertainties on the endpoint. This exercise,
called error propagation, is usually undertaken with
Monte Carlo simulation or first-order error analysis.
The outcome of an uncertainty analysis is a prob-
ability distribution that reflects uncertainty on the
endpoint. However, uncertainty analysis may not al-
ways be the most useful expression of risk. Other ex-
pressions of uncertainty, such as prediction,
confidence intervals, or odds ratios are easier to un-
derstand and interpret. If important error terms are ig-
nored when a probability statement is made, the
investigator must report this omission. Otherwise,
the probability statement is not representative, and
the uncertainties are underestimated.
Since uncertainty provides a measure of preci-
sion or value, it can be used by decision makers to
guide management actions. For example, in some
cases the uncertainty in a biological impact may be
too large to justify management changes. As a conse-
quence, managers may defer action until additional
monitoring data can be gathered rather than require
pollutant discharge controls. If the uncertainty is
large and the estimated costs of additional pollutant
controls quite high, it may be wise either to defer ac-
tion or to look for smaller, relatively less expensive
abatement strategies for an interim period while the
monitoring program continues.
Though environmental planners at national,
state, and local levels have rarely considered uncer-
tainty in their planning efforts, their work has been
generally successful over the past 20 years. It is, how-
ever, certainly possible that more effective manage-
ment that is, less costly, more beneficial
management might have occurred if uncertainty
had been explicitly considered.
If overall uncertainty is ignored, the illusion pre-
vails that scientific information is more precise than it
actually is. As a consequence, we are surprised and
disappointed when biological outcomes are substan-
tially different from predictions. Moreover, if we don't
calculate uncertainty, we have no rational basis for
specifying the magnitude of our sampling program or
the resources (money, time, personnel) that should be
allocated to planning. Thus, decisions on planning
and analysis are more likely based on convention and
whim than on the logical objective of reducing scien-
tific uncertainty.
Statistical analysis is largely concerned with un-
certainty and variability. Therefore, uncertainty is an
important concept in this guidance manual. The anal-
yses presented here and in subsequent chapters are
based on particular measures of uncertainty, for ex-
ample, confidence intervals. These measures are "sta-
tistics"; they reflect data, and are not always
considered in the broader context of uncertainty
that is, as establishing the uncertainty in a quantity of
interest. We will, however, consider these statistics in
the broader sense, with concern for the theoretical is-
sues raised in this section. Particularly given the
small samples that often occur with biocriteria assess-
ments, investigators should ask the following ques-
tions:
. Do the data adequately represent
uncertainty?
. Are all important sources of uncertainty
represented in the data?
. Should expert scientific judgment be used to
augment or correct measures of uncertainty?
. If components of uncertainty are ignored
because they are not included in the data,
are conclusions or decisions affected?
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
Statistical analysis is not a rote exercise devoid
of judgment.
Statistical Inference
Statistical inference is gained by two primary ap-
proaches: (1) interval estimation, and (2) hypothesis
testing. Interval estimation concerns the calculation
of a confidence interval or prediction interval that
bounds the range of likely values for a quantity of in-
terest. The end product is typically the estimated
quantity (e.g., a mean value) plus or minus the upper
and lower interval. The same information is used in
hypothesis testing; however, in hypothesis testing,
the end product is a decision concerning the truth of a
candidate hypothesis about the magnitude^ of the
quantity of interest.
In a particular problem, the choice between us-
ing interval estimation or hypothesis testing generally
depends on the question or issue at hand. For exam-
ple, if a summary of scientific evidence is requested,
confidence intervals are apt to be favored; however, if
a choice or decision is to be made, hypothesis tests are
likely to be preferred.
Interval Estimation
Statistical intervals, whether confidence or predic-
tion, may be based on an assumed probability model
describing the statistic of interest, or they may require
no assumption of a particular underlying probability
model.
Hahn and Meeker (1991) note that the proper
choice of statistical interval depends on the problem
or issue of concern. As a rule, if the interval is in-
tended to bound ^population parameter (e.g., the true
mean), then the appropriate choice is the confidence
interval. If, however, the interval is to bound a future
member of the population (e.g., a forecasted value),
then the appropriate choice is the prediction interval.
Another statistical interval less frequently used in
ecology is the tolerance interval, which bounds a
specified proportion of observations.
In conventional (classical, or frequentist) statis-
tical inference, the statistical interval has a particular
interpretation that is often incorrectly stated in scien-
tific studies. For example, if a 95 percent statistical in-
terval for the mean is 7 ± 2, it is not correct to say that
there is a 95 percent chance that the true mean lies be-
tween 5 and 9." Rather, it is correct to say that 95 per-
cent of the time this interval is calculated, the true
mean will lie within the computed interval. Although
it sounds awkward and not directly relevant to the is-
sue at hand, this interpretation is the correct meaning
of a classical statistical interval. In truth, once it is cal-
culated, the interval either does or does not contain
the true value. In classical statistics, the inference
from interval estimation refers to the procedure for in-
terval calculation, not to the particular interval that is
calculated.
Hypothesis Testing
Biosurveys are used for many purposes, one of which
is to assess impact or effect. Resource managers may
want to assess, for example, the influence of a pollut-
ant discharge or land use change on a particular area.
The effect of the impact can be determined based on
the study of trends over time or by comparing up-
stream and downstream conditions. In some in-
stances, the interest is in magnitude of effect, but
concern often focuses simply on the presence or ab-
sence of an effect of a specific magnitude. In such
cases, hypothesis testing is usually the statistical pro-
cedure of choice.
In conventional statistical analysis, hypothesis
testing for a trend or effect is often based on a point
null hypothesis. Typically, the point null hypothesis
is that no trend or effect exists. The position is pre-
sented as a "straw man" (Wonnacott and Wonnacott,
1977) that the scientist expects to reject on the basis of
evidence. To test this hypothesis, the investigator col-
Table 2.4 Possible outcomes from hypothesis testing.
STATE OF
THE WORLD
H0 is True
H0 is Fale
(Ha is True)
DECISION
ACCEPT H0
Correct decision.
Probability = 1 - a;
corresponds to the
confidence level.
Type II error.
Probability = p
REJECT H0
Type I error.
Probability = a;
also called the
significance level.
Correct decision.
Probablity = 1 - p;
also called power.
lects data to provide a sample estimate of the effect
(e.g., change in biotic integrity at a single site over
time). The data are used to provide a sample estimate
of a test statistic, and a table for the test statistic is con-
sulted to estimate how unusual the observed value of
the test statistic is if the null hypothesis is true. If the
observed value of the test statistic is unusual, the null
hypothesis is rejected.
In a typical application of parametric hypothesis
testing, a hypothesis, H0, called the null hypothesis, is
proposed and then evaluated using a standard statis-
tical procedure like the t test. Competing with this
null hypothesis for acceptance is the alternative hy-
pothesis, H1. Under this simple scheme, there are four
possible outcomes of the testing procedure: the hy-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
pothesis is either true or false, and the test results can
be accepted or rejected for each hypothesis (see Table
2.4).
The point null hypothesis is a precise hypothesis
that may be symbolically expressed:
where 0 is a parameter of interest. An example of a
point null hypothesis in words is, "no change occurs
in mean IBI after the new wastewater treatment plant
goes on line." Symbolically, it is expressed as
Ho: V-L-VZ = °
Ha: /u1-Atz 5* 0
where fj.1 is the "before" true mean and /u2 is the "after"
true mean. The test of the null hypothesis proceeds
with the calculation of the sample means, 3c, and x2.
In most cases, the sample means will differ as a con-
sequence of natural variability or measurement error
or both, so a decision must be made concerning how
large this difference must be before it is considered
too large to result from variability or error. In classical
statistics, this decision is often based on standard
practice (e.g., a Type I error of 0.05 is acceptable), or
on informal consideration of the consequences of an
incorrect conclusion.
The result of a hypothesis test can be a conclu-
sion or a decision concerning the rejected hypothesis.
Alternatively, the result can be expressed as a
"p- value," which quantifies the strength of the data
evidence in favor of the null hypothesis. Thep-value
is defined as the probability that "the sample value
would be as large as the value actually observed, if H0
is true" (Wonnacott and Wonnacott, 1977). In effect,
thep-value provides a measure of how likely a partic-
ular value is, assuming that the null hypothesis is
true. Thus, the smaller thep-value, the less likely that
the sample supports H0. This is useful information; it
suggests that p-values should always be reported to
allow the reader to decide the strength of the evi-
dence.
Common Assumptions
Virtually all statistical procedures and tests require
the validity of one or more assumptions. These as-
sumptions concern either the underlying population
being sampled or the distribution for a test statistic.
Since the failure of an assumption can have a substan-
tial effect on a statistical test, the common assump-
tions of normality, equality of variances, and
independence are discussed in this section. We must
ask, for example, to what extent can an assumption be
violated without serious consequences? Or how
should assumption violations be addressed?
Normality. A common assumption of many para-
metric statistical tests is that samples are drawn from
a normal distribution. Alternatively, it may be as-
sumed that the statistic of interest (e.g., a mean) is de-
scribed by a normal sampling distribution. In either
case, the key distinction between parametric and
nonparametric (or distribution-free) statistical tests is
that a probability model (often normal) is assumed.
Empirical evidence (e.g., Box et al. 1978) indi-
cates that the significance level but not the power is
robust or not greatly affected by mild violations of the
normality assumption for statistical tests concerned
with the mean. This finding suggests that a test result
indicating "statistical significance" is reliable, but a
"nonsignificant" result may be the result of a lack of
robustness to nonnormality. The normality of a sam-
ple can be checked using a normal probability plot,
chi square test, Kolmogorov-Smirnov test, or by test-
ing for skewness or kurtosis; however, many biologi-
cal surveys are not designed to produce enough
samples to make these tests definitive.
Normality of the sampling distribution for a test
statistic is important because it provides a probability
model for interval estimation and hypothesis tests. In
some cases, transformation of the data may help the
investigator achieve approximate normality (or sym-
metry) in a sample, if normality is required. Since
nonnegative concentration data cannot be truly nor-
mal, and since empirical evidence suggests that envi-
ronmental contaminant data may be described with a
lognormal distribution, the logarithmic transforma-
tion is a good first choice. Therefore, in the absence of
contrary evidence, we recommend that concentration
data be log-transformed prior to analysis.
Equality of Variance. A second common as-
sumption is that when two or more distributions are
involved in a test, the variances will be constant
across distributions. Many tests are also robust to
mild violations of this assumption, particularly if the
sample sizes are nearly identical. To test this assump-
tion, a t test (usually a two-tailed one) can be per-
formed; see Snedecor and Cochran (1967) for an
example, and Miller (1986) for interpretive results.
Conover (1980) provides an alternative, namely,
nonparametric tests of equality of variances. Note
that if two means are being compared based on sam-
ples with vastly different variances, the differences of
interest may be more fundamental than the difference
between the means.
Independence. The assumption of greatest gen-
eral concern is independence. Most statistical tests
(parametric and nonparametric) require a random
sample, or a sample composed of independent obser-
vations. Dependency between or among observations
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
in a data set means that each observation contains
some information already conveyed in other observa-
tions. Thus, there is less new independent informa-
tion in a dependent data set than in an independent
data set of the same sample size. Because statistical
procedures are often not robust to violation of the in-
dependence assumption, adjustments are generally
recommended to address anticipated problems.
Dependence in a sample can result from spatial
or temporal patterns, that is, from persistence
through time and space. In most types of analyses, the
assumption of independence refers to independence
in the disturbances (errors). For example, in a time se-
ries with temporal trend and seasonal pattern, de-
pendence or autocorrelation in the raw data series
may exist because of a deterministic feature of the
data (e.g., the time trend or seasonal pattern).
This type of autocorrelation poses no difficulty;
it is addressed by modeling the deterministic features
of the data and subtracting the modeled component
from the original series. Of particular concern in test-
ing for trend is autocorrelation that remains after all
deterministic features are removed (i.e., errors that
are in the disturbances). When this situation arises,
an adjustment to the trend test is necessary. Reckhow
et al. (1993) provide guidance and software.
A similar situation can occur in the estimation of
a regression slope or a central tendency statistic such
as the mean or trimmed mean. In such cases, the inde-
pendence assumption refers to the errors, as esti-
mated by the residuals, around the regression line or
the mean. If persistence or dependence is found in the
residuals, then the independence assumption is vio-
lated and corrective action is needed. Options to ad-
dress this problem include using an effective sample
size (Reckhow and Chapra, 1983), or generalized,
least squares for regression (see Kmenta [1986] or any
standard econometrics regression text).
If the investigator finds positive autocorrelation
in the disturbances (i.e., if each disturbance is posi-
tively correlated with nearby disturbances in the se-
ries), confidence interval estimates will be too narrow
and may lead to rejection of the null hypothesis.
Autocorrelation in the disturbances is the most com-
mon and potentially the most troublesome of the
causes of assumption violations.
The degree of autocorrelation is a function of the
frequency of sampling; that is, a data set based on an
irregular sampling frequency cannot be characterized
by a single, fixed value for autocorrelation. For biolog-
ical time series, stream data obtained more frequently
than monthly may be expected to be autocorrelated
(after trends and seasonal cycles are removed).
Stream survey data based on less frequent sampling
are less likely to exhibit sample autocorrelation esti-
mates of significance.
Parametric Methods the t Test
Parametric approaches involve a model (e.g., regres-
sion slope) for any deterministic features and a proba-
bility model for the errors. In some cases, the
deterministic model will be a linear, curvilinear, or
step function, while the model for the errors is typi-
cally a normal probability distribution with inde-
pendent, identically distributed errors. In other cases,
the deterministic model may simply be a constant (as
it is when interest focuses on an "upstream/down-
stream" comparison between two sites), though the
probability model may in all cases be a normal proba-
bility distribution. The t test is a typical parametric
test.
Using the t test
A Student's t statistic:
t-
s/Vn
has a Student's t distribution (n-1 degrees of freedom);
here, "x" is the mean of a random sample from a nor-
mal distribution with true mean /j. and constant vari-
ance, s is the sample standard deviation, and n is the
sample size. In addition, for two samples:
t = -
x, -x.
s,+s.
also has a Student's t distribution (n1+nz-2 degrees of
freedom); here, xa andx2 are the sample means; sx and
s2 are the sample standard deviations; and jna and n2
are the sample sizes. This distribution is widely tabu-
lated, and it is commonly used in hypothesis testing
and confidence interval estimation for a sample mean
(one-sample test; Equation 2.la) or a comparison of
sample means (two-sample test; Equation 2.1b).
When Student's t distribution is used in a hy-
pothesis test (a t test), it is assumed that samples are
drawn from a normal distribution, the variances are
constant across distributions, and the observations
are independent. Of these assumptions, Box et al.
(1978) have shown that the t test has limited robust-
ness to violations of the first two (normality and
equality of variances); however, problems will oc-
cur if the observations are dependent. The scientist
should probably be concerned about the first two as-
sumptions only in situations in which the two data
sets have substantially different variances and sub-
stantially different sample sizes (see Snedecor and
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
Cochran [1967] for F test calculations to compare
variances).
An attractive variation of the t statistic for use in
situations where outliers are of concern was proposed
by Yuen and Dixon (1973; see also Miller, 1986; and
Staudte and Sheather, 1990). They created an out-
lier-resistant, or robust, version of the t statistic
(Equations 2.la. and 2.1 b) using a trimmed mean and a
Winsorized standard deviation. For example, if a t sta-
tistic is used to compare the means of two popula-
tions, the robust (trimmed t) version is
t =
Mri
(2.2)
where x^
s
= trimmed mean for sample i
= Winsorized standard deviation
= number of observations in sample i
A Winsorized statistic is similar to a trimmed sta-
tistic. For trimming, observations are ordered from
lowest to highest, and the ^-lowest and ^-highest are
removed from the sample for the calculation of the
jc-trimmed statistic (e.g., trimmed mean). For
£-Winsorizing, observations are ordered from lowest
to highest, and the ^-lowest and ^-highest are not re-
moved, but are reassigned the values of the lowest ob-
servation and the highest observation remaining in
the trimmed sample. The following example illus-
trates this.
A sample of 10 IBI values is obtained for analysis:
9, 31, 26, 25, 34, 38, 33, 31, 28, 37
And ordered from lowest to highest:
25, 26, 28, 29, 31, 31, 33, 34, 37, 38.
The 10 percent- trimmed sample is
26,28,29,31, 31,33,34,37
The 10 percent-Winsorized sample is
26, 26, 28, 29, 31, 31, 33, 34, 37, 37.
If we were to calculate the 10 percent-trimmed t
statistic in Equation 2.2 for this IBI sample, we would
use: (1) the trimmed sample (eight observations) to
calculate a mean, and (2) the Winsorized sample (10
observations) to calculate a standard deviation. For
the two-sample comparison of means, the trimmed t
statistic has (l-2k)(n1+n2)-2 degrees of freedom or, in
the above example, 7 degrees of freedom (df). The
trimmed t statistic is an attractive option that should
be considered whenever outliers are a concern.
The parametric approach is appropriate and ad-
vantageous if the deterministic model is a reasonable
characterization of reality and if the model for errors
holds. In such cases, parametric tests should be more
powerful than nonparametric or distribution-free al-
ternatives. Thus, the assumption that deterministic
and probability models are correct is the basis on
which the superior performance of parametric meth-
ods rests. If the assumptions concerning these models
are incorrect, then the results of the parametric tests
may be invalid and distribution-free procedures may
be more appropriate.
Nonparametric Tests the Wtest
Distribution-free methods, as the name suggests, do
not require an assumption concerning the particular
form of the underlying probability model for the data
generation process. An assumption of independence
is, however, usually made; therefore, autocorrelation
can be as serious a problem in nonparametric meth-
ods as it is for parametric and robust methods. Distri-
bution-free tests are often based on rank (or order); the
sample observations are arranged from lowest to
highest. The Wilcoxon-Mann-Whitney test or Wtest is
a typical distribution-free test.
Using the Wtest
The Wtest is a two-sample hypothesis test, designed
to test the hypothesis that two random samples are
drawn from identical continuous distributions with
the same center (alternative hypothesis: one distribu-
tion is offset from the other, but otherwise identical).
This test is often presented as an option to the
two-sample t test that should be considered if the as-
sumption of normality is believed to be seriously in
error. The Wtest has its own statistic, which is tabu-
lated in most elementary statistics textbooks .(i.e.,
those with a chapter on nonparametric methods).
However, for moderate to large sample sizes (e.g., n >
15), the statistic is approximately normal under the
null hypothesis, so the standard normal table can be
used.
The scientist should consider the Wtest for any
situation in which the two-sample t test may be used.
Comparative studies of these two tests indicate that
while the t test is robust to violations of the normality
assumption, the W test is relatively powerful while
not requiring normality. Situations that appear se-
verely nonnormal might favor the W test; otherwise
the t test may be selected. Some statisticians (e.g.,
Blalock, 1972) recommend that both tests be con-
ducted as a double check on the hypothesis.
Unfortunately, violation of the independence as-
sumption appears to be as serious for the Wtest as for
the t test. If these tests are to be meaningful, the scien-
tist must confirm independence or make other adjust-
ments as noted in Reckhow et al. (1993).
10
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
In essence, the Mutest is used to determine if the
two distributions under study have the same central
tendency, or if one distribution is offset from the
other. To conduct the Wtest, the data points from the
samples are combined, while maintaining the sepa-
rate sample identity. This overall data set is ordered
from low value to high value, and ranks are assigned
according to this ordering.
To test the null hypothesis of no difference be-
tween the two distributions (f [x] and g[x])
H0:f(x)=g(x)
the ranks, Rp for the data points in one of the two
samples are summed:
The ranks should be specified as follows
(Wonnacott and Wonnacott, 1977): Start ordering
(low to high, or high to low) from the end (high or low)
at which the observations from the smaller sample
tend to be greater in number, and sum the ranks to es-
timate W from this smaller sample. This estimate
keeps W small as it is reported in most tables. For ei-
ther one-sided or two-sided tests, if ties occur in the
ranks, then all tied observations should be assigned
the same average rank.
Statistical significance is a function of the degree
to which, under the null hypothesis, the ranks occu-
pied by either data set differ from the ranks expected
as a result of random variation. For small samples, the
W statistic calculated in Equation 2.3 can be com-
pared to tabulated values to determine its signifi-
cance (see Hollander and Wolfe, 1973). For moderate
to large samples (where total n from both samples >
15), W is approximately normal (if the null hypothesis
is true). Therefore, the W statistic may be evaluated
using a standard normal table with mean (E[W]) and
variance (Var[W]):
E(W) =
Var(W) =
nA+
(2.4)
(2.5a)
If there are ties in the data, then the variance may
be calculated as
Var(W) = -
12
(nA+nB)(nA+nB-l)
(2.5b)
where t is the size (number of data points with
the same value) of tied group;. The effect of ties is neg-
ligible unless there are several large groups (t > 3) in
the data set.
These statistics are used to create the standard
normal deviate:
z =
W-E(W)
(Far(W))05
(2.6)
where: nA, nB =the number of observations in
samples A and B (nA< rig).
Example an IBI case study
IBI data have been obtained from upstream and
downstream sites surrounding a wastewater dis-
charge. Assume independence.
Upstream
Downstream
33
26
34
30
2.5
18
3.7
32
39
36
45
36
49
43
47
42
45
41
44
41
(a) Test the null hypothesis that the true differ-
ence between the upstream and downstream IBI
means is zero, versus the alternative hypothesis that
the downstream IBI mean is lower than the upstream
IBI mean.
HQ-.^U-^O = 0
First, some basic statistics for each sample:
Upstream
Downstream
SAMPLE MEAN
39.8
34.5
SAMPLE STANDARD
DEVIATION
7.57
8.09
For a comparison of two means based on equal
sample sizes, the t statistic is
x, -x, 39.8-345 53 5
(s,+s2)
757 + 8.09
7.83^02 35
At the 0.05 significance level, the one-tailed f sta-
tistic for 18 degrees of freedom is 1.73. Since 1.51 <
1.73, we cannot reject the null hypothesis (at the 0.05
level).
(b) Test the null hypothesis (see part a) using the
10 percent trimmed t (10 percent trimmed from each
end).
\ii-\tt40.5-35.5 5.0
t,H =
(5wl+sw2) jl
n,
1 5.83 + 639
orr
\10+10
= 6.12V02
= 183
At the 0.05 significance level, the one-tailed t sta-
tistic for 14 degrees of freedom is 1.76. Since 1.83
1.76, we reject the null hypothesis (at the 0.05 level).
(c) Test the null hypothesis (see part a) using the
Wtest.
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
11
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
ORDER IBI VALUES
Upstream
Downstream
49
47
45
45
44
43
42
41
41
39
37
36
36
34
33
32
30
26
25
18
ORDER
Upstream
Downstream
1
2
3.5
3.5
5
6
7
8.5
8.5
10
11
12.5
12.5
14
15
16
17
18
19
20
Here the separate samples have been combined
for the purpose of rank ordering. The Wtest statistic
can then be calculated from the ranks:
W =
= 1+2+ 3.5+ 3+ 5+10+ 11+ 14 + 15+19= 84
E(W)=nA(nB+nA
= (nBnA / 12)[nB + nA
=10(10+10+ 1)/ 2=105
+ nA)(nB .+ nA
Conclusions
In hypothesis testing, the conclusion to not reject H0
(in effect, to accept H0) should not be evaluated
strictly on the basis of a, the probability ofrejectingH0
when it is true (Type I error; see Table 2.4). Instead, we
must be concerned with (3, the probability of accept-
ing H0 when it is false (Type II error). Unfortunately, P
does not have a single value, but is dependent on the
true (but unknown) value of the difference between
population means and on the sample size, n. For a par-
ticular testing procedure and sample size, we can de-
termine and plot a relationship between the true
difference between means and p. This plot is called
the operating characteristic curve.
To understand the issues concerning signifi-
cance and power (a and 1-p), consider the null hy-
pothesis in the IBI case study:
= [(io)(io) / i2][io +10 +1 - {(2)(3) + (2)(3) + (2)(3)} / (10 + io)(io +10 -1)] = i74.6i HO: The population mean IBI at the upstream site is
the same as the population mean IBI at the
downstream site.
(W-E(W) 84-105
174.61"
=-1.59
At the 0.05 significance level, the one-tailed z
statistic is 1.65. Since 1.59 < 1.65, we cannot reject the
null hypothesis (at the 0.05 level.
A glance at the IBI values and ranks in this exam-
ple indicates a difference between the two samples
(box plots and histograms would provide further sup-
porting evidence). At issue is whether this difference
in the sample is a chance occurrence or an indication
of a true difference between the sites. If we adopt the
conventional 0.05 level for hypothesis testing, then
the conclusions from the three tests are ambiguous.
Still, we can say the following about both the site
comparisons and the methods:
(i) The downstream site is slightly impacted.
Even though only one of the three test results yielded
significance (at the 0.05 level), all three were close,
suggesting a slight difference between the sites.
(ii) For each site, the lowest IBI value (25 for up-
stream, and 18 for downstream) is influential, partic-
ularly on the standard deviation. As a consequence,
for the conventional t test, the denominator in the t
statistic is inflated and rejection of the null hypothe-
sis is less likely. Note that the lowest IBI value for the
upstream site (IBI = 25) also affects the distribu-
tion-free Wtest. This IBI value holds a high rank (19)
for the upstream sample, and substantially affects the
test result. If that single IBI value had been 27 instead
of 25, we would have rejected the null hypothesis at
the 0.05 level.
(iii) The trimmed t is resistant to unusual obser-
vations or outliers, and thus provides the best single
indicator of difference between the sites as conveyed
by the bulk of the data from each site.
In addition, because of the wastewater dis-
charge, consider the general alternative hypothesis:
HA: The population mean IBI at the upstream site is
higher than the population mean IBI at the
downstream site.
If we adopt a = 0.05 (the probability of rejecting
H0 when it is true; Type I error) as our significance
level, then Figure 2.la displays the sampling distribu-
tion for the mean under H0 with 18 degrees of free-
dom. The horizontal axis in Figure 2.1 is the
"difference between the means"; thus, the sampling
distribution is centered at zero in Figure 2.la (consis-
tent with zero difference between means under H0).
The 0.05-significant tail area (the "rejection region")
begins at 6.06, which means that the sample differ-
ence must be greater than or equal to 6.06 for us to re-
ject H0. Since the difference between the means in our
sample IBI was only 5.3, we are inclined to accept the
null hypothesis, based on the conventional t test.
(Note: to find the beginning of the tail area multi-
ply the t statistic times the standard error. In this ex-
ample, the t statistic is 1.73 [one-sided, 0.05 level, 18
degrees of freedom], and the standard error is 3.5.
Thus, the tail area begins at [1.73][3.5] = 6.06.)
Now suppose that the following alternative hy-
pothesis, Hv is actually true for the sample IBI case:
Ht: The population mean IBI at the upstream site is
higher by 5.0 than the population mean IBI at
the downstream site.
In addition suppose that while H^ actually is
true, we propose a hypothesis test for H0 based on the
acceptance region in Figure 2. la (i.e., accept H0 if the
12
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
(a) If HO is true
a = .05 (5% tail area)
6.06
(b) If H| is true
0
Figure 2.la and b.Sampling distributions under different hypotheses.
difference between the means is less than 6.06),
which is exactly what occurred in our example. As we
noted above, consideration of H0 alone (Figure 2.la)
leads us to accept the null hypothesis.
Yet, with Hj actually true (see Fig. 2.1b), if we
propose a hypothesis test for H0 based on the accep-
tance region in Figure 2.la, there is a 62 percent
chance that we will accept H0 when it is actually false,
according to Figure 2.lb (given the sample size in the
example). This high likelihood of Type II error (see Ta-
ble 2.4) underscores the danger of concluding the hy-
pothesis test with acceptance of the null hypothesis.
The power of this particular test is 1-p, or a 38 percent
chance of detecting an IBI change of 5. Note that the
specific alternative hypothesis H^ is one example of
an unlimited number of possibilities associated with
the general alternative hypothesis HA. Associated
with Hp' (3 = 0.62 is one point on the power curve for
this test and sample size. To properly determine the
power of a test, we need to calculate f> for a range of
specific alternative hypotheses.
A second issue of concern in hypothesis testing
is the problem of multiple simultaneous hypothesis
testing, or "multiplicity" (Mosteller and Tukey, 1977).
The classical interpretation of the 0.05 significance
level (for a) associated with a hypothesis test is that
95 percent of the time this testing procedure is ap-
plied, the conclusion to accept the null hypothesis
will not be in error if the null hypothesis is true. That
is, on the average, one in 20 tests under these condi-
tions will result in Type I errors.
The problem of multiplicity arises when an in-
vestigator conducts several tests of a similar nature on
a set of data. If all but a few of the tests yield statisti-
cally insignificant results, the scientist should not ig-
nore this in favor of those that are significant. The
error of multiplicity results when one ignores the ma-
jority of the test results and cites only those that are
apparently statistically significant. As Mosteller and
Tukey (1977) note, the multiplicity error is techni-
cally the incorrect assignment of an a-level. When
multiple tests of a similar nature are run on a set of
data, a collective a should be used, associated with si-
multaneous test results. This tactic is typically re-
ferred to as the Bonferroni correction for correlation
analysis.
The following comments from Wonnacott and
Wonnacott (1972, pp. 201-202) summarize our atti-
tude toward hypothesis testing:
We conclude that although statistical theory
provides a rationale for rejecting H0, it pro-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
13
-------
CHAPTER 2. Classical Statistical Inference and Uncertainty
vides no formal rationale for accepting H0. The
null hypothesis may sometimes be uninterest-
ing, and one that we neither believe or wish to
establish; it is selected because of its simplic-
ity. In such cases, it is the alternative Ha that
we are trying to establish, and we prove Ha by
rejecting H0. We can see now why statistics is
sometimes called "the science of disproof." H0
cannot be proved, and H1 is proved by dis-
proving (rejecting) H0. It follows that if we
wish to prove some proposition, we will often
call it H1 and set up the contrary hypothesis H0
as the "straw man" we hope to destroy. And of
course if H0 is only such a straw man, then it
becomes absurd to accept it in the face of a
small sample result that really supports Hr
Since there are great dangers in accepting H0,
the decision instead should often be simply to
"not reject H0," i.e., reserve judgment. This
means that type II error in its worse form may
be avoided; but it also means you may be leav-
ing the scene of the evidence with nothing in
hand. It is for this reason that either the con-
struction of a confidence interval or the calcu-
lation of a prob-value is preferred, since either
provides a summary of the information pro-
vided by the sample, useful to sharpen up
your knowledge of what the underlying popu-
lation is really like.
If, on the other hand, a simple accept-or-reject
hypothesis test is desired, then we must look
to a far more sophisticated technique. Spe-
cifically, we must explicitly take account not
only of the sample data used in any standard
hypothesis test (along with the adequacy of
the sample size), but also:
1. Prior belief. How much confidence do we
have in the engineering department that has
assured us that the new process is better? Is
their vote divided? Have they ever been wrong
before?
2. Loss involved in making a wrong decision. If
we make a type I error (i.e., decide to reject the
old process in favor of the new, even though
the old is as good), what will be the costs of re-
tooling, etc.?
These comments amount to an advocacy of
Bayesian decision theory. While it may be difficult to
interpret a biosurvey in decision analysis terms, prior
information and loss functions should, at a mini-
mum, be considered in an informal manner. It is good
engineering and planning practice to make use of all
relevant information in inference and decision mak-
ing.
14
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data
-------
BIOLOGICAL CRITERIA
Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data
CHAPTER 3. Designing the Sample Survey 15
Critical Aspects of Survey Design 15
Variability 15
Representativeness and Sampling Techniques 15
Cause and Effect 16
Controls 16
Key Elements 17
Pilot Studies 17
Location and Sampling Points 18
Location of Control Sites 19
Estimation of Sample Size 20
Important Rules 20
-------
CHAPTER 3 Designing the Sample Survey
The design of the sample survey is a critical ele-
ment in the environmental assessment process,
and certain statistical methods are associated with
specific designs. This chapter examines various types
of survey design and shows how the selection of the
design relates to the interpretation and use of data
within the biocriteria program. For information on de-
signs not covered in this chapter, see Cochran, 1963;
Cochran and Cox, 1957; Green, 1979; Williams, 1978;
and Reckhow and Stow, 1990.
Efforts to design sample surveys frequently re-
sult in situations that force the investigator to evalu-
ate the trade-offs between an increase in uncertainty
and the costs of reducing this uncertainty (Reckhow
and Chapra, 1983). But major components of uncer-
tainty, including variability, error, and bias in biologi-
cal and statistical sources, can sometimes be
controlled by a well-specified survey design.
For example, variability can be caused by natural
fluctuations in biological indicators over space and
time; error can be associated with inaccurate data ac-
quisition or reduction; and bias can occur when the
sample is not representative of the population under
review or when the samples are not randomly col-
lected. These sources of uncertainty should be evalu-
ated before the sampling design is selected because
the best design will minimize the effects of variability,
error, and bias on decision making.
Critical Aspects of Survey
Design
Data collection within the biocriteria program re-
quires the investigator to address issues associated
with both classical and experimental survey designs.
In general, experimental survey design focuses on the
collection of data that leads to the testing of a specific
hypothesis. Classical survey design is motivated less
by hypothesis testing than by the "survey" concept.
That is, the investigator gathers a relatively small
amount of data, the sample, and extrapolates from it a
view of the totality of available information.
In this chapter, we will address issues that over-
lap these design types. In addition, we will focus on
designs appropriate to local, site-specific situations.
For larger geographic survey designs, see Hunsaker
and Carpenter (1990), or Linthurst et al. (1986).
Variability
A critical aspect of sampling design is to identify and
separate components of variability, including the im-
portant ones of time, space, and random errors. Yearly
and seasonal variations and spatial variations like
those caused by changes in geographic patterns
should be accounted for in the survey design. A de-
sign that stratifies the sampling based on knowledge
of spatial and temporal changes in the abundance and
character of biological indicators is preferred to sys-
tematic random sampling. That is, if biological indi-
cators are known to exhibit temporal and spatial
patterns, then sampling locations and times must be
adjusted to match the biological variability.
Representativeness and Sampling
Techniques
The object of a biological survey design is to reduce
the total information available to a small sample: ob-
servations are made and data collected on a relatively
small number of biological variables. Representative-
ness is, therefore, a key consideration in the design of
sample collection procedures. The data generated
during the survey should be representative of the
population or process under evaluation. Biased sam-
ples occur when the data are not representative of the
population. For example, a sample mean may be low
(biased) because the investigator failed to sample geo-
graphic areas of high abundance.
Several techniques can increase the odds of col-
lecting a representative sample; however, the tech-
nique most frequently used is random sampling.
Theoretically, in simple random sampling, every unit
in the population has the same chance of being in-
cluded in the sample. Random sampling is a physical
way to introduce independence among environmen-
tal measurements. In addition, random sampling has
the affect of minimizing various types of bias in the
interpretation of results.
If the geographic area sampled is large, with
known or suspected environmental patterns, a good
technique is to divide the area into relatively homoge-
nous sections and randomly sample within each one.
This technique is known as stratified sampling. Sam-
ples can be allocated to each section in proportion to
the size of the area or to the known abundance of or-
ganisms within each area. In still other cases, system-
atic sampling may be appropriate. Systematic
sampling improves precision in the sample estimates,
especially when known spatial patterns exist
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
15
-------
CHAPTER 3. Designing the Sample Survey
(Cochran, 1963). Randomly allocated replicate sam-
ples collected on a grid allow for good spatial coverage
of patchy environments, yet minimize the potential
for sampling bias.
Cause and Effect
In classical statistical experiments, a population is
identified and randomly divided into two groups. The
treatment is administered to one group; the other
group serves as the control. The difference in the aver-
age response between the two groups indicates the ef-
fect of the treatment, and the random assignment of
individuals to the groups permits an inference of cau-
sality because the observed difference results from
the treatment and not from some preexisting differ-
ence between the groups.
In an ecological assessment, the treatment and
control groups are not selected at random from a
larger population, since the impacted site cannot be
selected at random. And no matter how carefully the
reference site is matched, the investigator cannot
compensate for the lack of random selection. In this
sense, a statistically valid test of the hypothesis that
an observed difference between an impacted site and
a control site results from a specific cause is impossi-
ble. The hypothesis that the two sites are different can
be tested, but the difference cannot be attributed to a
specific cause. In statistical terms, the stress on the
impacted site is completely confounded with preex-
isting differences between the impact and reference
site.
Although a firm case can be made that a site is
subject to adverse impacts, investigators must realize
that the site is an experimental unit that cannot be
replicated. They must take care to avoid
"pseudoreplication" (Hurlbert, 1984) the testing of
a hypothesis about adverse effects without appropri-
ate statistical design or analysis methods. The prob-
lem is a misunderstanding or misspecification of the
hypothesis being tested. It is avoided by understand-
ing that only the hypothesis of a difference between
sites can be statistically tested. Cause-and-effect is-
sues cannot be resolved using statistical methods. Of
course, establishing that a difference exists is an es-
sential step in the process of demonstrating an ad-
verse ecological effect. If there is no detectable
difference, there is no cause to establish.
Methods used to establish causality can make
use of statistical techniques, such as regression or cor-
relation. For example, regression can be used to show
that toxicity increases along with the concentration of
some chemical known to originate from a wastewater
outfall. The regression describes the relationship; it
does not imply the cause, though presence of a strong
relationship is evidence that a link exists.
One way to resolve these issues is to collect both
spatial and temporal data from a control site. If the
spatial control is missing and only before and after
impact samples are available at the impacted site, sta-
tistical tests cannot rule out the possibility that the
change would have occurred with or without the im-
pact. If the temporal control is missing, the statistical
tests cannot rule out the possibility that the differ-
ences between the control and impact site may have
occurred prior to the impact. In practice, control data
are rarely available in both spatial and temporal di-
mensions. Therefore, most environmental assess-
ments detect only that differences exist between the
control and impact sites. The causal link is more diffi-
cult to discern.
Controls
In environmental assessments, control or reference
data are used in hypothesis tests to evaluate whether
data from the control and impact site are statistically
different. Evidence of impact is based on changes in
the biological community that did not occur in the
control area. Sources of control information include
baseline data, reference site data, and numeric stan-
dards. The case for causality can be strengthened if
the controls are properly selected.
In an ideal study design, both temporal and spa-
tial control data should be collected (Green, 1979).
The control site should be geographically separated
from the impacted site but have similar physical and
ecological features (e.g., elevation, temperature, wind
patterns, and habitat type and disturbance). In
aquatic habitats, parameters such as stream order,
flow rate, and stream hydrography should be consid-
ered. Ideally, biological indicators of impact should
be collected at the control site before and after the im-
pact occurs.
Statistically, a valid control site should have con-
servative properties. That is, its statistics should be
the same as at the impacted site except for the effects
of the impact. Physical, chemical, and ecological vari-
ables should be measured and statistically evaluated
to confirm that the impact and control sites are prop-
erly matched. Investigators should test for mean dif-
ferences as well as differences in distribution. In
addition, the variance of the physical and ecological
similarities between the control and impact sites
should be the same over time. For example, if the
mean pH between the two sites is consistent but the
impact site experiences much wider swings in pH
than the control site, then the ability to confidently
detect an impact for a pH-dependent toxicant is com-
promised. Samples within the control and reference
site should be randomly allocated at some level. For
example, in a random sampling design (Fig. 3.1), the
16
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 3. Designing the Sample Survey
Control Area
Before
Impact Area
After
Figure 3.1Random before and after control impact (BACI) sample design having both temporal and spa-
tial dimensions. Random samples indicated are from within areas identified as being of similar habitat.
(Adapted from Green, 1979.)
samples would be randomly allocated in a tempo-
ral/spatial framework that would allow for a number
of different statistical analyses, including analysis of
variance (ANOVA).
In an optimal study design, the impact would be
in the future. Thus, baseline data providing a tempo-
ral control would be available to the investigator. In
practice, baseline data are rarely available, and the in-
vestigator cannot be certain whether differences be-
tween the impact and control sites preceded or
followed the impact. Therefore, cause and effect can-
not be determined. However, the fact that a difference
exists allows the investigator to hypothesize a causal
link.
In some cases, biological variables collected at
an impact site may be compared to a fixed numeric
value rather than to a set of identical measurements
collected at a reference site. Nevertheless, the issues
associated with demonstrating causality remain the
same. In addition, the investigator should note that
the numeric criterion has no variance. It is usually
presented as a single number with no associated un-
certainty. In such cases, a t statistic (see chapter 4)
would be appropriate. As an alternative to the nu-
meric criterion, investigators could use the data from
which the criterion was derived. Uncertainty esti-
mates from that data set could be used in statistical
comparisons.
Key Elements
Several specific survey designs are appropriate for
use in a biocriteria program, but designs for a particu-
lar environmental assessment should be developed
with the aid of a consulting statistician. Such plans
should include the following key elements, beginning
with the notion of a pilot study.
Pilot Studies
In a pilot study, the investigator makes a limited sur-
vey of the variables that determine impact at both the
impact and control site. Data from the survey can be
used to estimate sample sizes, evaluate sampling
methods, establish important variance components,
and critique or reevaluate the larger design. The sam-
ple size helps determine the particular levels of statis-
tical confidence that can be gleaned from the study. In
general, a pilot study can save time and effort by veri-
fying an investigator's preliminary assumptions and
initial evaluations of the impact site. Current studies
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
17
-------
CHAPTER 3. Designing the Sample Survey
and historical data collected at the site of interest or
similar sites can be used to help establish a good mon-
itoring design.
Location of Sampling Points
A second key issue in the study design is the location
of the sampling points. Many specific designs and
variations are available, including (1) completely ran-
dom sample designs, (2) systematic sample designs,
and (3) stratified random sample designs.
Random Samples. In complete random sam-
pling, every potential sampling point has the same
probability of selection. The investigator randomly
assigns the sample points within the impact site and
independently within the control site. No attempt is
made to partition the impact and control sites either
spatially or temporally except to ensure similar physi-
cal habitats. The sampling units are numbered se-
quentially, and the selection is made using a random
number table or computer-generated random num-
bers.
The advantage of random sampling is that statis-
tical analysis of data from points located completely
at random is comparatively straightforward. In addi-
tion, the method provides built-in estimates of preci-
sion. On the other hand, random sampling can miss
important characteristics of the site, spatial coverage
tends to be nonuniform, and some points may be of
little interest.
Systematic Samples. Systematic sampling oc-
curs when the investigator locates samples in a
nonrandom but consistent manner. For example, sam-
ples can be located at the nodes of a grid, at regular in-
tervals along a transect, or at equally spaced intervals
along a streambank. The grid or interval can be gener-
ated randomly, after which the position of all samples
is fixed in space.
Systematic sampling has two advantages over
simple random sampling. First, it is easier to draw,
since only one random number is required. Second,
the sampling points are evenly distributed over the
entire area. For this reason, systematic sampling often
gives more accurate results than random sampling,
particularly for patchy environments or environ-
ments with distinct discontinuous populations.
Systematic sampling also has its disadvantages.
For example, if the magnitude of the biological vari-
able exhibits a fixed pattern or cycle over space or
time, then systematic sampling is unlikely to repre-
sent variance of the entire population. Suppose an or-
ganism has several hatches, roughly at equally spaced
time intervals during the sampling period, then sam-
ples taken at fixed-time intervals may provide a bi-
ased estimate of the average number of individuals
alive at one time. If possible, the population should be
checked for such periodicity. If periodicity is found or
suspected but not verifiable, systematic sampling
should not be used.
Another disadvantage of systematic sampling is
that it is more complicated to estimate the standard
error than if random sampling had been used. Despite
these problems, systematic sampling is often part of a
more complex sampling plan in which it is possible to
obtain unbiased estimates of the sampling errors.
Stratified Random Samples. Stratified samples
combine the advantages of random and systematic
sampling. Stratified random sampling consists of the
following three steps: (1) the population is divided
into a number of parts, called strata; (2) a random
sample is drawn independently in each stratum, and
(3) an estimate of the population mean is calculated.
Thus:
yst =
(3.1)
N
where yst is the estimate of the population mean,
Nh is the total number of sampling units in the h stra-
tum, and yh is the sample mean in the h* stratum,
and N = ^ N h is the size of the population. Note that
Nh are not sample sizes but the total sizes of the strata,
which must be known to calculate this value.
Stratification is employed if it can be shown that
differences between the strata means in the popula-
tion do not contribute to the sampling error in the esti-
mate of y h. In other words, the sampling error of y h
arises solely from variations among sampling units
that are in the same stratum. If the strata can be
formed so that they are internally homogeneous, a
gain in precision over simple random sampling can
occur.
In stratified sampling, the sample size can vary
independently across strata. Therefore, money and
human resources can be allocated efficiently across
strata. As a general rule, strata with the greatest uncer-
tainty (i.e., with the largest expected variance, or
about which little is known) should receive the great-
est amount of sampling effort.
For environments that are known to be fairly ho-
mogeneous with respect to the biological variable un-
der consideration, stratified random sampling will
not add precision to the population estimates. In fact,
using stratification in these environments may intro-
duce a loss of precision and a possible bias in the pop-
ulation estimates. In these cases, the investigator may
save a great deal of time and effort by using simple
random sampling in the sampling plan.
18
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 3. Designing the Sample Survey
Location of Control Sites
Under EPA's biocriteria program, states may establish
either site-specific reference sites or ecologically sim-
ilar regional reference sites for comparison with im-
pacted sites (U.S. Environ. Prot. Agency, 1990).
Typical site-specific reference sites may be estab-
lished along a gradient. For example, a reference site
can be established upstream of a wastewater outfall
(Fig. 3.1). Gradients work well for rivers and streams;
for larger waterbodies, reference sites can be estab-
lished on a one-to-one basis with a similar waterbody
in the region not experiencing the impact under eval-
uation.
An important consideration in site-specific ref-
erence conditions is to establish that the control site is
not impaired at all or that it is only minimally im-
paired. In particular, baseline data should be obtained
to demonstrate that the impact is linked to the differ-
ences detected between the reference site and the
control site.
Ideally, a reference site should exhibit no impair-
ment; however, natural variability in biological data
may make the determination of minimal or no impact
difficult, especially if the impact is relatively small.
An interesting method for site selection is to establish
several reference sites based on their physical simi-
larities with the impact site. For example, selecting
one reference site with higher flow than the impact
site and another with lower flow may increase the in-
vestigator's ability to determine the presence of a real
impact. Comparisons of data collected from the im-
pact and reference sites should provide consistent in-
terpretations of the impact, regardless of which
reference site is used in the comparison.
Minimizing temporal variation in biological
measurements can be critical to the evaluation of con-
trol and impacted sites. A general rule is that samples
should be obtained from the control and reference
sites during the same time periods. It may be feasible
to target an index period (e.g., late spring or summer)
in which the biological variables are assumed to be
appropriate indicators of ecological health (e.g., the
period of maximum abundance or the period of mini-
mum variation in water chemistry). However, for
some organisms, periods of maximum abundance
may also be periods of high variability. In this case,
periods of low abundance but stable conditions can
be used to help the investigator detect impairment if it
exists.
Estimation of Sample Size
A final key component in developing a survey design
is to determine how many samples are required. In
most plans, the issue involves a trade-off between the
accuracy of the sample estimate and the magnitude of
available monetary and human resources. Conse-
quently, the first step is to determine how large an er-
ror can be tolerated in the sample estimate. This
decision requires careful thought; it depends on how
the collected data will be used and the consequences
of a sizable uncertainty associated with the sample es-
timates. Thus, in reality, selecting a sample size is
somewhat arbitrary and driven by practical consider-
ations of time and money. Investigators should, how-
ever, always approach the selection of sample size
using sound statistical principles.
The appropriate equations for calculating sam-
ple sizes are often design dependent. Here, we present
a design for simple random sampling. Suppose that d
is the allowable error in the sample mean, and the in-
vestigator is willing to take only a 5 percent chance
that the error will exceed d. In other words, the inves-
tigator wants to be reasonably certain that the error
will not exceed d. The equation for the sample size is
(3.2)
t2a2
n = -
and t is the f statistic for the level of confidence
required. For a 95 percent confidence level that the
sample mean will not exceed d, t = 1.96. Obviously,
an estimate of the population standard deviation, a, is
necessary to use this relationship. In many cases, an
estimate of a can be obtained from existing data.
When few data are available about a, it is a good idea
to generate a set of tables to develop a sense of the
range of samples required.
Suppose, for example, that an investigator
wishes to estimate mean pH readings above a
wastewater discharge. How many samples are needed
to estimate the true mean pH? At the extremes, the in-
vestigator guesses that the standard deviation might
range between 0.5 and 1.2 pH units. This estimate
leads to Tables 3.1 and 3.2:
Table 3.1. Number of samples needed to estimate
the true mean (low extreme). j
CONFIDENCE
LEVEL
95%
90%
MARGIN OF ERROR (a=0.5) i
0.2 pH units
24
17
0.5 pH units
4
3
IpHunit
1
1
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
19
-------
CHAPTER 3. Designing the Sample Survey
Table 3.2 Number of samples needed to estimate
the true mean (high extreme).
CONFIDENCE
LEVEL
95%
90%
MARGIN OF ERROR (a=l.2)
0.2 pH units
138
98
0.5 pH units
22
16
1 pHunit
6
4
Note that the number of required samples increases
dramatically as the confidence and precision in the
estimates increase, and as the population standard
deviation increases. As a general rule, the precision of
the estimate is inversely proportional to the square
root of the sample size. Therefore, increasing the sam-
ple size from 10 to 40 will roughly double the preci-
sion.
For a fixed precision, changing the required con-
fidence in the estimate from 95 to 99 percent slightly
more than doubles the sample size. Equation 3.2 can
easily be adopted for binary response variables in
which the responses are expressed as proportions or
percentages (Cochran, 1963). In addition, for those
situations where the number of sampling units is fi-
nite, a finite population correction for the sample size
is available (Cochran, 1963).
Equations for calculating sample sizes for ran-
dom, nonrandom, and stratified sample surveys can
be found in the literature. They depend on the sample
design, the available variance estimates, and whether
the environmental assessment has both spatial and
temporal components.
Important Rules
Developing a sample design is frequently driven by
factors other than statistics and biology. For example,
the investigator may be asked to determine a differ-
ence between upstream and downstream stations of a
municipal treatment plant outfall, long after the sus-
pected impacts began. Even in these cases, creative
sampling strategies can help develop the link be-
tween the wastewater outfall and downstream im-
pacts. The following rules apply to most
environmental assessment scenarios.
. Rule 1. Sample designs and their associated
analytical techniques can be difficult to
conceptualize and implement. Always
consult individuals with appropriate
training before starting a biocriteria study.
. Rule 2. State precisely and clearly the
problem under evaluation before attempting
to develop a survey design.
Rule 3. Collect samples from a reference site
as a basis for inferring impact. In general,
the sampling scheme used at the impacted
site should be the same as that employed at
the reference site.
Rule 4. To the degree possible, use
environmental characteristics to minimize
the error in the sample estimate. For
example, for patchy environments examine
the possibility of systematic sampling; for
heterogeneous populations, examine the
possibility of using stratified random
sampling. In all cases, attempt to minimize
sample bias by randomly allocating samples
(either geographically or temporally across
the entire population, or within strata).
Rule 5. For seasonally dependent biocriteria,
collect data for several seasons before
attempting to determine an impact. For
biocriteria that are not seasonally
dependent, collect sufficient data to
represent the variability in the population.
Rule 6. Collect enough data so that the
accuracy and precision requirements
associated with using the information are
achieved.
20
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
BIOLOGICAL CRITERIA
Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data
CHAPTER 4. Detecting Mean Differences 21
Cases Involving Two Means 21
Random sampling model, external value for 6 21
Random sampling model, internal value for 6 22
Testing against a Numeric criterion 22
A Distribution-Free Test 23
Evaluating Two-Sample Means Testing 23
Multiple Sample Case 23
Parametric or Analysis of Variance Methods 23
Nonparametric or Distribution Free Procedures 25
Testing for Broad Alternatives 25
The Kolmogorov-Smirnov Two-Sample Test 26
Relationship of Survey Design to Analysis Techniques 27
-------
CHAPTER 4 Detecting Mean Differences
Hypothesis testing methods that seek to detect
the mean differences arising from two or more
independent samples are among the most common
statistical procedures performed. However, these pro-
cedures are frequently used without regard to some
basic assumptions about the data under investigation
which, in some cases, leads to errors in interpreta-
tion.
This section describes and illustrates several
methods for detecting mean differences. It focuses on
(1) cases in which only two means are involved, and
(2) situations involving more than two means. It also
presents suggestions concerning the use and abuse of
means testing procedures.
Cases Involving Tiro Means
Several scenarios within the biocriteria program re-
quire investigators to compare the mean differences
between two independent populations. Suppose for
example, that we want to use biocriteria in a regula-
tory setting in the following situation:
A wastewater treatment plant discharges its ef-
fluent into a stream at a single point. Upstream of the
discharge facility, the stream is in good shape (unaf-
fected by any known sources of pollution). The re-
source agency has sufficient funds to monitor three
stations upstream of the discharge site and a compara-
ble number of streams downstream of the discharge
site during the late summer. The agency has chosen to
evaluate aquatic life use impairment using benthic
species richness.
At each of the six sites, 10 independent measures
of species richness were generated by randomly
placed ponar grabs over a relatively small spatial area
(a sample size of 10 was chosen based on variability
estimates generated at a different, but similar site).
Sites of comparable habitat quality were chosen for
sampling. The upstream sites will serve as a reference
condition against which to compare the downstream
condition.
In addition to the current survey (i.e., sampling
regime, data collection, and interpretation), the regu-
latory agency has identified an additional upstream
site for which it has 10 years of comparable long-term
(historical) data. The investigators have no reason to
believe that a time component exists in the long-term
data. Table 4.1 presents descriptive information asso-
ciated with the upstream and downstream sites and
with the long-term site.
The question for investigators is this: Do the data
reveal a downstream effect associated with the
wastewater discharge? Several methods are available
for assessing the mean differences between the up-
stream and downstream sites, and each method has
both positive and negative aspects.
Random Sampling Model, External Value
fora
Suppose investigators believe that the 30 measures of
benthic species richness collected at the upstream
and downstream sites can be treated as random sam-
ples from appropriate populations. In particular, they
Table 4.1 Descriptive statistics: upstream-downstream measures of benthic species richness.
SITE
Upstream
Downstream
Historic
! Pooled Data
1
2
3
4
5
6
7
1-3
4-6
N
10
10
10
10
10
10
200
30
30
MEAN
10.0
12.6
11.2
10.4
7.7
9.0
10.4
11.3
9.0
STD.
2.3
2.5
2.4
2.4
3.7
1.8
3.4
2.5
2.9
MINIMUM
7.5
10.3
7.2
6.3
3.4
5.6
0.17
7.2
3.4
MAXIMUM
14.8
18.0
15.1
13.7
14.7
11.1
19.4
18.0
14.7
10%-TRIMMED
MEAN
9.7
12.2
11.2
10.5
7.4
9.1
11.1
11.1
9.0
MEDIAN
ABSOLUTE
DEVIATION
1.5
1.3
1.0
1.0
2.7
1.5
2.6
1.0
1.6
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
21
-------
CHAPTER 4. Detecting Mean Differences
believe that the two populations have the same form
(i.e., normal distributions with the same variance, a)
but different means, na and fj.b. How can the investiga-
tors use statistical theory to make inferences about
the effect of the wastewater treatment plant dis-
charge?
If the data were random samples from the popu-
lations, with Na = 30 observations from the upstream
population and Nb observations from the downstream
population, the variances of the calculated averages,
Ya and Yb would be:
_ d2 (4.1)
Likewise, in the random sampling model, Ya and
Yb would be distributed independently, so that:
_cr
N\
1
NT
N,
(4.2)
Even if the distributions of the original observa-
tions had been moderately nonnormal, the distribu-
tion of the difference Ya-Yb between sample averages
would be nearly normal because of the central limit
effect. Therefore, on the assumption of random sam-
pling,
z = -
J_ J^
sT + N7
(4.3)
would be approximately a unit normal deviate.
Now, CT, the hypothetical population value for
the standard deviation, is unknown. However, the
historical data yield a standard deviation of 3.4. If this
value is used for the common standard deviation of
the sampled populations, the standard error of the dif-
ference, Ya-Yb = 2.3, is
aj + =0.89
V30 30
Based on the robust estimators (trimmed mean
difference of 2.1 and median absolute difference of
1.6) the standard error of the difference would be
0.41. If the assumptions are appropriate, the approxi-
mate significance level associated with the postulated
difference (//Qpib) in the population means will then
be obtained by referring
zn =-
.89
to a table of significance levels of the normal distribu-
tion. In particular, for the null hypothesis (/^.a-^b) = 0,
z0 = 2.3/.S9 = 2.6, and Pr(z < 2.6) < .005. Again, the
upstream/downstream effect seems to be realistic (us-
ing the robust estimators, z = 5.1 and Pr[z < 5.1]
< .001). Note that we use the z distribution in this ex-
ample because the population variance is determined
from an external set of data that represents the popu-
lation of interest an assumption equivalent to as-
suming that the variance of the population is known
(i.e., not estimated).
Random Sampling Model, Internal Value
for a
Suppose now that the only evidence about CT is
from the Na = 30 samples taken upstream and the Nb
= 30 samples taken downstream. The sample vari-
ances are
Z(Yal-Ya)2
s, =-
N -1
= 625
Z(Ybl-Yb)2
Nb-l
= 8.41
On the assumption that the population variances
of the upstream and downstream sites are, to an ade-
quate approximation, equal, these estimates may be
combined to provide a pooled estimate of s2 of this
common a2. This is accomplished by adding the sums
of squares in the numerators and dividing by the sum
of the degrees of freedom,
2_Z(Yal-Ya)2+X(Ybl-Yb)2
N
+Nb-2
= 752
On the assumption of random sampling from
normal populations with equal variances, in which
the discrepancy [(Ya-Yb) - (^a-/ub)] is compared with
the estimated standard error of Ya-Yb, a t distribution
vfithNa+Nb-2 degrees of freedom is appropriate. The
t statistic in this example is calculated as
t =-
1 1
s I +
N. N,
231
"OJl
= 32
This statistic is referred to a t table with 58 de-
grees of freedom. In particular, for the null hypothesis
that (MQ-/"b) = 0, Pr(t < 3.2) < .001. Again, an up-
stream/downstream effect seems plausible. Using the
robust statistics, a pooled estimate of error can be cal-
culated as the average of the median absolute devia-
tions associated with each data set ([I + 1.6 ] / 2 =
1.3). Therefore, the t statistic is 6.3 and the Pr(t< 6.3)<
.001. Note that we use the t distribution in this exam-
ple because the population variance is estimated from
the survey data and not assumed to be known.
Testing against a Numeric Criterion
In the preceding sections, hypothesis tests were pre-
sented for the two-sample case. Similar tests are avail-
22
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 4. Detecting Mean Differences
able for testing a sample mean against a fixed numeric
criterion (for which an associated uncertainty does
not exist). In this case, the t statistic can be written as
follows:
V(W) =
NaNb(Nb+Na+l)
12
W-E(W)
t =-
(4.4)
Here, s is the sample standard deviation and n is
the numeric criterion of interest. The probability of a
greater value can be found in a t table using n-1 de-
grees of freedom.
A Distribution-Free Test
In many instances, the assumption that the raw data
(or paired differences) are normally distributed does
not hold. Even the simplest monitoring design involv-
ing the comparison of two means requires either (1) a
long sequence of relevant previous records that may
not be available or (2) a random sampling assumption
that may not be tenable. One solution to this dilemma
is the use of distribution free statistics such as the W
rank sum test (Hollander and Wolfe, 1973). The Wtest
is designed to test the hypothesis that two random
samples are drawn from identical continuous distri-
butions with the same center. An alternative hypothe-
sis is that one distribution is offset from the other, but
otherwise identical. Comparative studies of the t and
W tests indicate that while the t test is somewhat ro-
bust to the normality assumption, the W test is rela-
tively powerful while not requiring normality. In
many cases, performing both the t and W tests can be
used as a double check on the hypothesis.
To conduct the Wlest (see Chapter 2), the investi-
gator combines the data points from the samples, but
maintains the separate sample identity. This overall
data set is ordered from low value to high value, and
ranks are assigned according to this ordering. To test
the null hypothesis of no difference between the two
distributions f(x) and g(x) (i.e., H0: f[x] = g[x]), the
ranks of the data points in one of the two samples are
summed:
W=£Rj (4.5)
Statistical significance is a function of the degree
to which, under the null hypothesis, the ranks occu-
pied by either data set differ from the ranks expected
as a result of random variation. For small samples, the
W statistic calculated in Equation 4.5 can be com-
pared to tabulated values to determine its signifi-
cance. Alternatively, for moderate to large samples, W
is approximately normal with mean E(W) and vari-
ance V(W):
Na (Nh +Na +1)
E(W) = (4.6)
Z
(4.7)
(4.8)
In the upstream/downstream case that we have
been discussing, E(W) = 1,127, z = 3.12, and Pr(
-------
CHAPTER 4. Detecting Mean Differences
Table 4.2. Assumptions, advantages, and disadvantages associated with various two-sample means testing
procedures.
REFERENCE
DISTRIBUTION
External
Normal
distribution with
external estimate
of c
Normal
distribution with
internal estimate
of CT
Distribution free
ASSUMPTIONS
Past data can
provide relevant
reference set for
observed
difference Ya-Yb
Individual
observations are
as if obtained by
random sampling
from normal
populations with
common standard
deviation.
Individual
observations are
as if obtained by
ADVANTAGES
No assumption of
independence of
errors. No need
for random
sampling
hypothesis.
Continuous
DISADVANTAGES
Need relevant,
lengthy past
records.
Construction of
reference
distribution can
be tedious
SHOULD CONSIDER
FOR USE WHEN:
Quality,
consistency, and
length of data are
deemed to
represent a healthy
ecosystem.
Need to know CT. Quality,
reference ; Need assumption consistency, and
distribution that
is easy to
calculate.
No external data
needed.
random sampling :
from normal
populations with
unknown
common standard
deviation a
of independence
of individual
errors coming
from random
sampling
hypothesis.
Need assumption
of independence
of individual
errors coming
from random
length of data are
deemed to be a
sample from a
healthy ecosystem.
Data
transformation may
be necessary to
achieve normality.
Most commonly
used test.
Appropriate if
normality
assumptions hold.
sampling If outliers or
hypothesis.
influential data
apprent, consider
i the use of robust
estimated by s
Individual
observations are
as if obtained by
random sampling
from populations
Computations are
easy. No external
data needed.
Populations
randomly
of almost any : sampled need not
kind.
be normal.
Need assumption
of independence
or symmetry of
individual errors
estimators of the
mean and variance.
Can be used if
normality
assumptions are
suspect. Can be
arising from : used to verify
random sampling
hypothesis.
results of
parametric tests.
SHOULD NOT
USE WHEN:
Known impacts to
reference site have
occurred, or
physical and
biological
differences
between the impact
and reference site
are identified.
Quality of data is
suspect or impacts
at the external site
are known or
suspected.
Normality
assumptions do not
hold. Generally,
robust estimators
of the mean and
variance can
reduce the
influence of
outliers.
No real
disadvantage of
these tests. In most
cases, power of the
test is equivalent or
near the parametric
counterpart. |
These decisions include the effects of interest (model
specification one-way designs, two-way designs,
and so forth); whether the classification variables are
random, fixed, or nested; whether any interactions
(nonadditive effects) are present in the data; how to
handle unbalanced designs (unequal sample sizes for
the various treatments); and the nature of the error
term.
As we can see from this list, ANOVA procedures
are not simple but require a great deal of thought. In
general, the ANOVA model should follow directly
from the sample design used to collect the biocriteria
data. The following model illustrates a simple
one-way, fixed block design like that described in the
upstream/downstream case presented here. The over-
all model for the ANOVA is
Yfl =
(4.9)
where
YJJ = the jth response for the 1th site
/j. = the population mean
a,- = the effect of site i on Y
24
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data,
-------
CHAPTER 4. Detecting Mean Differences
6ij = the error associated with each
observation in the data.
The model assumes that the errors are normally
distributed with mean 0 and variance a2. Based on the
model, any observation is composed of an overall
mean (^), a site effect (a), and a random element (e)
from a normally distributed population. Hypothesis
testing for the ANOVA model is undertaken by calcu-
lating the variance associated with model compo-
nents (sums-of-square differences around the mean
effect). A test statistic is formed by comparing the
mean square differences associated with a model
component to the mean error term. This statistic is
distributed as an F distribution. Table 4.3 presents an
example of this variance breakdown for the simple
upstream/downstream model.
Table 4.3. Analysis of variance results for the case
study model.
SOURCE
Site
Error
Total
DF
5
54
59
SUM OF
SQUARES
146.57
350.67
497.24
MEAN
SQUARE
29.31
F VALUE
4.51
6.49
Pr>F
_°_
As seen in the table, the effect of site means is an
important indicator of the level of benthic species
richness. Therefore, it seems a good idea to explore
the relationship among the site means as a method of
examining a possible gradient of upstream/down-
stream differences. Several methods are available for
testing the differences between site means. In this ex-
ample, the method of least significant difference
(LSD), Duncan's multiple range test, and Tukey's
studentized range test are presented. (A review of
these and other multiple comparison methods is in
the SAS/STAT Guide for Personal Computers.) Tables
4.4 through 4.6 present the results of these multiple
comparison tests.
Table 4.4. Least significant difference multiple
comparison test.
GROUPING
B
IB
B
B
A
A
A
C
C
MEAN
12.6
11.2
N
10
10
10.4 10
9.9
8.9
7.6
10
10
10
SITE
2
3
4
1
6
5
Table 4.5. Duncan's multiple comparison test.
GROUPING
B
B
B
B
A
A
A
C
C
C
MEAN
AT
12.6 10
11.2
10.4
9.9
8.9
7.6
10
10
10
10
10
SITE
2
3
4
1
6
5
Table 4.6. Tukey's multiple comparison test.
GROUPING
B
B
B
B
A
A
A
- -
MEAN
12.6
! 11.2
C
C
C
p
L . .
10.4
9.9
8.9
7.6
N
10
10
10
10
10
10
SITE
2
3
4
1
6
5
In the above tables, sites within a specified
grouping are not different at the a = 0.05 level of sig-
nificance.
Nonparametric or Distribution Free
Procedures
Distribution free methods for testing multiple sample
means are available in much the same format as for
parametric tests. The Kruskal-Wallis rank sum test
(one-way design) and the Friedman rank sum test
(two-way design) are frequently used when the nor-
mality assumptions do not hold (see Hollander and
Wolfe [1973] for a review of these methods). Multiple
comparison methods based on the individual rank
scores for each site are available.
Again, the investigator must develop the model
to match the experimental design. In the up-
stream/downstream comparisons of benthic species
richness, the Kruskal-Wallis test with a simple
one-way model results in a chi-square statistic of
16.38 (Pr < chi-square = 0.006). Again, the up-
stream/downstream sites appear to differ in the mea-
sured biocriteria. Results of the multiple comparison
tests using ranks were similar to those presented in
the ANOVA model.
A Test for Broad Alternatives
Frequently, investigators are faced with situations in
which tests for mean differences or variance differ-
ences are not sufficient. For example, investigators
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
25
-------
CHAPTER 4. Detecting Mean Differences
may realize that smaller fish are more sensitive to a
pollutant than larger fish. In such cases, simple test-
ing for mean differences (in which the mean is calcu-
lated without regard to size class) between reference
and impacted sites may not suffice. Instead, the mea-
sure of toxic effect will be better reflected through
changes in the distribution of fish caught at the two
sites. Examining the differences in distribution func-
tions among sites may be a more sensitive way to de-
tect effects than relying on population estimates such
as the mean and variance.
Statistics designed to detect broad classes of al-
ternatives, as in the scenario presented here, are dis-
tribution free tests (i.e., they do not rely on normality
assumptions), although they do have parametric
counterparts. For a single sample, goodness-of-fit
tests to gage the correspondence between an empiri-
cal distribution function of observations and a spe-
cific probability model or distribution (e.g., normal or
lognormal) may be useful. These tests can also be con-
ducted using the chi-square statistic (see Snedecor
and Cochran, 1967).
The Kolmogorov-Smirnov
Two-Sample Test
Within the biocriteria program investigators will
frequently be challenged to evaluate a broad range of
differences between two or more populations. The
Kolmogorov-Smirnov (KS) two-sample test is easy to
implement and can be used to evaluate the relation-
ship between two distribution functions. This test
provides graphic and statistical evaluations of two
sets of data.
The KS two-sample test involves the develop-
ment of two cumulative distribution functions (CDFs)
to test the hypothesis that each sample was taken
from the same population. The test is based on the dif-
ference between the empirical distribution functions.
The largest difference between the two functions,
Dmax, forms the basis for the test statistic. Dmax is the
maximum vertical distance at any horizontal point
between the two CDFs (Fig. 4.1).
To generate a CDF for an individual sample, the
data are ordered from lowest to highest, and the rank
order of each point determined. Dividing each rank
by the sample size results in a cumulative distribu-
tion function ranging from 0 to 1 (or 0 to 100 percent,
if multiplying by 100). The two samples need not
have the same number of observations. Tabled values
of the test statistic are available for various sample
sizes (Hollander and Wolfe, 1973). The test is both
one-sided and two-sided. For the benthic species rich-
CD
D-
.1
Js
3
d
100:
90
80-
70
50:
40
30:
20
10
Legend:
^ ^ upstream sites
.... downstream sites
3 5 7 9 11 13 15 17
Species Richness
Figure 4.1 Cumulative distribution functions of upstream and downstream sites.
19
26
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
CHAPTER 4. Detecting Mean Differences
ness example shown here in Figure 4.1, Dmax is 0.433
(43.3 percent) which occurred at a species richness
value of 10.6. The null hypothesis is rejected with a
Type I error rate of 0.0072.
Relationship of Survey Design
to Analysis Technique
Table 4.7 outlines the relationship between means
testing techniques and selected survey designs as de-
scribed in earlier sections. As a general rule, the data
analysis techniques are driven by the survey design.
The principle decision points are the number of sites,
the available sample size, and the presence or absence
of reference sites. However, investigators should not
be constrained by the survey design. Data explora-
tion, using any technique that fits the data, is encour-
aged and can provide insightful results.
Table 4.7. Survey design and analysis techniques.
SURVEY DESIGN
Upstream/downstream: random sampling at single sites
using current survey data
Upstream/downstream: random samplings at multiple sites
using current survey data
Upstream/downstream: random sampling within spatial or
temporal strata with one or more sites
Impact site data with large off-site external data; for
example when determination of impact is not clearly
definable or no good upstream reference condition
available
Systematic sampling such as random sampling along a
transect or nodes of a grid
Regionally impacted sites with one or more reference sites
MEAN DETECTION METHOD
t-test using an internal value of the variance; Wilcoxon test;
with large data sets, a KS two-sample test may be
appropriate
One-way ANOVA using an internal value of the variance;
KS two-sample test on merged upstream and downstream
data; Kruskal-Wallis rank sum test
Two-way (or more complicated) ANOVA tests; Friedman
rank sum test (and other more complicated nonparametric
tests)
External reference distribution tests including the
two-sample KS test; (-test with external estimate of the
variance
ANOVA, (-tests with internal estimates of the variance, and
possibly distribution tests (also note that such designs may
be subjected to techniques that demonstrate geographical
trends and patterns such as kriging and GIS methods)
Two-sample KS test
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
27
-------
BIOLOGICAL CRITERIA
Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data
CHAPTER 5. Discussion and Examples 29
Working with Small Sample Sizes 29
Assessments Involving Several Indicators 30
Regional Reference Data 31
Using Background Variability Measures 32
Final suggestions for Small Sample Sizes 32
Decision Analysis and Uncertainty 33
-------
CHAPTER 5 Discussion and Examples
In the previous four chapters, standard statistical
methods were presented, discussed, and illus-
trated with simple examples. Those methods and ex-
amples represent conventional analyses or situations
in which sample sizes are relatively large so that hy-
pothesis testing is essentially straightforward. The
analyses were motivated by available, commonly ap-
plied methods, and the examples were structured to
fit the methods. The purpose was to provide back-
ground statistical guidance, with examples.
In this chapter a different approach is taken.
Here, typical problems involving biosurvey data are
the starting points, and statistical methods for analy-
sis and hypothesis testing are proposed and applied
specifically to the problem. In some cases, hypothesis
testing is possible; in others, the small sample size
may limit statistical inference. In the latter situation,
the investigator may consider design changes so that
different statistical analyses can be undertaken with
biosurvey data in the future.
We begin with a general discussion of the impor-
tance of small sample size and briefly examine judg-
mental and statistical options for small sample size,
followed by examples of hypothesis testing with
small samples. The chapter concludes with "rules of
thumb."
Working with Small Sample
Sizes
The conventional methods for statistical hypothesis
testing and interval estimation presented in chapters
1 through 4 work best under conditions that do not al-
ways exist with biosurvey data. The common ap-
proaches based on an underlying normal probability
model are clearly not essential; distribution-free
methods are versatile and effective. Still, virtually all
confirm-atory analyses (i.e., those concerned with hy-
pothesis testing and interval estimation) require esti-
mation of a "location" statistic that is the quantity of
interest (e.g., a mean, median, or quartile), and they
also require estimation of a variability statistic (e.g., a
standard error) that indicates the spread of values for
the location statistic.
An example of a desirable scenario for confirma-
tory statistical analysis was described in Chapter 2.
Data must be available from the sites of direct interest
in the assessment, and sample sizes must be large
enough for hypothesis testing. If the site-specific data
are inadequate (less than two, which would prevent
direct calculation of a sample variance), or too small,
(e.g., less than five, which would make the calculated
sample variance quite uncertain), then alternatives to
statistical testing or intervals are possible, but these
alternatives are apt to include additional conditions
or assumptions beyond those required in conven-
tional analyses.
For example, a single sampling might yield a
point estimate for IBI downstream of a wastewater
discharge, but provide no measure of variability. If
historic data exist on IBI at other impacted sites, then
it is reasonable to assume that the variability in the
historic data can be used as the variability measure
for testing at the site of interest. If, on the other hand,
the historic data analysis includes an IBI regression
based on predictors, such as watershed area and
physical habitat quality, then the standard error for
this regression is the appropriate variability measure.
The key feature of these hypothetical examples is that
other, relevant information exists that the investigator
believes can be used to estimate statistics for the site
of interest.
In the absence of historic data for statistical esti-
mation (usually for the estimate of variability), hy-
pothesis testing and interval estimation may still be
possible if the scientist is prepared to make certain as-
sumptions. For example, suppose that an aquatic biol-
ogist is confident that he or she can estimate the
variability in IBI in impacted streams based on experi-
ence and knowledge of the literature. This estimate
could provide the necessary variability measure, but
it is obviously conditional on the judgment of the bi-
ologist.
None of the approaches presented in this docu-
ment are without assumptions; even the example in
Chapter 2 includes the assumption that the sample
data adequately reflect the true situation. Judg-
ment-based estimates of statistics require a different
assumption, namely, the assumption that the investi-
gator's judgment is good.
The most serious difficulty in the application of
interval estimation and hypothesis testing for
biosurvey data is the small sample size associated
with many biological surveys. The strength of infer-
ences from statistical analysis is tied to sample size. If
expert judgment is not available or not acceptable,
then sample size must be large; otherwise, statistical
testing is either not possible or not particularly useful.
But how large is "large enough"? There is no single,
correct answer to that question. As a rule, the stan-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data
29
-------
CHAPTER 5. Discussion and Examples
dard error drops according to the square root of the
sample size; thus, the answer to the question depends
on the error level that is acceptable in the problem un-
der study.
In general, sample sizes greater than 10 are usu-
ally desirable, and sample sizes smaller than five may
prevent meaningful statistical testing. In addition,
since standard error may be expected to drop with the
square root of the sample size, there are diminishing
returns as sample size grows larger.
What can be done when sample size is too small
and expert judgment is either not available or not ac-
ceptable? Any amount of data or evidence can indi-
cate an effect (or the absence of an effect), and this
information can be described in text, presented in ta-
bles, or displayed in graphs. However, in the case of
very small samples, it is important to emphasize that
the analysis is descriptive and not confirmatory. Al-
ternatively, if the investigators have data on biological
and chemical indicators of impairment and criteria
for each of the indicators, then it may still be possible
to test effects across indicators.
Suppose there is no sample size estimate only
an estimate of variability based on expert judgment.
How can statistical testing be completed? We actually
have some well-established approaches to elicit judg-
ment-based quantities and error estimates, along with
an effective number of degrees of freedom (Meyer and
Booker, 1991). Alternatively, the scientist may simply
summarize test results in a table with sample size (or
degrees of freedom) and test results (e.g., p-values)
given for a range from small to large samples. In some
cases, the conclusion may not depend on the effective
sample size; in others, sample size may be critical,
which places more importance on the goodness of the
judgmental assessment.
Assessments Involving
Several Indicators
Suppose that sampling has occurred at a stream site at
which environmental degradation is suspected, but
the sample size for any single indicator is too small for
hypothesis testing. For each indicator, the state has es-
tablished an impairment criterion; thus, the results of
sampling could be presented either as a measurement
(e.g., dissolved oxygen concentration) or as success or
failure in meeting the state's criterion. Each of the in-
dicators is expected to provide an independent mea-
sure or assessment of environmental degradation;
therefore, several indices cannot be separately in-
cluded in the analysis if they are based on the same
underlying measurements.
As an example, Table 5.1 presents three biologi-
cal indices, the IBI, ICI, and Iwb based on sampling at
a single site on three different dates. The state
biocriteria are also given. It is assumed that the
two-month period between samplings results in tem-
poral independence between the samples.
Table 5.1 Biological Indices and biocriteria
DATE
June 15
August 15
October 15
Biocriteria
IBI
43(1)
39(0)
42(1)
40
ICI
38(1)
38(1)
36(1)
35
Iwb
9.4(1)
8.7(1)
i
8.3(1)
8.5
Since we have only a single estimate per date on
each index, and only three data points per date and
per index, statistical inference opportunities are lim-
ited. We can, however, treat the nine index estimates
in Table 5.1 as nine independent measures by which
to assess the underlying condition of biologic impair-
ment, based on biocriteria violations. The indices in
Table 5.1 are recorded as a 0-1 variable, in parenthe-
ses, indicating attainment (1) or violation (0) of each
biocriterion. Next, these nine 0-1 data points can be
subjected to statistical analysis to determine the over-
all biologic impairment reflected in the aggregate of
the three indices.
First, calculate the proportion of violations (p) in
the sample as an estimate for the probability of bio-
logic impairment at the site:
p is a point estimate that is uncertain due to natural
variability and measurement error. We can calculate a
confidence interval for p or test the hypothesis that p is
less than a specified critical value. Once it is calcu-
lated, a confidence interval or a percentile could serve
as a cutoff point indicative of biological impairment.
For example, one might define impairment as more
than 50 percent violations. As a variation on that idea,
Rankin and Yoder (1990) selected the 75th percentile
in a histogram of sample IBI deviations (from the
mean value) to be the limit of tolerable variation.
Confidence intervals for p can be determined us-
ing binomial tables or graphs like those presented in
Hahn and Meeker (1991), or using Table 1.4.1 in
Snedecor and Cochran (1967). For example, the
two-sided 90 percent confidence interval for this ex-
ample (based on Table A.23a in Hahn and Meeker) is
0.041
-------
CHAPTER 5. Discussion and Examples
Cochran, 1967), the two-sided 90 percent confidence
interval is
p-lj645V(p)(l-p)/n < P < P +1.645V(p)(l-p)/n
which, for this example is
--1645-1-(-)/9 < p<- +1.645.1-(-)/9
9 V9 9 9 V9 9
0
-------
CHAPTER 5. Discussion and Examples
making this classification, the investigator would
have noticed that little overlap of the distributions oc-
curs in the extreme tails of the impacted and reference
site distributions.
Using Background Variability
Measures
In the previous section, the Ohio ICI biocriteria
were identified as point values between classes (e.g.,
ICI = 35 is the warmwater habitat criterion separating
"good/exceptional" from "fair"). When a single ICI de-
termination is available from a new site, the Ohio cri-
teria can be used to classify the site, ignoring
uncertainty. Beyond that, if it is assumed that the
Ohio ICI classification scheme is fixed and certain,
and if a reliable estimate of site ICI variability is avail-
able, then the classification based on a single ICI
value can be assessed using a hypothesis test.
In situations with only a single estimate of a
bioindicator, collateral information must be obtained
to provide the estimate of variability. There are sev-
eral potential measures of site bioindicator variability
that might be suitable; Rankin and Yoder's (1990) dis-
cussion presents several informative graphs to show,
for example, that the IBI coefficient of variation drops
as IBI increases (Rankin and Yoder, 1990, Fig. 2), and
IBI coefficient of variation increases slightly as drain-
age area increases (ibid., Fig. 7).
Knowledge and judgment can be quite helpful in
selecting the variability estimate. For example, if it is
believed that the site bioindicator variability is
roughly constant within a specified category, then a
calculated estimate of variability for the bioindicator
within the appropriate class can be used as the vari-
ability measure for the site of interest. Categories may
be selected on any criterion (e.g., ecoregion, IBI range)
that is scientifically plausible and leads to an accept-
ably large overall sample size for variability estima-
tion.
Rankin and Yoder's graphs suggest that, while
the IBI coefficient of variation changes with selected
categories (IBI range), the IBI standard deviation may
be roughly constant across IBI classes and across
ecoregions. A median standard deviation between 4
and 5 appears to be quite consistent in the graphs.
Based on this collateral information, it is assumed
that site-specific IBI in Ohio, under constant condi-
tions (i.e., no change in site factors that determine
IBI), has a standard deviation of 4.5.
Here is an example of how this estimate is used.
Assume that the single IBI measurement shown in
Figure 5.1 (IBI = 35) was taken in Ohio under the con-
ditions described. Since the sampling program in
Ohio is quite large, 4.5 is effectively the true standard
deviation for all sites; thus, with a single sample, it
may be concluded that the standard error for the
mean value (IBI = 35) is also 4.5. To determine
whether the sample is taken from the reference or im-
pacted distribution, assume that 18 IBI samples were
taken at the reference and impacted sites, and that the
following statistics are calculated:
Reference site sample mean = 42, sample standard
deviation = 5;
Impacted site sample mean = 27, sample standard
deviation = 8.
Then, a two-tailed t test using Equation 2. Ib (see
Chapter 2) evaluating the null hypothesis that the
means are the same will result in the following:
t = 1.43, for the hypothesis that the reference site
mean is equal to the mean of the third site mean;
and
t = 1.245, for the hypothesis that the impacted site
mean is equal to the third site mean.
Based on this information, the investigator has
some evidence that the sample collected from the
third site is closer to the impacted site mean than to
the reference site mean. However, as conveyed by the
similar t statistic results, the confidence in this con-
clusion is relatively weak.
Final Suggestions for Small
Sample Sizes
The discussion and examples in this chapter,
while intended as useful, general guidance, are not
firmly rooted in statistical theory and hence not al-
ways to be followed. Rather, they reflect our experi-
ence and observations. Further, they concern the real
situations that biologists confront situations that
do not conform to well-established statistical proce-
dures. However difficult and awkward for statistical
analysis, the problems must be addressed. With this
caveat, the following concluding comments summa-
rize the discussion and examples presented here:
1. If the sample size is 1, a measure of variability
may still be obtained using expert judgment or
other data. If no variability measure can be jus-
tified, then descriptive statistics may be the ex-
tent of the analysis (i.e., no interval estimation
or hypothesis testing).
2. If the sample size is more than 1 but still small
(perhaps 5 or fewer), then it is possible to use
the sample to estimate variability for interval
estimation or hypothesis testing. However, the
intervals may be very large and the tests not
32
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
0.08-
0.07-
0.06-
0.05-
0
CHAPTER 5. Discussion and Examples
Sample
Reference Site
10
20
30
IBI
40
50
60
Figure 5.1IBI Distributions for reference and impacted sites
very powerful, because small sample size
means that the strength of evidence is weak.
3. Situations may exist with more than a single
estimate of variability. Perhaps one estimate
will be based on data and a second estimate on
expert judgment. In that case, the two esti-
mates of variance can be pooled, using an esti-
mator like that in Chapter 4's "Reference
Distribution Based on Random Sampling
Model, Internal Value for a." A difficulty in
pooling when a judgmental estimate of vari-
ance is involved is determination of the de-
grees of freedom for the judgmental variance
estimate. Perhaps the best approach is to make
a reasoned guess as to how much information
the judgment contains with respect to samples
(the "effective sample size"):
(a) if the judgment is highly uncertain, as-
sign it a small number of degrees of freedom
(perhaps 2-5),
(b) if there is more confidence in the judg-
ment, assign the judgment estimate 5+ de-
grees of freedom.
If the conclusions from this analysis are not
particularly sensitive to the exact choice of the
effective sample size for the judgmental esti-
mate, then inferences may be made with some
confidence. If, however, the conclusions are
sensitive to this choice, then the best approach
may be to obtain additional information before
drawing final conclusions.
Decision Analysis and
Uncertainty
In the preliminary approach presented here we have
advocated the use of classical statistical hypothesis
testing to summarize data concerning biological crite-
ria. We assume that a decision and succinct conclu-
sions based on the data are needed. However,
alternatives to hypothesis testing may be appropriate
in certain situations. For example, statistical and
graphic summaries (e.g., confidence intervals,
bivariate plots) may be used to summarize and pres-
ent information when the investigator believes that a
classical hypothesis test based on a single parameter
is too brief or that more evidence should be presented.
An alternative is to recast the hypothesis testing
problem using a decision analytic framework. Deci-
sion analysis (Raiffa, 1968; Reckhow, 1984) begins
with the scientific base summarized in the hypothesis
test and incorporates the consequences (e.g., costs
and benefits) of possible decisions. In an informal
analysis, a decision analytic approach may be under-
taken by the decision maker if a desired outcome of
management action is "to hedge away" from large ad-
verse consequences or losses. Informal consider-
ations and hedging may be most effectively
undertaken in an a priori assessment of costs and
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
33
-------
CHAPTER 5. Discussion and Examples
benefits, which then becomes a primary basis for
choosing between various levels of test significance.
Thus, if it seems likely that biological degradation can
be avoided, then the decision maker may request that
the biologist set the significance level for testing (e.g.,
that H0 has no impact) relatively high (e.g., at 0.10 or
0.20). Alternatively, if cleanup costs are high relative
to benefits, then the test significance level (for H0 has
no impact) could be set relatively low (e.g., at 0.01 or
0.005).
Suppose that a measure of biological integrity is
tested for upstream-downstream differences sur-
rounding wastewater treatment plant discharges from
small treatment plants (less than 5 million gallons per
day) throughout the state. If the per person cost to up-
grade the treatment level for small communities is
generally quite high, and the benefits to be derived
from biological improvements are generally low (rela-
tive to the organisms affected and typical uses of the
streams), hedging away from high cost may be infor-
mally undertaken by setting the significance (or "ac-
tion") level of the test quite low (e.g., 0.01 or 0.005).
Additional study of biological degradation, costs, and
benefits would be triggered only if an up-
stream-downstream test result was significant at this
level.
Hedging away from large losses is an option pre-
cisely because of scientific uncertainty. If there were
no scientific uncertainty about biological degrada-
tion, then the analysis would always focus on costs
and benefits, and the management option with the
highest net benefits would be selected. On the other
hand, if scientific uncertainty is extreme, an appro-
priate strategy may be either to hedge farther from
large adverse consequences or to seek more informa-
tion, if possible, to reduce scientific uncertainty be-
fore new management action is adopted.
In more formal applications, decision analysis
may be used to combine uncertain scientific informa-
tion on biocriteria (expressed probabilistically) with
an overall measure of net benefits or use associated
with management actions. This approach is most ef-
fective in a Bayesian context; Reckhow (1984) pres-
ents a simple example applied to lake eutrophication
management. However, comprehensive Bayesian de-
cision analysis is apt to be prohibitively expensive (in
terms of human resources and cost) for all but the
most critical and consequential problems.
One outcome of data analysis may be that the de-
cision maker will desire more information before im-
plementing new management actions. In formal
decision analysis, a value of information calculation
should be made to help one determine the wisdom of
immediate action versus additional data collection
and analysis. In informal analysis, one should con-
sider how useful new information would be if action
has to be deferred pending its arrival.
The outcome of hypothesis testing is a statistical
summary of evidence on biological degradation. It
does not establish cause and effect, although a
well-designed test may associate degradation with a
candidate cause. The strength of causal conclusions
depends on a number of factors including a priori sci-
entific knowledge and field observation. Scientific
support for management actions is greatest when the
observation of degradation is accompanied by docu-
mentation of a causal relationship.
In most cases, environmental management deci-
sions reflect a certain limited understanding of causal
connections and a certain degree of observational evi-
dence that is more statistical in nature. This combina-
tion is a reasonable basis for decision; in fact, it would
be unreasonable to expect detailed causal knowledge
in support of every decision. However, as manage-
ment actions are undertaken and biological response
is observed after the fact, more observational evi-
dence may be gathered to support earlier decisions.
34
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
BIOLOGICAL CRITERIA
Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data
APPENDIX A. Basic Statistics and Statistical Concepts 35
Measures of Central Tendency 35
Mean 35
Median 35
Trimmed Mean 35
Mode 36
Geometric Mean 36
Measures of Dispersion 36
Standard Deviation 36
Absolute Deviation 36
Interquartile Range 36
Range 37
Resistance and Robustness 37
Graphic Analyses 37
Histograms 37
Stem and Leaf Displays 39
Box and Whisker Plots 40
Bivariate Scatter Plots 41
-------
APPENDIX A Basic Statistics and Statistical
Concepts
Certain specific features of a data set are charac-
terized by descriptive statistics. Of these mea-
sures, the center, or central tendency of a set of data, is
probably the most important. Among the candidate
statistics for central tendency are the mean, median,
mode, and geometric mean. Once the center of a data
set is described, the next important feature is the data
distribution: the spread, dispersion, or scale. Among
the candidate estimators of dispersion are range, stan-
dard deviation, and interquartile range. These two
characteristics of a data set, central tendency and dis-
persion, are the most common descriptive statistics.
Other characteristics, such as skewness and kurtosis,
are occasionally important. The examples that follow
illustrate the choice of descriptive statistics.
Measures of Central Tendency
Probably the single most useful way to summarize a
data set is to indicate the center of the sample. "Cen-
ter" suggests the vague notion of the middle of a clus-
ter of data points or perhaps the region of greatest
concentration. Since samples of data exhibit a variety
of distributions when plotted as bar graphs (histo-
grams), it is not possible to define the center unambig-
uously. As a result several statistical estimators can
serve as candidates for determining central tendency
or location, and each candidate has advantages and
disadvantages for the task at hand.
Mean
The arithmetic mean, or simply, the mean the sum
of all data values divided by their number is the
most frequently used central tendency estimator. It is
so commonly used that scientists often lose sight of
the true reason for calculating descriptive statistics.
In some cases, the mean is calculated as the central
tendency, though another central tendency statistic
would be better.
The arithmetic mean (x) is the sum of the obser-
vations (Xj) divided by the number of observations (n):
Each observation contributes its magnitude to
the sum of the observations and hence to the mean.
For symmetric distributions (like the normal
bell-shaped or Gaussian distribution), the mean cal-
culated from a sample of data (the sample mean) often
comes quite close to the center, or peak, of the
histogram for that sample. However, biological data
are often not symmetrically distributed. The ex-
tremely high or extremely low observations charac-
teristic of skewed (nonsymmetrical) data
distributions pull the mean in the direction of the
skew; a few extremely high observations can pull the
mean away from the bulk of the observations and to-
ward the few high data points. In those situations, a
more resistant estimator, such as the median or the
mode, may be preferred.
Median
The median is the value of the middle observation
when data are arranged in order of size from lowest
to highest value. The median is therefore known as an
"order statistic" since it is based on an ordering or
ranking of observations. When the total number of ob-
servations is an even number, leading to two middle
values, the median is then the average of the two mid-
dle values.
The "order" of the median observation is
Median Observation = (n + l)/2 (A.2)
The effect on the median of all but the mid-
dle-ranking observations is simply to hold a place in
the ranking so that outlying observations do not pull
the median toward the extremes. The median is resis-
tant to the influence of any particular observations;
therefore, it is a good statistic to use when the histo-
gram is skewed or unusually shaped.
Trimmed Mean
The trimmed mean is the mean value from a
subsample of the original sample. The subsample is
formed by symmetrically trimming a small percent-
age of the data points from either end of the ordered
observations. For example, a 10-percent trimmed
mean is calculated from the subsample remaining af-
ter the highest and lowest 10 percent of the observa-
tions are removed from the set. At the extreme, the
median is the trimmed mean with all but the middle
observation removed.
The trimmed mean is an efficient indicator of
central tendency if censoring has occurred or if a few
outlying observations are found in the data. Here,
censoring refers to data points reported as "below de-
tection limits." If 15 percent of the data points are be-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
35
-------
APPENDIX A. Basic Statistics and Statistical Concepts
low detection limits, then a 15-percent trimmed
mean estimator (involving 15 percent trimming from
each end) should result in less bias than the arithme-
tic mean, the estimator based on all uncensored ob-
servations.
Mode
The mode is the value in the sample that is most fre-
quently observed; it can be used for discrete or cate-
gorical data. If no value is repeated more than once, as
is possible for biological data on a continuous scale,
the mode will not be a useful estimator of central ten-
dency. Alternatively, if a histogram is used to repre-
sent the data, then the mode is defined as the range of
values associated with the tallest bar on the histo-
gram. The mode is a good estimator for central ten-
dency because the most frequently observed value is
usually near the center of the distribution. The histo-
gram will indicate visually whether the mode actu-
ally does correspond with the center of the sample.
Geometric Mean
The geometric mean is a reasonable measure of cen-
tral tendency for a set of data that exhibit a lognormal
distribution. It is the antilog of the mean of
logarithmically transformed data. The lognormal data
distribution is skewed in the original units of mea-
surement, but normal (Gaussian) when the original
measurements are log-transformed. Several investi-
gators suggest that the lognormal distribution is a
good probability model for concentration data on en-
vironmental contaminants. Data sets described by the
lognormal distribution have a few high values that are
somewhat extreme from the bulk of the observations.
The geometric mean may be calculated in two
ways:
Geometric Mean = anti l"1 '
or:
(A.4)
i
Geometric Mean = []~J x; ] n
where Y[K* = x, x2 x3 ... xn.
Measures of Dispersion
If central tendency measures are not used to summa-
rize a data set, then measures of dispersion or spread
will be used instead. Dispersion in a data set refers to
the variability in the observations around the center
of the distribution. Good measures of dispersion will
be obtained from symmetric distributions. Asymme-
try, or skew, will affect the estimate of dispersion so
that it overestimates spread in the shorter tail of the
data distribution (while underestimating it in the lon-
ger tail). A transformation (e.g., logarithm) should be
considered in cases of asymmetry in order to create a
symmetric distribution. Statistics are then calculated
on the basis of the transformed metric.
Standard Deviation
The most commonly used statistic for dispersion is
the standard deviation. In fact, the standard devia-
tion, like the mean, is used so often that it is some-
times thought to be the equivalent of dispersion. It is,
however, a measure of variability that represents the
average distance of the data from the mean; and, like
the mean, it is strongly affected by extreme values.
Thus, the standard deviation for a distribution of data
with a long tail to the right is inflated by the values at
the extreme right. Investigators may prefer to create a
symmetric distribution before calculating the stan-
dard deviation.
For a sample, the sample variance (s2) is
s2 =
n-1
(A.5)
and the sample standard deviation (s) is the square
root of the variance (Vs2).
Absolute Deviation
The standard deviation is based on squared error;
squaring the deviation between a data point and the
sample mean increases the influence of the largest
and smallest observations on the estimate of devia-
tion. The absolute deviation can be calculated to re-
duce the influence of outliers on the dispersion
statistic. To arrive at the absolute deviation, the mean
(or median) is first estimated, and then the absolute
value of the difference between the mean or median
and each data point is calculated. The mean or me-
dian of these absolute deviations is then calculated as
the mean or median absolute deviation.
Interquartile Range
Since the standard deviation is unduly influenced by
extreme observations in both symmetric and asym-
metric distributions of data, a resistant alternative to
the standard deviation (as the median is to the mean)
is needed for situations in which the data are skewed
but transformation is undesirable. Fortunately a good
alternative exists the interquartile range: the range
that includes the central 50 percent of all observa-
tions in the set. The interquartile range, like the me-
dian, is based on order statistics; thus, it is unaffected
by the magnitude of the extreme observations in ei-
ther tail. It is calculated as the difference between the
36
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
APPENDIX A. Basic Statistics and Statistical Concepts
observation at the 75th percentile (upper quartile) and
the observation at the 25th percentile (lower quartile):
Lower quartile rank order =
(1/2)(1 + median rank order)
Upper quartile rank order =
(1/2) (1 + n + lower quartile rank)
Interquartile range (I) =
lower quartile value - upper quartile value.
Range
Range is an easily determined and therefore fre-
quently cited measure of dispersion. The range is sim-
ply the maximum value minus the minimum value.
Since it is clearly affected by the magnitude of the ob-
servations at either extreme, the range should not be
relied on as the sole indicator of variability. Neverthe-
less, it is often informative to list the range along with
another dispersion statistic.
Resistance and Robustness
In a number of scientific fields, particularly those that
depend on observational (as opposed to experimen-
tal) data, errors of measurement and natural variabil-
ity are apt to result in empirical distributions
(histograms) with occasional outliers and shapes that
are more spread-out than the normal density func-
tion. This result, which is fairly common in water
quality studies, makes robustness and resistance im-
portant considerations when choosing statistics to
summarize data. In some situations, of course, the
outliers, rather than central tendency and dispersion,
will be the focus of the study.
A resistant estimator is one that is insensitive to
data points that are quite different from the rest of the
data (i.e., outliers). A robust estimator is one that per-
forms well (efficiently), even if an assumption con-
cerning the underlying probability model is wrong.
For central tendency, the mean is neither resistant nor
robust. The median is resistant to outliers but not ro-
bust since it is not as efficient as other options (i.e., it
is subject to large standard error). The trimmed mean
and so-called M-estimators (Hampel et al. 1986) are
both resistant and robust.
The most commonly used measure of disper-
sion, the standard deviation (or variance), is nonresis-
tant (highly sensitive to outliers) and not robust
because squaring the deviation emphasizes deviant
data points. The absolute deviation and the
interquartile range are more resistant but not highly
robust.
Resistance and robustness provide a measure of
insurance against features of the sample data that may
yield a summary estimate that is not representative of
the data set as a whole. A robust and resistant estima-
tor is not the best choice if, for example, there are no
outliers and the sample is an exact normal density
function. However, if outliers do occur, and samples
are not normal (or lognormal), then robust and resis-
tant estimators of center and dispersion are wise and
safe choices that will help investigators avoid faulty
inferences.
Graphic Analyses
It is good practice in statistical analysis to begin with
various displays of the raw data. That is, before de-
scriptive statistics are calculated from a data set, and
before analyses such as hypothesis testing and linear
(regression) model building occur, it is wise to look at
empirical graphs. The graphs recommended for this
task help the investigator identify the need to trans-
form the data before conducting the statistical analy-
sis.
Most procedures in statistics (e.g., regression
analysis, hypothesis testing) derive summary values
(e.g., mean, trimmed mean) from a data set. If the in-
ferences drawn from statistical procedures are to be
valid for the entire data set, then the summary statis-
tics must represent the entire set. Graphic displays
guide the choice of any necessary manipulations of
the data set and help assure the selection of appropri-
ate summary statistics. The examples presented here
underscore the importance of displaying the data at
the beginning of a statistical study.
Graphs can also be useful during the course of a
statistical study. For example, bivariate scatter plots
help scientists select independent variables for a re-
gression equation, and scientists will often wisely
choose to present the results of a statistical analysis in
graphic form. Conclusions are often most effectively
conveyed through graphs.
Histograms
Perhaps the most fundamental level of study is an
analysis of data on a single characteristic. Assume, for
example, that an aquatic biologist has a data set for
species richness from a stream study and now desires
to summarize this information. The biologist could
calculate the trimmed mean and median absolute de-
viation of the sample; alternatively, she could calcu-
late other statistics representing central tendency and
dispersion. To determine which of these statistics are
most useful, the biologist should first look at a plot of
the data. The histogram is often used to display data
representing a single characteristic (such as IBI).
For example, suppose that the index of biotic in-
tegrity in Table A. 1 has just been determined for a par-
ticular stream from headwaters to mouth, and the
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
37
-------
APPENDIX A. Basic Statistics and Statistical Concepts
Table A.I. IBI data for a
particular stream.
12
12
14
15
16
22
24
23
IBI
25
24
26
24
24
27
DATA
33
34
35
56
58
36
35
38
23 41
28
42
9
8'
7
g 6'
0)
t 5'
i? 4'
3
2
1 '
1
if
It . ^
0 10 20 30 40 50 60
IBI
Figure A.I.Histogram of IBI data for a particular stream.
biologist wants to picture the
biotic integrity of this stream. As a
first cut, the histogram in Figure A.I is plotted. To
construct the histogram, the biologist must first di-
vide the range into equal intervals. In Figure A.I, the
range is approximated by 10 to 60 (actually it is 12 to
58) and is divided into intervals of 5 units. For each in-
terval, 11 to 15, 16 to 20, and so on, the height of the
bar represents the number of data points that lie
within that interval. So there are four IBI data points
that lie within 31 to 35 and eight within 21 to 25.
Thus, the bar for the 21 to 25 interval is twice the
height of the 31 to 35 bar.
What does the histogram tell us about this
stream? Basically, it provides us with a visual image of
the distribution of the sample. In specific terms, it
means that we can quickly see such things as the loca-
tion of the center of the sample, amount of dispersion,
extent of symmetry, and the existence of outliers in
the sample. Outliers need not be errors or aberrations;
they are simply "set apart" from the bulk of the obser-
vations. The reasons why they are set apart may be of
particular interest in some studies.
In Figure A.I, the center may be visually associ-
ated with the highest bar (mode) at 21 to 25, or it may
be identified as a middle value (median) around 30.
Dispersion could perhaps be characterized by stating
that the range is 12 to 58, and almost 60 percent (actu-
ally 15 to 26) of the data points lie between 20 and 35.
The histogram is not symmetric, however, and one
might want to check on the validity of the two outly-
ing observations on the extreme right.
The picture created by the histogram is of con-
siderable value in the selection of descriptive statis-
tics. Some care should be observed in the
construction of the histogram, however. With
changes in interval size, the histogram can assume
different shapes that may affect the inferences. For ex-
ample, the IBI data in Figure A. 2 are plotted using an
interval size of 10 units. On that scale, the two highest
data points no longer appear as outliers. In contrast,
the two-unit intervals in Figure A. 3 give the impres-
sion of possible outliers on both the right and left ex-
tremes of center. It is probably good practice to scale
the histogram so that the observations are neither too
aggregated (as in Figure A.2) nor too spread out to per-
mit reasonable inferences to be drawn.
Thus, the histogram provides an impression of
the extent of symmetry in the sample. Symmetry in a
data set is a desirable attribute for two reasons. First, it
often means that one can characterize the sample as
having a distribution with a shape similar to one of
the symmetric distributions (e.g., the normal distri-
bution), which is often assumed to be an underlying
model in statistical inference. Stating, for example,
that a sample approximates the normal distribution
conveys useful information. Beyond that, symmetry
implies that common descriptive statistics are clear:
central tendency refers to the center of symmetry, and
dispersion characterizes variability without skew.
Therefore, it may be useful to apply a transfor-
mation, if necessary, to create symmetry in an asym-
metric data set. Continuous concentration and
38
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
-------
APPENDIX A. Basic Statistics and Statistical Concepts
14"
12
10-
&
1 8'
cr
2 6
LL
C
) 1
W^?'^'VA*^A
^T-
0 2
:., ; r :
*?.'./!>' ',r£*tj
''fisjs*/^i;ii./j,
""^^^^
v\* "\«0«
0 3
IBI
f^^^S^^?*'
^i^^V^, ,.*..'
.J
0 40 50 60
Figure A.2.Histogram of IBI data with 10-unit intervals.
7'
6-
5
C A
CD ^
CT
-------
APPENDIX A. Basic Statistics and Statistical Concepts
14"
H O
12
10-
§ 8'
CT
2 6'
LJ_
4'
C
) 1 I
>
Log
/
(IBI)
;.;
*S--';,
Js!
J
IL
^
'"d
I 5
Figure A.4.Histogram for log(IBI).
Figure A.5.Histogram for log(IBI): Alternative scale.
line, the "leaves" are written. For each data point, the
leaf is the next digit lower in value than the stems
digit. Since the stems in Figure A. 6 are composed of
the tens digit, the leaves are made up of the units dig-
its. Each observation contributes one leaf to the row
containing its stem. For the IBI data points in Table
A.I, the first observation (12) results in a 2 (the units
digit) placed in the row for the first tens stem (cover-
8-
7
6-
&5-
-------
APPENDIX A. Basic Statistics and Statistical Concepts
g
1
o
o
O
Indicates Statistical Significance at
0.05 Level (for Median)
tu
D)
Cd
DC
e
CO
E
CD
Maximum Value
75% Value
/ Median Value
25% Value
Minimum Value
Lake A
LakeB
Figure A. 7.Box and whisker plots.
40
35
30
CO
0 pc
55
20
15
, , T
\ / '
Ho
1 / \
1 ' 1 >
V -/
K J. N
1979 1989 1990
Year
y
Figure A.8.Stream IBI box plots.
erhaps one that provides both pictorial and statistical
comparison. One such model is the box and whisker
plot, which is available in many statistical software
packages for the microcomputer.
Figure A. 7 shows the basic struc-
ture of the box plot. For clarification,
note that the "statistical significance of
the median" on Figure A. 7 refers to the
degree of vertical overlap of the notch or
indention in one box with the notch in
another box. If the notches do not over-
lap vertically, then the medians may be
considered significantly different at ap-
proximately the 0.05 level.
Box plots are based on order statis-
tics which, like the median, are calcu-
lated by ranking the observations from
lowest to highest. Box plots can be used
to convey information on the sample
median; dispersion, as conveyed by the
range and the interquartile range; skew,
as conveyed by the symmetry in the
shape above and below the median; rel-
ative size of the data set, as conveyed by
the width of the box; and statistical sig-
nificance of the median.
Figure A. 8 shows three sample box
plots for stream IBI data for 1979,1989,
and 1990. The box and whisker plots in
Figure A.8 provide a substantial
amount of information on IBI during
the years of sampling. First, it is appar-
ent that IBI has increased since 1979, as
there is little vertical overlap of the
1979 box plot with the other two. This
conclusion is further supported by the
lack of vertical overlap in the 1979
notch with the other two notches. In
contrast, while the medians for 1989
and 1990 differ, they are not signifi-
cantly different (0.05 level) and the
samples (boxes) overlap considerably.
None of the years exhibit substantial
skew in the sample data. The 1989 data
are skewed the most, based on the rela-
tive symmetry of the box and whiskers
around the median.
Box plots are helpful as diagnostic
tools and as a method of demonstrating
conclusions about samples following
the completion of a statistical study.
Tukey (1977) and Reckhow (1979) de-
scribe several interesting applications.
Bivariate Scatter Plots
Many statistics (e.g., correlation coefficients)
and many statistical methods (e.g., regression analy-
sis) are fundamentally concerned with relationships
between pairs of variables. Without doubt, the best
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data
41
-------
APPENDIX A. Basic Statistics and Statistical Concepts
CD
O)
00
O)
35
30
25
20
25
30 35
1989IBI
40
Figure A.9.IBI bivariate plot for 1989 and 1990 data.
A second topic of interest for
bivariate samples is the presence
or absence of outliers. Outliers
have no universally accepted ob-
jective definition; rather, the term
is used here to identify observa-
tions that stand apart from a clus-
ter of points. We are concerned
about outliers because they are apt
to have excessive influence on
nonresistant statistics like the
mean, variance, sample correla-
tion coefficient, and OLS regres-
sion coefficients. Bivariate plots
are valuable for outlier identifica-
tion and may suggest approaches
(e.g., transformation) for correc-
tion. In Figure A. 9, the two highest
values probably would not be con-
sidered outliers, since they are
compatible with the pattern exhib-
ited in the rest of the data and not
substantially separated from those
data.
way to examine a relationship between pairs of vari-
ables (a bivariate relationship) is through a scatter
plot.
In Figure A.9, a bivariate scatter plot is presented
for the 1989 and 1990 IBI data for a particular stream.
From the plot, we can examine the distribution of data
for each variable separately and for the two variables
together. For example, we can see from Figure A.9 that
two relatively high observations tend to stand apart
from the rest of the data, particularly in the horizontal
direction. As might be expected, there is an approxi-
mately linear correlation between the IBI estimates
for successive years.
Two characteristics of a bivariate sample are of-
ten of interest in statistical studies. First, the biologist
may be interested in the pattern or shape (e.g., linear-
ity or nonlinearity) of a relationship. Linear relation-
ships are often desirable for ease of analysis;
correlation analysis and ordinary least squares (OLS)
regression provide measures of the strength ofalinear
relationship. If the bivariate relationship is nonlinear,
it is possible that a transformation can be applied to
make it linear, or a nonlinear model may be used.
Without question, the scatter plot is the most impor-
tant diagnostic device for evaluating linearity, and it
is often quite helpful in selecting a transformation.
42
Biological Criteria: Technical Guidance for Survey Design
-------
REFERENCES
Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rog-
ers, and J.W. Tukey. 1972. Robust Estimates of Location.
Princeton University Press, Princeton, NJ.
Barnett, V, and T. Lewis. 1984. Outliers in Statistical Data.
2nd edition. John Wiley and Sons, Chichester, UK.
Blalock, H.M. Jr. 1972. Social Statistics. McGraw-Hill, New
York.
Box, G.E.P., J.S. Hunter, and W.G. Hunter. 1978. Statistics for
Experimenters: An Introduction to Design, Data Analy-
sis, and Model Building. John Wiley and Sons, New
York.
Cochran, W.G. 1963. Sampling Techniques. John Wiley and
Sons, New York.
Cochran, W.G., and G. M. Cox. 1957. Experimental Designs.
John Wiley and Sons, New York.
Conover, W.J. 1980. Practical Nonparametric Statistics. 2nd
edition. John Wiley and Sons, New York.
Dixon, W.J., and J.W. Tukey. 1968. Approximate behavior of
the distribution of Winsorized t (trim-
rning/Winsorization 2). Technometrics 10:83-98.
Flury, B., and H. Riedwyl. 1988. Multivariate Statistics, a
Practical Approach. Chapman and Hall, London.
Gilbert, R.O. 1987. Statistical Methods for Environmental
Pollution Monitoring. Van Nostrand Reinhold, New
York.
Green, R.H. 1979. Sampling Design and Statistical Methods
for Environmental Biologists. John Wiley and Sons,
New York.
Hahn, G.J., and WQ. Meeker. 1991. Statistical Intervals.
John Wiley and Sons, New York.
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A.
Stahel. 1986. Robust Statistics: The Approach Based on
Influence Functions. John Wiley and Sons, New York.
Hill, M.A., and W.J. Dixon. 1982. Robustness in real life: a
study of clinical laboratory data. Biometrics 38:377-96.
Hollander, M., and Wolfe, D.A-1973. Nonparametric Statis-
tical Methods. John Wiley and Sons, New York.
Hunsaker, C.T., andD.E. Carpenter, eds. 1990. Environmen-
tal Monitoring and Assessment Program: Ecological In-
dicators. EPA/600/3-90/060. Off. Res. Dev, U.S. Environ.
Prot. Agency, Washington, DC.
Horn, P.S., P.W. Britton, andD.E Lewis. 1988. On the predic-
tion of a single future observation from a possibly noisy
sample. The Statistician 37:165-72.
Huber, P.J. 1981. Robust Statistics. John Wiley and Sons,
New York.
Hurlbert, S.H. 1984. Pseudoreplication and the design of
ecological field experiments. Ecolog. Monogr.
54:187-211.
Iglewicz, B. 1983. Robust scale estimators and confidence
intervals for location. Pages 404-31 in D.G. Hoaglin, E
Mosteller, and J.W. Tukey, eds., Understanding Robust
and Exploratory Data Analysis. John Wiley and Sons,
New York.
Kmenta, J. 1986. Elements of Econometrics. 2d ed.
Macmillan, New York.
Linthurst, R.A., et al. 1986. Population Descriptions and
Physico-Chemical Relationships. Vol 1 of Characteris-
tics of Lakes in the Eastern United States.
EPA/600/4-86/007a. U.S. Environ. Prot. Agency, Wash-
ington, DC.
Meyer, M. A., and J.M. Booker. 1990. Eliciting and Ana-
lyzing Expert Judgement: A Practical Guide. Academic
Press, London.
Miller, R.G. Jr. 1986. Beyond ANOVA: Basics of Applied Sta-
tistics. John Wiley and Sons, New York.
Morgan, M.G., and M. Henrion. 1990. Uncertainty. Cam-
bridge University Press, UK.
Mosteller, E, and J.W. Tukey. 1977. Data Analysis and Re-
gression: A Second Course in Statistics. Addi-
son-Wesley, Reading, MA.
Ohio Environmental Protection Agency. 1988. The Role of
Biological Data in Water Quality Assessment. Vol. 1 of
Biological Criteria for the Protection of Aquatic Life. Div.
Water Qual. Monitor. Assess., Columbus, OH.
Raiffa, H. 1968. Decision Analysis. Addison-Wesley, Read-
ing, MA.
Rankin, E.T., and C.O. Yoder. 1990. The nature of sampling
variability in the index on biotic integrity (IBI) in Ohio
streams. EPA-905-9-90/005. Pages 9-18 in Proc. 1990
Midw. Pollut. Meet., Chicago, IL.
Reckhow, K.H. 1979. Techniques for exploring and present-
ing data applied to lake phosphorus concentration. Can.
J. Fish. Aquat. Sci. 37(2):290-94.
. 1984. Decision theory applied to lake management.
Pages 196-200 in Proc. Fourth Ann. Conf. N. Am. Lake
Manage. Soc., City and State?
Reckhow, K.H., and S.C. Chapra. 1983. Data Analysis and
Empirical Modeling. Vol 1 of Engineering Approaches
for Lake Management. Butterworth Pubs., Boston, MA.
Reckhow, K.H., and C. Stow. 1990. Monitoring design and
data analysis for trend detection. Lake Reserv. Manage.
6(1):49-60.
Reckhow, K.H., K. Kepford, and W. Warren-Hicks. 1993. Sta-
tistical Methods for the Analysis of Lake Water Quality
Trends. EPA 841-R-93-003. U.S. Environ. Prot. Agency,
Washington, DC.
Rey, W.J.J. 1983. Introduction to Robust and Quasi-Robust
Statistical Methods. Springer-Verlag, Berlin.
Rocke, D.M. 1983. Robust statistical analysis of
interlaboratory studies. Biometrika 70:421-31.
Rocke, D.M., G.W. Downs, and A.J. Rocke. 1982. Are robust
estimators really necessary? Technometrics
24(2):95-101.
Snedecor, G.W., and W.G. Cochran. 1967. Statistical
Methods. 6th ed. Iowa State University Press, Ames.
Staudte, R.G., and S.J. Sheather. 1990. Robust Estimation
and Testing. John Wiley and Sons, New York.
43
-------
Stevens, D. 1989. Field sampling design. In W. War-
ren-Hicks andB. Parkhurst, eds., Ecological Assessment
of Hazardous Waste Sites: A Field and Laboratory Refer-
ence. EPA/600/3-89/013. Environ. Research Lab., U.S.
Environ. Prot. Agency, Corvallis, OR.
Stigler, S.M. 1977. Do robust estimators work with real
data? Ann. Stat. 5(6):1055-98.
Tukey, J.W. 1960. A survey of sampling from contaminated
distributions. Pages 448-85 in I. Olkin, ed., Contribu-
tions to Probability and Statistics, Stanford University
Press, Stanford, CA.
. 1977. Exploratory Data Analysis. Addison Wesley,
Reading, MA.
Tukey, J.W., and D.M. McLaughlin. 1963. Less vulnerable
confidence and significance procedures for location
based on a single sample: trimming/Winsorization.
SankhyaA. 25:331-52.
U.S. Environmental Protection Agency. 1990. Biological
Criteria, National Program Guidance for Surface Waters.
EPA-440/5-90-004. Off. Water Reg. Stand., Washington,
DC.
U.S. Government Printing Office. 1988. The Clean Water
Act as amended by the Water Quality Act of 1987. Pub.
L. 100-4, Washington, DC.
Warren-Hicks, W.J., and J. Messer. 1990. Using Biological
Indices to Measure Ecological Condition in Regional Re-
sources. Draft Rep. Prep, for Atmos. Res. Exposure As-
sess. Lab., Research Triangle Park, NC.
Williams, B. 1978. A Sampler on Sampling. John Wiley and
Sons, New York.
Wonnacott, T.H., and R.J. Wonnacott. 1977. Introductory
Statistics. John Wiley and Sons, New York.
Yoder, C.0.1991. Answering some concerns about biologi-
cal criteria based on experiences in Ohio. Pages 95-104
in G.H. Flock, ed., Water Quality Standards for the 21st
Century. Proc. Off. Water, U.S. Environ. Prot. Agency,
Washington, DC.
Yuen, K.K., and W.J. Dixon. 1973. The approximate behav-
iour and performance of the two-sample trimmed t.
Biometrika 60:369-74.
@REF =
44
------- |