Unitod Stales
          Environ menul Protection
               Office of Water
Biological Criteria:
Technical Guidance For
Survey Design and Statistical
Evaluation of Biosurvey

Technical Guidance for Survey Design and
  Statistical Evaluation of Biosurvey Data
          Prepared for EPA by TetraTech, Inc.
     Principal authors: Kenneth H. Reckhow, Ph.D. and
            William Warren-Hicks, Ph.D.

              George Gibson, Jr., Ph.D.
            Office of Science and Technology
                  Project Leader
         Health and Ecological Criteria Division
                  Office of Water
         U.S. Environmental Protection Agency
               Washington, D.C. 20460

                  December 1997

   This document was developed by the United States Environmen-
tal Protection Agency, Office of Science and Technology, Health and
                 Ecological Criteria Division.

   This text was written by Kenneth H. Reckow, PhD. And William
 Warren-Hicks, PhD. Jerqen Gerristen, PhD. of Tetra Tech, Inc. pro-
vided editorial and technical support. George R. Gibson, Jr., PhD. of
           USEPA was Project Leader ana co-editor.

   This manual provides technical guidance to States, Indian
Tribes, and other users of biologicarcriteria to assist with survey de-
sign and statistical evaluation of biosurvey data.  While this manual
constitutes EPA's scientific recommendations regarding survey de-
signs and statstical analyses, it does not substitute for the CWA or
EPA's regulations; nor is it a regulation itself. Thus, it cannot impose
legally binding requirements on the EPA, States, Indian Tribes, or
the regulated community, and might not apply to  a particular situa-
tion or circumstance.  EPA may change this guidance in the future.

Foreword                                                                           vii

CHAPTER 1. The Biological Criteria Program and Guidance Documents                1
The Concept of Biological Integrity	1
     Narrative and Numeric Biological Criteria	1
     Biological Criteria and Water Resource Management	2
An Overview of this Document	2

CHAPTER 2. Classical Statistical Inference and Uncertainty                            3
Formulating the Problem Statement	3
Basic Statistics and Statistical Concepts	3
     Descriptive Statistics	 3
     Recommendations	4
Uncertainty	6
Statistical Inference	 7
     Interval Estimation	7
     Hypothesis Testing	 7
     Common Assumptions	8
     Parametric Methods — the t test	9
     Nonparametric Tests — the W test	10
     Example — an IBI case study	 11
     Conclusions	12

CHAPTER 3. Designing the Sample Survey                                            15
Critical Aspects of Survey Design	15
     Variability	15
     Representativeness and Sampling Techniques	15
     Cause and Effect	 16
     Controls	16
Key Elements	17
     Pilot Studies	17
     Location of Sampling Points	18
     Location of Control Sites	19
     Estimation of Sample Size	19
Important Rules	20

CHAPTER 4. Detecting Mean Differences                                              21
Cases Involving Two Means	21
     Random sampling model, external value for a	21
     Random sampling model, internal value for CT	22
     Testing against a Numeric criterion	22
     A Distribution-Free Test	23
     Evaluating Two-Sample Means Testing	23

Multiple Sample Case	23
 Parametric or Analysis of Variance Methods	23
     Nonparametric or Distribution Free Procedures	25
     Testing for Broad Alternatives	25
The Kolmogorov-Smirnov Two-Sample Test	26
Relationship of Survey Design to Analysis Techniques	27

CHAPTER 5. Discussion and Examples                                                 29
Working with Small Sample Sizes	29
Assessments Involving Several Indicators	30
Regional Reference Data		31
Using Background Variability Measures   	32
Final suggestions for Small Sample Sizes	32
Decision Analysis and Uncertainty	33

APPENDIX A. Basic Statistics and Statistical Concepts                                 35
Measures of Central Tendency	35
     Mean	35
     Median	35
     Trimmed Mean	35
     Mode	36
     Geometric Mean	36
Measures of Dispersion  	36
     Standard Deviation  .	36
     Absolute Deviation	36
     Interquartile Range	36
     Range	37
Resistance and Robustness	37
Graphic Analyses	37
     Histograms	37
     Stem and Leaf Displays	39
     Box and  Whisker Plots	40
     Bivariate Scatter Plots	41

References                                                                           43


TABLE                                                                              PAGE
2.1.  Measures of Central Tendency	4
2.2.  Measures of Dispersion	5
2.3.  Useful Graphical Techniques	5
2.4.  Possible Outcomes from Hypothesis Testing	7
3.1.  Number of samples needed to estimate the true mean (low extreme)	19
3.2.  Number of samples needed to estimate the true mean (high extreme)	20
4.1.  Descriptive Statistics: Upstream-Downstream Example	21
4.2.  Assumptions, Advantages, and Disadvantages Associated with Various Two-Sample Means
     Testing Procedures	24
4.3.  Analysis of Variance Results for the Case Study Model	25
4.4.  LSD Multiple Comparison Test	25
4.5.  Duncan's Multiple Comparison Test	25
4.6.  Tukey's Multiple Comparison Test	25
4.7.  Survey Design and Analysis Techniques   	27
5.1.  Biological Indices and biocriteria	30
A.I.  IBIData	38
FIGURE                                                                             PAGE
2.1.  Sampling Distributions under Different Hypotheses	13
3.1.  Random Sample Design having both Temporal and Spatial Dimensions	17
4.1.  Cumulative Distribution functions of upstream and downstream sites	26
5.1.  IBI Distributions for reference and inpacted sites	 33
A.I.  IBI Histogram	38
A.2.  IBI Histogram with Ten-Unit Interval Size	39
A.3.  IBI Histogram with Two-Unit Interval Size	39
A.4.  Histogram for Log(IBI)	40
A.5.  Histogram for Log(IBI): Alternative Scale	40
A.6.  Stem and Leaf Display .  . .	40
A.7.  Box and Whisker Plots	41
A.8.  Stream IBI Box Plot	41
A.9.  IBI Bivariate Plot	42

    Biological Criteria: Technical Guidance for Survey
    Design and Statistical Evaluation of Biosurvey
Data, by Kenneth H.  Reckhow and William War-
ren-Hicks, was prepared for the U.S. Environmental
Protection Agency to help states develop their biolog-
ical criteria for surface waters and specifically to help
water resource managers assess the reliability of their
data. A good biological criteria program will be practi-
cal and cost effective, but above all it will be predi-
cated on valid and scientifically sound information.
      The application of the concepts and methods
  of statistics to the biological criteria process en-
ables us ". . . to describe variability, to plan research
so as to take variability into account, and to analyze
data so as to extract the maximum information and
 also to quantify the reliability of that information"
                 (Samuels, 1989).
     This initial guidance document is intended to
 reintroduce statistics to the natural resources man-
 ager who may not be current in the application of
this tool (and our ranks are legion, we just don't like
 to admit it). The emphasis is  on the practical appli-
 cation of basic statistical concepts to the develop-
ment of biological criteria for  surface water resource
  protection, restoration,  and  management. Subse-
 quent guides will be developed to expand on and
         refine the ideas presented here.
     Address comments on this document and sug-
 gestions for future editions to George Gibson, U.S.
 Environmental Protection Agency, Office of Water,
  Office of Science and Technology (4304), 401 M
       Street, S.W., Washington, B.C. 20460.

CHAPTER 1  The Biological Criteria Program
                         and Guidance Documents
    Efforts to measure and manage water quality in
    the United States are an evolving process. Since
its simple beginning more than 200 years ago, water
monitoring has progressed from observations of the
physical impacts of sediments and flotsam to chemi-
cal analyses of the multiple constituents of surface
water to the relatively recent incorporation of biologi-
cal observations in systematic evaluations of the re-
source. Further, although biological measurements of
the aquatic system have been well-established proce-
dures since the Saprobic system was documented at
the turn of this century, such information has only re-
cently been incorporated into the nation's approach
to water resource evaluation, management,  and pro-
    The U.S.  Environmental  Protection  Agency
(EPA) is charged in the Clean Water Act (Pub.  L. \ 00-4,
§101) "to restore and maintain the chemical, physi-
cal, and biological integrity of the Nation's waters." To
incorporate biological integrity into its monitoring
program, the Agency established the Biological Crite-
ria Program in the Office of Water.
    This program provides technical guidance to the
states for measuring biological integrity as an aspect
of water resource quality. Biological integrity comple-
ments the physical and chemical factors already used
to measure and protect the nation's surface water re-
sources. Eventually all surface water types will be in-
cluded in program technical guidance, including
streams, rivers, lakes and reservoirs,  wetlands, estu-
aries, and near coastal marine waters.
    States will use this information to establish bio-
logical  criteria or benchmarks  of resource quality
against which they may assess the status of their wa-
ters, the relative success of their management efforts,
and the extent of their attainment or  noncompliance
with regulatory conditions or  water  use  permits.
These criteria are intended to augment, not replace,
other physical and chemical methods, to help refine
and enhance our water protection efforts.

The  Concept of Biological
Biological integrity is the condition of the aquatic
community inhabiting unimpaired waterbodies  of a
specified habitat as measured by community struc-
ture and function (U.S. Environ. Prot. Agency, 1990).
Essentially,  the  concept refers to the naturally
dynamic and diverse population of indigenous organ-
isms that would have evolved in a particular area if it
had not been affected by human activities. Such in-
tegrity or naturally occurring diversity becomes the
primary reference condition or source of biological
criteria used to measure and protect all waterbodies
in a particular region.
    Only the careful and systematic measuring  of
key attributes of the natural aquatic ecosystem and its
constituent biological communities can determine
the condition of biological integrity. These key attrib-
utes or biological endpoints indicate the quality of the
waters of concern. They are established by biosurveys
— by analyses based on the sampling of fish, inverte-
brates, plants, and other  flora  and fauna.  Such
biosurveys establish the endpoints or measures used
to summarize several community characteristics
such as taxa richness, numbers of individuals, sensi-
tive or insensitive species, observed pathologies, and
the presence or absence of essential habitat elements.
    The  careful  selection and derivation of these
measures (hereafter, metrics), together with detailed
habitat characterization, is essential to translate the
concept of biological integrity into useful biological
criteria. That is, the quantitative distillation of the
survey data makes it possible to compare and contrast
several waterbodies in an objective, systematic, and
defensible manner.

Narrative and Numeric Biological
Two forms of biological criteria are used in EPA's sys-
tem of water resources evaluation and management.

• Narrative biological criteria are general statements
of attainable or attained conditions of biological in-
tegrity and water resource quality for a given use des-
ignation. They are qualitative statements of intent —•
promises formally adopted by the  states to protect
and restore the most natural forms of the system. Nar-
rative criteria frequently include statements such as
"the waters are to be free from pollutants of human or-
igin in so far as achievable," or "to be restored and
maintained in the most natural state." The statements
must then be operationally defined and implemented
by a designated state agency.
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                     CHAPTER 1. The Biological Criteria Program Guidance Document
   Numeric criteria are derived from and predicated
on the same objective status  as narrative  criteria,
which are then retained as preliminary statements of
intent. The difference between the two is that the
qualitative  statement of integrity, the condition to be
protected or restored, is refined by the inclusion of
quantitative (numeric) endpoints as specific compo-
nents of the criteria. Compliance with numeric crite-
ria involves meeting  stipulated thresholds  or
quantitative measures of biological integrity.
     The formal adoption of criteria of either type into
state law (with EPA concurrence) makes the criteria
"standards." They are then applicable and enforce-
able under  the provisions of the Clean Water Act.

Biological Criteria and Water Resource
Because  these criteria will become the basis for re-
source management and possible regulatory actions,
the manner of their design is of utmost importance to
the states and EPA. The choice of metrics to represent
and measure biological integrity is the responsibility
of ecologists, biologists,  and water  resource manag-
ers. The Agency's role is to continue to develop tech-
nical guidance documents and manuals to assist in
this process.
     The purpose of this document is to present meth-
ods that  will help managers interpret and gage the
confidence with which  the criteria can be  used to
make resource  management  decisions. Using this
guidance, both the technician and  the  policymaker
can objectively convert data into management infor-
mation that will help protect water resources. How-
ever, the use and limits of the information must be
clearly understood to ensure coordination and mu-
tual cooperation between science and management.

An Overview of this Document
The focus of this document is on the basic statistical
concepts that apply within the biocriteria program.
From the program's inception, the problem statement,
survey design, and the statistical methods used in the
analysis must be correlated to provide functional re-
sults. Accordingly, chapter 2 begins with formula-
tions of the problem statement — the focused objec-
tive that helps narrow the scope of observations in the
ecosystem to those necessary to predict the status and
impairment of the biota—and culminates in a discus-
sion of hypothesis testing, the approach advocated in
this guidance document. Chapter 2 also refers begin-
ners to Appendix A for a succinct review of the basic
statistics  and statistical concepts used within the
chapter and throughout this document.
     Chapter 3 presents key issues associated with
the design of the sample survey. Surveys are without
doubt the critical element in an environmental as-
sessment. Designs that minimize error,  uncertainty,
and variability in both biological and statistical mea-
sures have  a great effect on decision makers. This
chapter explores the difference between classical and
experimental design and the issues involved with
random, systematic, and stratified samples. Sample
sizes and how to proceed in confusing circumstances
round out the discussion.
     Chapter 4 deals with  problems that arise from
hypothesis  testing methods based on detecting the
mean differences arising from two or more independ-
ent samples. The use and abuse of means testing pro-
cedures is an important topic. It should generally be
keyed to  the survey design, but other  information
should also be taken into consideration because er-
rors  of interpretation often involve assumptions
about data.
     Chapter 5 is a further discussion, with examples,
of the basic concepts introduced in earlier chapters.
Though hypothesis testing is generally preferred, this
chapter discusses circumstances in which other pro-
cedures may be useful. It also introduces the role of
cost-benefit assumptions in decision analysis and the
limits of data collection and interpretation in the de-
termination of causality. The reader should recall at
all times the basic nature of this document. Advanced
practitioners may look to the references used in pre-
paring this  document for additional options and dis-
         Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data

CHAPTER 2. Classical Statistical Interference and Uncertainty                 3
Formulating the Problem Statement	3
Basic Statistics and Statistical Concepts	3
     Descriptive Statistics	3
     Recommendations	4
Uncertainty	6
Statistical Interference	7
     Interval Estimation	7
     Hypothesis Testing	7
     Common Assumptions	8
     Parametric Methods — the ^test	9
     Nonparametric Tests —the Wtest	10
     Example — an IBI case study	11
     Conclusions	12

CHAPTER 2  Classical Statistical  Inference
                         and Uncertainty
    Before the biological survey can be designed and
    linked to statistical methods of interpretation, an
exact formulation of the problem is needed to narrow
the scope of the study and focus investigators on col-
lecting the data. The choice of biological and chemi-
cal variables should be made early in the process, and
the survey design built around that selection. Fancy
statistics and survey designs may be appropriate, but
biologically defined objectives should dominate and
use the statistics, not the reverse (Green, 1979).

Formulating the Problem
A clear statement of the objective or problem is the
necessary basis on which the biological survey is de-
signed. A general question such as "does the effluent
from the municipal treatment plant damage the envi-
ronment?" does little to help decision makers. Con-
sider,  however,  their response to a more specific
statement:  "Is   the  mean   abundance  of
young-of-the-year green sunfish caught in seines
above the discharge point greater (with an error rate of
5 percent) than those similarly trapped downstream
of  the discharge point?" The precise nature of this
question makes it a clear guide for the collection and
interpretation of data.
    The  problem statement should minimally in-
clude the biological variables that indicate environ-
mental damage, a reference to the comparisons used
to determine the impact, and a reference to the level of
precision (or uncertainty) that the investigator needs
to be confident that an impact has been determined.
In  the preceding example, green sunfish are the bio-
logical indicator of impact, upstream  and down-
stream seine data are the basis of comparison, and an
error rate of 5 percent provides an acceptable level of
    The problem statement, the survey design, and
the statistical methods used to interpret the data are
closely linked. Here, the survey  design is an up-
stream/downstream set of samples with the upstream
data providing a reference for comparison. A t test or
rank sign test may be used to test for mean differences
between the sites.
    From a statistical standpoint, the biological vari-
ables  (measures) used to show damage should have
low natural variability and respond sharply to an im-
pact relative to any sampling variability. Natural vari-
ability contributes to the uncertainty associated with
their response to an impact. Lower natural variability
permits reliable inferences with smaller sample sizes.
    Examining historical data is an excellent means
of selecting biological criteria that are sensitive to en-
vironmental impacts. Species that exhibit large natu-
ral spatial and temporal variations  may be suitable
indicators of environmental change only in small
time scales or localized areas. If so, the use of such
variables will limit the investigator's ability to assess
environmental change in long-term monitoring pro-
grams. Historical data, combined with good scientific
judgment, can be used to select biological criteria that
exhibit minimal natural variability within the context
of the site under evaluation.

Basic Statistics and Statistical
When a data set is quite small, the entire set can be re-
ported. However, for larger data sets, the most effec-
tive learning  takes place, when investigators
summarize the data in a few well-chosen  statistics.
The choice to trade some of the information available
in the entire set for the convenience of a few descrip-
tive statistics is usually a good one, provided that the
descriptive statistics are carefully selected and cor-
rectly represent the original data.
    Some descriptive  statistics are so commonly
used that we forget that they are but one option among
many candidate statistics. For example, the mean and
the standard deviation (or variance) are statistics used
to estimate the center of a data set and the spread on
those data. The scientist who uses these statistics has
already decided that they are the best choices to de-
scribe the data. They work very well, for example, as
representatives of symmetrically distributed data that
follow an approximately normal distribution. Thus,
their use in such circumstances is entirely justified.
However,  in other  situations involving biological
data, alternative descriptive statistics may be pre-

Descriptive Statistics
Before selecting a descriptive statistic, the scientist
must understand the purpose of the statistic. Descrip-
tive statistics are often used in biological studies be-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2.  Classical Statistical Inference and Uncertainty
cause the convenience of a few summary numbers
outweighs the loss of information that results from
not using the entire data set. Nevertheless, as much
information as possible must be summarized in the
descriptive statistics because the alternative may in-
volve a misrepresentation of the original data.
     The basic statistics and statistical techniques
used in this chapter are further defined, described,
and illustrated in the appendix to this document (Ap-
pendix A). Readers unfamiliar with descriptive statis-
tics and graphic techniques should read Appendix A
now and use it hereafter as a reference. Other readers
may proceed directly to the tables  in this chapter,
which summarize the advantages and disadvantages
of the statistical estimators and techniques described
in the appendix.
    The common measures of the center, or central
tendency, of a data set are the mean, median, mode,
geometric mean, and trimmed mean. None of these
options is the best choice in all  situations (see Table
2.1), yet each conveys useful information. The points
raised in Table 2.1 are not comprehensive or absolute;
they do, however, reflect the author's experience with
these estimators.
    Environmental contaminant concentration data
are strictly  positive, and sample data sets  exhibit
asymmetry (i.e., a few relatively high observations).
Therefore, a transformation, in particular, the loga-
rithmic transformation, should be applied to concen-
tration and other data that exhibit these characteris-
tics before analysis. When a transformation is used,
data analysis and estimation occur within the trans-
formed metric; if appropriate, the results may be con-
verted back to the original metric for presentation.
    A measure of dispersion — spread or variability
— is another commonly reported descriptive statistic.
Common estimators for dispersion are standard devi-
ation,  absolute deviation, interquartile  range, and
range. These estimators are defined, described, and il-
lustrated with examples in the appendix;  Table 2.2
summarizes when and how they may be used.
    Table 2.3  summarizes four of the most useful
univariate and bivariate graphic techniques, includ-
ing histograms, stem and leaf displays, box and whis-
ker plots, and bivariate plots. These methods are also
illustrated in Appendix A.

There is no rigorous theoretical or empirical support
for  using the normal distribution as a population
model for chemical and biological measures of water
quality or as a model for errors. Instead, the evidence
supports using the lognormal model. However, uncer-
tainty about the correctness of the lognormal model
suggests that prudent investigators will recommend
estimators  that perform  well  even if an  assumed
model is wrong.
Table 2.1 — Measures of central tendency.
Geometric Mean
Trimmed Mean
• Most widely known
and used choice
• Easy to explain
Easy to explain
Easy to determine
Resistant to others
Easy to explain
Easy to determine
• Appropriate for
certain skewed
• Resistant to outliers
• Not resistant to
• Not as efficient1 as
some alternatives
under deviations from
• Not as efficient as the
mean under normality
• Not as efficient as the
mean under normality
• Not as easy to
explain as first three
• Not as easy to
explain as first three
• Sample mean is
• Distribution is
known to be normal
• Distribution is
• Sample median is
• Outliers may occur
• Most frequently
observed value is
• Data are discrete or
can be discretized
• Distribution appears
• Outliers may occur
and estimator
efficiency is desired
• Outliers may occur
• Distribution is not

• More efficient
options are appropriate
• More widely known
estimators are

1 Higher efficiency means lower standard error.
         Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
Table 2.2 — Measures of dispersion.
Median Absolute
• Most widely known
• Routinely calculated
by statistics packages
• Resistant to outliers
• Resistant to outliers
• Relatively easy to
• Easy to determine
• Strongly influenced
by outliers
• Not as efficient1 as
some alternatives
under even slight
deviations from
• Not as efficient as the
standard deviation
under normality
• Not as efficient as the
standard deviation
under normality
• Not as efficient
• Sample standard
deviation is required
• Distribution is
known to be normal
• Outliers may occur
• Outliers may occur
• Range is required
• Outliers may occur
• Sample histogram is
even slightly more
dispersed than is a
normal distribution

• Any of the above
options is appropriate
1 Higher efficiency means lower standard error.
Table 2.3 — Useful graphic techniques.
Stem and Leaf Display
Box and Whisker Plot
Bivariate Plot
• Bar chart for data on a single (univariate)
• Shows shape of empirical distribution
• Same as histogram
• Presents numeric values in display
• Display of order statistics (extremes,
quartiles, and median)
• May be used to graph the same
characteristic (e.g., variable) for several
samples (e.g., different sampling sites)
• Scatter plot of data points (variable x
versus variable 7)
• Visual identification of distribution
shape, symmetry, center, dispersion, and
• Same as histogram
• Visual identification of distribution
shape, symmetry, center, dispersion, and
outliers (single sample)
• Comparison of several samples for
symmetry, center, and dispersion
• Visual assessement of the strength of a
linear relationship between two variables
• Evidence of patterns, nonlinearity and
bivariate outliers
    Many books and articles have been written re-
cently concerning the theoretical and empirical evi-
dence in favor of nonparametric methods and robust
and resistant estimators. Books that consider alterna-
tive estimators of center and dispersion (e.g., Huber,
1981; Hampel et al. 1986; Key, 1983; Barnett and
Lewis, 1984; Miller,  1986;  Staudte and Sheather,
1990) build a strong case for more robust estimators
than the mean and variance. Indeed, there is good evi-
dence (Tukey, 1960;  Andrews et al. 1972) that the
mean and variance may be the worst choices among
the common estimators for error-contaminated data.
Several articles that involve  comparisons of estima-
tors on real data (e.g., Stigler, 1977; Rocke et al. 1982;
Hill and Dixon,  1982) also favor robust estimators
over conventional alternatives.
    As a consequence, the median and the trimmed
mean are recommended for the routine calculation of
a data set's central tendency. The interquartile range
and the median absolute deviation are recommended
for calculation of the dispersion. These suggestions
represent a compromise between robustness, ease of
explanation, and calculation simplicity.  For  the
trimmed mean, recommended amounts of trimming
range from 10 percent (Stigler, 1977) to over 20 per-
cent (e.g., Rocke et al. 1982). A critical argument in
support of the trimmed mean is that interval estima-
tion and hypothesis testing are still possible using the
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
 t statistic (Tukey and McLaughlin, 1963; Dixon and
 Tukey, 1968; Gilbert, 1987).

 In statistics, uncertainty is a measure of confidence.
 That is, uncertainty provides a measure of precision
 — it assigns the value of scientific information in eco-
 logical studies. Scientific uncertainty is present in all
 studies concerning biological criteria, but uncer-
 tainty does not prevent management and decision
 making. Rather, uncertainty provides a basis for se-
 lecting among alternative actions and for deciding
 whether additional information is needed (and if so,
 what  experimentation  or observation should  take
     In ecological studies, scientific uncertainty re-
 sults from inadequate scientific knowledge, natural
 variability, measurement error, and sampling error
; (e.g., the standard error of an estimator). In the actual
 analysis, uncertainty arises from erroneous specifica-
 tion of a model or from errors in statistics, parameters,
 initial conditions,  inputs for the  model, or expert
     In some situations,  uncertainty in an unknown
 quantity (e.g., a model parameter or a biological end-
 point) may be estimated using a measure of variabil-
 ity. Likewise, in some situations, model error may be
 estimated using a measure of goodness-of-fit (predic-
 tions versus observations) of the model. In many situ-
 ations, a judicious estimate of uncertainty is the only
 option; in these cases, careful estimation is an accept-
 able alternative and methods exist to elicit these judg-
 ments from experts (Morgan and Henrion, 1990).
     In many studies, uncertainty is present in more
 than one component (e.g., parameters and models), so
 the investigator must estimate the combined effects of
 the  uncertainties on the  endpoint.  This  exercise,
 called error propagation, is usually undertaken with
 Monte Carlo simulation  or first-order error analysis.
     The outcome of an uncertainty analysis is a prob-
 ability distribution that reflects uncertainty on the
 endpoint. However, uncertainty analysis may not al-
 ways be the most useful  expression of risk. Other ex-
 pressions of uncertainty, such as prediction,
 confidence intervals, or  odds ratios are easier to un-
 derstand and interpret. If important error terms are ig-
 nored when a  probability statement is made,  the
 investigator must report this omission.  Otherwise,
 the probability  statement is not representative, and
 the uncertainties are underestimated.
     Since uncertainty provides a measure of preci-
 sion or value, it can be  used by decision makers to
 guide management actions. For example, in some
 cases  the uncertainty in a biological impact may be
 too large to justify management changes. As a conse-
quence, managers may defer action until additional
monitoring data can be gathered rather than require
pollutant discharge controls. If the  uncertainty is
large and the estimated costs of additional pollutant
controls quite high, it may be wise either to defer ac-
tion or to look for smaller, relatively less expensive
abatement strategies for an interim period while the
monitoring program continues.
    Though environmental planners at national,
state, and local levels have rarely considered uncer-
tainty in their planning efforts,  their work has been
generally successful over the past 20 years. It is, how-
ever, certainly possible that more effective manage-
ment — that is, less  costly, more beneficial
management — might have occurred if uncertainty
had been explicitly considered.
    If overall uncertainty is ignored, the illusion pre-
vails that scientific information is more precise than it
actually is. As a consequence, we are surprised and
disappointed when biological outcomes are substan-
tially different from predictions. Moreover, if we don't
calculate uncertainty, we have no rational basis for
specifying the magnitude of our sampling program or
the resources (money, time, personnel) that should be
allocated to planning. Thus, decisions on planning
and analysis are more likely based on convention and
whim than on the logical objective of reducing scien-
tific uncertainty.
    Statistical analysis is largely concerned with un-
certainty and variability. Therefore, uncertainty is an
important concept in this guidance manual. The anal-
yses presented  here and in subsequent chapters are
based on particular measures of uncertainty, for ex-
ample, confidence intervals. These measures are "sta-
tistics"; they  reflect  data,  and  are not always
considered in the broader context of uncertainty —
that is, as establishing the uncertainty in a quantity of
interest. We will, however, consider these statistics in
the broader sense, with concern for the theoretical is-
sues raised in this section. Particularly given the
small samples that often occur with biocriteria assess-
ments, investigators should ask the following ques-
    .   Do the data adequately represent
    .   Are all important sources of uncertainty
       represented in the data?
    .   Should expert scientific  judgment be used to
       augment or correct measures of uncertainty?
    .   If components of uncertainty  are ignored
       because they are not included in the data,
       are  conclusions or decisions affected?
          Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
     Statistical analysis is not a rote exercise devoid
of judgment.

Statistical Inference
Statistical inference is gained by two primary  ap-
proaches: (1) interval estimation, and (2) hypothesis
testing. Interval estimation concerns the calculation
of a confidence  interval or prediction interval that
bounds the range of likely values for a quantity of in-
terest. The  end  product is typically the  estimated
quantity (e.g., a mean value) plus or minus the upper
and lower interval. The same information is used in
hypothesis testing;  however, in hypothesis testing,
the end product is a decision concerning the truth of a
candidate hypothesis about  the magnitude^ of  the
quantity of interest.
     In a particular problem, the choice between us-
ing interval estimation or hypothesis testing generally
depends on the question or issue at hand. For exam-
ple, if a summary of scientific evidence is requested,
confidence intervals are apt to be favored; however, if
a choice or decision is to be made, hypothesis tests are
likely to be preferred.

Interval Estimation
Statistical intervals, whether confidence or predic-
tion, may be based on an assumed probability model
describing the statistic of interest, or they may require
no assumption of a particular underlying probability
     Hahn and Meeker (1991) note that the proper
choice of statistical interval depends on the problem
or issue of concern. As a rule, if the interval  is in-
tended to bound ^population parameter (e.g., the true
mean), then the appropriate choice is the confidence
interval. If, however, the interval is to bound a future
member of the population (e.g.,  a forecasted value),
then the appropriate choice is the prediction interval.
Another  statistical interval less frequently used in
ecology is the tolerance interval, which  bounds a
specified proportion of observations.
     In conventional (classical, or frequentist) statis-
tical inference, the statistical interval has a particular
interpretation that is often incorrectly stated in scien-
tific studies. For example, if a 95 percent statistical in-
terval for the mean is 7 ± 2, it is not correct to say that
there is a 95 percent chance that the true mean lies be-
tween 5 and 9." Rather, it is correct to say that 95 per-
cent of the time this interval is calculated, the true
mean will lie within the computed interval. Although
it sounds awkward and not directly relevant to the is-
sue at hand, this interpretation is the correct meaning
of a classical statistical interval. In truth, once it is cal-
culated, the interval either does or does not contain
the true value. In classical statistics, the inference
from interval estimation refers to the procedure for in-
terval calculation, not to the particular interval that is

Hypothesis Testing
Biosurveys are used for many purposes, one of which
is to assess impact or effect. Resource managers may
want to assess, for example, the influence of a pollut-
ant discharge or land use change on a particular area.
The effect of the impact can be determined based on
the study of trends over time or by comparing up-
stream and downstream conditions. In  some  in-
stances, the interest is in magnitude of effect, but
concern often focuses simply on the presence or ab-
sence of an effect of a specific magnitude. In such
cases, hypothesis testing is usually the statistical pro-
cedure of choice.
    In conventional statistical analysis, hypothesis
testing for a trend or effect is  often based on a point
null hypothesis. Typically, the point null hypothesis
is that no trend or effect  exists. The position is pre-
sented as a "straw man" (Wonnacott and Wonnacott,
1977) that the scientist expects to reject on the basis of
evidence. To test this hypothesis, the investigator col-
Table 2.4 — Possible outcomes from hypothesis testing.
H0 is True
H0 is Fale
(Ha is True)
Correct decision.
Probability = 1 - a;
corresponds to the
confidence level.
Type II error.
Probability = p
Type I error.
Probability = a;
also called the
significance level.
Correct decision.
Probablity = 1 - p;
also called power.
lects data to provide a sample estimate of the effect
(e.g., change in biotic integrity at a single site over
time). The data are used to provide a sample estimate
of a test statistic, and a table for the test statistic is con-
sulted to estimate how unusual the observed value of
the test statistic is if the null hypothesis is true. If the
observed value of the test statistic is unusual, the null
hypothesis is rejected.
    In a typical application of parametric hypothesis
testing, a hypothesis, H0, called the null hypothesis, is
proposed and then evaluated using a standard statis-
tical procedure like the t  test. Competing with this
null hypothesis for acceptance is the alternative hy-
pothesis, H1. Under this simple scheme, there are four
possible outcomes of the testing procedure: the hy-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
pothesis is either true or false, and the test results can
be accepted or rejected for each hypothesis (see Table
    The point null hypothesis is a precise hypothesis
that may be symbolically expressed:
where 0  is a parameter of interest. An example of a
point null hypothesis in words is, "no change occurs
in mean IBI after the new wastewater treatment plant
goes on line." Symbolically, it is expressed as

                  Ho: V-L-VZ = °
                  Ha: /u1-Atz 5* 0
where fj.1 is the "before" true mean and /u2 is the "after"
true mean. The test of the null hypothesis proceeds
with the  calculation of the sample means, 3c, and x2.
In most cases, the sample means will differ as a con-
sequence of natural variability or measurement error
or both, so a decision must be made concerning how
large this difference must be before it is considered
too large to result from variability or error. In classical
statistics, this  decision is often based  on standard
practice (e.g., a Type I error of 0.05 is acceptable), or
on informal consideration of the consequences of an
incorrect conclusion.
     The result of a hypothesis test can be a conclu-
sion or a decision concerning the rejected hypothesis.
Alternatively, the result can be expressed  as a
"p- value," which quantifies the strength of the data
evidence in favor of the null hypothesis. Thep-value
is defined as the probability that "the sample value
would be as large as the value actually observed, if H0
is true" (Wonnacott and Wonnacott, 1977). In effect,
thep-value provides a measure of how likely a partic-
ular value is, assuming that the null hypothesis is
true. Thus, the smaller thep-value, the less likely that
the sample supports H0. This is useful information; it
suggests  that p-values should always be reported to
allow the reader to decide the strength of the evi-

Common Assumptions
Virtually all statistical procedures  and tests require
the validity of one or more assumptions. These as-
sumptions concern either the underlying population
being sampled or the distribution  for a test statistic.
Since the failure of an assumption can have a substan-
tial effect on a statistical  test, the  common assump-
tions of  normality,  equality of variances, and
independence are discussed in this section. We must
ask, for example, to what extent can an assumption be
violated  without serious consequences? Or  how
should assumption violations be addressed?
   Normality. A common assumption of many para-
metric statistical tests is that samples are drawn from
a normal distribution. Alternatively, it may be as-
sumed that the statistic of interest (e.g., a mean) is de-
scribed by a normal sampling distribution. In either
case,  the key distinction between parametric and
nonparametric (or distribution-free) statistical tests is
that a probability model (often normal) is assumed.
    Empirical evidence (e.g., Box et al. 1978) indi-
cates that the significance level but not the power is
robust or not greatly affected by mild violations of the
normality assumption for statistical tests concerned
with the mean. This finding suggests that a test result
indicating "statistical significance" is reliable, but a
"nonsignificant" result may be the result of a lack of
robustness to nonnormality. The normality of a sam-
ple can be checked using a normal probability plot,
chi square test, Kolmogorov-Smirnov test, or by test-
ing for skewness or kurtosis; however, many biologi-
cal surveys  are not designed to produce  enough
samples to make these tests definitive.
    Normality of the sampling distribution for a test
statistic is important because it provides a probability
model for interval estimation and hypothesis tests. In
some  cases, transformation of the data may help the
investigator achieve approximate normality (or sym-
metry) in a sample, if normality is required. Since
nonnegative concentration data cannot be truly nor-
mal, and since empirical evidence suggests that envi-
ronmental contaminant data may be described with a
lognormal distribution, the logarithmic transforma-
tion is a good first choice. Therefore, in the absence of
contrary evidence, we recommend that concentration
data be log-transformed prior to analysis.

•  Equality  of Variance.  A second common as-
sumption is that when two or more distributions are
involved in  a test, the  variances will  be constant
across distributions. Many tests are also robust to
mild violations of this assumption, particularly if the
sample sizes are nearly identical. To test this assump-
tion, a t test  (usually  a two-tailed one) can be per-
formed; see Snedecor and Cochran (1967) for an
example, and Miller (1986) for interpretive results.
Conover (1980) provides an alternative, namely,
nonparametric tests of equality of variances. Note
that if two means are being compared based on sam-
ples with vastly different variances, the differences of
interest may be more fundamental than the difference
between the means.

•  Independence. The assumption of greatest gen-
eral concern  is independence. Most statistical tests
(parametric and nonparametric) require  a  random
sample, or a sample composed of independent obser-
vations. Dependency between or among observations
         Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
in a data set means that each observation contains
some information already conveyed in other observa-
tions. Thus, there is less new independent informa-
tion in a dependent data set than in an independent
data set of the same sample size. Because statistical
procedures are often not robust to violation of the in-
dependence assumption, adjustments are generally
recommended to address anticipated problems.
     Dependence in a sample can result from spatial
or temporal patterns,  that  is, from persistence
through time and space. In most types of analyses, the
assumption of independence refers to independence
in the disturbances (errors). For example, in a time se-
ries with temporal trend and seasonal pattern, de-
pendence or autocorrelation in the raw data series
may exist because of a  deterministic feature of the
data (e.g., the time trend or seasonal pattern).
    This type of autocorrelation poses no difficulty;
it is addressed by modeling the deterministic features
of the data  and subtracting the modeled component
from the original series. Of particular concern in test-
ing for trend is autocorrelation that remains after all
deterministic features are removed (i.e., errors that
are in the disturbances). When this situation arises,
an adjustment to the trend test is necessary. Reckhow
et al. (1993) provide guidance and software.
    A similar situation can occur in the estimation of
a regression slope or a central tendency statistic such
as the mean or trimmed mean. In such cases, the inde-
pendence assumption refers  to the errors, as esti-
mated by the residuals, around the regression line or
the mean. If persistence or dependence is found in the
residuals, then the independence assumption is vio-
lated and corrective action is needed. Options to ad-
dress this problem include using an effective sample
size (Reckhow and Chapra,  1983), or generalized,
least squares for regression (see Kmenta [1986] or any
standard econometrics regression text).
    If the investigator finds positive autocorrelation
in the disturbances (i.e., if each disturbance is posi-
tively correlated with nearby disturbances in the se-
ries), confidence interval estimates will be too narrow
and may lead to rejection of the  null hypothesis.
Autocorrelation in the disturbances is the most com-
mon and potentially  the most troublesome of the
causes of assumption violations.
    The degree of autocorrelation is a function of the
frequency of sampling; that is, a data set based on an
irregular sampling frequency cannot be characterized
by a single, fixed value for autocorrelation. For biolog-
ical time series, stream data obtained more frequently
than monthly may be expected to be autocorrelated
(after trends and seasonal  cycles are removed).
Stream survey data based on less frequent sampling
are less likely to exhibit sample autocorrelation esti-
mates of significance.

Parametric Methods — the t Test
Parametric approaches involve a model (e.g., regres-
sion slope) for any deterministic features and a proba-
bility model for the errors.  In some cases,  the
deterministic model will be a linear, curvilinear, or
step function, while the model for the errors is typi-
cally a normal probability distribution with  inde-
pendent, identically distributed errors. In other cases,
the deterministic model may simply be a constant (as
it is when interest focuses on an "upstream/down-
stream" comparison between two sites), though the
probability model may in all cases be a normal proba-
bility distribution. The t test is a typical parametric

Using the t test
A Student's t statistic:
has a Student's t distribution (n-1 degrees of freedom);
here, "x" is the mean of a random sample from a nor-
mal distribution with true mean /j. and constant vari-
ance, s is the sample standard deviation, and n is the
sample size. In addition, for two samples:
               t = -
                      x, -x.
also has a Student's t distribution (n1+nz-2 degrees of
freedom); here, xa andx2 are the sample means; sx and
s2 are the sample standard deviations; and jna and n2
are the sample sizes. This distribution is widely tabu-
lated, and it is commonly used in hypothesis testing
and confidence interval estimation for a sample mean
(one-sample test; Equation 2.la) or a comparison of
sample means (two-sample test;  Equation 2.1b).
    When Student's t distribution is used in a hy-
pothesis test (a t test), it is assumed that samples are
drawn from a normal distribution, the variances are
constant across distributions, and the observations
are independent. Of these assumptions,  Box et al.
  (1978) have shown that the t test has limited robust-
  ness to violations of the first  two (normality and
  equality of variances);  however, problems will oc-
  cur if the observations are dependent. The scientist
  should probably be concerned about the first two as-
  sumptions only in situations in which the two data
  sets have substantially different variances and sub-
  stantially different sample sizes (see Snedecor and
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2.  Classical Statistical Inference and Uncertainty
Cochran [1967]  for F test calculations to compare
    An attractive variation of the t statistic for use in
situations where outliers are of concern was proposed
by Yuen and Dixon (1973; see also Miller, 1986; and
Staudte and Sheather,  1990). They created an out-
lier-resistant, or robust, version of the t  statistic
(Equations 2.la. and 2.1 b) using a trimmed mean and a
Winsorized standard deviation. For example, if a t sta-
tistic is used to  compare the means of two popula-
tions, the robust (trimmed t) version is
                t  =
where   x^
               = trimmed mean for sample i
               = Winsorized standard deviation

               = number of observations in sample i
    A Winsorized statistic is similar to a trimmed sta-
tistic. For trimming, observations are ordered from
lowest to highest, and the ^-lowest and ^-highest are
removed from the sample for the calculation of the
jc-trimmed  statistic (e.g., trimmed mean).  For
£-Winsorizing, observations are ordered from lowest
to highest, and the ^-lowest and ^-highest are not re-
moved, but are reassigned the values of the lowest ob-
servation  and the highest observation remaining in
the trimmed sample. The following  example illus-
trates this.

   A sample of 10 IBI values is obtained for analysis:
       9, 31, 26, 25, 34, 38,  33, 31, 28, 37

   And ordered from lowest to highest:
       25, 26, 28, 29, 31, 31, 33, 34, 37, 38.

   The 10 percent- trimmed sample is
       26,28,29,31, 31,33,34,37

   The 10 percent-Winsorized sample is
       26, 26, 28, 29, 31, 31, 33, 34, 37, 37.

    If we were to calculate the 10 percent-trimmed t
statistic in Equation 2.2 for this IBI sample, we would
use: (1) the  trimmed sample (eight observations) to
calculate a mean, and (2) the Winsorized sample (10
observations) to calculate a standard deviation. For
the two-sample comparison of means, the trimmed t
statistic has  (l-2k)(n1+n2)-2 degrees of freedom or, in
the above example, 7 degrees of freedom (df). The
trimmed t statistic is an attractive option that should
be considered whenever outliers  are a concern.
    The parametric approach is appropriate and ad-
vantageous if the deterministic model is a reasonable
characterization of reality and if the model for errors
holds. In such cases, parametric tests should be more
powerful than nonparametric or distribution-free al-
ternatives. Thus, the assumption that deterministic
and probability models are  correct is the basis on
which the superior performance of parametric meth-
ods rests. If the assumptions concerning these models
are incorrect, then the results of the parametric tests
may be invalid and distribution-free procedures may
be more appropriate.

Nonparametric Tests — the Wtest
Distribution-free methods, as the name suggests, do
not require an assumption concerning the particular
form of the underlying probability model for the data
generation process. An assumption of independence
is, however, usually made; therefore, autocorrelation
can be as serious a problem in nonparametric meth-
ods as it is for parametric and robust methods. Distri-
bution-free tests are often based on rank (or order); the
sample  observations are arranged from lowest to
highest. The Wilcoxon-Mann-Whitney test or Wtest is
a typical distribution-free test.

Using the Wtest
The Wtest is a two-sample hypothesis test, designed
to test the hypothesis that two random samples are
drawn from identical continuous distributions with
the same center (alternative hypothesis: one distribu-
tion is offset from the other, but otherwise identical).
This test is often presented  as an option  to the
two-sample t test that should be considered if the as-
sumption of normality is believed to be seriously in
error. The Wtest has its own statistic, which is tabu-
lated in most elementary statistics textbooks .(i.e.,
those with a chapter on nonparametric methods).
However, for moderate to large sample sizes (e.g., n >
15), the statistic is approximately normal under the
null hypothesis, so the standard normal table can be
    The scientist should consider the Wtest for any
situation in which the two-sample t test may be used.
Comparative studies of these two tests indicate that
while the t test is robust to violations of the normality
assumption, the  W test is relatively powerful while
not requiring normality. Situations that appear se-
verely nonnormal might favor the W test; otherwise
the t test may be selected. Some  statisticians (e.g.,
Blalock, 1972) recommend that both tests be con-
ducted as a double check on the hypothesis.
    Unfortunately, violation of the independence as-
sumption appears to be as serious for the Wtest as for
the t test. If these tests are to be meaningful, the scien-
tist must confirm independence or make other adjust-
ments as noted in Reckhow et al. (1993).
         Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
    In essence, the Mutest is used to determine if the
two distributions under study have the same central
tendency, or if one distribution is offset  from the
other. To conduct the Wtest, the data points from the
samples are combined, while maintaining  the sepa-
rate sample identity. This overall data set is ordered
from low value to high value, and ranks are assigned
according to this ordering.
    To test the null hypothesis of no difference be-
tween the two distributions (f [x] and g[x])

    the ranks, Rp for the data points in one of the two
samples are summed:
    The ranks should be specified as  follows
(Wonnacott  and Wonnacott,  1977):  Start ordering
(low to high, or high to low) from the end (high or low)
at which the observations from the smaller sample
tend to be greater in number, and sum the ranks to es-
timate W from this smaller sample. This estimate
keeps W small as it is reported in most tables. For ei-
ther one-sided or two-sided tests, if ties occur in the
ranks, then all tied observations should be assigned
the same average rank.
    Statistical significance is a function of the degree
to which, under the null hypothesis, the ranks occu-
pied by either data set differ from the ranks expected
as a result of random variation. For small samples, the
W statistic calculated in Equation 2.3 can be com-
pared to tabulated values to determine its signifi-
cance (see Hollander and Wolfe, 1973). For moderate
to large samples (where total n from both samples >
15), W is approximately normal (if the null hypothesis
is true).  Therefore, the  W statistic may be evaluated
using a standard normal table with mean (E[W]) and
variance (Var[W]):
  E(W) =
Var(W) =

    If there are ties in the data, then the variance may
be calculated as
 Var(W) = -
    where t is the size (number of data points with
the same value) of tied group;. The effect of ties is neg-
ligible unless there are several large groups (t > 3) in
the data set.
    These statistics are used to create the standard
normal deviate:
                 z =
where:   nA, nB =the number of observations in
         samples A and B (nA< rig).

Example — an IBI case study
IBI data have been obtained from upstream and
downstream  sites surrounding a  wastewater dis-
charge. Assume independence.
     (a) Test the null hypothesis that the true differ-
ence between the upstream and  downstream IBI
means is zero, versus the alternative hypothesis that
the downstream IBI mean is lower than the upstream
IBI mean.

                 HQ-.^U-^O = 0
    First, some basic statistics for each sample:

    For a comparison of two means based on equal
sample sizes, the t statistic is
      x, -x,            39.8-345          53     5
                                                              757 + 8.09
                                                7.83^02   35
    At the 0.05 significance level, the one-tailed f sta-
tistic for 18 degrees of freedom is 1.73. Since 1.51 <
1.73, we cannot reject the null hypothesis (at the 0.05

    (b) Test the null hypothesis (see part a) using the
10 percent trimmed t (10 percent trimmed from each
            \ii-\tt40.5-35.5       5.0
                                                          t,H =
                                                   (5wl+sw2) jl
                              1   5.83 + 639
                                     = 6.12V02
                                                                                                = 183
    At the 0.05 significance level, the one-tailed t sta-
tistic for 14 degrees of freedom is 1.76. Since 1.83
1.76, we reject the null hypothesis (at the 0.05 level).

    (c) Test the null hypothesis (see part a) using the
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
































    Here the separate samples have been combined
for the purpose of rank ordering. The Wtest statistic
can then be calculated from the ranks:
         W =
                 = 1+2+ 3.5+ 3+ 5+10+ 11+ 14 + 15+19= 84

 = (nBnA / 12)[nB + nA
                            =10(10+10+ 1)/ 2=105
                                      + nA)(nB .+ nA
In hypothesis testing, the conclusion to not reject H0
(in effect, to accept H0)  should not be  evaluated
strictly on the basis of a, the probability ofrejectingH0
when it is true (Type I error; see Table 2.4). Instead, we
must be concerned with (3, the probability of accept-
ing H0 when it is false (Type II error). Unfortunately, P
does not have a single value, but is dependent on the
true (but unknown) value of the difference between
population means and on the sample size, n. For a par-
ticular testing procedure and sample size, we can de-
termine and plot a  relationship between the  true
difference between means and p. This plot is called
the operating characteristic curve.
     To  understand  the  issues concerning signifi-
cance and power (a  and 1-p), consider  the null hy-
pothesis in the IBI case study:
 = [(io)(io) / i2][io +10 +1 - {(2)(3) + (2)(3) + (2)(3)} / (10 + io)(io +10 -1)] = i74.6i    HO: The population mean IBI at the upstream site is
                                                             the same as the population  mean IBI at the
                                                             downstream site.
         (W-E(W)  84-105
    At the 0.05 significance level, the one-tailed z
statistic is 1.65. Since 1.59 < 1.65, we cannot reject the
null hypothesis (at the 0.05 level.
    A glance at the IBI values and ranks in this exam-
ple indicates a difference between the two samples
(box plots and histograms would provide further sup-
porting evidence). At issue is whether this difference
in the sample is a chance occurrence or an indication
of a true difference between the sites. If we adopt the
conventional 0.05 level for hypothesis testing, then
the conclusions from the three tests are ambiguous.
Still, we can say the following about  both the site
comparisons and the methods:
    (i) The downstream site is slightly impacted.
Even though only one of the three test results yielded
significance (at the 0.05 level), all three were  close,
suggesting a slight difference between the sites.
    (ii) For each site, the lowest IBI value (25 for up-
stream, and 18 for downstream) is influential, partic-
ularly on the standard deviation. As a consequence,
for the conventional t test, the  denominator in the t
statistic is inflated and rejection of the null hypothe-
sis is less likely. Note that the lowest IBI value for the
upstream site (IBI  = 25) also affects  the  distribu-
tion-free Wtest. This IBI value holds a high rank (19)
for the upstream sample, and substantially affects the
test result. If that single IBI value had been 27 instead
of 25, we would have rejected the null  hypothesis at
the 0.05 level.
    (iii) The trimmed t is resistant to unusual obser-
vations or outliers, and thus provides the best  single
indicator of difference between the sites as conveyed
by the bulk of the data from each site.
                                                 In addition,  because of the wastewater dis-
                                            charge, consider the general alternative hypothesis:

                                               HA: The population mean IBI at the upstream site is
                                                    higher than  the population mean IBI at the
                                                    downstream site.

                                                 If we adopt a  =  0.05 (the probability of rejecting
                                            H0 when it is true; Type I error) as our significance
                                            level, then Figure 2.la displays the sampling distribu-
                                            tion for the mean  under H0 with 18 degrees of free-
                                            dom.  The horizontal axis in Figure  2.1  is the
                                            "difference between the means";  thus, the sampling
                                            distribution is centered at zero in Figure 2.la (consis-
                                            tent with zero difference between means under H0).
                                            The 0.05-significant tail area (the "rejection region")
                                            begins at 6.06, which means that the sample differ-
                                            ence must be greater than or equal to 6.06 for us to re-
                                            ject H0. Since the difference between the means in our
                                            sample IBI was only 5.3, we are inclined to accept the
                                            null hypothesis, based on the conventional t test.
                                                 (Note: to find the beginning of the tail area multi-
                                            ply the t statistic times the standard error. In this ex-
                                            ample, the t statistic is 1.73 [one-sided, 0.05 level, 18
                                            degrees of freedom], and the standard error is 3.5.
                                            Thus, the tail area begins at [1.73][3.5] = 6.06.)
                                                 Now suppose that the following  alternative hy-
                                            pothesis, Hv is actually true for the sample IBI case:

                                               Ht: The population mean IBI at the upstream site is
                                                    higher by 5.0 than the population mean IBI at
                                                    the downstream site.
                                                 In addition suppose that while  H^ actually is
                                            true, we propose a hypothesis test for H0 based on the
                                            acceptance region in Figure 2. la (i.e., accept H0 if the
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                         CHAPTER 2. Classical Statistical Inference and Uncertainty
       (a) If HO is true
                                                            a = .05 (5% tail area)
         (b) If H| is true
    Figure 2.la and b.—Sampling distributions under different hypotheses.
difference  between  the  means is  less than  6.06),
which is exactly what occurred in our example. As we
noted above, consideration of H0 alone (Figure 2.la)
leads us to accept the null hypothesis.
     Yet, with Hj actually true (see Fig. 2.1b), if we
propose a hypothesis test for H0 based on the accep-
tance region in Figure 2.la, there is a  62 percent
chance that we will accept H0 when it is actually false,
according to Figure 2.lb (given the sample size in the
example). This high likelihood of Type II error (see Ta-
ble 2.4)  underscores the danger of concluding the hy-
pothesis test with acceptance of the null hypothesis.
The power of this particular test is 1-p, or a 38 percent
chance  of detecting an IBI change of 5. Note that the
specific alternative hypothesis H^ is one example of
an unlimited number of possibilities associated with
the general alternative hypothesis HA.  Associated
with Hp' (3 = 0.62 is one point on the power curve for
this test and sample  size. To properly determine the
power of a test, we need to calculate f> for a range of
specific alternative hypotheses.
     A second issue of concern in hypothesis testing
is the problem of multiple simultaneous hypothesis
testing,  or "multiplicity" (Mosteller and Tukey, 1977).
The  classical interpretation of the 0.05  significance
level (for a) associated with a hypothesis test is  that
95 percent of the time this testing procedure is ap-
plied, the conclusion to accept the null hypothesis
will not be in error if the null hypothesis is true. That
is, on the average, one in 20 tests under these condi-
tions will  result in Type I errors.
    The problem of multiplicity arises when an in-
vestigator conducts several tests of a similar nature on
a set of data. If all but a few of the tests yield statisti-
cally insignificant results, the scientist should not ig-
nore this in favor of those that are significant. The
error of multiplicity results when one ignores the ma-
jority of the test results and cites only those that are
apparently statistically significant. As Mosteller and
Tukey (1977) note, the multiplicity  error is techni-
cally the incorrect assignment of an a-level. When
multiple tests of a similar nature are run on a set of
data, a collective a should be used, associated with si-
multaneous test results. This tactic is typically re-
ferred to as the Bonferroni correction for correlation
    The following comments from  Wonnacott and
Wonnacott (1972, pp. 201-202) summarize our atti-
tude toward hypothesis testing:

    We conclude that although statistical theory
    provides a rationale for rejecting H0, it pro-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                        CHAPTER 2. Classical Statistical Inference and Uncertainty
    vides no formal rationale for accepting H0. The
    null hypothesis may sometimes be uninterest-
    ing, and one that we neither believe or wish to
    establish; it is selected because of its simplic-
    ity. In such cases, it is the alternative Ha that
    we are trying to establish, and we prove Ha by
    rejecting H0. We can see now why statistics is
    sometimes called "the science of disproof." H0
    cannot  be  proved, and H1 is proved by dis-
    proving (rejecting) H0. It  follows that if we
    wish to  prove some proposition, we will often
    call it H1 and set up the contrary hypothesis H0
    as the "straw man" we hope to destroy. And of
    course if H0 is only such a straw man,  then it
    becomes absurd to accept it in the face of a
    small sample result that really supports Hr

    Since there are great dangers in accepting H0,
    the decision instead should often be simply to
    "not reject H0," i.e., reserve judgment. This
    means that type II error in its worse form may
    be avoided; but it also means you may be leav-
    ing the scene of the evidence with nothing in
    hand. It is for this reason that either the con-
    struction of a confidence interval or the calcu-
    lation of a prob-value is preferred, since either
    provides a summary of the information pro-
    vided by the sample, useful to sharpen up
    your knowledge of what the underlying popu-
    lation is really like.
                                                If, on the other hand, a simple accept-or-reject
                                                hypothesis test is desired, then we must look
                                                to a far more sophisticated technique. Spe-
                                                cifically, we must explicitly take account not
                                                only of the sample data used in any standard
                                                hypothesis test (along  with the adequacy of
                                                the sample size), but also:

                                                1. Prior belief. How much confidence do we
                                                have in the engineering department that has
                                                assured us that the new process is better? Is
                                                their vote divided? Have they ever been wrong

                                                2. Loss involved in making a wrong decision. If
                                                we make a type I error (i.e., decide to reject the
                                                old process in favor of the new, even though
                                                the old is as good), what will be the costs of re-
                                                tooling, etc.?

                                                These  comments amount  to an advocacy of
                                            Bayesian decision theory. While it may be difficult to
                                            interpret a biosurvey in decision analysis terms, prior
                                            information and loss functions should, at a mini-
                                            mum, be considered in an informal manner. It is good
                                            engineering and planning practice to make use of all
                                            relevant information in inference and decision mak-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data

Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data

CHAPTER 3. Designing the Sample Survey                                  15
Critical Aspects of Survey Design	15
    Variability	15
    Representativeness and Sampling Techniques	15
    Cause and Effect	16
    Controls	16
Key Elements	17
    Pilot Studies	17
    Location and Sampling Points	18
    Location of Control Sites	19
    Estimation of Sample Size	20
Important Rules	20

CHAPTER 3    Designing the  Sample Survey
     The design of the sample survey is a critical ele-
     ment in the environmental assessment process,
and certain statistical methods are associated with
specific designs. This chapter examines various types
of survey design and shows how the selection of the
design relates to the interpretation and use of data
within the biocriteria program. For information on de-
signs not covered in this chapter, see Cochran, 1963;
Cochran and Cox, 1957; Green, 1979; Williams, 1978;
and Reckhow and Stow, 1990.
    Efforts to design sample surveys frequently re-
sult in situations that force the investigator to evalu-
ate the trade-offs between an increase in uncertainty
and the costs of reducing this uncertainty (Reckhow
and Chapra, 1983). But major components of uncer-
tainty, including variability, error, and bias in biologi-
cal  and  statistical sources, can  sometimes be
controlled by a well-specified survey design.
    For example, variability can be caused by natural
fluctuations in biological indicators over space and
time; error can be associated  with inaccurate data ac-
quisition or reduction; and bias can occur when the
sample is not representative  of the population under
review or when the samples are not randomly col-
lected. These sources of uncertainty should be evalu-
ated before the sampling design is selected because
the best design will minimize the effects of variability,
error,  and bias on decision making.

Critical Aspects of Survey
Data collection within the  biocriteria program re-
quires the investigator to address issues associated
with both classical and experimental survey designs.
In general, experimental survey design focuses on the
collection of data that leads to the testing of a specific
hypothesis. Classical survey  design is motivated less
by hypothesis testing than by the "survey" concept.
That is, the investigator gathers a relatively small
amount of data, the sample, and extrapolates from it a
view of the totality of available information.
    In this chapter, we will address issues that over-
lap these design types. In addition, we will focus on
designs appropriate to local, site-specific situations.
For  larger geographic survey designs, see Hunsaker
and Carpenter (1990), or Linthurst et al. (1986).
A critical aspect of sampling design is to identify and
separate components of variability, including the im-
portant ones of time, space, and random errors. Yearly
and seasonal  variations and spatial variations like
those caused by changes in geographic  patterns
should be accounted for in the survey design. A de-
sign that stratifies the sampling based on knowledge
of spatial and temporal changes in the abundance and
character of biological indicators is preferred to sys-
tematic random sampling. That is, if biological indi-
cators are known to exhibit temporal and spatial
patterns, then sampling locations and times must be
adjusted to  match the biological variability.

Representativeness and Sampling
The object of a biological survey design is to reduce
the total information available to a small sample: ob-
servations are made and data collected on a relatively
small number of biological variables. Representative-
ness is, therefore, a key  consideration in the design of
sample  collection procedures. The  data generated
during the  survey should be representative of the
population  or process under evaluation. Biased sam-
ples occur when the data are not representative of the
population. For example, a sample mean may be low
(biased) because the investigator failed to sample geo-
graphic areas of high abundance.
    Several techniques can increase the odds of col-
lecting a representative sample; however, the  tech-
nique most frequently used is random sampling.
Theoretically, in simple random sampling, every unit
in the population has the same  chance of being in-
cluded in the sample. Random sampling is a physical
way to introduce independence among environmen-
tal measurements. In addition, random sampling has
the affect of minimizing various types  of bias in the
interpretation of results.
    If the  geographic  area sampled is large,  with
known or suspected  environmental patterns, a good
technique is to divide the area into relatively homoge-
nous sections  and randomly sample within each one.
This technique is known as stratified sampling.  Sam-
ples can be  allocated to each section in  proportion to
the size of the area or to the known abundance  of or-
ganisms within each  area. In still other cases, system-
atic  sampling may be  appropriate.  Systematic
sampling improves precision in the sample estimates,
especially  when known spatial  patterns  exist
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                                CHAPTER 3.  Designing the Sample Survey
(Cochran, 1963). Randomly allocated replicate sam-
ples collected on a grid allow for good spatial coverage
of patchy environments, yet minimize the potential
for sampling bias.

Cause and Effect
In classical statistical experiments, a population is
identified and randomly divided into two groups. The
treatment is administered to one  group; the other
group serves as the control. The difference in the aver-
age response between the two groups indicates the ef-
fect of the treatment, and the random assignment of
individuals to the groups permits an inference of cau-
sality because the observed  difference results from
the treatment and not from some preexisting differ-
ence between the groups.
    In an ecological assessment, the treatment and
control groups  are  not selected at random from a
larger population, since the impacted site cannot be
selected at random. And no matter how carefully the
reference site is matched,  the investigator cannot
compensate for the lack of random selection. In this
sense,  a statistically valid test of the hypothesis that
an observed difference between an impacted site and
a control site results from a specific cause is impossi-
ble. The hypothesis that the two sites are different can
be tested, but the difference cannot be attributed to a
specific cause. In statistical terms,  the stress on the
impacted site is completely confounded with preex-
isting differences between the impact and reference
    Although a firm case can be made that a site is
subject to adverse impacts, investigators must realize
that the site is an experimental unit that cannot be
replicated.  They must  take  care  to avoid
"pseudoreplication" (Hurlbert, 1984) — the testing of
a hypothesis about adverse effects without appropri-
ate statistical design or analysis methods. The prob-
lem is  a misunderstanding or misspecification of the
hypothesis being tested. It is  avoided by understand-
ing that only the hypothesis  of a difference between
sites can be statistically tested. Cause-and-effect is-
sues cannot be resolved using statistical methods. Of
course, establishing that a difference exists is an es-
sential step in the process of demonstrating an ad-
verse  ecological  effect. If there is no detectable
difference, there is no cause to establish.
    Methods used  to establish causality can make
use of statistical techniques, such as regression or cor-
relation. For example, regression can be used to show
that toxicity increases along with the concentration of
some chemical known to originate from a wastewater
outfall. The regression describes the relationship; it
does not imply the cause, though presence of a strong
relationship is evidence that a link exists.
                                                 One way to resolve these issues is to collect both
                                            spatial and temporal data from a control  site. If the
                                            spatial control is missing and  only before and after
                                            impact samples are available at the impacted site, sta-
                                            tistical tests cannot rule out the possibility that the
                                            change would have occurred with or without the im-
                                            pact. If the temporal control is missing, the statistical
                                            tests cannot rule out the possibility that  the differ-
                                            ences between the  control and impact site may have
                                            occurred prior to the impact. In practice, control data
                                            are rarely available in both spatial and temporal di-
                                            mensions. Therefore, most environmental assess-
                                            ments detect only that differences exist between the
                                            control and impact sites. The causal link is more diffi-
                                            cult to discern.

                                            In environmental assessments, control or reference
                                            data are used in hypothesis tests to evaluate whether
                                            data from the control and impact site are statistically
                                            different. Evidence of impact is based on changes in
                                            the biological community that did not occur in the
                                            control area. Sources of control information include
                                            baseline data, reference site data, and numeric stan-
                                            dards. The case for causality can be strengthened if
                                            the controls are properly selected.
                                                 In an ideal study design, both temporal and spa-
                                            tial control data should be collected (Green, 1979).
                                            The control site should be geographically separated
                                            from the impacted site but have similar physical and
                                            ecological features (e.g., elevation, temperature, wind
                                            patterns,  and habitat type and  disturbance). In
                                            aquatic  habitats, parameters such as stream  order,
                                            flow rate, and stream hydrography should be consid-
                                            ered. Ideally, biological indicators  of impact should
                                            be collected at the control site before and after the im-
                                            pact occurs.
                                                 Statistically, a valid control site should have con-
                                            servative properties. That is, its statistics should be
                                            the same as at the impacted site except for the effects
                                            of the impact. Physical, chemical, and ecological vari-
                                            ables should be measured and  statistically evaluated
                                            to confirm that the impact and  control sites are prop-
                                            erly matched. Investigators should test for mean dif-
                                            ferences  as  well as differences in distribution. In
                                            addition, the variance of the physical and ecological
                                            similarities  between the control and  impact sites
                                            should be the same over time. For example, if the
                                            mean pH between the two sites is consistent but the
                                            impact site  experiences much wider swings in pH
                                            than the control site, then the ability to confidently
                                            detect an impact for a pH-dependent toxicant is com-
                                            promised. Samples within the control and reference
                                            site should be randomly allocated at  some level. For
                                            example, in a random sampling design (Fig. 3.1), the
Biological Criteria:  Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                               CHAPTER 3. Designing the Sample Survey

                                    Control Area
                                                              Impact Area
    Figure 3.1—Random before and after control impact (BACI) sample design having both temporal and spa-
tial dimensions. Random samples indicated are from within areas identified as being of similar habitat.
(Adapted from Green, 1979.)
samples would be randomly allocated in a tempo-
ral/spatial framework that would allow for a number
of different statistical analyses, including analysis of
variance (ANOVA).
    In an optimal study design, the impact would be
in the future. Thus, baseline data providing a tempo-
ral control would be available to the investigator. In
practice, baseline data are rarely available, and the in-
vestigator cannot be certain whether differences be-
tween the impact and control sites preceded or
followed the impact. Therefore, cause and effect can-
not be determined.  However, the fact that a difference
exists allows the investigator to hypothesize a causal
    In some cases, biological variables collected at
an impact site may be compared to a fixed numeric
value rather than to a set of identical measurements
collected at a reference site. Nevertheless, the issues
associated with demonstrating causality remain the
same. In addition,  the investigator should note that
the numeric criterion has no variance. It  is usually
presented as a single number with no associated un-
certainty.  In such cases, a t statistic (see chapter 4)
would be appropriate.  As an alternative to the nu-
meric criterion, investigators could use the data from
which the criterion was derived. Uncertainty esti-
mates from that data set could be used in statistical

Key Elements
Several specific survey designs are appropriate for
use in a biocriteria program, but designs for a particu-
lar environmental assessment should be developed
with the aid of a consulting statistician. Such plans
should include the following key elements, beginning
with the notion of a pilot study.

Pilot Studies
In a pilot study, the investigator makes a limited sur-
vey of the variables that determine impact at both the
impact and control site. Data from the survey can be
used to estimate sample  sizes,  evaluate sampling
methods, establish important variance components,
and critique or reevaluate the larger design. The sam-
ple size helps determine the particular levels of statis-
tical confidence that can be gleaned from the study. In
general, a pilot study can save time and effort by veri-
fying an investigator's preliminary assumptions and
initial evaluations of the impact site. Current studies
Biological Criteria:  Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                               CHAPTER 3.  Designing the Sample Survey
and historical data collected at the site of interest or
similar sites can be used to help establish a good mon-
itoring design.

Location of Sampling Points
A second key issue in the study design is the location
of the sampling points. Many specific designs  and
variations are available, including (1) completely ran-
dom sample designs, (2) systematic sample designs,
and (3) stratified random sample designs.

•  Random Samples. In  complete  random sam-
pling, every potential sampling point has the same
probability of selection. The investigator randomly
assigns the sample points within the impact site and
independently within the control site. No attempt is
made to partition the impact and  control sites either
spatially or temporally except to ensure similar physi-
cal habitats. The  sampling units are  numbered se-
quentially, and the selection is made using a random
number table or computer-generated  random num-
    The advantage of random sampling is that statis-
tical analysis of data from points located completely
at random is comparatively straightforward. In addi-
tion, the method provides built-in estimates of preci-
sion. On the other hand, random  sampling can miss
important characteristics of the site, spatial coverage
tends to be nonuniform, and some points may be of
little interest.

•  Systematic Samples.  Systematic  sampling oc-
curs when the investigator locates  samples in a
nonrandom but consistent manner. For example, sam-
ples can be located at the nodes of a grid, at regular in-
tervals along a transect, or at equally spaced intervals
along a streambank. The grid or interval can be gener-
ated randomly, after which the position of all samples
is fixed in space.
    Systematic sampling has two  advantages  over
simple random sampling. First, it is easier to draw,
since only one random number is required. Second,
the sampling points are evenly distributed over the
entire area. For this reason, systematic sampling often
gives more accurate results than random sampling,
particularly for patchy environments or environ-
ments with distinct discontinuous populations.
     Systematic sampling also has its disadvantages.
For example, if the magnitude of the biological vari-
able exhibits a fixed pattern or cycle over space or
time, then systematic sampling is unlikely to repre-
sent variance of the entire population. Suppose an or-
ganism has several hatches, roughly at equally spaced
time intervals during the sampling period, then sam-
ples taken at fixed-time intervals may provide a bi-
                                           ased estimate of the average number of individuals
                                           alive at one time. If possible, the population should be
                                           checked for such periodicity. If periodicity is found or
                                           suspected but not verifiable, systematic sampling
                                           should not be used.
                                               Another disadvantage of systematic sampling is
                                           that it is more complicated to estimate the standard
                                           error than if random sampling had been used. Despite
                                           these problems, systematic sampling is often part of a
                                           more complex sampling plan in which it is possible to
                                           obtain unbiased estimates of the sampling errors.

                                           • Stratified Random Samples. Stratified samples
                                           combine the advantages of random and systematic
                                           sampling. Stratified random sampling consists of the
                                           following three steps: (1) the population is divided
                                           into a number of parts, called strata; (2)  a random
                                           sample is drawn independently in each stratum, and
                                           (3) an estimate of the population mean is calculated.
                                                             yst =•
                                                where yst is the estimate of the population mean,
                                           Nh is the total number of sampling units in the h stra-
                                           tum, and yh is the sample mean in the h* stratum,
                                           and N = ^ N h is the size of the population. Note that
                                           Nh are not sample sizes but the total sizes of the strata,
                                           which must be known to calculate this value.
                                                Stratification is employed if it can be shown that
                                           differences between the strata means in the popula-
                                           tion do not contribute to the sampling error in the esti-
                                           mate of y h. In other words, the sampling error of y h
                                           arises solely from variations among sampling units
                                           that are in  the same  stratum. If the strata can be
                                           formed so that  they are internally  homogeneous, a
                                           gain in precision over simple random sampling can
                                                In stratified sampling, the sample size can vary
                                           independently across  strata. Therefore, money and
                                           human resources can be allocated efficiently across
                                           strata. As a general rule, strata with the greatest uncer-
                                           tainty  (i.e.,  with the largest expected variance, or
                                           about which little is known) should receive the great-
                                           est amount of sampling effort.
                                                For environments that are known to be fairly ho-
                                           mogeneous with respect to the biological variable un-
                                           der consideration, stratified random sampling will
                                           not add precision to the population estimates. In fact,
                                           using stratification in these environments may intro-
                                           duce a loss of precision and a possible bias in the pop-
                                           ulation estimates. In these cases, the investigator may
                                           save a great deal of time and effort by using simple
                                           random sampling in the sampling plan.
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                               CHAPTER 3. Designing the Sample Survey
Location of Control Sites
Under EPA's biocriteria program, states may establish
either site-specific reference sites or ecologically sim-
ilar regional reference sites for comparison with im-
pacted sites  (U.S.  Environ. Prot. Agency, 1990).
Typical site-specific reference sites may be estab-
lished along a gradient. For example, a reference site
can be established upstream of a wastewater outfall
(Fig. 3.1). Gradients work well for rivers and streams;
for larger waterbodies, reference sites can be estab-
lished on a one-to-one basis with a similar waterbody
in the region not experiencing the impact under eval-
     An important consideration in site-specific ref-
erence conditions is to establish that the control site is
not impaired at all or that it is only  minimally im-
paired. In particular, baseline data should be obtained
to demonstrate that the impact is linked to the differ-
ences detected  between the reference site and the
control site.
     Ideally, a reference site should exhibit no impair-
ment; however, natural variability in biological data
may make the determination of minimal or no impact
difficult, especially if the impact is relatively small.
An interesting method for site selection is to establish
several reference sites based on their physical simi-
larities with the impact site. For example, selecting
one reference site with higher flow than the impact
site and another with lower flow may increase the in-
vestigator's ability to determine the presence of a real
impact. Comparisons of data collected from the im-
pact and reference sites should provide consistent in-
terpretations of the impact, regardless of which
reference site is used in the comparison.
     Minimizing temporal variation in biological
measurements can be critical to the evaluation of con-
trol and impacted sites. A general rule is that samples
should be obtained from the control and reference
sites during the same time periods. It may be feasible
to target an index period (e.g., late spring or summer)
in which the biological variables are  assumed to be
appropriate indicators of ecological health (e.g., the
period of maximum abundance or the period of mini-
mum variation in water chemistry). However, for
some organisms,  periods  of  maximum  abundance
may also be periods of high variability. In this case,
periods of low abundance but stable conditions can
be used to help the investigator detect impairment if it

Estimation of Sample Size
A final key component in developing a survey design
is to determine how many samples are required. In
most plans, the issue involves a trade-off between the
accuracy of the sample estimate and the magnitude of
available monetary and human resources. Conse-
quently, the first step is to determine how large an er-
ror can be tolerated in the sample estimate. This
decision requires careful thought; it depends on how
the collected data will be used and the consequences
of a sizable uncertainty associated with the sample es-
timates. Thus, in reality, selecting a sample size is
somewhat arbitrary and driven by practical consider-
ations of time and money. Investigators should, how-
ever, always approach the selection of sample size
using sound statistical principles.
    The appropriate equations for calculating sam-
ple sizes are often design dependent. Here, we present
a design for simple  random sampling. Suppose that d
is the allowable error in the sample mean, and the in-
vestigator  is willing to take only a 5 percent chance
that the error will exceed d. In other words, the inves-
tigator  wants to be reasonably certain that the error
will not exceed d. The equation for the sample size is
                    n = -
    and t is the f statistic for the level of confidence
required. For a 95 percent confidence level that the
sample mean will not exceed d, t = 1.96. Obviously,
an estimate of the population standard deviation, a, is
necessary to use  this relationship. In many cases, an
estimate of a can be obtained from existing data.
When few data are available about a, it is a good idea
to generate a set of tables to develop a  sense of the
range of samples required.
    Suppose, for example,  that  an  investigator
wishes  to estimate mean pH readings  above  a
wastewater discharge. How many samples are needed
to estimate the true mean pH? At the extremes, the in-
vestigator guesses that the standard deviation might
range between 0.5 and 1.2 pH units. This estimate
leads to Tables 3.1 and 3.2:
Table 3.1. — Number of samples needed to estimate
the true mean (low extreme). j

0.2 pH units
0.5 pH units
Biological Criteria:  Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                               CHAPTER 3. Designing the Sample Survey
Table 3.2 — Number of samples needed to estimate
the true mean (high extreme).
0.2 pH units
0.5 pH units
1 pHunit
Note that the number of required samples increases
dramatically as the confidence and precision in the
estimates increase, and as the population standard
deviation increases. As a general rule, the precision of
the estimate is inversely proportional to the square
root of the sample size. Therefore, increasing the sam-
ple size from 10 to 40 will roughly double the preci-
    For a fixed precision, changing the required con-
fidence in the estimate from 95 to 99 percent slightly
more than doubles the sample size. Equation 3.2 can
easily be adopted for binary response variables in
which the responses are expressed as proportions or
percentages (Cochran, 1963). In addition, for  those
situations where the number of sampling units is fi-
nite, a finite population correction for the sample size
is available (Cochran, 1963).
    Equations for calculating sample sizes for ran-
dom, nonrandom, and stratified sample surveys can
be found in the literature. They depend on the sample
design, the available variance estimates, and whether
the environmental assessment has both spatial and
temporal components.

Important Rules
Developing a sample design is frequently driven by
factors other than statistics and biology. For example,
the investigator may  be asked to determine a differ-
ence between upstream and downstream stations of a
municipal treatment plant outfall, long after the sus-
pected impacts began. Even in these cases, creative
sampling strategies  can help develop the link be-
tween  the wastewater outfall and downstream im-
pacts. The  following  rules apply  to  most
environmental assessment scenarios.

    .   Rule 1. Sample designs and their associated
       analytical techniques can be difficult to
       conceptualize and implement. Always
       consult individuals with appropriate
       training before starting a biocriteria study.
    .   Rule 2. State precisely and clearly the
       problem under evaluation before attempting
       to develop a survey design.
                                                  Rule 3. Collect samples from a reference site
                                                  as a basis for inferring impact. In general,
                                                  the sampling scheme used at the impacted
                                                  site should be the same as that employed at
                                                  the reference site.

                                                  Rule 4. To the degree possible, use
                                                  environmental characteristics to minimize
                                                  the error in the sample estimate. For
                                                  example, for patchy environments examine
                                                  the possibility of systematic sampling; for
                                                  heterogeneous populations,  examine the
                                                  possibility of using stratified random
                                                  sampling. In all cases, attempt to minimize
                                                  sample bias by randomly allocating samples
                                                  (either geographically or temporally across
                                                  the entire population, or within strata).

                                                  Rule 5. For seasonally dependent biocriteria,
                                                  collect data for several seasons before
                                                  attempting to determine an impact.  For
                                                  biocriteria that are not seasonally
                                                  dependent, collect sufficient data to
                                                  represent the variability in the population.

                                                  Rule 6. Collect enough data so that the
                                                  accuracy and precision requirements
                                                  associated with using the information are
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data

CHAPTER 4. Detecting Mean Differences                                   21
Cases Involving Two Means	21
    Random sampling model, external value for 6	21
    Random sampling model, internal value for 6	22
    Testing against a Numeric criterion	22
    A Distribution-Free Test	23
    Evaluating Two-Sample Means Testing	23
Multiple Sample Case	23
Parametric or Analysis of Variance Methods	23
    Nonparametric or Distribution Free Procedures	25
    Testing for Broad Alternatives	25
The Kolmogorov-Smirnov Two-Sample Test	26
Relationship of Survey Design to Analysis Techniques	27

CHAPTER 4   Detecting  Mean Differences
      Hypothesis testing methods that seek to detect
      the mean differences arising from two or more
independent samples are among the most common
statistical procedures performed. However, these pro-
cedures are frequently used without regard to some
basic assumptions about the data under investigation
— which, in some cases, leads to errors in interpreta-
    This section describes  and  illustrates several
methods for detecting mean differences. It focuses on
(1) cases in which only two means are involved, and
(2) situations involving more than two means. It also
presents suggestions concerning the use and abuse of
means testing procedures.

Cases Involving Tiro Means
Several scenarios within the biocriteria program re-
quire investigators to compare the mean differences
between two independent populations. Suppose for
example, that we want to use biocriteria in a regula-
tory setting in the following situation:
    A wastewater treatment plant discharges its ef-
fluent into  a stream at a single point. Upstream of the
discharge facility, the stream is in good shape (unaf-
fected by any known sources of pollution). The re-
source agency has sufficient  funds to monitor three
stations upstream of the discharge site and a compara-
ble number of streams downstream  of the discharge
site during the late summer. The agency has chosen to
evaluate aquatic life use impairment using benthic
species richness.
    At each of the six sites, 10 independent measures
of species  richness  were generated  by randomly
placed ponar grabs over a relatively small spatial area
(a sample size of 10 was chosen based on variability
estimates generated at a different, but similar site).
Sites of comparable habitat quality were chosen for
sampling. The upstream sites will serve as a reference
condition against which to compare the downstream
    In addition to the current survey (i.e., sampling
regime, data collection, and interpretation), the regu-
latory agency has identified an additional upstream
site for which it has 10 years of comparable long-term
(historical) data. The investigators have no reason to
believe that a time component exists in the long-term
data. Table 4.1 presents descriptive information asso-
ciated with the upstream and downstream sites and
with the long-term site.
    The question for investigators is this: Do the data
reveal  a  downstream effect associated with the
wastewater discharge? Several methods are available
for  assessing the mean differences between the up-
stream and downstream sites, and each method has
both positive and negative aspects.

Random Sampling Model, External Value
Suppose investigators believe that the 30 measures of
benthic species richness collected at  the upstream
and downstream sites can be treated as random sam-
ples from appropriate populations. In particular, they
Table 4.1 — Descriptive statistics: upstream-downstream measures of benthic species richness.
! Pooled Data
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                                CHAPTER 4.  Detecting Mean Differences
believe that the two populations have the same form
(i.e., normal distributions with the same variance, a)
but different means, na and fj.b. How can the investiga-
tors use statistical theory to make inferences about
the effect of the wastewater treatment plant dis-
    If the data were random samples from the popu-
lations, with Na = 30 observations from the upstream
population and Nb observations from the downstream
population, the variances of the calculated averages,
Ya and Yb would be:
                                _ d2         (4.1)
    Likewise, in the random sampling model, Ya and
Yb would be distributed independently, so that:
    Even if the distributions of the original observa-
tions had been moderately nonnormal, the distribu-
tion of the difference Ya-Yb between sample averages
would be nearly normal because of the central limit
effect. Therefore, on the assumption of random sam-
             z = -
  J_  J^
  sT + N7
would be approximately a unit normal deviate.
    Now, CT, the hypothetical population value for
the standard deviation, is unknown. However,  the
historical data yield a standard deviation of 3.4. If this
value is used for the common standard deviation of
the sampled populations, the standard error of the dif-
ference, Ya-Yb = 2.3, is

                aj— + — =0.89
                  V30  30

     Based on the robust estimators (trimmed mean
difference of 2.1 and median absolute difference of
1.6) the standard error of the difference would be
0.41. If the assumptions are appropriate, the approxi-
mate significance level associated with the postulated
difference (//Q—pib) in the population means will then
be obtained by referring
               zn =-
to a table of significance levels of the normal distribu-
tion. In particular, for the null hypothesis (/^.a-^b) = 0,
z0 = 2.3/.S9 = 2.6, and Pr(z < 2.6) < .005. Again, the
upstream/downstream effect seems to be realistic (us-
ing the robust estimators, z = 5.1 and Pr[z < 5.1]
                                            < .001). Note that we use the z distribution in this ex-
                                            ample because the population variance is determined
                                            from an external set of data that represents the popu-
                                            lation of interest — an assumption equivalent to as-
                                            suming that the variance of the population is known
                                            (i.e., not estimated).

                                            Random Sampling Model, Internal Value
                                           for a
                                                Suppose now that the only evidence about CT is
                                            from the Na = 30 samples taken upstream and the Nb
                                            = 30 samples taken downstream. The sample vari-
                                            ances are
                                                         s, =-
                                                                N  -1
                                                                         = 625
                                                                                  = 8.41
                                                On the assumption that the population variances
                                            of the upstream and downstream sites are, to an ade-
                                            quate approximation, equal, these estimates may be
                                            combined to provide a pooled estimate of s2 of this
                                            common a2. This is accomplished by adding the sums
                                            of squares in the numerators and dividing by the sum
                                            of the degrees of freedom,
                                                                                        • = 752
                                                On  the assumption of random sampling from
                                            normal populations with equal variances, in which
                                            the discrepancy [(Ya-Yb) - (^a-/ub)] is compared with
                                            the estimated standard error of Ya-Yb, a t distribution
                                            vfithNa+Nb-2 degrees of freedom is appropriate. The
                                            t statistic in this example is calculated as
                                        t =-
                                                              1     1
                                                           s I— + —
                                                             N.   N,
                                                                              = 32
                                                This statistic is referred to a t table with 58 de-
                                            grees of freedom. In particular, for the null hypothesis
                                            that (MQ-/"b) =  0, Pr(t < 3.2)  < .001. Again, an up-
                                            stream/downstream effect seems plausible. Using the
                                            robust statistics, a pooled estimate of error can be cal-
                                            culated as the average of the median absolute devia-
                                            tions associated with each data set ([I +  1.6 ] / 2 =
                                            1.3). Therefore, the t statistic is 6.3 and the Pr(t< 6.3)<
                                            .001. Note that we use the t distribution in this exam-
                                            ple because the population variance is estimated from
                                            the survey data and not assumed to be known.

                                            Testing against a Numeric Criterion
                                            In the preceding sections, hypothesis tests were pre-
                                            sented for the two-sample case. Similar tests are avail-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                               CHAPTER 4. Detecting Mean Differences
able for testing a sample mean against a fixed numeric
criterion (for which an associated uncertainty does
not exist). In this case, the t statistic can be written as
                                                                V(W) =

                    t =-
    Here, s is the sample standard deviation and n is
the numeric criterion of interest. The probability of a
greater value can be found in a t table using n-1 de-
grees of freedom.

A Distribution-Free  Test
In many instances, the assumption that the raw data
(or paired differences) are normally distributed does
not hold. Even the simplest monitoring design involv-
ing the comparison of two means requires either (1) a
long sequence of relevant previous records that may
not be available or (2) a random sampling assumption
that may not be tenable. One solution to this dilemma
is the use of distribution free statistics such as the  W
rank sum test (Hollander and Wolfe, 1973). The Wtest
is  designed to test the hypothesis that two random
samples are drawn from identical continuous distri-
butions with the same center. An alternative hypothe-
sis is that one distribution is offset from the other, but
otherwise identical. Comparative studies of the t and
W tests indicate that while the t test is somewhat ro-
bust to the normality assumption, the W test is rela-
tively powerful  while not requiring normality. In
many cases, performing both the t and W tests can be
used as a double check on the hypothesis.
    To conduct the Wlest (see Chapter 2), the investi-
gator combines the data points from the samples, but
maintains the separate sample identity. This overall
data set is ordered from low value to high value, and
ranks are assigned according to this ordering. To test
the null hypothesis of no difference between the two
distributions f(x) and g(x)  (i.e., H0: f[x]  = g[x]), the
ranks of the data points in one of the two samples are

                   W=£Rj                (4.5)

    Statistical significance is a function of the degree
to which, under the null hypothesis, the ranks occu-
pied by either data set differ from the ranks expected
as a result of random variation. For small samples, the
W statistic calculated in Equation 4.5 can be com-
pared to tabulated values to determine its signifi-
cance. Alternatively, for moderate to large samples,  W
is  approximately normal with mean E(W) and vari-
ance V(W):
                    Na (Nh +Na +1)
            E(W) = ———	         (4.6)
                                                                     Z —
                                                        In the upstream/downstream case that we have
                                                    been discussing, E(W) = 1,127, z = 3.12, and Pr(
                                CHAPTER 4. Detecting Mean Differences
Table 4.2. — Assumptions, advantages, and disadvantages associated with various two-sample means testing

distribution with
external estimate
of c

distribution with
internal estimate
of CT

Distribution free

Past data can
provide relevant
reference set for
difference Ya-Yb

observations are
as if obtained by
random sampling
from normal
populations with
common standard

observations are
as if obtained by
No assumption of
independence of
errors. No need
for random

Need relevant,
lengthy past
Construction of
distribution can
be tedious

consistency, and
length of data are
deemed to
represent a healthy

Need to know CT. Quality,
reference ; Need assumption consistency, and
distribution that
is easy to

No external data

random sampling :
from normal
populations with

common standard
deviation a
of independence
of individual
errors coming
from random

Need assumption
of independence
of individual
errors coming
from random
length of data are
deemed to be a
sample from a
healthy ecosystem.
transformation may
be necessary to
achieve normality.
Most commonly
used test.
Appropriate if
assumptions hold.
sampling If outliers or

influential data
apprent, consider
i the use of robust
estimated by s

observations are
as if obtained by
random sampling
from populations

Computations are
easy. No external
data needed.
of almost any : sampled need not
be normal.

Need assumption
of independence
or symmetry of
individual errors
estimators of the
mean and variance.
Can be used if
assumptions are
suspect. Can be
arising from : used to verify
random sampling
results of
parametric tests.
Known impacts to
reference site have
occurred, or
physical and
between the impact
and reference site
are identified.
Quality of data is
suspect or impacts
at the external site
are known or

assumptions do not
hold. Generally,
robust estimators
of the mean and
variance can
reduce the
influence of

No real
disadvantage of
these tests. In most
cases, power of the
test is equivalent or
near the parametric
counterpart. |
These decisions include the effects of interest (model
specification — one-way designs, two-way designs,
and so forth); whether the classification variables are
random, fixed, or nested; whether any interactions
(nonadditive effects) are present in the data; how to
handle unbalanced designs (unequal sample sizes for
the various treatments); and the nature of the error
     As we can see from this list, ANOVA procedures
are not simple but require a great deal of thought. In
general, the ANOVA model should follow directly
from the sample design used to collect the biocriteria
                                            data. The following  model illustrates a simple
                                            one-way, fixed block design like that described in the
                                            upstream/downstream case presented here. The over-
                                            all model for the ANOVA is
                                                              Yfl =
                                                     YJJ = the jth response for the 1th site
                                                      /j. = the population mean

                                                     a,- = the effect of site i on Y
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data,

                                CHAPTER 4.  Detecting Mean Differences
         6ij = the error associated with each
               observation in the data.

    The model assumes that the errors are normally
distributed with mean 0 and variance a2. Based on the
model, any  observation is composed of an overall
mean (^), a site effect (a), and a random element (e)
from a normally distributed population. Hypothesis
testing for the ANOVA model is undertaken by calcu-
lating the variance associated with model compo-
nents (sums-of-square differences around the mean
effect). A test  statistic is formed by comparing the
mean square differences associated  with a model
component to  the mean error term. This statistic is
distributed as an F distribution. Table 4.3 presents an
example of this variance breakdown  for the simple
upstream/downstream model.
Table 4.3. — Analysis of variance results for the case
study model.


    As seen in the table, the effect of site means is an
important indicator of the level of benthic species
richness. Therefore, it seems a good idea to explore
the relationship among the site means as a method of
examining  a possible gradient of upstream/down-
stream differences. Several methods are available for
testing the differences between site means. In this ex-
ample, the method  of least  significant difference
(LSD), Duncan's multiple range test, and  Tukey's
studentized range  test are presented. (A review of
these and other multiple comparison methods is in
the SAS/STAT Guide for Personal Computers.) Tables
4.4 through 4.6 present the results of these multiple
comparison tests.
Table 4.4. — Least significant difference multiple
comparison test.



10.4 10
Table 4.5. — Duncan's multiple comparison test.


12.6 10
Table 4.6. — Tukey's multiple comparison test.


- 	 -

! 11.2
L . .
    In the above  tables,  sites  within a specified
grouping are not different at the a = 0.05 level of sig-

Nonparametric or Distribution Free
Distribution free methods for testing multiple sample
means are available in much the same format as for
parametric tests. The Kruskal-Wallis rank sum test
(one-way design)  and the Friedman rank sum test
(two-way design) are frequently used when the nor-
mality assumptions do not hold (see Hollander and
Wolfe [1973] for a review of these methods). Multiple
comparison methods based on the individual rank
scores for each site are available.
    Again, the investigator must develop the model
to match the  experimental design. In  the up-
stream/downstream comparisons of benthic species
richness,  the Kruskal-Wallis  test with a simple
one-way  model results in a chi-square  statistic of
16.38 (Pr < chi-square = 0.006). Again, the up-
stream/downstream sites appear to differ in the mea-
sured biocriteria. Results of the multiple comparison
tests using ranks were similar to those presented in
the ANOVA model.

A Test for Broad Alternatives
Frequently, investigators are faced with situations in
which tests for  mean differences or variance differ-
ences are not sufficient. For example, investigators
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                               CHAPTER 4. Detecting Mean Differences
may realize that smaller fish are more sensitive to a
pollutant than larger fish. In such cases, simple test-
ing for mean differences (in which the mean is calcu-
lated without regard to size  class) between reference
and impacted sites may not suffice. Instead, the mea-
sure of toxic effect will be  better reflected through
changes in the distribution of fish caught at the two
sites. Examining the differences in distribution func-
tions among sites may be a more sensitive way to de-
tect effects than relying on population estimates such
as the mean and variance.
    Statistics designed to detect broad classes of al-
ternatives, as in the scenario presented here, are dis-
tribution free tests (i.e., they do not rely on normality
assumptions), although  they do  have parametric
counterparts. For a single  sample,  goodness-of-fit
tests to gage the correspondence between an empiri-
cal distribution function of observations and a spe-
cific probability model or distribution (e.g., normal or
lognormal) may be useful. These tests can also be con-
ducted using the  chi-square statistic (see Snedecor
and Cochran, 1967).

The Kolmogorov-Smirnov
Two-Sample Test
                                                       Within the biocriteria program investigators will
                                                   frequently be challenged to evaluate a broad range of
                                                   differences between two or more  populations. The
                                                   Kolmogorov-Smirnov (KS) two-sample test is easy to
                                                   implement and can be used to evaluate the relation-
                                                   ship between two distribution functions. This test
                                                   provides graphic and statistical evaluations of two
                                                   sets of data.
                                                       The KS two-sample test involves the develop-
                                                   ment of two cumulative distribution functions (CDFs)
                                                   to test the hypothesis that each sample was taken
                                                   from the same population. The test is based on the dif-
                                                   ference between the empirical distribution functions.
                                                   The largest difference between the two functions,
                                                   Dmax, forms the basis for the test statistic. Dmax is the
                                                   maximum vertical distance at any horizontal point
                                                   between the two CDFs (Fig. 4.1).
                                                       To generate a CDF for an individual sample, the
                                                   data are ordered from lowest to highest, and the rank
                                                   order of each point determined. Dividing each rank
                                                   by the sample size results  in a cumulative distribu-
                                                   tion function ranging from 0 to 1 (or 0 to 100 percent,
                                                   if multiplying by 100). The two samples need not
                                                   have the same number of observations. Tabled values
                                                   of the test statistic are available for various sample
                                                   sizes (Hollander and Wolfe, 1973). The test is both
                                                   one-sided and two-sided. For the benthic species rich-









^— ^— upstream sites
.... downstream sites
                           3     5     7     9     11    13    15    17
                                           Species Richness

Figure 4.1 — Cumulative distribution functions of upstream and downstream sites.
        Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                               CHAPTER 4. Detecting Mean Differences
ness example shown here in Figure 4.1, Dmax is 0.433
(43.3 percent) which occurred at a species richness
value of 10.6. The null hypothesis is rejected with a
Type I error rate of 0.0072.

Relationship of Survey Design
to Analysis Technique
Table 4.7 outlines the relationship between means
testing techniques and selected survey designs as de-
scribed in earlier sections. As a general rule, the data
analysis techniques are driven by the survey design.
The principle decision points are the number of sites,
the available sample size, and the presence or absence
of reference sites. However, investigators should not
be constrained by the survey design. Data explora-
tion, using any technique that fits the data, is encour-
aged and can provide insightful results.
Table 4.7. — Survey design and analysis techniques.
Upstream/downstream: random sampling at single sites
using current survey data
Upstream/downstream: random samplings at multiple sites
using current survey data
Upstream/downstream: random sampling within spatial or
temporal strata with one or more sites
Impact site data with large off-site external data; for
example when determination of impact is not clearly
definable or no good upstream reference condition
Systematic sampling such as random sampling along a
transect or nodes of a grid
Regionally impacted sites with one or more reference sites

t-test using an internal value of the variance; Wilcoxon test;
with large data sets, a KS two-sample test may be
One-way ANOVA using an internal value of the variance;
KS two-sample test on merged upstream and downstream
data; Kruskal-Wallis rank sum test
Two-way (or more complicated) ANOVA tests; Friedman
rank sum test (and other more complicated nonparametric
External reference distribution tests including the
two-sample KS test; (-test with external estimate of the
ANOVA, (-tests with internal estimates of the variance, and
possibly distribution tests (also note that such designs may
be subjected to techniques that demonstrate geographical
trends and patterns such as kriging and GIS methods)
Two-sample KS test
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data

CHAPTER 5. Discussion and Examples                                     29
Working with Small Sample Sizes	29
Assessments Involving Several Indicators	30
Regional Reference Data	31
Using Background Variability Measures	32
Final suggestions for Small Sample Sizes	32
Decision Analysis and Uncertainty	33

CHAPTER 5    Discussion  and Examples
  In the previous four chapters, standard statistical
  methods were presented,  discussed, and  illus-
trated with simple examples. Those methods and ex-
amples represent conventional analyses or situations
in which sample sizes are relatively large so that hy-
pothesis  testing is essentially straightforward. The
analyses were motivated by available, commonly ap-
plied methods,  and the examples were structured to
fit the methods. The purpose was  to provide  back-
ground statistical  guidance, with examples.
    In this chapter a different approach is taken.
Here,  typical  problems involving biosurvey data are
the starting points, and statistical methods for analy-
sis and hypothesis testing are proposed and applied
specifically to the  problem. In some cases, hypothesis
testing is possible; in  others,  the small sample size
may limit statistical inference. In the latter situation,
the investigator may consider design changes so that
different  statistical analyses can be  undertaken with
biosurvey data in  the future.
    We begin with a general discussion of the impor-
tance  of small sample size and briefly examine judg-
mental and statistical options for small sample size,
followed  by  examples of hypothesis testing with
small  samples. The chapter concludes with "rules of

Working with Small Sample
The conventional methods for statistical hypothesis
testing and interval estimation presented in chapters
1 through 4 work best under conditions that do not al-
ways  exist with biosurvey data. The common  ap-
proaches based on an underlying normal probability
model are clearly not  essential;  distribution-free
methods  are versatile and effective.  Still, virtually all
confirm-atory analyses (i.e., those concerned with hy-
pothesis testing and interval estimation) require esti-
mation of a "location" statistic that is the quantity of
interest (e.g.,  a mean, median, or quartile), and they
also require estimation of a variability statistic (e.g., a
standard  error) that indicates the spread of values for
the location statistic.
    An example of a desirable scenario for confirma-
tory statistical analysis was described in Chapter 2.
Data must be available from the sites of direct interest
in the assessment, and  sample  sizes must be large
enough for hypothesis testing. If the site-specific data
are inadequate  (less than two, which would prevent
direct calculation of a sample variance), or too small,
(e.g., less than five, which would make the calculated
sample variance quite uncertain), then alternatives to
statistical testing or intervals are possible, but these
alternatives are apt to include additional conditions
or assumptions beyond those  required in conven-
tional analyses.
    For example, a single sampling  might yield a
point estimate for IBI downstream of a wastewater
discharge, but provide  no measure of variability. If
historic data exist on IBI at other impacted sites, then
it is reasonable to assume that  the variability in the
historic data can be used as the variability measure
for testing at the site of interest. If, on the other hand,
the historic data analysis includes an IBI regression
based on predictors, such  as  watershed area  and
physical habitat quality, then the standard error for
this regression is the appropriate variability measure.
The key feature of these hypothetical examples is that
other, relevant information exists that the investigator
believes can be used to  estimate statistics for the site
of interest.
    In the absence of historic data for statistical  esti-
mation (usually for the estimate of variability), hy-
pothesis testing and interval estimation may still be
possible if the scientist is prepared to make certain as-
sumptions. For example, suppose that an aquatic biol-
ogist is confident that he  or she can estimate the
variability in IBI in impacted streams based on experi-
ence and knowledge of the literature. This estimate
could provide the necessary variability measure, but
it is obviously conditional on the judgment of the bi-
    None of the approaches presented in this docu-
ment are without assumptions; even the example in
Chapter 2 includes the assumption that the sample
data adequately reflect the true  situation. Judg-
ment-based estimates of statistics require a different
assumption, namely, the assumption that the investi-
gator's judgment is good.
    The most serious difficulty in the application of
interval  estimation and  hypothesis testing for
biosurvey data is the small  sample size associated
with many biological surveys. The  strength of infer-
ences from statistical analysis is tied to sample size. If
expert judgment is not available or not acceptable,
then sample size must be large; otherwise, statistical
testing is either not possible or not particularly useful.
But how large is "large enough"? There is no single,
correct answer to that question. As a rule, the stan-
Biological Criteria:  Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data

                                 CHAPTER 5. Discussion and Examples
dard error drops according to the square root of the
sample size; thus, the answer to the question depends
on the error level that is acceptable in the problem un-
der study.
    In general, sample sizes greater than 10 are usu-
ally desirable, and sample sizes smaller than five may
prevent meaningful  statistical testing. In addition,
since standard error may be expected to drop with the
square root of the sample size, there are diminishing
returns as sample size grows larger.
    What can be done when sample size is too small
and expert judgment is either not available or not ac-
ceptable? Any amount of data or evidence can indi-
cate an effect (or the absence of an effect), and this
information can be described in text, presented in ta-
bles, or displayed in graphs. However, in the case of
very small samples, it is important to emphasize that
the analysis is descriptive and not confirmatory. Al-
ternatively, if the investigators have data on biological
and chemical indicators of impairment and criteria
for each of the indicators, then it may still be possible
to test effects across indicators.
    Suppose there is no sample size estimate — only
an estimate of variability based on expert judgment.
How can statistical testing be completed? We actually
have some well-established approaches to elicit judg-
ment-based quantities and error estimates, along with
an effective number of degrees of freedom (Meyer and
Booker, 1991). Alternatively, the scientist may simply
summarize test results in a table with sample size (or
degrees of freedom) and  test results (e.g., p-values)
given for a range from small to large samples. In some
cases, the conclusion may not depend on the effective
sample size; in others, sample size may be  critical,
which places more importance on the goodness of the
judgmental assessment.

Assessments  Involving
Several  Indicators
Suppose that sampling has occurred at a stream site at
which environmental degradation is suspected, but
the sample size for any single indicator is too small for
hypothesis testing. For each indicator, the state has es-
tablished an impairment criterion; thus, the results of
sampling could be presented either as a measurement
(e.g., dissolved oxygen concentration) or as success or
failure in meeting the state's criterion. Each of the in-
dicators is expected to provide an independent mea-
sure or assessment of environmental degradation;
therefore,  several indices cannot be  separately in-
cluded in the analysis if they are based on the same
underlying measurements.
    As an example, Table 5.1 presents three biologi-
cal indices, the IBI, ICI, and Iwb based on sampling at
                                            a single site  on three different dates.  The state
                                            biocriteria are also given. It  is assumed that  the
                                            two-month period between samplings results in tem-
                                            poral independence between the samples.
Table 5.1 — Biological Indices and biocriteria
June 15
August 15
October 15
                                                Since we have only a single estimate per date on
                                            each index, and only three data points per date and
                                            per index, statistical inference opportunities are lim-
                                            ited. We can, however, treat the nine index estimates
                                            in Table 5.1 as nine independent measures by which
                                            to assess the underlying condition of biologic impair-
                                            ment, based on biocriteria violations. The indices in
                                            Table 5.1 are recorded as a 0-1 variable, in parenthe-
                                            ses,  indicating attainment (1) or violation (0) of each
                                            biocriterion. Next, these nine 0-1 data points can be
                                            subjected to statistical analysis to determine the over-
                                            all biologic impairment reflected in the aggregate of
                                            the three indices.
                                                First, calculate the proportion of violations (p) in
                                            the sample as an estimate  for the probability of bio-
                                            logic impairment at the site:
                                            p is a point estimate that is uncertain due to natural
                                            variability and measurement error. We can calculate a
                                            confidence interval for p or test the hypothesis that p is
                                            less than a specified critical value. Once it is calcu-
                                            lated, a confidence interval or a percentile could serve
                                            as a cutoff point indicative of biological impairment.
                                            For example, one might define impairment as more
                                            than 50 percent violations. As a variation on that idea,
                                            Rankin and Yoder (1990) selected the 75th percentile
                                            in a histogram of sample IBI deviations (from the
                                            mean value) to be the limit of tolerable variation.
                                                Confidence intervals for p can be determined us-
                                            ing binomial tables or graphs like those presented in
                                            Hahn  and Meeker (1991), or using Table  1.4.1 in
                                            Snedecor and Cochran (1967).  For  example,  the
                                            two-sided 90 percent confidence interval for this ex-
                                            ample (based on Table A.23a in Hahn and Meeker) is

                                 CHAPTER 5. Discussion and Examples
Cochran, 1967), the two-sided 90 percent confidence
interval is
   p-lj645V(p)(l-p)/n < P < P +1.645V(p)(l-p)/n

    which, for this example is

      --1645-1-(-)/9 < p<- +1.645.1-(-)/9
      9      V9 9        9      V9 9
                                 CHAPTER 5. Discussion and Examples
making this  classification, the investigator would
have noticed that little overlap of the distributions oc-
curs in the extreme tails of the impacted and reference
site distributions.

Using Background Variability
    In the previous section, the Ohio ICI biocriteria
were identified as point values between classes (e.g.,
ICI = 35 is the warmwater habitat criterion separating
"good/exceptional" from "fair"). When a single ICI de-
termination is available from a new site, the Ohio cri-
teria can be used to classify the  site,  ignoring
uncertainty.  Beyond that,  if it is  assumed that the
Ohio ICI classification scheme is  fixed and certain,
and if a reliable estimate of site ICI variability is avail-
able, then the classification based on a single ICI
value can be assessed using a hypothesis test.
    In situations with only a single  estimate of a
bioindicator, collateral information must be obtained
to provide the estimate of  variability. There are sev-
eral potential measures of site bioindicator variability
that might be suitable; Rankin and Yoder's (1990) dis-
cussion presents several informative graphs to show,
for example, that the IBI coefficient of variation drops
as IBI increases (Rankin and Yoder, 1990, Fig. 2), and
IBI coefficient of variation increases slightly as drain-
age area increases (ibid., Fig. 7).
    Knowledge and judgment can be quite helpful in
selecting the variability estimate. For example, if it is
believed that the site bioindicator variability is
roughly constant within a  specified category, then a
calculated estimate of variability for the bioindicator
within the appropriate class can be used as the vari-
ability measure for the site of interest. Categories may
be selected on any criterion (e.g., ecoregion, IBI range)
that is scientifically plausible and leads to an accept-
ably large overall sample size for variability estima-
    Rankin and Yoder's graphs suggest that,  while
the IBI coefficient of variation changes with selected
categories (IBI range), the IBI standard deviation may
be roughly constant across IBI classes and across
ecoregions. A median standard deviation between 4
and 5 appears to be quite consistent in the graphs.
Based on this collateral information, it is assumed
that site-specific IBI in Ohio,  under constant condi-
tions (i.e., no change in site factors that determine
IBI), has a standard deviation of 4.5.
    Here is an example of how this estimate is used.
Assume that the single IBI measurement shown in
Figure 5.1 (IBI = 35) was taken in Ohio under the con-
ditions  described. Since the sampling program in
Ohio is quite large, 4.5 is effectively the true standard
                                            deviation for all sites; thus, with a single sample, it
                                            may be concluded that the standard error for the
                                            mean value (IBI  = 35) is  also  4.5. To  determine
                                            whether the sample is taken from the reference or im-
                                            pacted distribution, assume that 18 IBI samples were
                                            taken at the reference and impacted sites, and that the
                                            following statistics are calculated:
                                               Reference site sample mean = 42,  sample standard
                                                   deviation = 5;

                                               Impacted site sample mean = 27,  sample standard
                                                   deviation = 8.

                                                 Then, a two-tailed t test using Equation 2. Ib (see
                                            Chapter 2) evaluating the null  hypothesis that the
                                            means are the same will result in the following:
                                               t =  1.43, for the hypothesis that the reference site
                                                   mean is equal to the mean of the third site mean;

                                               t =  1.245, for the hypothesis that the impacted site
                                                   mean is equal to the third site mean.

                                                 Based on this information,  the  investigator has
                                            some evidence that the sample  collected from the
                                            third site is closer to the impacted site mean than to
                                            the reference site mean. However, as conveyed by the
                                            similar t statistic results, the confidence in this con-
                                            clusion is relatively weak.

                                            Final Suggestions for Small
                                            Sample Sizes
                                                 The discussion and examples  in this chapter,
                                            while intended as useful, general guidance, are not
                                            firmly  rooted in statistical theory and hence not al-
                                            ways to be followed. Rather, they reflect our experi-
                                            ence and observations. Further, they  concern the real
                                            situations that biologists confront — situations that
                                            do not conform to well-established statistical proce-
                                            dures.  However difficult and awkward for statistical
                                            analysis, the problems must be addressed. With this
                                            caveat, the following concluding comments summa-
                                            rize the discussion and examples presented here:

                                               1. If the sample size is 1, a measure of variability
                                                  may still be obtained using expert judgment or
                                                  other data. If no variability measure can be jus-
                                                  tified, then descriptive statistics may be the ex-
                                                  tent of the analysis (i.e., no interval estimation
                                                  or hypothesis testing).

                                               2. If the sample size is more than 1 but still small
                                                  (perhaps 5 or fewer), then it is possible to use
                                                  the  sample  to estimate variability for interval
                                                  estimation or hypothesis testing. However, the
                                                  intervals may be very large and the tests not
Biological Criteria:  Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data




CHAPTER 5. Discussion and Examples

                                    Reference Site
  Figure 5.1—IBI Distributions for reference and impacted sites
     very  powerful,  because small sample size
     means that the strength of evidence is weak.

   3. Situations may exist with more than a single
     estimate of variability. Perhaps one estimate
     will be based on data and a second estimate on
     expert judgment. In that case, the  two esti-
     mates of variance can be pooled, using an esti-
     mator like  that  in  Chapter 4's "Reference
     Distribution  Based on  Random Sampling
     Model,  Internal Value for a." A difficulty in
     pooling when a judgmental estimate of vari-
     ance  is involved is determination of the de-
     grees of freedom for the judgmental variance
     estimate. Perhaps the best approach is to make
     a reasoned guess as to how  much information
     the judgment contains with  respect to samples
     (the "effective sample size"):
           (a) if the judgment is highly uncertain, as-
       sign it a small number of degrees of freedom
       (perhaps 2-5),
           (b) if there is more confidence in the judg-
       ment, assign the judgment estimate 5+ de-
       grees of freedom.
     If the conclusions from this analysis are not
     particularly sensitive to the  exact choice of the
     effective sample size for the judgmental esti-
     mate, then inferences may be made with some
     confidence. If, however,  the conclusions are
     sensitive to this choice, then the best approach
                         may be to obtain additional information before
                         drawing final conclusions.
                   Decision Analysis and
                   In the preliminary approach presented here we have
                   advocated the use of classical statistical hypothesis
                   testing to summarize data concerning biological crite-
                   ria. We assume that a decision and succinct conclu-
                   sions based on the data are needed.  However,
                   alternatives to hypothesis testing may be appropriate
                   in certain situations. For example,  statistical and
                   graphic summaries (e.g.,  confidence  intervals,
                   bivariate plots) may be used to summarize and pres-
                   ent information when the investigator believes that a
                   classical hypothesis test based on a single parameter
                   is too brief or that more evidence should be presented.
                       An alternative is to recast the hypothesis testing
                   problem using a decision analytic framework. Deci-
                   sion  analysis  (Raiffa, 1968;  Reckhow, 1984) begins
                   with the scientific base summarized in the hypothesis
                   test and incorporates the consequences  (e.g., costs
                   and benefits)  of possible decisions. In an informal
                   analysis, a decision analytic approach may be under-
                   taken by the decision maker if a desired outcome of
                   management action is "to hedge away" from large ad-
                   verse consequences or losses. Informal  consider-
                   ations  and hedging may  be most effectively
                   undertaken in an  a  priori assessment of costs and
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                                 CHAPTER 5. Discussion and Examples
benefits, which  then becomes  a  primary basis for
choosing between various levels of test significance.
Thus, if it seems likely that biological degradation can
be avoided, then the decision maker may request that
the biologist set the significance level for testing (e.g.,
that H0 has no impact) relatively high (e.g., at 0.10 or
0.20). Alternatively, if cleanup costs are high relative
to benefits, then the test significance level (for H0 has
no impact) could be set relatively low (e.g., at 0.01 or
    Suppose that a measure of biological integrity is
tested for  upstream-downstream differences sur-
rounding wastewater treatment plant discharges from
small treatment plants (less than 5 million gallons per
day) throughout the state. If the per person cost to up-
grade the treatment level for small communities is
generally quite high, and the benefits to be derived
from biological improvements are generally low (rela-
tive to the organisms affected and typical uses of the
streams), hedging away from high cost may be infor-
mally undertaken by setting the significance  (or "ac-
tion") level of the test quite low (e.g., 0.01 or 0.005).
Additional study of biological degradation, costs, and
benefits would be  triggered only  if an up-
stream-downstream test result was significant at this
    Hedging away from large losses is an option pre-
cisely because of scientific uncertainty. If there were
no scientific uncertainty about biological degrada-
tion, then the analysis would always focus on costs
and benefits, and the management option with the
highest net benefits would be selected. On the other
hand, if scientific uncertainty is extreme, an appro-
priate strategy may be either to hedge farther from
large adverse consequences or to seek more informa-
tion, if possible,  to reduce scientific uncertainty be-
fore new management action is  adopted.
                                                In more formal applications, decision analysis
                                            may be used to combine uncertain scientific informa-
                                            tion on biocriteria (expressed probabilistically) with
                                            an overall measure of net benefits or use associated
                                            with management actions. This approach is most ef-
                                            fective in a Bayesian context; Reckhow (1984) pres-
                                            ents a simple example applied to lake eutrophication
                                            management. However, comprehensive Bayesian de-
                                            cision analysis is apt to be prohibitively expensive (in
                                            terms of human resources and cost) for all but the
                                            most critical and consequential problems.
                                                One outcome of data analysis may be that the de-
                                            cision maker will desire more information before im-
                                            plementing  new management  actions. In  formal
                                            decision analysis, a value of information calculation
                                            should be made to help one determine the wisdom of
                                            immediate action versus additional data collection
                                            and analysis. In informal analysis, one should con-
                                            sider how useful new information would be if action
                                            has to be deferred pending its arrival.
                                                The outcome of hypothesis testing is a statistical
                                            summary  of evidence on biological degradation. It
                                            does not  establish cause  and  effect, although  a
                                            well-designed test may associate degradation with a
                                            candidate  cause. The strength of causal conclusions
                                            depends on a number of factors including a priori sci-
                                            entific knowledge and field observation. Scientific
                                            support for management actions is greatest when the
                                            observation of degradation is accompanied by docu-
                                            mentation of a causal relationship.
                                                In most cases, environmental management deci-
                                            sions reflect a certain limited understanding of causal
                                            connections and a certain degree of observational evi-
                                            dence that is more statistical in nature. This combina-
                                            tion is a reasonable basis for decision; in fact, it would
                                            be unreasonable to expect detailed causal knowledge
                                            in support of every decision. However, as manage-
                                            ment actions are undertaken and biological response
                                            is observed  after the  fact,  more observational evi-
                                            dence may be gathered to support earlier decisions.
Biological Criteria:  Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data

APPENDIX A. Basic Statistics and Statistical Concepts                       35
Measures of Central Tendency	35
    Mean	35
    Median	35
    Trimmed Mean	35
    Mode	36
    Geometric Mean	36
Measures of Dispersion	36
    Standard Deviation	36
    Absolute Deviation	36
    Interquartile Range	36
    Range	37
Resistance and  Robustness	37
Graphic Analyses	37
    Histograms	37
    Stem and Leaf Displays	39
    Box and Whisker Plots	40
    Bivariate Scatter Plots	41

APPENDIX A   Basic Statistics  and  Statistical
     Certain specific features of a data set are charac-
     terized by descriptive statistics. Of these mea-
sures, the center, or central tendency of a set of data, is
probably the most important. Among the candidate
statistics for central tendency are the mean, median,
mode, and geometric mean. Once the center of a data
set is described, the next important feature is the data
distribution: the spread, dispersion, or scale. Among
the candidate estimators of dispersion are range, stan-
dard deviation, and  interquartile range. These two
characteristics of a data set, central tendency and dis-
persion, are the most common descriptive statistics.
Other characteristics, such as skewness and kurtosis,
are occasionally important. The examples that follow
illustrate the choice of descriptive statistics.

Measures of Central Tendency
Probably the single most useful way to summarize a
data set is to indicate the center of the sample. "Cen-
ter" suggests the vague notion of the middle of a clus-
ter of data points or perhaps the region of greatest
concentration. Since samples of data exhibit a variety
of distributions when plotted as bar graphs  (histo-
grams), it is not possible to define the center unambig-
uously. As a result several statistical estimators can
serve as candidates for determining central tendency
or location, and each candidate has advantages and
disadvantages for the task at hand.

The arithmetic mean, or simply, the mean — the sum
of all data values divided by their number — is the
most frequently used central tendency estimator. It is
so commonly used that scientists often lose sight of
the true reason for calculating descriptive statistics.
In some cases,  the mean is calculated as the central
tendency, though another central tendency statistic
would be better.
    The arithmetic mean (x) is the sum of the obser-
vations (Xj) divided by the number of observations (n):
    Each observation contributes its magnitude to
the sum of the observations and hence to the mean.
For symmetric  distributions (like  the  normal
bell-shaped or Gaussian distribution), the mean cal-
culated from a sample of data (the sample mean) often
comes  quite close  to the center, or peak,  of the
histogram for that sample. However, biological data
are often not symmetrically distributed. The ex-
tremely high or extremely low observations charac-
teristic  of skewed (nonsymmetrical)  data
distributions pull the mean in the direction of the
skew; a few extremely high observations can pull the
mean away from the bulk of the observations and to-
ward the few high data points. In those situations, a
more resistant estimator,  such as the median or the
mode, may be preferred.

The median is the value  of the middle observation
when data are arranged in order of size — from lowest
to highest value. The median is therefore known as an
"order  statistic" since it is based on an ordering or
ranking of observations. When the total number of ob-
servations is an even number, leading to two middle
values, the median is then the average of the two mid-
dle values.
    The "order" of the median observation is
         Median Observation = (n + l)/2     (A.2)

    The  effect on the median of all but  the mid-
dle-ranking observations is simply to hold a place in
the ranking so that outlying observations do not pull
the median toward the extremes. The median is resis-
tant to the influence of any particular observations;
therefore, it is a good statistic to use when the histo-
gram is skewed or unusually shaped.

Trimmed Mean
The  trimmed  mean is  the  mean  value from a
subsample of the original sample. The subsample is
formed by symmetrically trimming a small percent-
age of the data points from either end of the ordered
observations. For example,  a 10-percent  trimmed
mean is calculated from the subsample remaining af-
ter the highest and lowest 10 percent of the observa-
tions are removed from the set. At the extreme, the
median is the trimmed mean with all but the middle
observation removed.
    The trimmed mean is an efficient indicator of
central tendency if censoring has occurred or if a few
outlying observations are found  in the  data. Here,
censoring refers to data points reported as "below de-
tection limits." If 15 percent of the data points are be-
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                           APPENDIX A. Basic Statistics and Statistical Concepts
low  detection limits, then a 15-percent trimmed
mean estimator (involving 15 percent trimming from
each end) should result in less bias than the arithme-
tic mean, the estimator based on all uncensored ob-

The mode is the value in the sample that is most fre-
quently observed; it can be used for discrete or cate-
gorical data. If no value is repeated more than once, as
is possible for biological data on a continuous scale,
the mode will not be a useful estimator of central ten-
dency. Alternatively, if a histogram is used to repre-
sent the data, then the mode is defined as the range of
values associated with the tallest bar on the histo-
gram. The mode is a good estimator for central ten-
dency because the most frequently observed value is
usually near the center of the distribution. The histo-
gram will indicate visually whether the mode actu-
ally does correspond with the center of the sample.

Geometric Mean
The geometric mean is a reasonable measure of cen-
tral tendency for a set of data that exhibit a lognormal
distribution.  It is  the antilog of the mean of
logarithmically transformed data. The lognormal data
distribution is skewed in the  original units of mea-
surement, but normal (Gaussian) when the original
measurements are  log-transformed. Several investi-
gators suggest that the  lognormal  distribution is  a
good probability model for concentration data on en-
vironmental contaminants. Data sets described by the
lognormal distribution have a few high values that are
somewhat extreme from the bulk of the observations.
    The geometric mean may be calculated in two

       Geometric Mean = anti l"1          '
            Geometric Mean = []~J x; ] n

where    Y[K* = x, • x2 • x3 • ... • xn.
Measures of Dispersion
If central tendency measures are not used to summa-
rize a data set, then measures of dispersion or spread
will be used instead. Dispersion in a data set refers to
the variability in the observations around the center
of the distribution. Good measures of dispersion will
be obtained from symmetric distributions. Asymme-
try, or skew, will affect the estimate of dispersion so
                                                     that it overestimates spread in the shorter tail of the
                                                     data distribution (while underestimating it in the lon-
                                                     ger tail). A transformation (e.g., logarithm) should be
                                                     considered in cases of asymmetry in order to create a
                                                     symmetric distribution. Statistics are then calculated
                                                     on the basis of the transformed metric.

                                                     Standard Deviation
                                                     The most commonly used statistic for dispersion is
                                                     the standard deviation. In fact, the standard  devia-
                                                     tion, like the mean, is used so often that it is  some-
                                                     times thought to be the equivalent of dispersion. It is,
                                                     however, a measure of variability that represents the
                                                     average distance of the data from the mean; and, like
                                                     the mean, it  is  strongly affected by extreme values.
                                                     Thus, the standard deviation for a distribution of data
                                                     with a long tail to the right is inflated by the values at
                                                     the extreme right. Investigators may prefer to create a
                                                     symmetric distribution before calculating the stan-
                                                     dard deviation.
                                                         For a sample, the sample variance (s2) is
                                                                    s2 =
and the sample standard deviation (s) is the square
root of the variance (Vs2).

Absolute Deviation
The standard deviation is based on  squared error;
squaring the deviation between a data point and the
sample mean increases the influence of the largest
and smallest observations on the estimate of devia-
tion. The absolute deviation can be calculated to re-
duce  the influence of outliers on the dispersion
statistic. To arrive at the absolute deviation, the mean
(or median) is first estimated, and then the absolute
value of the difference between the mean or median
and each data point is calculated. The mean or me-
dian of these absolute deviations is then calculated as
the mean or median absolute deviation.

Interquartile Range
Since the standard deviation is unduly influenced by
extreme observations in both symmetric and asym-
metric distributions of data, a resistant alternative to
the standard deviation (as the median is to the mean)
is needed for situations in which the data are skewed
but transformation is undesirable. Fortunately a good
alternative exists — the interquartile range: the range
that includes the central 50 percent  of all observa-
tions in the set. The interquartile range, like the me-
dian, is based on order statistics; thus, it is unaffected
by the magnitude of the extreme observations in ei-
ther tail. It is calculated as the difference between the
         Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                           APPENDIX A. Basic Statistics and Statistical Concepts
observation at the 75th percentile (upper quartile) and
the observation at the 25th percentile (lower quartile):
   Lower quartile rank order =
       (1/2)(1 + median rank order)

   Upper quartile rank order =
       (1/2) (1 + n + lower quartile rank)

   Interquartile range (I) =
       lower quartile value - upper quartile value.

Range is an  easily  determined  and therefore fre-
quently cited measure of dispersion. The range is sim-
ply the maximum value minus the minimum value.
Since it is clearly affected by the magnitude of the ob-
servations at either extreme, the range should not be
relied on as the sole indicator of variability. Neverthe-
less, it is often informative to list the range along with
another dispersion statistic.

Resistance and  Robustness
In a number of scientific fields, particularly those that
depend on observational (as opposed to  experimen-
tal) data, errors of measurement and natural variabil-
ity are apt to result in  empirical distributions
(histograms) with occasional outliers and shapes that
are more spread-out than  the normal density  func-
tion. This result, which is fairly  common in water
quality studies, makes robustness  and resistance im-
portant considerations when choosing statistics to
summarize data. In some  situations, of  course, the
outliers, rather than central tendency and dispersion,
will be the focus of the study.
    A resistant estimator is one that is insensitive to
data points that are quite different from the rest of the
data (i.e., outliers). A robust estimator is one that per-
forms well (efficiently),  even if an assumption con-
cerning the underlying probability model is wrong.
For central tendency, the mean is neither resistant nor
robust. The median is resistant to outliers but not ro-
bust since it is not as efficient as other options (i.e., it
is subject to large standard error). The trimmed mean
and so-called M-estimators (Hampel et al. 1986) are
both resistant and robust.
    The most commonly used measure of disper-
sion, the standard deviation (or variance), is nonresis-
tant  (highly sensitive to  outliers) and  not robust
because squaring the deviation emphasizes  deviant
data  points.  The absolute  deviation and the
interquartile range are more resistant but not highly
    Resistance and robustness provide a measure of
insurance against features of the sample data that may
yield a summary estimate that is not representative of
the data set as a whole. A robust and resistant estima-
tor is not the best choice if, for example, there are no
outliers and the  sample is an exact normal density
function. However, if outliers do occur, and samples
are not normal (or lognormal), then robust and resis-
tant estimators of center and dispersion are wise and
safe choices that will help investigators avoid faulty

Graphic Analyses
It is good practice in statistical analysis to begin with
various displays  of the raw data. That is, before de-
scriptive statistics are calculated from a data set, and
before analyses such as hypothesis testing and linear
(regression) model building occur, it is wise to look at
empirical graphs. The graphs recommended for this
task help the investigator identify the need to trans-
form the data before conducting the statistical analy-
    Most procedures in statistics (e.g., regression
analysis, hypothesis testing) derive summary values
(e.g., mean, trimmed mean) from a data set. If the in-
ferences drawn from statistical procedures are to be
valid for the entire data set, then the summary statis-
tics must represent  the entire set. Graphic displays
guide the choice of  any necessary manipulations of
the data set and help assure the selection of appropri-
ate summary statistics. The examples presented here
underscore the importance of displaying the data at
the beginning of a statistical study.
    Graphs can also be useful during the course of a
statistical study.  For example, bivariate scatter plots
help scientists select independent variables for a re-
gression equation,  and scientists will often wisely
choose to present the results of a statistical analysis in
graphic form. Conclusions are often most effectively
conveyed through graphs.

Perhaps the most fundamental level of  study is an
analysis of data on a single characteristic. Assume, for
example, that an aquatic biologist has a  data set for
species richness from a stream study and now desires
to summarize this information. The biologist could
calculate the trimmed mean and median absolute de-
viation of the sample; alternatively, she could calcu-
late other statistics representing central tendency and
dispersion. To determine which of these statistics are
most useful, the biologist should first look at a plot of
the data. The histogram is often used to display data
representing a single characteristic (such as IBI).
    For example, suppose that the index of biotic in-
tegrity in Table A. 1 has just been determined for a par-
ticular stream from headwaters to mouth, and the
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                           APPENDIX A. Basic Statistics and Statistical Concepts
Table A.I.— IBI data for a
particular stream.




23 41

g 6'
t 5'
i? 4'
1 '


It . ^
0 10 20 30 40 50 60
                                    Figure A.I.—Histogram of IBI data for a particular stream.
    biologist wants to picture the
biotic integrity of this stream. As a
first cut, the histogram in Figure A.I is plotted. To
construct the histogram, the biologist must first di-
vide the range into equal intervals. In Figure A.I, the
range is approximated by 10 to 60 (actually it is 12 to
58) and is divided into intervals of 5 units. For each in-
terval, 11 to 15, 16 to 20, and so on, the height of the
bar represents the number of data  points that lie
within that interval. So there are four IBI data points
that lie within 31  to 35 and eight within 21 to 25.
Thus, the bar for the 21 to 25 interval is twice the
height of the 31 to 35 bar.
    What  does the histogram tell us about this
stream? Basically, it provides us with a visual image of
the distribution  of the sample. In specific terms, it
means that we can quickly see such things as the loca-
tion of the center of the sample, amount of dispersion,
extent of symmetry, and the existence of outliers in
the sample. Outliers need not be errors or aberrations;
they are simply "set apart" from the bulk of the obser-
vations. The reasons why they are set apart may be of
particular interest in some studies.
    In Figure A.I, the center may be visually associ-
ated with the highest bar (mode) at 21 to 25, or it may
be identified as a middle value (median) around 30.
Dispersion could perhaps be characterized by stating
that the range is 12 to 58, and almost 60 percent (actu-
ally 15 to 26) of the data points lie between 20 and 35.
The histogram is not symmetric, however, and one
might want to check on the validity of the two outly-
ing observations on the extreme right.
                                                 The picture created by the histogram is of con-
                                            siderable value in the selection of descriptive statis-
                                            tics. Some care  should be observed  in the
                                            construction of the  histogram, however.  With
                                            changes in interval size, the histogram can assume
                                            different shapes that may affect the inferences. For ex-
                                            ample, the IBI data in Figure A. 2 are plotted using an
                                            interval size of 10 units. On that scale, the two highest
                                            data points no longer appear as outliers. In contrast,
                                            the two-unit intervals in Figure A. 3 give the impres-
                                            sion of possible outliers on both the right and left ex-
                                            tremes of center. It is probably good practice to scale
                                            the histogram so that the observations are neither too
                                            aggregated (as in Figure A.2) nor too spread out to per-
                                            mit reasonable inferences to be drawn.
                                                 Thus, the histogram provides an impression of
                                            the extent of symmetry in the sample. Symmetry in a
                                            data set is a desirable attribute for two  reasons. First, it
                                            often means that one can characterize the sample as
                                            having  a distribution with a shape similar to one of
                                            the symmetric distributions (e.g., the normal distri-
                                            bution), which is often assumed to be an underlying
                                            model in statistical inference. Stating, for example,
                                            that a sample approximates the normal distribution
                                            conveys useful information. Beyond  that, symmetry
                                            implies that common descriptive statistics are  clear:
                                            central tendency refers to the center of symmetry, and
                                            dispersion characterizes variability without skew.
                                                 Therefore, it may be  useful to apply a transfor-
                                            mation, if necessary, to create symmetry in an asym-
                                            metric  data set. Continuous concentration and
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                           APPENDIX A. Basic Statistics and Statistical Concepts

1 8'
2 6


) 1

^T- •
0 2

:., ; • r :

*?.'./•!>' ',r£*tj

• v\* "\«0«
0 3


^i^^V^, ,.*..'
0 40 50 60
    Figure A.2.—Histogram of IBI data with 10-unit intervals.
C A •
CD ^

                           APPENDIX A. Basic Statistics and Statistical Concepts
H O •
§ 8'
2 6'


) 1 I






I 5

  Figure A.4.—Histogram for log(IBI).
  Figure A.5.—Histogram for log(IBI): Alternative scale.
line, the "leaves" are written. For each data point, the
leaf is the next digit lower in value than the stems
digit. Since the stems in Figure A. 6 are composed of
the tens digit, the leaves are made up of the units dig-
its. Each observation contributes one leaf to the row
containing its stem. For the IBI data points  in Table
A.I, the first observation (12) results in a 2 (the units
digit) placed in the row for the first tens stem (cover-

                           APPENDIX A. Basic Statistics and Statistical Concepts
               Indicates Statistical Significance at
               0.05 Level (for Median)
                                Maximum Value
                                 75% Value

                            	/ Median Value
25% Value
                                           Minimum Value
                   Lake A
    Figure A. 7.—Box and whisker plots.


— 30
0 pc


, 	 , T
\ / '
1 / \
1 ' 1 >
V — -/
K J. N
1979 1989 1990

    Figure A.8.—Stream IBI box plots.
erhaps one that provides both pictorial and statistical
comparison. One such model is the box and whisker
plot, which is available in many statistical software
packages for the microcomputer.
    Figure A. 7 shows the basic struc-
ture of the box plot.  For clarification,
note that the "statistical significance of
the median" on Figure A. 7 refers to the
degree of vertical overlap of the notch or
indention in one box with the notch in
another box. If the notches do not over-
lap vertically, then the medians may be
considered significantly different at  ap-
proximately the 0.05 level.
    Box plots are based on order statis-
tics which, like the median, are calcu-
lated by ranking the observations from
lowest to highest. Box plots can be used
to convey information  on the  sample
median; dispersion, as conveyed by the
range and the interquartile range; skew,
as conveyed by the symmetry  in the
shape above and below the median; rel-
ative size of the data set, as conveyed by
the width of the box; and statistical sig-
nificance  of the median.
    Figure A. 8 shows three sample box
plots for stream IBI data for 1979,1989,
and 1990. The box and whisker plots in
Figure A.8 provide  a substantial
amount of information on IBI during
the years  of sampling. First, it is appar-
ent that IBI has increased since 1979, as
there is  little vertical  overlap  of  the
1979 box plot with the other two. This
conclusion is further supported by  the
lack of vertical  overlap in  the 1979
notch with the other two notches. In
contrast,  while the medians for 1989
and 1990 differ,  they are not  signifi-
cantly different  (0.05  level) and  the
samples  (boxes) overlap considerably.
None of the years  exhibit substantial
skew in the sample data. The 1989 data
are skewed the most, based on the rela-
tive symmetry of the box and whiskers
around the median.
    Box plots are helpful as diagnostic
tools and as a method of demonstrating
conclusions about  samples  following
the completion  of  a statistical study.
Tukey (1977) and Reckhow (1979)  de-
scribe several interesting applications.
                                          Bivariate Scatter Plots
                                              Many statistics (e.g.,  correlation coefficients)
                                          and many statistical methods (e.g., regression analy-
                                          sis) are fundamentally concerned with relationships
                                          between pairs of variables. Without doubt, the best
Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data

                            APPENDIX A. Basic Statistics and Statistical Concepts
30       35
     Figure A.9.—IBI bivariate plot for 1989 and 1990 data.
    A second topic of interest for
bivariate samples is the presence
or absence of outliers. Outliers
have no universally accepted ob-
jective definition; rather, the term
is used here to identify observa-
tions that stand apart from a clus-
ter of  points. We are concerned
about outliers because they are apt
to have excessive influence on
nonresistant statistics like the
mean,  variance,  sample correla-
tion coefficient, and OLS regres-
sion coefficients.  Bivariate  plots
are valuable for outlier identifica-
tion and may suggest approaches
(e.g., transformation) for correc-
tion. In Figure A. 9, the two highest
values probably would not be con-
sidered outliers, since  they are
compatible with the pattern exhib-
ited in the rest of the data and not
substantially separated from those
way to examine a relationship between pairs of vari-
ables  (a bivariate relationship) is through a scatter
    In Figure A.9, a bivariate scatter plot is presented
for the 1989 and 1990 IBI data for a particular stream.
From the plot, we can examine the distribution of data
for each variable separately and for the two variables
together. For example, we can see from Figure A.9 that
two relatively high observations tend to stand apart
from the rest of the data, particularly in the horizontal
direction. As might be  expected, there is an approxi-
mately linear correlation between the IBI  estimates
for successive years.
    Two characteristics of a bivariate sample are of-
ten of interest in statistical studies. First, the biologist
may be interested in the pattern or shape (e.g., linear-
ity or nonlinearity) of a relationship. Linear relation-
ships are often desirable for  ease of  analysis;
correlation analysis and ordinary least squares (OLS)
regression provide measures of the strength ofalinear
relationship. If the bivariate relationship is nonlinear,
it is possible that a transformation can be applied to
make it linear, or a nonlinear model may be used.
Without question, the scatter plot is the most impor-
tant diagnostic device for evaluating linearity, and it
is  often quite helpful in selecting a transformation.
                          Biological Criteria: Technical Guidance for Survey Design

Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rog-
   ers, and J.W. Tukey. 1972. Robust Estimates of Location.
   Princeton University Press, Princeton, NJ.
Barnett, V, and T. Lewis. 1984. Outliers in Statistical Data.
   2nd edition. John Wiley and Sons, Chichester, UK.
Blalock, H.M. Jr. 1972. Social Statistics. McGraw-Hill, New
Box, G.E.P., J.S. Hunter, and W.G. Hunter. 1978. Statistics for
   Experimenters: An Introduction to Design, Data Analy-
   sis, and Model Building. John Wiley and Sons, New
Cochran, W.G. 1963. Sampling Techniques. John Wiley and
   Sons, New York.
Cochran, W.G., and G. M. Cox. 1957. Experimental Designs.
   John Wiley and Sons, New York.
Conover, W.J. 1980. Practical Nonparametric Statistics. 2nd
   edition. John Wiley and Sons, New York.
Dixon, W.J., and J.W. Tukey. 1968. Approximate behavior of
   the   distribution  of  Winsorized   t   (trim-
   rning/Winsorization 2). Technometrics 10:83-98.
Flury, B., and H. Riedwyl. 1988. Multivariate Statistics, a
   Practical Approach. Chapman and Hall,  London.
Gilbert, R.O. 1987. Statistical Methods  for Environmental
   Pollution  Monitoring. Van  Nostrand Reinhold, New
Green, R.H. 1979. Sampling Design and Statistical Methods
   for Environmental Biologists. John Wiley and Sons,
   New York.
Hahn, G.J., and  WQ. Meeker. 1991. Statistical Intervals.
   John Wiley and Sons, New York.
Hampel, F.R., E.M.  Ronchetti, P.J. Rousseeuw, and W.A.
   Stahel. 1986. Robust Statistics: The Approach Based on
   Influence Functions. John Wiley and Sons, New York.
Hill, M.A., and W.J. Dixon. 1982. Robustness in real life: a
   study of clinical laboratory data. Biometrics 38:377-96.
Hollander, M., and Wolfe, D.A-1973. Nonparametric Statis-
   tical Methods. John Wiley and Sons, New York.
Hunsaker, C.T., andD.E. Carpenter, eds. 1990. Environmen-
   tal Monitoring and Assessment Program: Ecological In-
   dicators. EPA/600/3-90/060. Off. Res. Dev, U.S. Environ.
   Prot. Agency, Washington, DC.
Horn, P.S., P.W. Britton, andD.E Lewis. 1988. On the predic-
   tion of a single future observation from a possibly noisy
   sample. The Statistician 37:165-72.
Huber, P.J. 1981. Robust Statistics. John Wiley and Sons,
   New York.
Hurlbert, S.H. 1984. Pseudoreplication and the design of
   ecological  field  experiments. Ecolog. Monogr.
Iglewicz, B. 1983. Robust scale estimators and confidence
   intervals for location. Pages 404-31  in D.G. Hoaglin, E
   Mosteller, and J.W. Tukey, eds., Understanding Robust
   and Exploratory Data Analysis. John Wiley and Sons,
   New York.
Kmenta, J.  1986.  Elements  of Econometrics.  2d ed.
   Macmillan, New York.
Linthurst, R.A., et al. 1986. Population Descriptions and
   Physico-Chemical Relationships. Vol 1 of Characteris-
   tics of  Lakes  in the  Eastern United  States.
   EPA/600/4-86/007a. U.S. Environ. Prot. Agency, Wash-
   ington, DC.
Meyer, M. A., and J.M. Booker.  1990. Eliciting and Ana-
   lyzing Expert Judgement: A Practical Guide. Academic
   Press, London.
Miller, R.G. Jr. 1986. Beyond ANOVA: Basics of Applied Sta-
   tistics. John Wiley and Sons, New York.
Morgan, M.G.,  and M. Henrion.  1990.  Uncertainty. Cam-
   bridge University Press, UK.
Mosteller, E, and J.W. Tukey. 1977. Data Analysis and Re-
   gression: A Second Course in  Statistics. Addi-
   son-Wesley, Reading, MA.
Ohio Environmental Protection Agency. 1988. The Role of
   Biological Data in Water Quality Assessment. Vol. 1 of
   Biological Criteria for the Protection of Aquatic Life. Div.
   Water Qual. Monitor. Assess., Columbus, OH.
Raiffa, H. 1968. Decision Analysis. Addison-Wesley, Read-
   ing, MA.
Rankin, E.T., and C.O. Yoder. 1990. The nature of sampling
   variability in the index on biotic  integrity (IBI) in Ohio
   streams.  EPA-905-9-90/005. Pages  9-18 in Proc.  1990
   Midw. Pollut. Meet., Chicago, IL.
Reckhow, K.H. 1979. Techniques for exploring and present-
   ing data applied to lake phosphorus concentration. Can.
   J. Fish. Aquat. Sci. 37(2):290-94.
	. 1984. Decision theory applied to lake management.
   Pages 196-200 in Proc. Fourth Ann.  Conf. N. Am. Lake
   Manage.  Soc., City and State?
Reckhow, K.H., and S.C. Chapra. 1983. Data Analysis and
   Empirical Modeling. Vol 1  of Engineering Approaches
   for Lake Management. Butterworth Pubs., Boston, MA.
Reckhow, K.H., and C. Stow. 1990. Monitoring design and
   data analysis for trend detection. Lake Reserv. Manage.
Reckhow, K.H., K. Kepford, and W. Warren-Hicks. 1993. Sta-
   tistical Methods for the Analysis of Lake Water Quality
   Trends. EPA 841-R-93-003. U.S. Environ. Prot. Agency,
   Washington, DC.
Rey, W.J.J. 1983. Introduction to Robust and Quasi-Robust
   Statistical Methods. Springer-Verlag, Berlin.
Rocke, D.M.  1983.  Robust statistical analysis  of
   interlaboratory studies. Biometrika 70:421-31.
Rocke, D.M., G.W. Downs, and A.J. Rocke.  1982. Are robust
   estimators  really necessary? Technometrics
Snedecor, G.W., and W.G. Cochran. 1967.  Statistical
   Methods. 6th ed. Iowa State University Press, Ames.
Staudte, R.G., and S.J. Sheather. 1990. Robust Estimation
   and Testing. John Wiley and Sons, New York.

Stevens, D. 1989. Field sampling design.  In  W. War-
   ren-Hicks andB. Parkhurst, eds., Ecological Assessment
   of Hazardous Waste Sites: A Field and Laboratory Refer-
   ence. EPA/600/3-89/013. Environ.  Research Lab., U.S.
   Environ. Prot. Agency, Corvallis, OR.
Stigler, S.M. 1977. Do robust estimators  work with real
   data? Ann. Stat. 5(6):1055-98.
Tukey, J.W. 1960. A survey of sampling from contaminated
   distributions. Pages 448-85 in I. Olkin, ed., Contribu-
   tions to Probability and Statistics, Stanford University
   Press, Stanford, CA.
	. 1977. Exploratory Data Analysis. Addison Wesley,
   Reading, MA.
Tukey, J.W., and D.M. McLaughlin. 1963. Less vulnerable
   confidence and significance procedures  for location
   based on a single sample:  trimming/Winsorization.
   SankhyaA. 25:331-52.
U.S. Environmental  Protection Agency. 1990. Biological
   Criteria, National Program Guidance for Surface Waters.
   EPA-440/5-90-004. Off. Water Reg.  Stand., Washington,
U.S. Government Printing Office.  1988. The  Clean Water
   Act as amended by the Water Quality Act of 1987. Pub.
   L. 100-4, Washington, DC.
Warren-Hicks, W.J., and J. Messer. 1990. Using Biological
   Indices to Measure Ecological Condition in Regional Re-
   sources. Draft Rep. Prep, for Atmos. Res. Exposure As-
   sess. Lab., Research Triangle Park, NC.
Williams, B. 1978. A Sampler on Sampling. John Wiley and
   Sons, New York.
Wonnacott, T.H., and R.J. Wonnacott. 1977. Introductory
   Statistics. John Wiley and Sons, New York.
Yoder, C.0.1991. Answering some concerns about biologi-
   cal criteria based on experiences in Ohio. Pages 95-104
   in G.H. Flock, ed., Water Quality Standards for the 21st
   Century. Proc. Off. Water, U.S. Environ. Prot. Agency,
   Washington, DC.
Yuen, K.K., and W.J. Dixon. 1973. The approximate behav-
   iour and performance of the two-sample trimmed t.
   Biometrika 60:369-74.
     @REF =