EPA Unitod Stales Environ menul Protection Agency Office of Water 4904 CPAB22B97002 Biological Criteria: Technical Guidance For Survey Design and Statistical Evaluation of Biosurvey Data ------- BIOLOGICAL CRITERIA Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data Prepared for EPA by TetraTech, Inc. Principal authors: Kenneth H. Reckhow, Ph.D. and William Warren-Hicks, Ph.D. George Gibson, Jr., Ph.D. Office of Science and Technology Project Leader Health and Ecological Criteria Division Office of Water U.S. Environmental Protection Agency Washington, D.C. 20460 December 1997 ------- Acknowledgements This document was developed by the United States Environmen- tal Protection Agency, Office of Science and Technology, Health and Ecological Criteria Division. This text was written by Kenneth H. Reckow, PhD. And William Warren-Hicks, PhD. Jerqen Gerristen, PhD. of Tetra Tech, Inc. pro- vided editorial and technical support. George R. Gibson, Jr., PhD. of USEPA was Project Leader ana co-editor. ------- Disclaimer This manual provides technical guidance to States, Indian Tribes, and other users of biologicarcriteria to assist with survey de- sign and statistical evaluation of biosurvey data. While this manual constitutes EPA's scientific recommendations regarding survey de- signs and statstical analyses, it does not substitute for the CWA or EPA's regulations; nor is it a regulation itself. Thus, it cannot impose legally binding requirements on the EPA, States, Indian Tribes, or the regulated community, and might not apply to a particular situa- tion or circumstance. EPA may change this guidance in the future. 11 ------- CONTENTS Foreword vii CHAPTER 1. The Biological Criteria Program and Guidance Documents 1 The Concept of Biological Integrity 1 Narrative and Numeric Biological Criteria 1 Biological Criteria and Water Resource Management 2 An Overview of this Document 2 CHAPTER 2. Classical Statistical Inference and Uncertainty 3 Formulating the Problem Statement 3 Basic Statistics and Statistical Concepts 3 Descriptive Statistics 3 Recommendations 4 Uncertainty 6 Statistical Inference 7 Interval Estimation 7 Hypothesis Testing 7 Common Assumptions 8 Parametric Methods the t test 9 Nonparametric Tests the W test 10 Example an IBI case study 11 Conclusions 12 CHAPTER 3. Designing the Sample Survey 15 Critical Aspects of Survey Design 15 Variability 15 Representativeness and Sampling Techniques 15 Cause and Effect 16 Controls 16 Key Elements 17 Pilot Studies 17 Location of Sampling Points 18 Location of Control Sites 19 Estimation of Sample Size 19 Important Rules 20 CHAPTER 4. Detecting Mean Differences 21 Cases Involving Two Means 21 Random sampling model, external value for a 21 Random sampling model, internal value for CT 22 Testing against a Numeric criterion 22 A Distribution-Free Test 23 Evaluating Two-Sample Means Testing 23 ill ------- Multiple Sample Case 23 Parametric or Analysis of Variance Methods 23 Nonparametric or Distribution Free Procedures 25 Testing for Broad Alternatives 25 The Kolmogorov-Smirnov Two-Sample Test 26 Relationship of Survey Design to Analysis Techniques 27 CHAPTER 5. Discussion and Examples 29 Working with Small Sample Sizes 29 Assessments Involving Several Indicators 30 Regional Reference Data 31 Using Background Variability Measures 32 Final suggestions for Small Sample Sizes 32 Decision Analysis and Uncertainty 33 APPENDIX A. Basic Statistics and Statistical Concepts 35 Measures of Central Tendency 35 Mean 35 Median 35 Trimmed Mean 35 Mode 36 Geometric Mean 36 Measures of Dispersion 36 Standard Deviation . 36 Absolute Deviation 36 Interquartile Range 36 Range 37 Resistance and Robustness 37 Graphic Analyses 37 Histograms 37 Stem and Leaf Displays 39 Box and Whisker Plots 40 Bivariate Scatter Plots 41 References 43 IV ------- LIST OF TABLES TABLE PAGE 2.1. Measures of Central Tendency 4 2.2. Measures of Dispersion 5 2.3. Useful Graphical Techniques 5 2.4. Possible Outcomes from Hypothesis Testing 7 3.1. Number of samples needed to estimate the true mean (low extreme) 19 3.2. Number of samples needed to estimate the true mean (high extreme) 20 4.1. Descriptive Statistics: Upstream-Downstream Example 21 4.2. Assumptions, Advantages, and Disadvantages Associated with Various Two-Sample Means Testing Procedures 24 4.3. Analysis of Variance Results for the Case Study Model 25 4.4. LSD Multiple Comparison Test 25 4.5. Duncan's Multiple Comparison Test 25 4.6. Tukey's Multiple Comparison Test 25 4.7. Survey Design and Analysis Techniques 27 5.1. Biological Indices and biocriteria 30 A.I. IBIData 38 LIST OF FIGURES FIGURE PAGE 2.1. Sampling Distributions under Different Hypotheses 13 3.1. Random Sample Design having both Temporal and Spatial Dimensions 17 4.1. Cumulative Distribution functions of upstream and downstream sites 26 5.1. IBI Distributions for reference and inpacted sites 33 A.I. IBI Histogram 38 A.2. IBI Histogram with Ten-Unit Interval Size 39 A.3. IBI Histogram with Two-Unit Interval Size 39 A.4. Histogram for Log(IBI) 40 A.5. Histogram for Log(IBI): Alternative Scale 40 A.6. Stem and Leaf Display . . . 40 A.7. Box and Whisker Plots 41 A.8. Stream IBI Box Plot 41 A.9. IBI Bivariate Plot 42 ------- FOREWORD Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data, by Kenneth H. Reckhow and William War- ren-Hicks, was prepared for the U.S. Environmental Protection Agency to help states develop their biolog- ical criteria for surface waters and specifically to help water resource managers assess the reliability of their data. A good biological criteria program will be practi- cal and cost effective, but above all it will be predi- cated on valid and scientifically sound information. The application of the concepts and methods of statistics to the biological criteria process en- ables us ". . . to describe variability, to plan research so as to take variability into account, and to analyze data so as to extract the maximum information and also to quantify the reliability of that information" (Samuels, 1989). This initial guidance document is intended to reintroduce statistics to the natural resources man- ager who may not be current in the application of this tool (and our ranks are legion, we just don't like to admit it). The emphasis is on the practical appli- cation of basic statistical concepts to the develop- ment of biological criteria for surface water resource protection, restoration, and management. Subse- quent guides will be developed to expand on and refine the ideas presented here. Address comments on this document and sug- gestions for future editions to George Gibson, U.S. Environmental Protection Agency, Office of Water, Office of Science and Technology (4304), 401 M Street, S.W., Washington, B.C. 20460. VI ------- CHAPTER 1 The Biological Criteria Program and Guidance Documents Efforts to measure and manage water quality in the United States are an evolving process. Since its simple beginning more than 200 years ago, water monitoring has progressed from observations of the physical impacts of sediments and flotsam to chemi- cal analyses of the multiple constituents of surface water to the relatively recent incorporation of biologi- cal observations in systematic evaluations of the re- source. Further, although biological measurements of the aquatic system have been well-established proce- dures since the Saprobic system was documented at the turn of this century, such information has only re- cently been incorporated into the nation's approach to water resource evaluation, management, and pro- tection. The U.S. Environmental Protection Agency (EPA) is charged in the Clean Water Act (Pub. L. \ 00-4, §101) "to restore and maintain the chemical, physi- cal, and biological integrity of the Nation's waters." To incorporate biological integrity into its monitoring program, the Agency established the Biological Crite- ria Program in the Office of Water. This program provides technical guidance to the states for measuring biological integrity as an aspect of water resource quality. Biological integrity comple- ments the physical and chemical factors already used to measure and protect the nation's surface water re- sources. Eventually all surface water types will be in- cluded in program technical guidance, including streams, rivers, lakes and reservoirs, wetlands, estu- aries, and near coastal marine waters. States will use this information to establish bio- logical criteria or benchmarks of resource quality against which they may assess the status of their wa- ters, the relative success of their management efforts, and the extent of their attainment or noncompliance with regulatory conditions or water use permits. These criteria are intended to augment, not replace, other physical and chemical methods, to help refine and enhance our water protection efforts. The Concept of Biological Integrity Biological integrity is the condition of the aquatic community inhabiting unimpaired waterbodies of a specified habitat as measured by community struc- ture and function (U.S. Environ. Prot. Agency, 1990). Essentially, the concept refers to the naturally dynamic and diverse population of indigenous organ- isms that would have evolved in a particular area if it had not been affected by human activities. Such in- tegrity or naturally occurring diversity becomes the primary reference condition or source of biological criteria used to measure and protect all waterbodies in a particular region. Only the careful and systematic measuring of key attributes of the natural aquatic ecosystem and its constituent biological communities can determine the condition of biological integrity. These key attrib- utes or biological endpoints indicate the quality of the waters of concern. They are established by biosurveys by analyses based on the sampling of fish, inverte- brates, plants, and other flora and fauna. Such biosurveys establish the endpoints or measures used to summarize several community characteristics such as taxa richness, numbers of individuals, sensi- tive or insensitive species, observed pathologies, and the presence or absence of essential habitat elements. The careful selection and derivation of these measures (hereafter, metrics), together with detailed habitat characterization, is essential to translate the concept of biological integrity into useful biological criteria. That is, the quantitative distillation of the survey data makes it possible to compare and contrast several waterbodies in an objective, systematic, and defensible manner. Narrative and Numeric Biological Criteria Two forms of biological criteria are used in EPA's sys- tem of water resources evaluation and management. Narrative biological criteria are general statements of attainable or attained conditions of biological in- tegrity and water resource quality for a given use des- ignation. They are qualitative statements of intent promises formally adopted by the states to protect and restore the most natural forms of the system. Nar- rative criteria frequently include statements such as "the waters are to be free from pollutants of human or- igin in so far as achievable," or "to be restored and maintained in the most natural state." The statements must then be operationally defined and implemented by a designated state agency. Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 1. The Biological Criteria Program Guidance Document Numeric criteria are derived from and predicated on the same objective status as narrative criteria, which are then retained as preliminary statements of intent. The difference between the two is that the qualitative statement of integrity, the condition to be protected or restored, is refined by the inclusion of quantitative (numeric) endpoints as specific compo- nents of the criteria. Compliance with numeric crite- ria involves meeting stipulated thresholds or quantitative measures of biological integrity. The formal adoption of criteria of either type into state law (with EPA concurrence) makes the criteria "standards." They are then applicable and enforce- able under the provisions of the Clean Water Act. Biological Criteria and Water Resource Management Because these criteria will become the basis for re- source management and possible regulatory actions, the manner of their design is of utmost importance to the states and EPA. The choice of metrics to represent and measure biological integrity is the responsibility of ecologists, biologists, and water resource manag- ers. The Agency's role is to continue to develop tech- nical guidance documents and manuals to assist in this process. The purpose of this document is to present meth- ods that will help managers interpret and gage the confidence with which the criteria can be used to make resource management decisions. Using this guidance, both the technician and the policymaker can objectively convert data into management infor- mation that will help protect water resources. How- ever, the use and limits of the information must be clearly understood to ensure coordination and mu- tual cooperation between science and management. An Overview of this Document The focus of this document is on the basic statistical concepts that apply within the biocriteria program. From the program's inception, the problem statement, survey design, and the statistical methods used in the analysis must be correlated to provide functional re- sults. Accordingly, chapter 2 begins with formula- tions of the problem statement the focused objec- tive that helps narrow the scope of observations in the ecosystem to those necessary to predict the status and impairment of the biotaand culminates in a discus- sion of hypothesis testing, the approach advocated in this guidance document. Chapter 2 also refers begin- ners to Appendix A for a succinct review of the basic statistics and statistical concepts used within the chapter and throughout this document. Chapter 3 presents key issues associated with the design of the sample survey. Surveys are without doubt the critical element in an environmental as- sessment. Designs that minimize error, uncertainty, and variability in both biological and statistical mea- sures have a great effect on decision makers. This chapter explores the difference between classical and experimental design and the issues involved with random, systematic, and stratified samples. Sample sizes and how to proceed in confusing circumstances round out the discussion. Chapter 4 deals with problems that arise from hypothesis testing methods based on detecting the mean differences arising from two or more independ- ent samples. The use and abuse of means testing pro- cedures is an important topic. It should generally be keyed to the survey design, but other information should also be taken into consideration because er- rors of interpretation often involve assumptions about data. Chapter 5 is a further discussion, with examples, of the basic concepts introduced in earlier chapters. Though hypothesis testing is generally preferred, this chapter discusses circumstances in which other pro- cedures may be useful. It also introduces the role of cost-benefit assumptions in decision analysis and the limits of data collection and interpretation in the de- termination of causality. The reader should recall at all times the basic nature of this document. Advanced practitioners may look to the references used in pre- paring this document for additional options and dis- cussion. Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- BIOLOGICAL CRITERIA Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data CHAPTER 2. Classical Statistical Interference and Uncertainty 3 Formulating the Problem Statement 3 Basic Statistics and Statistical Concepts 3 Descriptive Statistics 3 Recommendations 4 Uncertainty 6 Statistical Interference 7 Interval Estimation 7 Hypothesis Testing 7 Common Assumptions 8 Parametric Methods the ^test 9 Nonparametric Tests the Wtest 10 Example an IBI case study 11 Conclusions 12 ------- CHAPTER 2 Classical Statistical Inference and Uncertainty Before the biological survey can be designed and linked to statistical methods of interpretation, an exact formulation of the problem is needed to narrow the scope of the study and focus investigators on col- lecting the data. The choice of biological and chemi- cal variables should be made early in the process, and the survey design built around that selection. Fancy statistics and survey designs may be appropriate, but biologically defined objectives should dominate and use the statistics, not the reverse (Green, 1979). Formulating the Problem Statement A clear statement of the objective or problem is the necessary basis on which the biological survey is de- signed. A general question such as "does the effluent from the municipal treatment plant damage the envi- ronment?" does little to help decision makers. Con- sider, however, their response to a more specific statement: "Is the mean abundance of young-of-the-year green sunfish caught in seines above the discharge point greater (with an error rate of 5 percent) than those similarly trapped downstream of the discharge point?" The precise nature of this question makes it a clear guide for the collection and interpretation of data. The problem statement should minimally in- clude the biological variables that indicate environ- mental damage, a reference to the comparisons used to determine the impact, and a reference to the level of precision (or uncertainty) that the investigator needs to be confident that an impact has been determined. In the preceding example, green sunfish are the bio- logical indicator of impact, upstream and down- stream seine data are the basis of comparison, and an error rate of 5 percent provides an acceptable level of uncertainty. The problem statement, the survey design, and the statistical methods used to interpret the data are closely linked. Here, the survey design is an up- stream/downstream set of samples with the upstream data providing a reference for comparison. A t test or rank sign test may be used to test for mean differences between the sites. From a statistical standpoint, the biological vari- ables (measures) used to show damage should have low natural variability and respond sharply to an im- pact relative to any sampling variability. Natural vari- ability contributes to the uncertainty associated with their response to an impact. Lower natural variability permits reliable inferences with smaller sample sizes. Examining historical data is an excellent means of selecting biological criteria that are sensitive to en- vironmental impacts. Species that exhibit large natu- ral spatial and temporal variations may be suitable indicators of environmental change only in small time scales or localized areas. If so, the use of such variables will limit the investigator's ability to assess environmental change in long-term monitoring pro- grams. Historical data, combined with good scientific judgment, can be used to select biological criteria that exhibit minimal natural variability within the context of the site under evaluation. Basic Statistics and Statistical Concepts When a data set is quite small, the entire set can be re- ported. However, for larger data sets, the most effec- tive learning takes place, when investigators summarize the data in a few well-chosen statistics. The choice to trade some of the information available in the entire set for the convenience of a few descrip- tive statistics is usually a good one, provided that the descriptive statistics are carefully selected and cor- rectly represent the original data. Some descriptive statistics are so commonly used that we forget that they are but one option among many candidate statistics. For example, the mean and the standard deviation (or variance) are statistics used to estimate the center of a data set and the spread on those data. The scientist who uses these statistics has already decided that they are the best choices to de- scribe the data. They work very well, for example, as representatives of symmetrically distributed data that follow an approximately normal distribution. Thus, their use in such circumstances is entirely justified. However, in other situations involving biological data, alternative descriptive statistics may be pre- ferred. Descriptive Statistics Before selecting a descriptive statistic, the scientist must understand the purpose of the statistic. Descrip- tive statistics are often used in biological studies be- Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty cause the convenience of a few summary numbers outweighs the loss of information that results from not using the entire data set. Nevertheless, as much information as possible must be summarized in the descriptive statistics because the alternative may in- volve a misrepresentation of the original data. The basic statistics and statistical techniques used in this chapter are further defined, described, and illustrated in the appendix to this document (Ap- pendix A). Readers unfamiliar with descriptive statis- tics and graphic techniques should read Appendix A now and use it hereafter as a reference. Other readers may proceed directly to the tables in this chapter, which summarize the advantages and disadvantages of the statistical estimators and techniques described in the appendix. The common measures of the center, or central tendency, of a data set are the mean, median, mode, geometric mean, and trimmed mean. None of these options is the best choice in all situations (see Table 2.1), yet each conveys useful information. The points raised in Table 2.1 are not comprehensive or absolute; they do, however, reflect the author's experience with these estimators. Environmental contaminant concentration data are strictly positive, and sample data sets exhibit asymmetry (i.e., a few relatively high observations). Therefore, a transformation, in particular, the loga- rithmic transformation, should be applied to concen- tration and other data that exhibit these characteris- tics before analysis. When a transformation is used, data analysis and estimation occur within the trans- formed metric; if appropriate, the results may be con- verted back to the original metric for presentation. A measure of dispersion spread or variability is another commonly reported descriptive statistic. Common estimators for dispersion are standard devi- ation, absolute deviation, interquartile range, and range. These estimators are defined, described, and il- lustrated with examples in the appendix; Table 2.2 summarizes when and how they may be used. Table 2.3 summarizes four of the most useful univariate and bivariate graphic techniques, includ- ing histograms, stem and leaf displays, box and whis- ker plots, and bivariate plots. These methods are also illustrated in Appendix A. Recommendations There is no rigorous theoretical or empirical support for using the normal distribution as a population model for chemical and biological measures of water quality or as a model for errors. Instead, the evidence supports using the lognormal model. However, uncer- tainty about the correctness of the lognormal model suggests that prudent investigators will recommend estimators that perform well even if an assumed model is wrong. Table 2.1 Measures of central tendency. ESTIMATOR Mean Median Mode Geometric Mean Trimmed Mean ADVANTAGES Most widely known and used choice Easy to explain Easy to explain Easy to determine Resistant to others Easy to explain Easy to determine Appropriate for certain skewed (lognormal) distribution Resistant to outliers DISADVANTAGES Not resistant to outliers Not as efficient1 as some alternatives under deviations from normality Not as efficient as the mean under normality Not as efficient as the mean under normality Not as easy to explain as first three Not as easy to explain as first three SHOULD CONSIDER FOR USE WHEN Sample mean is required Distribution is known to be normal Distribution is symmetric Sample median is required Outliers may occur Most frequently observed value is required Data are discrete or can be discretized Distribution appears lognormal Outliers may occur and estimator efficiency is desired SHOULD NOT USE WHEN Outliers may occur Distribution is not symmetric More efficient options are appropriate More widely known estimators are appropriate 1 Higher efficiency means lower standard error. Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty Table 2.2 Measures of dispersion. ESTIMATOR Standard Deviation Median Absolute Deviation Interquartile Range Range ADVANTAGES Most widely known Routinely calculated by statistics packages Resistant to outliers Resistant to outliers Relatively easy to determine Easy to determine DISADVANTAGES Strongly influenced by outliers Not as efficient1 as some alternatives under even slight deviations from normality Not as efficient as the standard deviation under normality Not as efficient as the standard deviation under normality Not as efficient SHOULD CONSIDER FOR USE WHEN Sample standard deviation is required Distribution is known to be normal Outliers may occur Outliers may occur Range is required SHOULD NOT USE WHEN Outliers may occur Sample histogram is even slightly more dispersed than is a normal distribution Any of the above options is appropriate 1 Higher efficiency means lower standard error. Table 2.3 Useful graphic techniques. TECHNIQUE Histogram Stem and Leaf Display Box and Whisker Plot Bivariate Plot FEATURES Bar chart for data on a single (univariate) variable Shows shape of empirical distribution Same as histogram Presents numeric values in display Display of order statistics (extremes, quartiles, and median) May be used to graph the same characteristic (e.g., variable) for several samples (e.g., different sampling sites) Scatter plot of data points (variable x versus variable 7) USEFUL FOR Visual identification of distribution shape, symmetry, center, dispersion, and outliers Same as histogram Visual identification of distribution shape, symmetry, center, dispersion, and outliers (single sample) Comparison of several samples for symmetry, center, and dispersion Visual assessement of the strength of a linear relationship between two variables Evidence of patterns, nonlinearity and bivariate outliers Many books and articles have been written re- cently concerning the theoretical and empirical evi- dence in favor of nonparametric methods and robust and resistant estimators. Books that consider alterna- tive estimators of center and dispersion (e.g., Huber, 1981; Hampel et al. 1986; Key, 1983; Barnett and Lewis, 1984; Miller, 1986; Staudte and Sheather, 1990) build a strong case for more robust estimators than the mean and variance. Indeed, there is good evi- dence (Tukey, 1960; Andrews et al. 1972) that the mean and variance may be the worst choices among the common estimators for error-contaminated data. Several articles that involve comparisons of estima- tors on real data (e.g., Stigler, 1977; Rocke et al. 1982; Hill and Dixon, 1982) also favor robust estimators over conventional alternatives. As a consequence, the median and the trimmed mean are recommended for the routine calculation of a data set's central tendency. The interquartile range and the median absolute deviation are recommended for calculation of the dispersion. These suggestions represent a compromise between robustness, ease of explanation, and calculation simplicity. For the trimmed mean, recommended amounts of trimming range from 10 percent (Stigler, 1977) to over 20 per- cent (e.g., Rocke et al. 1982). A critical argument in support of the trimmed mean is that interval estima- tion and hypothesis testing are still possible using the Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty t statistic (Tukey and McLaughlin, 1963; Dixon and Tukey, 1968; Gilbert, 1987). Uncertainty In statistics, uncertainty is a measure of confidence. That is, uncertainty provides a measure of precision it assigns the value of scientific information in eco- logical studies. Scientific uncertainty is present in all studies concerning biological criteria, but uncer- tainty does not prevent management and decision making. Rather, uncertainty provides a basis for se- lecting among alternative actions and for deciding whether additional information is needed (and if so, what experimentation or observation should take place). In ecological studies, scientific uncertainty re- sults from inadequate scientific knowledge, natural variability, measurement error, and sampling error ; (e.g., the standard error of an estimator). In the actual analysis, uncertainty arises from erroneous specifica- tion of a model or from errors in statistics, parameters, initial conditions, inputs for the model, or expert judgment. In some situations, uncertainty in an unknown quantity (e.g., a model parameter or a biological end- point) may be estimated using a measure of variabil- ity. Likewise, in some situations, model error may be estimated using a measure of goodness-of-fit (predic- tions versus observations) of the model. In many situ- ations, a judicious estimate of uncertainty is the only option; in these cases, careful estimation is an accept- able alternative and methods exist to elicit these judg- ments from experts (Morgan and Henrion, 1990). In many studies, uncertainty is present in more than one component (e.g., parameters and models), so the investigator must estimate the combined effects of the uncertainties on the endpoint. This exercise, called error propagation, is usually undertaken with Monte Carlo simulation or first-order error analysis. The outcome of an uncertainty analysis is a prob- ability distribution that reflects uncertainty on the endpoint. However, uncertainty analysis may not al- ways be the most useful expression of risk. Other ex- pressions of uncertainty, such as prediction, confidence intervals, or odds ratios are easier to un- derstand and interpret. If important error terms are ig- nored when a probability statement is made, the investigator must report this omission. Otherwise, the probability statement is not representative, and the uncertainties are underestimated. Since uncertainty provides a measure of preci- sion or value, it can be used by decision makers to guide management actions. For example, in some cases the uncertainty in a biological impact may be too large to justify management changes. As a conse- quence, managers may defer action until additional monitoring data can be gathered rather than require pollutant discharge controls. If the uncertainty is large and the estimated costs of additional pollutant controls quite high, it may be wise either to defer ac- tion or to look for smaller, relatively less expensive abatement strategies for an interim period while the monitoring program continues. Though environmental planners at national, state, and local levels have rarely considered uncer- tainty in their planning efforts, their work has been generally successful over the past 20 years. It is, how- ever, certainly possible that more effective manage- ment that is, less costly, more beneficial management might have occurred if uncertainty had been explicitly considered. If overall uncertainty is ignored, the illusion pre- vails that scientific information is more precise than it actually is. As a consequence, we are surprised and disappointed when biological outcomes are substan- tially different from predictions. Moreover, if we don't calculate uncertainty, we have no rational basis for specifying the magnitude of our sampling program or the resources (money, time, personnel) that should be allocated to planning. Thus, decisions on planning and analysis are more likely based on convention and whim than on the logical objective of reducing scien- tific uncertainty. Statistical analysis is largely concerned with un- certainty and variability. Therefore, uncertainty is an important concept in this guidance manual. The anal- yses presented here and in subsequent chapters are based on particular measures of uncertainty, for ex- ample, confidence intervals. These measures are "sta- tistics"; they reflect data, and are not always considered in the broader context of uncertainty that is, as establishing the uncertainty in a quantity of interest. We will, however, consider these statistics in the broader sense, with concern for the theoretical is- sues raised in this section. Particularly given the small samples that often occur with biocriteria assess- ments, investigators should ask the following ques- tions: . Do the data adequately represent uncertainty? . Are all important sources of uncertainty represented in the data? . Should expert scientific judgment be used to augment or correct measures of uncertainty? . If components of uncertainty are ignored because they are not included in the data, are conclusions or decisions affected? Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty Statistical analysis is not a rote exercise devoid of judgment. Statistical Inference Statistical inference is gained by two primary ap- proaches: (1) interval estimation, and (2) hypothesis testing. Interval estimation concerns the calculation of a confidence interval or prediction interval that bounds the range of likely values for a quantity of in- terest. The end product is typically the estimated quantity (e.g., a mean value) plus or minus the upper and lower interval. The same information is used in hypothesis testing; however, in hypothesis testing, the end product is a decision concerning the truth of a candidate hypothesis about the magnitude^ of the quantity of interest. In a particular problem, the choice between us- ing interval estimation or hypothesis testing generally depends on the question or issue at hand. For exam- ple, if a summary of scientific evidence is requested, confidence intervals are apt to be favored; however, if a choice or decision is to be made, hypothesis tests are likely to be preferred. Interval Estimation Statistical intervals, whether confidence or predic- tion, may be based on an assumed probability model describing the statistic of interest, or they may require no assumption of a particular underlying probability model. Hahn and Meeker (1991) note that the proper choice of statistical interval depends on the problem or issue of concern. As a rule, if the interval is in- tended to bound ^population parameter (e.g., the true mean), then the appropriate choice is the confidence interval. If, however, the interval is to bound a future member of the population (e.g., a forecasted value), then the appropriate choice is the prediction interval. Another statistical interval less frequently used in ecology is the tolerance interval, which bounds a specified proportion of observations. In conventional (classical, or frequentist) statis- tical inference, the statistical interval has a particular interpretation that is often incorrectly stated in scien- tific studies. For example, if a 95 percent statistical in- terval for the mean is 7 ± 2, it is not correct to say that there is a 95 percent chance that the true mean lies be- tween 5 and 9." Rather, it is correct to say that 95 per- cent of the time this interval is calculated, the true mean will lie within the computed interval. Although it sounds awkward and not directly relevant to the is- sue at hand, this interpretation is the correct meaning of a classical statistical interval. In truth, once it is cal- culated, the interval either does or does not contain the true value. In classical statistics, the inference from interval estimation refers to the procedure for in- terval calculation, not to the particular interval that is calculated. Hypothesis Testing Biosurveys are used for many purposes, one of which is to assess impact or effect. Resource managers may want to assess, for example, the influence of a pollut- ant discharge or land use change on a particular area. The effect of the impact can be determined based on the study of trends over time or by comparing up- stream and downstream conditions. In some in- stances, the interest is in magnitude of effect, but concern often focuses simply on the presence or ab- sence of an effect of a specific magnitude. In such cases, hypothesis testing is usually the statistical pro- cedure of choice. In conventional statistical analysis, hypothesis testing for a trend or effect is often based on a point null hypothesis. Typically, the point null hypothesis is that no trend or effect exists. The position is pre- sented as a "straw man" (Wonnacott and Wonnacott, 1977) that the scientist expects to reject on the basis of evidence. To test this hypothesis, the investigator col- Table 2.4 Possible outcomes from hypothesis testing. STATE OF THE WORLD H0 is True H0 is Fale (Ha is True) DECISION ACCEPT H0 Correct decision. Probability = 1 - a; corresponds to the confidence level. Type II error. Probability = p REJECT H0 Type I error. Probability = a; also called the significance level. Correct decision. Probablity = 1 - p; also called power. lects data to provide a sample estimate of the effect (e.g., change in biotic integrity at a single site over time). The data are used to provide a sample estimate of a test statistic, and a table for the test statistic is con- sulted to estimate how unusual the observed value of the test statistic is if the null hypothesis is true. If the observed value of the test statistic is unusual, the null hypothesis is rejected. In a typical application of parametric hypothesis testing, a hypothesis, H0, called the null hypothesis, is proposed and then evaluated using a standard statis- tical procedure like the t test. Competing with this null hypothesis for acceptance is the alternative hy- pothesis, H1. Under this simple scheme, there are four possible outcomes of the testing procedure: the hy- Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty pothesis is either true or false, and the test results can be accepted or rejected for each hypothesis (see Table 2.4). The point null hypothesis is a precise hypothesis that may be symbolically expressed: where 0 is a parameter of interest. An example of a point null hypothesis in words is, "no change occurs in mean IBI after the new wastewater treatment plant goes on line." Symbolically, it is expressed as Ho: V-L-VZ = ° Ha: /u1-Atz 5* 0 where fj.1 is the "before" true mean and /u2 is the "after" true mean. The test of the null hypothesis proceeds with the calculation of the sample means, 3c, and x2. In most cases, the sample means will differ as a con- sequence of natural variability or measurement error or both, so a decision must be made concerning how large this difference must be before it is considered too large to result from variability or error. In classical statistics, this decision is often based on standard practice (e.g., a Type I error of 0.05 is acceptable), or on informal consideration of the consequences of an incorrect conclusion. The result of a hypothesis test can be a conclu- sion or a decision concerning the rejected hypothesis. Alternatively, the result can be expressed as a "p- value," which quantifies the strength of the data evidence in favor of the null hypothesis. Thep-value is defined as the probability that "the sample value would be as large as the value actually observed, if H0 is true" (Wonnacott and Wonnacott, 1977). In effect, thep-value provides a measure of how likely a partic- ular value is, assuming that the null hypothesis is true. Thus, the smaller thep-value, the less likely that the sample supports H0. This is useful information; it suggests that p-values should always be reported to allow the reader to decide the strength of the evi- dence. Common Assumptions Virtually all statistical procedures and tests require the validity of one or more assumptions. These as- sumptions concern either the underlying population being sampled or the distribution for a test statistic. Since the failure of an assumption can have a substan- tial effect on a statistical test, the common assump- tions of normality, equality of variances, and independence are discussed in this section. We must ask, for example, to what extent can an assumption be violated without serious consequences? Or how should assumption violations be addressed? Normality. A common assumption of many para- metric statistical tests is that samples are drawn from a normal distribution. Alternatively, it may be as- sumed that the statistic of interest (e.g., a mean) is de- scribed by a normal sampling distribution. In either case, the key distinction between parametric and nonparametric (or distribution-free) statistical tests is that a probability model (often normal) is assumed. Empirical evidence (e.g., Box et al. 1978) indi- cates that the significance level but not the power is robust or not greatly affected by mild violations of the normality assumption for statistical tests concerned with the mean. This finding suggests that a test result indicating "statistical significance" is reliable, but a "nonsignificant" result may be the result of a lack of robustness to nonnormality. The normality of a sam- ple can be checked using a normal probability plot, chi square test, Kolmogorov-Smirnov test, or by test- ing for skewness or kurtosis; however, many biologi- cal surveys are not designed to produce enough samples to make these tests definitive. Normality of the sampling distribution for a test statistic is important because it provides a probability model for interval estimation and hypothesis tests. In some cases, transformation of the data may help the investigator achieve approximate normality (or sym- metry) in a sample, if normality is required. Since nonnegative concentration data cannot be truly nor- mal, and since empirical evidence suggests that envi- ronmental contaminant data may be described with a lognormal distribution, the logarithmic transforma- tion is a good first choice. Therefore, in the absence of contrary evidence, we recommend that concentration data be log-transformed prior to analysis. Equality of Variance. A second common as- sumption is that when two or more distributions are involved in a test, the variances will be constant across distributions. Many tests are also robust to mild violations of this assumption, particularly if the sample sizes are nearly identical. To test this assump- tion, a t test (usually a two-tailed one) can be per- formed; see Snedecor and Cochran (1967) for an example, and Miller (1986) for interpretive results. Conover (1980) provides an alternative, namely, nonparametric tests of equality of variances. Note that if two means are being compared based on sam- ples with vastly different variances, the differences of interest may be more fundamental than the difference between the means. Independence. The assumption of greatest gen- eral concern is independence. Most statistical tests (parametric and nonparametric) require a random sample, or a sample composed of independent obser- vations. Dependency between or among observations Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty in a data set means that each observation contains some information already conveyed in other observa- tions. Thus, there is less new independent informa- tion in a dependent data set than in an independent data set of the same sample size. Because statistical procedures are often not robust to violation of the in- dependence assumption, adjustments are generally recommended to address anticipated problems. Dependence in a sample can result from spatial or temporal patterns, that is, from persistence through time and space. In most types of analyses, the assumption of independence refers to independence in the disturbances (errors). For example, in a time se- ries with temporal trend and seasonal pattern, de- pendence or autocorrelation in the raw data series may exist because of a deterministic feature of the data (e.g., the time trend or seasonal pattern). This type of autocorrelation poses no difficulty; it is addressed by modeling the deterministic features of the data and subtracting the modeled component from the original series. Of particular concern in test- ing for trend is autocorrelation that remains after all deterministic features are removed (i.e., errors that are in the disturbances). When this situation arises, an adjustment to the trend test is necessary. Reckhow et al. (1993) provide guidance and software. A similar situation can occur in the estimation of a regression slope or a central tendency statistic such as the mean or trimmed mean. In such cases, the inde- pendence assumption refers to the errors, as esti- mated by the residuals, around the regression line or the mean. If persistence or dependence is found in the residuals, then the independence assumption is vio- lated and corrective action is needed. Options to ad- dress this problem include using an effective sample size (Reckhow and Chapra, 1983), or generalized, least squares for regression (see Kmenta [1986] or any standard econometrics regression text). If the investigator finds positive autocorrelation in the disturbances (i.e., if each disturbance is posi- tively correlated with nearby disturbances in the se- ries), confidence interval estimates will be too narrow and may lead to rejection of the null hypothesis. Autocorrelation in the disturbances is the most com- mon and potentially the most troublesome of the causes of assumption violations. The degree of autocorrelation is a function of the frequency of sampling; that is, a data set based on an irregular sampling frequency cannot be characterized by a single, fixed value for autocorrelation. For biolog- ical time series, stream data obtained more frequently than monthly may be expected to be autocorrelated (after trends and seasonal cycles are removed). Stream survey data based on less frequent sampling are less likely to exhibit sample autocorrelation esti- mates of significance. Parametric Methods the t Test Parametric approaches involve a model (e.g., regres- sion slope) for any deterministic features and a proba- bility model for the errors. In some cases, the deterministic model will be a linear, curvilinear, or step function, while the model for the errors is typi- cally a normal probability distribution with inde- pendent, identically distributed errors. In other cases, the deterministic model may simply be a constant (as it is when interest focuses on an "upstream/down- stream" comparison between two sites), though the probability model may in all cases be a normal proba- bility distribution. The t test is a typical parametric test. Using the t test A Student's t statistic: t- s/Vn has a Student's t distribution (n-1 degrees of freedom); here, "x" is the mean of a random sample from a nor- mal distribution with true mean /j. and constant vari- ance, s is the sample standard deviation, and n is the sample size. In addition, for two samples: t = - x, -x. s,+s. also has a Student's t distribution (n1+nz-2 degrees of freedom); here, xa andx2 are the sample means; sx and s2 are the sample standard deviations; and jna and n2 are the sample sizes. This distribution is widely tabu- lated, and it is commonly used in hypothesis testing and confidence interval estimation for a sample mean (one-sample test; Equation 2.la) or a comparison of sample means (two-sample test; Equation 2.1b). When Student's t distribution is used in a hy- pothesis test (a t test), it is assumed that samples are drawn from a normal distribution, the variances are constant across distributions, and the observations are independent. Of these assumptions, Box et al. (1978) have shown that the t test has limited robust- ness to violations of the first two (normality and equality of variances); however, problems will oc- cur if the observations are dependent. The scientist should probably be concerned about the first two as- sumptions only in situations in which the two data sets have substantially different variances and sub- stantially different sample sizes (see Snedecor and Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty Cochran [1967] for F test calculations to compare variances). An attractive variation of the t statistic for use in situations where outliers are of concern was proposed by Yuen and Dixon (1973; see also Miller, 1986; and Staudte and Sheather, 1990). They created an out- lier-resistant, or robust, version of the t statistic (Equations 2.la. and 2.1 b) using a trimmed mean and a Winsorized standard deviation. For example, if a t sta- tistic is used to compare the means of two popula- tions, the robust (trimmed t) version is t = Mri (2.2) where x^ s = trimmed mean for sample i = Winsorized standard deviation = number of observations in sample i A Winsorized statistic is similar to a trimmed sta- tistic. For trimming, observations are ordered from lowest to highest, and the ^-lowest and ^-highest are removed from the sample for the calculation of the jc-trimmed statistic (e.g., trimmed mean). For £-Winsorizing, observations are ordered from lowest to highest, and the ^-lowest and ^-highest are not re- moved, but are reassigned the values of the lowest ob- servation and the highest observation remaining in the trimmed sample. The following example illus- trates this. A sample of 10 IBI values is obtained for analysis: 9, 31, 26, 25, 34, 38, 33, 31, 28, 37 And ordered from lowest to highest: 25, 26, 28, 29, 31, 31, 33, 34, 37, 38. The 10 percent- trimmed sample is 26,28,29,31, 31,33,34,37 The 10 percent-Winsorized sample is 26, 26, 28, 29, 31, 31, 33, 34, 37, 37. If we were to calculate the 10 percent-trimmed t statistic in Equation 2.2 for this IBI sample, we would use: (1) the trimmed sample (eight observations) to calculate a mean, and (2) the Winsorized sample (10 observations) to calculate a standard deviation. For the two-sample comparison of means, the trimmed t statistic has (l-2k)(n1+n2)-2 degrees of freedom or, in the above example, 7 degrees of freedom (df). The trimmed t statistic is an attractive option that should be considered whenever outliers are a concern. The parametric approach is appropriate and ad- vantageous if the deterministic model is a reasonable characterization of reality and if the model for errors holds. In such cases, parametric tests should be more powerful than nonparametric or distribution-free al- ternatives. Thus, the assumption that deterministic and probability models are correct is the basis on which the superior performance of parametric meth- ods rests. If the assumptions concerning these models are incorrect, then the results of the parametric tests may be invalid and distribution-free procedures may be more appropriate. Nonparametric Tests the Wtest Distribution-free methods, as the name suggests, do not require an assumption concerning the particular form of the underlying probability model for the data generation process. An assumption of independence is, however, usually made; therefore, autocorrelation can be as serious a problem in nonparametric meth- ods as it is for parametric and robust methods. Distri- bution-free tests are often based on rank (or order); the sample observations are arranged from lowest to highest. The Wilcoxon-Mann-Whitney test or Wtest is a typical distribution-free test. Using the Wtest The Wtest is a two-sample hypothesis test, designed to test the hypothesis that two random samples are drawn from identical continuous distributions with the same center (alternative hypothesis: one distribu- tion is offset from the other, but otherwise identical). This test is often presented as an option to the two-sample t test that should be considered if the as- sumption of normality is believed to be seriously in error. The Wtest has its own statistic, which is tabu- lated in most elementary statistics textbooks .(i.e., those with a chapter on nonparametric methods). However, for moderate to large sample sizes (e.g., n > 15), the statistic is approximately normal under the null hypothesis, so the standard normal table can be used. The scientist should consider the Wtest for any situation in which the two-sample t test may be used. Comparative studies of these two tests indicate that while the t test is robust to violations of the normality assumption, the W test is relatively powerful while not requiring normality. Situations that appear se- verely nonnormal might favor the W test; otherwise the t test may be selected. Some statisticians (e.g., Blalock, 1972) recommend that both tests be con- ducted as a double check on the hypothesis. Unfortunately, violation of the independence as- sumption appears to be as serious for the Wtest as for the t test. If these tests are to be meaningful, the scien- tist must confirm independence or make other adjust- ments as noted in Reckhow et al. (1993). 10 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty In essence, the Mutest is used to determine if the two distributions under study have the same central tendency, or if one distribution is offset from the other. To conduct the Wtest, the data points from the samples are combined, while maintaining the sepa- rate sample identity. This overall data set is ordered from low value to high value, and ranks are assigned according to this ordering. To test the null hypothesis of no difference be- tween the two distributions (f [x] and g[x]) H0:f(x)=g(x) the ranks, Rp for the data points in one of the two samples are summed: The ranks should be specified as follows (Wonnacott and Wonnacott, 1977): Start ordering (low to high, or high to low) from the end (high or low) at which the observations from the smaller sample tend to be greater in number, and sum the ranks to es- timate W from this smaller sample. This estimate keeps W small as it is reported in most tables. For ei- ther one-sided or two-sided tests, if ties occur in the ranks, then all tied observations should be assigned the same average rank. Statistical significance is a function of the degree to which, under the null hypothesis, the ranks occu- pied by either data set differ from the ranks expected as a result of random variation. For small samples, the W statistic calculated in Equation 2.3 can be com- pared to tabulated values to determine its signifi- cance (see Hollander and Wolfe, 1973). For moderate to large samples (where total n from both samples > 15), W is approximately normal (if the null hypothesis is true). Therefore, the W statistic may be evaluated using a standard normal table with mean (E[W]) and variance (Var[W]): E(W) = Var(W) = nA+ (2.4) (2.5a) If there are ties in the data, then the variance may be calculated as Var(W) = - 12 (nA+nB)(nA+nB-l) (2.5b) where t is the size (number of data points with the same value) of tied group;. The effect of ties is neg- ligible unless there are several large groups (t > 3) in the data set. These statistics are used to create the standard normal deviate: z = W-E(W) (Far(W))05 (2.6) where: nA, nB =the number of observations in samples A and B (nA< rig). Example an IBI case study IBI data have been obtained from upstream and downstream sites surrounding a wastewater dis- charge. Assume independence. Upstream Downstream 33 26 34 30 2.5 18 3.7 32 39 36 45 36 49 43 47 42 45 41 44 41 (a) Test the null hypothesis that the true differ- ence between the upstream and downstream IBI means is zero, versus the alternative hypothesis that the downstream IBI mean is lower than the upstream IBI mean. HQ-.^U-^O = 0 First, some basic statistics for each sample: Upstream Downstream SAMPLE MEAN 39.8 34.5 SAMPLE STANDARD DEVIATION 7.57 8.09 For a comparison of two means based on equal sample sizes, the t statistic is x, -x, 39.8-345 53 5 (s,+s2) 757 + 8.09 7.83^02 35 At the 0.05 significance level, the one-tailed f sta- tistic for 18 degrees of freedom is 1.73. Since 1.51 < 1.73, we cannot reject the null hypothesis (at the 0.05 level). (b) Test the null hypothesis (see part a) using the 10 percent trimmed t (10 percent trimmed from each end). \ii-\tt40.5-35.5 5.0 t,H = (5wl+sw2) jl n, 1 5.83 + 639 orr \10+10 = 6.12V02 = 183 At the 0.05 significance level, the one-tailed t sta- tistic for 14 degrees of freedom is 1.76. Since 1.83 1.76, we reject the null hypothesis (at the 0.05 level). (c) Test the null hypothesis (see part a) using the Wtest. Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data 11 ------- CHAPTER 2. Classical Statistical Inference and Uncertainty ORDER IBI VALUES Upstream Downstream 49 47 45 45 44 43 42 41 41 39 37 36 36 34 33 32 30 26 25 18 ORDER Upstream Downstream 1 2 3.5 3.5 5 6 7 8.5 8.5 10 11 12.5 12.5 14 15 16 17 18 19 20 Here the separate samples have been combined for the purpose of rank ordering. The Wtest statistic can then be calculated from the ranks: W = = 1+2+ 3.5+ 3+ 5+10+ 11+ 14 + 15+19= 84 E(W)=nA(nB+nA = (nBnA / 12)[nB + nA =10(10+10+ 1)/ 2=105 + nA)(nB .+ nA Conclusions In hypothesis testing, the conclusion to not reject H0 (in effect, to accept H0) should not be evaluated strictly on the basis of a, the probability ofrejectingH0 when it is true (Type I error; see Table 2.4). Instead, we must be concerned with (3, the probability of accept- ing H0 when it is false (Type II error). Unfortunately, P does not have a single value, but is dependent on the true (but unknown) value of the difference between population means and on the sample size, n. For a par- ticular testing procedure and sample size, we can de- termine and plot a relationship between the true difference between means and p. This plot is called the operating characteristic curve. To understand the issues concerning signifi- cance and power (a and 1-p), consider the null hy- pothesis in the IBI case study: = [(io)(io) / i2][io +10 +1 - {(2)(3) + (2)(3) + (2)(3)} / (10 + io)(io +10 -1)] = i74.6i HO: The population mean IBI at the upstream site is the same as the population mean IBI at the downstream site. (W-E(W) 84-105 174.61" =-1.59 At the 0.05 significance level, the one-tailed z statistic is 1.65. Since 1.59 < 1.65, we cannot reject the null hypothesis (at the 0.05 level. A glance at the IBI values and ranks in this exam- ple indicates a difference between the two samples (box plots and histograms would provide further sup- porting evidence). At issue is whether this difference in the sample is a chance occurrence or an indication of a true difference between the sites. If we adopt the conventional 0.05 level for hypothesis testing, then the conclusions from the three tests are ambiguous. Still, we can say the following about both the site comparisons and the methods: (i) The downstream site is slightly impacted. Even though only one of the three test results yielded significance (at the 0.05 level), all three were close, suggesting a slight difference between the sites. (ii) For each site, the lowest IBI value (25 for up- stream, and 18 for downstream) is influential, partic- ularly on the standard deviation. As a consequence, for the conventional t test, the denominator in the t statistic is inflated and rejection of the null hypothe- sis is less likely. Note that the lowest IBI value for the upstream site (IBI = 25) also affects the distribu- tion-free Wtest. This IBI value holds a high rank (19) for the upstream sample, and substantially affects the test result. If that single IBI value had been 27 instead of 25, we would have rejected the null hypothesis at the 0.05 level. (iii) The trimmed t is resistant to unusual obser- vations or outliers, and thus provides the best single indicator of difference between the sites as conveyed by the bulk of the data from each site. In addition, because of the wastewater dis- charge, consider the general alternative hypothesis: HA: The population mean IBI at the upstream site is higher than the population mean IBI at the downstream site. If we adopt a = 0.05 (the probability of rejecting H0 when it is true; Type I error) as our significance level, then Figure 2.la displays the sampling distribu- tion for the mean under H0 with 18 degrees of free- dom. The horizontal axis in Figure 2.1 is the "difference between the means"; thus, the sampling distribution is centered at zero in Figure 2.la (consis- tent with zero difference between means under H0). The 0.05-significant tail area (the "rejection region") begins at 6.06, which means that the sample differ- ence must be greater than or equal to 6.06 for us to re- ject H0. Since the difference between the means in our sample IBI was only 5.3, we are inclined to accept the null hypothesis, based on the conventional t test. (Note: to find the beginning of the tail area multi- ply the t statistic times the standard error. In this ex- ample, the t statistic is 1.73 [one-sided, 0.05 level, 18 degrees of freedom], and the standard error is 3.5. Thus, the tail area begins at [1.73][3.5] = 6.06.) Now suppose that the following alternative hy- pothesis, Hv is actually true for the sample IBI case: Ht: The population mean IBI at the upstream site is higher by 5.0 than the population mean IBI at the downstream site. In addition suppose that while H^ actually is true, we propose a hypothesis test for H0 based on the acceptance region in Figure 2. la (i.e., accept H0 if the 12 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 2. Classical Statistical Inference and Uncertainty (a) If HO is true a = .05 (5% tail area) 6.06 (b) If H| is true 0 Figure 2.la and b.Sampling distributions under different hypotheses. difference between the means is less than 6.06), which is exactly what occurred in our example. As we noted above, consideration of H0 alone (Figure 2.la) leads us to accept the null hypothesis. Yet, with Hj actually true (see Fig. 2.1b), if we propose a hypothesis test for H0 based on the accep- tance region in Figure 2.la, there is a 62 percent chance that we will accept H0 when it is actually false, according to Figure 2.lb (given the sample size in the example). This high likelihood of Type II error (see Ta- ble 2.4) underscores the danger of concluding the hy- pothesis test with acceptance of the null hypothesis. The power of this particular test is 1-p, or a 38 percent chance of detecting an IBI change of 5. Note that the specific alternative hypothesis H^ is one example of an unlimited number of possibilities associated with the general alternative hypothesis HA. Associated with Hp' (3 = 0.62 is one point on the power curve for this test and sample size. To properly determine the power of a test, we need to calculate f> for a range of specific alternative hypotheses. A second issue of concern in hypothesis testing is the problem of multiple simultaneous hypothesis testing, or "multiplicity" (Mosteller and Tukey, 1977). The classical interpretation of the 0.05 significance level (for a) associated with a hypothesis test is that 95 percent of the time this testing procedure is ap- plied, the conclusion to accept the null hypothesis will not be in error if the null hypothesis is true. That is, on the average, one in 20 tests under these condi- tions will result in Type I errors. The problem of multiplicity arises when an in- vestigator conducts several tests of a similar nature on a set of data. If all but a few of the tests yield statisti- cally insignificant results, the scientist should not ig- nore this in favor of those that are significant. The error of multiplicity results when one ignores the ma- jority of the test results and cites only those that are apparently statistically significant. As Mosteller and Tukey (1977) note, the multiplicity error is techni- cally the incorrect assignment of an a-level. When multiple tests of a similar nature are run on a set of data, a collective a should be used, associated with si- multaneous test results. This tactic is typically re- ferred to as the Bonferroni correction for correlation analysis. The following comments from Wonnacott and Wonnacott (1972, pp. 201-202) summarize our atti- tude toward hypothesis testing: We conclude that although statistical theory provides a rationale for rejecting H0, it pro- Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data 13 ------- CHAPTER 2. Classical Statistical Inference and Uncertainty vides no formal rationale for accepting H0. The null hypothesis may sometimes be uninterest- ing, and one that we neither believe or wish to establish; it is selected because of its simplic- ity. In such cases, it is the alternative Ha that we are trying to establish, and we prove Ha by rejecting H0. We can see now why statistics is sometimes called "the science of disproof." H0 cannot be proved, and H1 is proved by dis- proving (rejecting) H0. It follows that if we wish to prove some proposition, we will often call it H1 and set up the contrary hypothesis H0 as the "straw man" we hope to destroy. And of course if H0 is only such a straw man, then it becomes absurd to accept it in the face of a small sample result that really supports Hr Since there are great dangers in accepting H0, the decision instead should often be simply to "not reject H0," i.e., reserve judgment. This means that type II error in its worse form may be avoided; but it also means you may be leav- ing the scene of the evidence with nothing in hand. It is for this reason that either the con- struction of a confidence interval or the calcu- lation of a prob-value is preferred, since either provides a summary of the information pro- vided by the sample, useful to sharpen up your knowledge of what the underlying popu- lation is really like. If, on the other hand, a simple accept-or-reject hypothesis test is desired, then we must look to a far more sophisticated technique. Spe- cifically, we must explicitly take account not only of the sample data used in any standard hypothesis test (along with the adequacy of the sample size), but also: 1. Prior belief. How much confidence do we have in the engineering department that has assured us that the new process is better? Is their vote divided? Have they ever been wrong before? 2. Loss involved in making a wrong decision. If we make a type I error (i.e., decide to reject the old process in favor of the new, even though the old is as good), what will be the costs of re- tooling, etc.? These comments amount to an advocacy of Bayesian decision theory. While it may be difficult to interpret a biosurvey in decision analysis terms, prior information and loss functions should, at a mini- mum, be considered in an informal manner. It is good engineering and planning practice to make use of all relevant information in inference and decision mak- ing. 14 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data ------- BIOLOGICAL CRITERIA Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data CHAPTER 3. Designing the Sample Survey 15 Critical Aspects of Survey Design 15 Variability 15 Representativeness and Sampling Techniques 15 Cause and Effect 16 Controls 16 Key Elements 17 Pilot Studies 17 Location and Sampling Points 18 Location of Control Sites 19 Estimation of Sample Size 20 Important Rules 20 ------- CHAPTER 3 Designing the Sample Survey The design of the sample survey is a critical ele- ment in the environmental assessment process, and certain statistical methods are associated with specific designs. This chapter examines various types of survey design and shows how the selection of the design relates to the interpretation and use of data within the biocriteria program. For information on de- signs not covered in this chapter, see Cochran, 1963; Cochran and Cox, 1957; Green, 1979; Williams, 1978; and Reckhow and Stow, 1990. Efforts to design sample surveys frequently re- sult in situations that force the investigator to evalu- ate the trade-offs between an increase in uncertainty and the costs of reducing this uncertainty (Reckhow and Chapra, 1983). But major components of uncer- tainty, including variability, error, and bias in biologi- cal and statistical sources, can sometimes be controlled by a well-specified survey design. For example, variability can be caused by natural fluctuations in biological indicators over space and time; error can be associated with inaccurate data ac- quisition or reduction; and bias can occur when the sample is not representative of the population under review or when the samples are not randomly col- lected. These sources of uncertainty should be evalu- ated before the sampling design is selected because the best design will minimize the effects of variability, error, and bias on decision making. Critical Aspects of Survey Design Data collection within the biocriteria program re- quires the investigator to address issues associated with both classical and experimental survey designs. In general, experimental survey design focuses on the collection of data that leads to the testing of a specific hypothesis. Classical survey design is motivated less by hypothesis testing than by the "survey" concept. That is, the investigator gathers a relatively small amount of data, the sample, and extrapolates from it a view of the totality of available information. In this chapter, we will address issues that over- lap these design types. In addition, we will focus on designs appropriate to local, site-specific situations. For larger geographic survey designs, see Hunsaker and Carpenter (1990), or Linthurst et al. (1986). Variability A critical aspect of sampling design is to identify and separate components of variability, including the im- portant ones of time, space, and random errors. Yearly and seasonal variations and spatial variations like those caused by changes in geographic patterns should be accounted for in the survey design. A de- sign that stratifies the sampling based on knowledge of spatial and temporal changes in the abundance and character of biological indicators is preferred to sys- tematic random sampling. That is, if biological indi- cators are known to exhibit temporal and spatial patterns, then sampling locations and times must be adjusted to match the biological variability. Representativeness and Sampling Techniques The object of a biological survey design is to reduce the total information available to a small sample: ob- servations are made and data collected on a relatively small number of biological variables. Representative- ness is, therefore, a key consideration in the design of sample collection procedures. The data generated during the survey should be representative of the population or process under evaluation. Biased sam- ples occur when the data are not representative of the population. For example, a sample mean may be low (biased) because the investigator failed to sample geo- graphic areas of high abundance. Several techniques can increase the odds of col- lecting a representative sample; however, the tech- nique most frequently used is random sampling. Theoretically, in simple random sampling, every unit in the population has the same chance of being in- cluded in the sample. Random sampling is a physical way to introduce independence among environmen- tal measurements. In addition, random sampling has the affect of minimizing various types of bias in the interpretation of results. If the geographic area sampled is large, with known or suspected environmental patterns, a good technique is to divide the area into relatively homoge- nous sections and randomly sample within each one. This technique is known as stratified sampling. Sam- ples can be allocated to each section in proportion to the size of the area or to the known abundance of or- ganisms within each area. In still other cases, system- atic sampling may be appropriate. Systematic sampling improves precision in the sample estimates, especially when known spatial patterns exist Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data 15 ------- CHAPTER 3. Designing the Sample Survey (Cochran, 1963). Randomly allocated replicate sam- ples collected on a grid allow for good spatial coverage of patchy environments, yet minimize the potential for sampling bias. Cause and Effect In classical statistical experiments, a population is identified and randomly divided into two groups. The treatment is administered to one group; the other group serves as the control. The difference in the aver- age response between the two groups indicates the ef- fect of the treatment, and the random assignment of individuals to the groups permits an inference of cau- sality because the observed difference results from the treatment and not from some preexisting differ- ence between the groups. In an ecological assessment, the treatment and control groups are not selected at random from a larger population, since the impacted site cannot be selected at random. And no matter how carefully the reference site is matched, the investigator cannot compensate for the lack of random selection. In this sense, a statistically valid test of the hypothesis that an observed difference between an impacted site and a control site results from a specific cause is impossi- ble. The hypothesis that the two sites are different can be tested, but the difference cannot be attributed to a specific cause. In statistical terms, the stress on the impacted site is completely confounded with preex- isting differences between the impact and reference site. Although a firm case can be made that a site is subject to adverse impacts, investigators must realize that the site is an experimental unit that cannot be replicated. They must take care to avoid "pseudoreplication" (Hurlbert, 1984) the testing of a hypothesis about adverse effects without appropri- ate statistical design or analysis methods. The prob- lem is a misunderstanding or misspecification of the hypothesis being tested. It is avoided by understand- ing that only the hypothesis of a difference between sites can be statistically tested. Cause-and-effect is- sues cannot be resolved using statistical methods. Of course, establishing that a difference exists is an es- sential step in the process of demonstrating an ad- verse ecological effect. If there is no detectable difference, there is no cause to establish. Methods used to establish causality can make use of statistical techniques, such as regression or cor- relation. For example, regression can be used to show that toxicity increases along with the concentration of some chemical known to originate from a wastewater outfall. The regression describes the relationship; it does not imply the cause, though presence of a strong relationship is evidence that a link exists. One way to resolve these issues is to collect both spatial and temporal data from a control site. If the spatial control is missing and only before and after impact samples are available at the impacted site, sta- tistical tests cannot rule out the possibility that the change would have occurred with or without the im- pact. If the temporal control is missing, the statistical tests cannot rule out the possibility that the differ- ences between the control and impact site may have occurred prior to the impact. In practice, control data are rarely available in both spatial and temporal di- mensions. Therefore, most environmental assess- ments detect only that differences exist between the control and impact sites. The causal link is more diffi- cult to discern. Controls In environmental assessments, control or reference data are used in hypothesis tests to evaluate whether data from the control and impact site are statistically different. Evidence of impact is based on changes in the biological community that did not occur in the control area. Sources of control information include baseline data, reference site data, and numeric stan- dards. The case for causality can be strengthened if the controls are properly selected. In an ideal study design, both temporal and spa- tial control data should be collected (Green, 1979). The control site should be geographically separated from the impacted site but have similar physical and ecological features (e.g., elevation, temperature, wind patterns, and habitat type and disturbance). In aquatic habitats, parameters such as stream order, flow rate, and stream hydrography should be consid- ered. Ideally, biological indicators of impact should be collected at the control site before and after the im- pact occurs. Statistically, a valid control site should have con- servative properties. That is, its statistics should be the same as at the impacted site except for the effects of the impact. Physical, chemical, and ecological vari- ables should be measured and statistically evaluated to confirm that the impact and control sites are prop- erly matched. Investigators should test for mean dif- ferences as well as differences in distribution. In addition, the variance of the physical and ecological similarities between the control and impact sites should be the same over time. For example, if the mean pH between the two sites is consistent but the impact site experiences much wider swings in pH than the control site, then the ability to confidently detect an impact for a pH-dependent toxicant is com- promised. Samples within the control and reference site should be randomly allocated at some level. For example, in a random sampling design (Fig. 3.1), the 16 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 3. Designing the Sample Survey Control Area Before Impact Area After Figure 3.1Random before and after control impact (BACI) sample design having both temporal and spa- tial dimensions. Random samples indicated are from within areas identified as being of similar habitat. (Adapted from Green, 1979.) samples would be randomly allocated in a tempo- ral/spatial framework that would allow for a number of different statistical analyses, including analysis of variance (ANOVA). In an optimal study design, the impact would be in the future. Thus, baseline data providing a tempo- ral control would be available to the investigator. In practice, baseline data are rarely available, and the in- vestigator cannot be certain whether differences be- tween the impact and control sites preceded or followed the impact. Therefore, cause and effect can- not be determined. However, the fact that a difference exists allows the investigator to hypothesize a causal link. In some cases, biological variables collected at an impact site may be compared to a fixed numeric value rather than to a set of identical measurements collected at a reference site. Nevertheless, the issues associated with demonstrating causality remain the same. In addition, the investigator should note that the numeric criterion has no variance. It is usually presented as a single number with no associated un- certainty. In such cases, a t statistic (see chapter 4) would be appropriate. As an alternative to the nu- meric criterion, investigators could use the data from which the criterion was derived. Uncertainty esti- mates from that data set could be used in statistical comparisons. Key Elements Several specific survey designs are appropriate for use in a biocriteria program, but designs for a particu- lar environmental assessment should be developed with the aid of a consulting statistician. Such plans should include the following key elements, beginning with the notion of a pilot study. Pilot Studies In a pilot study, the investigator makes a limited sur- vey of the variables that determine impact at both the impact and control site. Data from the survey can be used to estimate sample sizes, evaluate sampling methods, establish important variance components, and critique or reevaluate the larger design. The sam- ple size helps determine the particular levels of statis- tical confidence that can be gleaned from the study. In general, a pilot study can save time and effort by veri- fying an investigator's preliminary assumptions and initial evaluations of the impact site. Current studies Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data 17 ------- CHAPTER 3. Designing the Sample Survey and historical data collected at the site of interest or similar sites can be used to help establish a good mon- itoring design. Location of Sampling Points A second key issue in the study design is the location of the sampling points. Many specific designs and variations are available, including (1) completely ran- dom sample designs, (2) systematic sample designs, and (3) stratified random sample designs. Random Samples. In complete random sam- pling, every potential sampling point has the same probability of selection. The investigator randomly assigns the sample points within the impact site and independently within the control site. No attempt is made to partition the impact and control sites either spatially or temporally except to ensure similar physi- cal habitats. The sampling units are numbered se- quentially, and the selection is made using a random number table or computer-generated random num- bers. The advantage of random sampling is that statis- tical analysis of data from points located completely at random is comparatively straightforward. In addi- tion, the method provides built-in estimates of preci- sion. On the other hand, random sampling can miss important characteristics of the site, spatial coverage tends to be nonuniform, and some points may be of little interest. Systematic Samples. Systematic sampling oc- curs when the investigator locates samples in a nonrandom but consistent manner. For example, sam- ples can be located at the nodes of a grid, at regular in- tervals along a transect, or at equally spaced intervals along a streambank. The grid or interval can be gener- ated randomly, after which the position of all samples is fixed in space. Systematic sampling has two advantages over simple random sampling. First, it is easier to draw, since only one random number is required. Second, the sampling points are evenly distributed over the entire area. For this reason, systematic sampling often gives more accurate results than random sampling, particularly for patchy environments or environ- ments with distinct discontinuous populations. Systematic sampling also has its disadvantages. For example, if the magnitude of the biological vari- able exhibits a fixed pattern or cycle over space or time, then systematic sampling is unlikely to repre- sent variance of the entire population. Suppose an or- ganism has several hatches, roughly at equally spaced time intervals during the sampling period, then sam- ples taken at fixed-time intervals may provide a bi- ased estimate of the average number of individuals alive at one time. If possible, the population should be checked for such periodicity. If periodicity is found or suspected but not verifiable, systematic sampling should not be used. Another disadvantage of systematic sampling is that it is more complicated to estimate the standard error than if random sampling had been used. Despite these problems, systematic sampling is often part of a more complex sampling plan in which it is possible to obtain unbiased estimates of the sampling errors. Stratified Random Samples. Stratified samples combine the advantages of random and systematic sampling. Stratified random sampling consists of the following three steps: (1) the population is divided into a number of parts, called strata; (2) a random sample is drawn independently in each stratum, and (3) an estimate of the population mean is calculated. Thus: yst = (3.1) N where yst is the estimate of the population mean, Nh is the total number of sampling units in the h stra- tum, and yh is the sample mean in the h* stratum, and N = ^ N h is the size of the population. Note that Nh are not sample sizes but the total sizes of the strata, which must be known to calculate this value. Stratification is employed if it can be shown that differences between the strata means in the popula- tion do not contribute to the sampling error in the esti- mate of y h. In other words, the sampling error of y h arises solely from variations among sampling units that are in the same stratum. If the strata can be formed so that they are internally homogeneous, a gain in precision over simple random sampling can occur. In stratified sampling, the sample size can vary independently across strata. Therefore, money and human resources can be allocated efficiently across strata. As a general rule, strata with the greatest uncer- tainty (i.e., with the largest expected variance, or about which little is known) should receive the great- est amount of sampling effort. For environments that are known to be fairly ho- mogeneous with respect to the biological variable un- der consideration, stratified random sampling will not add precision to the population estimates. In fact, using stratification in these environments may intro- duce a loss of precision and a possible bias in the pop- ulation estimates. In these cases, the investigator may save a great deal of time and effort by using simple random sampling in the sampling plan. 18 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 3. Designing the Sample Survey Location of Control Sites Under EPA's biocriteria program, states may establish either site-specific reference sites or ecologically sim- ilar regional reference sites for comparison with im- pacted sites (U.S. Environ. Prot. Agency, 1990). Typical site-specific reference sites may be estab- lished along a gradient. For example, a reference site can be established upstream of a wastewater outfall (Fig. 3.1). Gradients work well for rivers and streams; for larger waterbodies, reference sites can be estab- lished on a one-to-one basis with a similar waterbody in the region not experiencing the impact under eval- uation. An important consideration in site-specific ref- erence conditions is to establish that the control site is not impaired at all or that it is only minimally im- paired. In particular, baseline data should be obtained to demonstrate that the impact is linked to the differ- ences detected between the reference site and the control site. Ideally, a reference site should exhibit no impair- ment; however, natural variability in biological data may make the determination of minimal or no impact difficult, especially if the impact is relatively small. An interesting method for site selection is to establish several reference sites based on their physical simi- larities with the impact site. For example, selecting one reference site with higher flow than the impact site and another with lower flow may increase the in- vestigator's ability to determine the presence of a real impact. Comparisons of data collected from the im- pact and reference sites should provide consistent in- terpretations of the impact, regardless of which reference site is used in the comparison. Minimizing temporal variation in biological measurements can be critical to the evaluation of con- trol and impacted sites. A general rule is that samples should be obtained from the control and reference sites during the same time periods. It may be feasible to target an index period (e.g., late spring or summer) in which the biological variables are assumed to be appropriate indicators of ecological health (e.g., the period of maximum abundance or the period of mini- mum variation in water chemistry). However, for some organisms, periods of maximum abundance may also be periods of high variability. In this case, periods of low abundance but stable conditions can be used to help the investigator detect impairment if it exists. Estimation of Sample Size A final key component in developing a survey design is to determine how many samples are required. In most plans, the issue involves a trade-off between the accuracy of the sample estimate and the magnitude of available monetary and human resources. Conse- quently, the first step is to determine how large an er- ror can be tolerated in the sample estimate. This decision requires careful thought; it depends on how the collected data will be used and the consequences of a sizable uncertainty associated with the sample es- timates. Thus, in reality, selecting a sample size is somewhat arbitrary and driven by practical consider- ations of time and money. Investigators should, how- ever, always approach the selection of sample size using sound statistical principles. The appropriate equations for calculating sam- ple sizes are often design dependent. Here, we present a design for simple random sampling. Suppose that d is the allowable error in the sample mean, and the in- vestigator is willing to take only a 5 percent chance that the error will exceed d. In other words, the inves- tigator wants to be reasonably certain that the error will not exceed d. The equation for the sample size is (3.2) t2a2 n = - and t is the f statistic for the level of confidence required. For a 95 percent confidence level that the sample mean will not exceed d, t = 1.96. Obviously, an estimate of the population standard deviation, a, is necessary to use this relationship. In many cases, an estimate of a can be obtained from existing data. When few data are available about a, it is a good idea to generate a set of tables to develop a sense of the range of samples required. Suppose, for example, that an investigator wishes to estimate mean pH readings above a wastewater discharge. How many samples are needed to estimate the true mean pH? At the extremes, the in- vestigator guesses that the standard deviation might range between 0.5 and 1.2 pH units. This estimate leads to Tables 3.1 and 3.2: Table 3.1. Number of samples needed to estimate the true mean (low extreme). j CONFIDENCE LEVEL 95% 90% MARGIN OF ERROR (a=0.5) i 0.2 pH units 24 17 0.5 pH units 4 3 IpHunit 1 1 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data 19 ------- CHAPTER 3. Designing the Sample Survey Table 3.2 Number of samples needed to estimate the true mean (high extreme). CONFIDENCE LEVEL 95% 90% MARGIN OF ERROR (a=l.2) 0.2 pH units 138 98 0.5 pH units 22 16 1 pHunit 6 4 Note that the number of required samples increases dramatically as the confidence and precision in the estimates increase, and as the population standard deviation increases. As a general rule, the precision of the estimate is inversely proportional to the square root of the sample size. Therefore, increasing the sam- ple size from 10 to 40 will roughly double the preci- sion. For a fixed precision, changing the required con- fidence in the estimate from 95 to 99 percent slightly more than doubles the sample size. Equation 3.2 can easily be adopted for binary response variables in which the responses are expressed as proportions or percentages (Cochran, 1963). In addition, for those situations where the number of sampling units is fi- nite, a finite population correction for the sample size is available (Cochran, 1963). Equations for calculating sample sizes for ran- dom, nonrandom, and stratified sample surveys can be found in the literature. They depend on the sample design, the available variance estimates, and whether the environmental assessment has both spatial and temporal components. Important Rules Developing a sample design is frequently driven by factors other than statistics and biology. For example, the investigator may be asked to determine a differ- ence between upstream and downstream stations of a municipal treatment plant outfall, long after the sus- pected impacts began. Even in these cases, creative sampling strategies can help develop the link be- tween the wastewater outfall and downstream im- pacts. The following rules apply to most environmental assessment scenarios. . Rule 1. Sample designs and their associated analytical techniques can be difficult to conceptualize and implement. Always consult individuals with appropriate training before starting a biocriteria study. . Rule 2. State precisely and clearly the problem under evaluation before attempting to develop a survey design. Rule 3. Collect samples from a reference site as a basis for inferring impact. In general, the sampling scheme used at the impacted site should be the same as that employed at the reference site. Rule 4. To the degree possible, use environmental characteristics to minimize the error in the sample estimate. For example, for patchy environments examine the possibility of systematic sampling; for heterogeneous populations, examine the possibility of using stratified random sampling. In all cases, attempt to minimize sample bias by randomly allocating samples (either geographically or temporally across the entire population, or within strata). Rule 5. For seasonally dependent biocriteria, collect data for several seasons before attempting to determine an impact. For biocriteria that are not seasonally dependent, collect sufficient data to represent the variability in the population. Rule 6. Collect enough data so that the accuracy and precision requirements associated with using the information are achieved. 20 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- BIOLOGICAL CRITERIA Technical Guidance for Survey Design and Statistical Evaluation of Biosurvey Data CHAPTER 4. Detecting Mean Differences 21 Cases Involving Two Means 21 Random sampling model, external value for 6 21 Random sampling model, internal value for 6 22 Testing against a Numeric criterion 22 A Distribution-Free Test 23 Evaluating Two-Sample Means Testing 23 Multiple Sample Case 23 Parametric or Analysis of Variance Methods 23 Nonparametric or Distribution Free Procedures 25 Testing for Broad Alternatives 25 The Kolmogorov-Smirnov Two-Sample Test 26 Relationship of Survey Design to Analysis Techniques 27 ------- CHAPTER 4 Detecting Mean Differences Hypothesis testing methods that seek to detect the mean differences arising from two or more independent samples are among the most common statistical procedures performed. However, these pro- cedures are frequently used without regard to some basic assumptions about the data under investigation which, in some cases, leads to errors in interpreta- tion. This section describes and illustrates several methods for detecting mean differences. It focuses on (1) cases in which only two means are involved, and (2) situations involving more than two means. It also presents suggestions concerning the use and abuse of means testing procedures. Cases Involving Tiro Means Several scenarios within the biocriteria program re- quire investigators to compare the mean differences between two independent populations. Suppose for example, that we want to use biocriteria in a regula- tory setting in the following situation: A wastewater treatment plant discharges its ef- fluent into a stream at a single point. Upstream of the discharge facility, the stream is in good shape (unaf- fected by any known sources of pollution). The re- source agency has sufficient funds to monitor three stations upstream of the discharge site and a compara- ble number of streams downstream of the discharge site during the late summer. The agency has chosen to evaluate aquatic life use impairment using benthic species richness. At each of the six sites, 10 independent measures of species richness were generated by randomly placed ponar grabs over a relatively small spatial area (a sample size of 10 was chosen based on variability estimates generated at a different, but similar site). Sites of comparable habitat quality were chosen for sampling. The upstream sites will serve as a reference condition against which to compare the downstream condition. In addition to the current survey (i.e., sampling regime, data collection, and interpretation), the regu- latory agency has identified an additional upstream site for which it has 10 years of comparable long-term (historical) data. The investigators have no reason to believe that a time component exists in the long-term data. Table 4.1 presents descriptive information asso- ciated with the upstream and downstream sites and with the long-term site. The question for investigators is this: Do the data reveal a downstream effect associated with the wastewater discharge? Several methods are available for assessing the mean differences between the up- stream and downstream sites, and each method has both positive and negative aspects. Random Sampling Model, External Value fora Suppose investigators believe that the 30 measures of benthic species richness collected at the upstream and downstream sites can be treated as random sam- ples from appropriate populations. In particular, they Table 4.1 Descriptive statistics: upstream-downstream measures of benthic species richness. SITE Upstream Downstream Historic ! Pooled Data 1 2 3 4 5 6 7 1-3 4-6 N 10 10 10 10 10 10 200 30 30 MEAN 10.0 12.6 11.2 10.4 7.7 9.0 10.4 11.3 9.0 STD. 2.3 2.5 2.4 2.4 3.7 1.8 3.4 2.5 2.9 MINIMUM 7.5 10.3 7.2 6.3 3.4 5.6 0.17 7.2 3.4 MAXIMUM 14.8 18.0 15.1 13.7 14.7 11.1 19.4 18.0 14.7 10%-TRIMMED MEAN 9.7 12.2 11.2 10.5 7.4 9.1 11.1 11.1 9.0 MEDIAN ABSOLUTE DEVIATION 1.5 1.3 1.0 1.0 2.7 1.5 2.6 1.0 1.6 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data 21 ------- CHAPTER 4. Detecting Mean Differences believe that the two populations have the same form (i.e., normal distributions with the same variance, a) but different means, na and fj.b. How can the investiga- tors use statistical theory to make inferences about the effect of the wastewater treatment plant dis- charge? If the data were random samples from the popu- lations, with Na = 30 observations from the upstream population and Nb observations from the downstream population, the variances of the calculated averages, Ya and Yb would be: _ d2 (4.1) Likewise, in the random sampling model, Ya and Yb would be distributed independently, so that: _cr N\ 1 NT N, (4.2) Even if the distributions of the original observa- tions had been moderately nonnormal, the distribu- tion of the difference Ya-Yb between sample averages would be nearly normal because of the central limit effect. Therefore, on the assumption of random sam- pling, z = - J_ J^ sT + N7 (4.3) would be approximately a unit normal deviate. Now, CT, the hypothetical population value for the standard deviation, is unknown. However, the historical data yield a standard deviation of 3.4. If this value is used for the common standard deviation of the sampled populations, the standard error of the dif- ference, Ya-Yb = 2.3, is aj + =0.89 V30 30 Based on the robust estimators (trimmed mean difference of 2.1 and median absolute difference of 1.6) the standard error of the difference would be 0.41. If the assumptions are appropriate, the approxi- mate significance level associated with the postulated difference (//Qpib) in the population means will then be obtained by referring zn =- .89 to a table of significance levels of the normal distribu- tion. In particular, for the null hypothesis (/^.a-^b) = 0, z0 = 2.3/.S9 = 2.6, and Pr(z < 2.6) < .005. Again, the upstream/downstream effect seems to be realistic (us- ing the robust estimators, z = 5.1 and Pr[z < 5.1] < .001). Note that we use the z distribution in this ex- ample because the population variance is determined from an external set of data that represents the popu- lation of interest an assumption equivalent to as- suming that the variance of the population is known (i.e., not estimated). Random Sampling Model, Internal Value for a Suppose now that the only evidence about CT is from the Na = 30 samples taken upstream and the Nb = 30 samples taken downstream. The sample vari- ances are Z(Yal-Ya)2 s, =- N -1 = 625 Z(Ybl-Yb)2 Nb-l = 8.41 On the assumption that the population variances of the upstream and downstream sites are, to an ade- quate approximation, equal, these estimates may be combined to provide a pooled estimate of s2 of this common a2. This is accomplished by adding the sums of squares in the numerators and dividing by the sum of the degrees of freedom, 2_Z(Yal-Ya)2+X(Ybl-Yb)2 N +Nb-2 = 752 On the assumption of random sampling from normal populations with equal variances, in which the discrepancy [(Ya-Yb) - (^a-/ub)] is compared with the estimated standard error of Ya-Yb, a t distribution vfithNa+Nb-2 degrees of freedom is appropriate. The t statistic in this example is calculated as t =- 1 1 s I + N. N, 231 "OJl = 32 This statistic is referred to a t table with 58 de- grees of freedom. In particular, for the null hypothesis that (MQ-/"b) = 0, Pr(t < 3.2) < .001. Again, an up- stream/downstream effect seems plausible. Using the robust statistics, a pooled estimate of error can be cal- culated as the average of the median absolute devia- tions associated with each data set ([I + 1.6 ] / 2 = 1.3). Therefore, the t statistic is 6.3 and the Pr(t< 6.3)< .001. Note that we use the t distribution in this exam- ple because the population variance is estimated from the survey data and not assumed to be known. Testing against a Numeric Criterion In the preceding sections, hypothesis tests were pre- sented for the two-sample case. Similar tests are avail- 22 Biological Criteria: Technical Guidance for Survey Design and Statistical Evaluation ofBiosurvey Data ------- CHAPTER 4. Detecting Mean Differences able for testing a sample mean against a fixed numeric criterion (for which an associated uncertainty does not exist). In this case, the t statistic can be written as follows: V(W) = NaNb(Nb+Na+l) 12 W-E(W) t =- (4.4) Here, s is the sample standard deviation and n is the numeric criterion of interest. The probability of a greater value can be found in a t table using n-1 de- grees of freedom. A Distribution-Free Test In many instances, the assumption that the raw data (or paired differences) are normally distributed does not hold. Even the simplest monitoring design involv- ing the comparison of two means requires either (1) a long sequence of relevant previous records that may not be available or (2) a random sampling assumption that may not be tenable. One solution to this dilemma is the use of distribution free statistics such as the W rank sum test (Hollander and Wolfe, 1973). The Wtest is designed to test the hypothesis that two random samples are drawn from identical continuous distri- butions with the same center. An alternative hypothe- sis is that one distribution is offset from the other, but otherwise identical. Comparative studies of the t and W tests indicate that while the t test is somewhat ro- bust to the normality assumption, the W test is rela- tively powerful while not requiring normality. In many cases, performing both the t and W tests can be used as a double check on the hypothesis. To conduct the Wlest (see Chapter 2), the investi- gator combines the data points from the samples, but maintains the separate sample identity. This overall data set is ordered from low value to high value, and ranks are assigned according to this ordering. To test the null hypothesis of no difference between the two distributions f(x) and g(x) (i.e., H0: f[x] = g[x]), the ranks of the data points in one of the two samples are summed: W=£Rj (4.5) Statistical significance is a function of the degree to which, under the null hypothesis, the ranks occu- pied by either data set differ from the ranks expected as a result of random variation. For small samples, the W statistic calculated in Equation 4.5 can be com- pared to tabulated values to determine its signifi- cance. Alternatively, for moderate to large samples, W is approximately normal with mean E(W) and vari- ance V(W): Na (Nh +Na +1) E(W) = (4.6) Z (4.7) (4.8) In the upstream/downstream case that we have been discussing, E(W) = 1,127, z = 3.12, and Pr( |