EMAP Statistical Methods Manual


                                                 EPA/620/R-96/002
                                                'August  20,  199.6_
EMAP  Statistical Methods Manual
                            by

                      Susan Diaz-Ramos
                      Anteon Corportion
                     200 S.W. 35th Street
                    Corvallis, Oregon 97333

                      Don L. Stevens, Jr.
                     Dynamac Corporation
                     200 S.W. 35th Street
                    Corvallis, Oregon 97333

                      Anthony R. Olsen
               NHEERL Western Ecology Division
              U.S. Environmental Protection  Agency
                     200 S.W. 35th Street
                     Corvallis, OR 97333
                         May 1996
        Environmental Monitoring and Assessment Program
    National Health and Environmental Effects Research Laboratory
               Office of Research and Development
              U.S. Environmental Protection Agency
                     Corvallis, OR 97333

-------
TECHNICAL REPORT D^
(Please read Instructions on the reverse be]
l. REPORT NO.
EPA/620/R-96/002
2.
4. TITLE AND SUBTITLE
EMAP statistical methods manual
7. AUTHORIS)
1S. Diaz-Ramos, 'D.L. Stevens, Jr., 2A. R. Olsen
9. PERFORMING ORGANIZATION NAME AND ADDRESS
'MERSC, Corvallis, OR,
^S EPA, NHEERL, Corvallis, OR
12. SPONSORING AGENCY NAME AND ADDRESS
US EPA ENVIRONMENTAL RESEARCH LABORATORY
200 SW 35tii Street
CorvaJlis, OR 97333
15. SUPPLEMENTARY NOTES
1996. U.S. Environmental Protection Agency, National
Research Laboratory, Corvallis, OR.
tTA 	 	

^'''lil'BliBII |
. PB96-2054S3 1
5. REPORT DATE
8/20/96
6. PERFORMING ORGANIZATION CODE
8. PERFORMING ORGANIZATION REPORT NO.
10. PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
13. TYPE Of REPORT AND PERIOD COVERED
P'lhlJRhed Report
14. SPONSORING AGENCY CODE
Health and Effects Environmental
16. ABSTRACT
As part of the EMAP multi-tier design research effort, a statistical methods manual
was completed to address estimation problems for the survey design tier. The
production of this manual ensures that statistical estimation procedures used in the
survey tier are documented and available to others. A particular focus is ensuring
the methods are available to the Regional EMAP studies. The procedures are detailed
so that a scientific computer programmer can implement the methods . As new methods
are developed they will be added to the manual.
17.
Z. DESCRIPTORS
KEY WORDS AND DOCUMENT ANALYSIS
b.lDENTIFI
Environmental monitoring, Statistical
analysis, survey de.sign, Environmental
Monitoring and Assessment Program (EMAP)
IB. DISTRIBUTION STATEMENT
19. SECURI
20. SECURI
ERS/OPEN ENDED TERMS C. COSATI Field/Group

TY CLASS (This Report) 21. NO. OF PAGES
TY CLASS (This page: 22. PRICE
EPA Form 2220-1 (R»». 4-77) '   PREVIOUS  EDITION is OBSOLETE

-------
ABSTRACT

The Statistical Methods Manual documents statistical analysis methods applicable to
data collected by the Environmental Monitoring and Assessment Program (EMAP). The
methods described give procedures to estimate the current status of ecological resources that
are appropriate for survey designs implemented by EMAP. The methods apply to analyses of
EMAP regional demonstration studies and R-EMAP studiesr-Sufficient information is given
to enable a user to determine if the method is appropriate-for the survey design used in these
studies. Additional methods will be added as appropriate to include updated analyses
procedures or to cover additional EMAP or R-EMAP studiesr- The audience for the manual
are statisticians or scientists with a reasonable background in statistics. The calculations are
detailed so that a scientific computer programmer can implement the methods.

Key Words: survey design, cumulative distribution estimation,'Status estimation, ecological
monitoring, U.S. EPA-EMAP.

Preferred citation:

Diaz-Ramos, S., D.L. Stevens, Jr., and A.R. Olsen. 1996. EMAP Statistics Methods Manual.
EPA/620/R-96/XXX. Corvallis, OR: U.S. Environmental Protection Agency, Office of
Research and Development, National Health and Environmental Effects Research
Laboratory.

ACKNOWLEDGEMENTS

We could not have prepared this methods manual without the cooperation and aid of
many individuals. Over the past five years, we have benefitted from technical discussions
with all participants in the EMAP Design and Statistics research effort. Kathleen Purdy, a
graduate student at Oregon State University, wrote two of the methods on deconvolution of
measurement error. We thank Danny Kugler for his work on the method that presents
simplified algorithms suitable for spreadsheet software. We recognize the work by Doug
Heimbuch, Harold Wilson, and John Seibel of Coastal Environmental Services, Inc. and Steve
Weisberg and Jon Volstad of Versar, Inc. who wrote two of the general overview papers
included. Similarly, the report by Virginia Lesser and Scott Overton was central to our effort
and is included for completeness.

Notice

The research described in this document has been funded by the U.S. Environmental
Protection Agency. This document has been prepared at EPA National Health and
Environmental Effects Research Laboratory, Western Ecology Division, in Corvallis, Oregon,
through Contract #68-C4-0019. It has been subjected to the Agency's peer and administrative
review and approved for publication. Mention of trade names or commercial products does
not constitute endorsement for use.

-------
CONTENTS
ABSTRACT

ACKNOWLEDGEMENTS

INTRODUCTION
BACKGROUND .......... '... ..................................... 3
Survey Design Approach ......................................... 3
Sampling Ecological Resources ............................... 4

ESTIMATION AND ANALYSIS ................................... 6

MISSING DATA .......................................... 10

REFERENCES ......... ................................ 12

STATUS ESTIMATION METHODS

METHOD 1: Cumulative Distribution Function for Proportion of a Discrete or an Extensive
Resource; Horvitz-Thompson Estimator, Normal Approximation

METHOD 2: Cumulative Distribution Function for Total Number of a Discrete or an
Extensive Resource; Horvitz-Thompson Estimator, Normal Approximation

METHOD 3: Size- Weighted Cumulative Distribution Function for Proportion of a Discrete
Resource; Horvitz-Thompson Estimator, Normal Approximation

METHOD 4: Size-Weighted Cumulative Distribution Function for Total of a Discrete
Resource; Horvitz-Thompson Estimator, Normal Approximation

METHOD 5: Variance of the Cumulative Distribution Function for Proportion of a Discrete
Resource; Horvitz-Thompson Variance Estimator

METHOD 6: Variance of the Cumulative Distribution Function for Total Number of a
Discrete Resource; Horvitz-Thompson Variance Estimator

METHOD 7: Variance of the Size-Weighted Cumulative Distribution Function for
Proportion of a Discrete Resource; Horvitz-Thompson Variance Estimator
111

-------
                                    CONTENTS
METHOD 8:  Variance of the Size-Weighted Cumulative Distribution Function for Total of
              a Discrete Resource;  Horvitz-Thompson Variance Estimator

METHOD 9:  Cumulative Distribution Function and Variance for Proportion of a Finite
              Population; Parametric Jackknife Estimator

METHOD 10: Variance of the Cumulative Distribution Function for Proportion of an
              Extensive Resource; Horvitz-Thompson Variance Estimator

METHOD 11: Cumulative Distribution Function and Variance for Proportion of a Resource;
              Simulation-Extrapolation Method

METHOD 12: Variance of the Cumulative Distribution Function for Proportion of a Discrete
              or an Extensive Resource;  Yates-Grundy Variance Formula

METHOD 13: Simplified Variance of the Cumulative Distribution Function for Proportion
              (Discrete or Extensive) and for Total Number of a Discrete Resource, and
              Variance of the Size-Weighted Cumulative Distribution Function for
              Proportion and Total of a Discrete Resource;  Simple Random Sample
              Variance Estimator

APPENDICES

    A.    Answers to Commonly Asked Questions about R-EMAP Sampling Designs and
          Data Analyses

    B.    R-EMAP Data Analysis Approach for Estimating the Proportion of Area that is
          Subnominal

    C.    EMAP Status Estimation:  Statistical Procedures and Algorithms
                                         IV

-------
INTRODUCTION

The Statistical Methods Manual documents statistical analysis methods applicable to data
collected by the Environmental Monitoring and Assessment Program (EMAP). A primary use
of the EMAP data is to estimate the current status of ecological resource characteristics using
scientifically sound procedures. The methods described give procedures to estimate current
status applicable to survey designs implemented by EMAP. A distinct feature of EMAP is
the use of survey designs as the foundation for site selection and subsequent scientific
inference to an ecological resource target population. Consequently, it is essential that the
appropriate statistical analysis method be linked with the survey design used for the collection
of the data.

The audience for the manual are statisticians or scientists with a reasonable background
in statistics. The methods were written with sufficient detail so that a scientific computer
programmer can implement the calculations; for this reason, the methods contain more
simplified notation than that used in this introduction. The appendices A and B are intended
for those with little statistical training who may become involved in the analysis of R-EMAP
studies. See appendix C for more information on the general theoretical development upon
which the algorithms in this manual are based.

The methods in the manual are appropriate to use for analyses of EMAP regional
demonstration studies and R-EMAP studies. The methods give-sufficient information to -
enable a user to determine if the method is appropriate for the survey design used in these
studies. Most methods reference one or more EMAP or R-EMAP studies for which the
method is appropriate.

Most of the methods in the document provide estimators for the cumulative distribution
or its variance for a variety of survey designs and conditions. Method 13 provides simplified
estimation algorithms for those using spreadsheet software. These estimates are to be used
for internal research only and not intended for use in any internal or external documents.
Methods 9 and 11 address the case when substantial measurement error is present in the data
(observations). In this case, the estimator of the cumulative function is biased. The bias may
be substantial and is most prevalent in the tails of the distribution. These two methods
present techniques to adjust for this bias. The following table provides a quick summary of
the methods.

-------
STATISTIC
CDF for proportion
Variance of the CDF for
proportion
CDF for total number
Variance of the CDF for
total number
Size-weighted CDF for
proportion
Variance of the size-
weighted CDF for
proportion
Size-weighted CDF for
total
Variance of the size-
weighted CDF for
total
CDF for proportion or
total in the presence of
measurement error;
Variance
CDF for proportion or
total in the presence of
measurement error;
Variance
RESOURCE
Discrete
Extensive
Discrete
Extensive
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
ESTIMATOR
Horvitz-Thompson
Horvitz-Thompson
Yates-Grundy
Simple Random Sample
Horvitz-Thompson
Yates-Grundy
Simple Random Sample
Horvitz-Thompson
Horvitz-Thompson
Simple Random Sample
Horvitz-Thompson
Horvitz-Thompson
Simple Random Sample
Horvitz-Thompson
Horvitz-Thompson
Simple Random Sample
Parametric Jackknife;
Horvitz-Thompson
Simulation-Extrapolation
(SMEX); SIMEX
Variance
METHOD
#
1
5
12
13
10
12
13
2
6
13
3
7
13
4
8
13
9
11
    We highly recommend that any analysis of EMAP regional demonstration study data or
R-EMAP study data be preceded by a thorough reading of reports that document the survey

-------
.-design, field measurement protocols, and indicator descriptions. This information is available
in EMAP reports and should be reviewed.

BACKGROUND

The Environmental Monitoring and Assessment Program is an interagency,
interdisciplinary program that will contribute to decisions on environmental protection and
management by integrating research, monitoring, and assessment. EMAP's strategies use
rigorous science while taking into account social values and policy-relevant questions. It was
initiated by EPA's Office of Research and Development to monitor status and trends in the
condition of ecological resources, to develop innovative methods for anticipating emerging
environmental problems, and in general, to provide a greater capacity for assessing and
monitoring the condition of the nation's ecological resources (Messer et al. 1991).

EMAP was designed to provide information that will enable policy-makers, decision-
makers and the public to:

• Estimate the current status, trends, and changes in selected indicators of the Nation's
ecological resources on a regional basis with known confidence.

• Estimate the geographic coverage and extent of the Nation's ecological resources with
known confidence.

• Seek associations between selected indicators of natural and anthropogenic stresses and
indicators of condition of ecological resources.

• Provide annual statistical summaries and periodic assessments of the Nation's
ecological resources.

A general overview of EMAP in mostly non-technical language is in the EMAP Program
Guide (Thornton et. al., 1993). Additional information on the assessment framework used by
EMAP as a common approach for planning and conducting a wide variety of ecological
assessments is given by EPA (1994). The statistical analysis of data from EMAP is best
undertaken with an understanding of the measurement selection process. Barber (1994)
describes the indicator development strategy used by EMAP in their regional demonstration
studies. The statistical methods in this report were intended primarily for these demonstration
studies and as well as studies conducted by EPA Regions in conjunction with EMAP.

Survey Design Approach

A distinctive feature of EMAP is strict reliance on probability sampling. Overton et al.
(1990) describe the conceptual framework for the sampUng-design approach for EMAP.
Stevens (1994) gives a description of how the conceptual framework is used in research-
demonstration studies for particular ecological resources. The implementation of the

-------
conceptual framework required development of sampling designs directed at environmental
resources distributed over space.

Probability sampling is fundamental to EMAP. Probability sampling provides the basis
for estimating resource extent and condition, for characterizing trends in extent or condition,
and for representing spatial pattern, all with known certainty. A probability sample has some
inherent characteristics that distinguish it from other samples: first, the population being
sampled is explicitly described; second, every element in the population has the opportunity
to be sampled with known probability; and third, the selection is carried out by a process that
includes an explicit random element. A probability sample from an explicitly defined
resource population is a means to certify that the data collected are free from any selection
bias, conscious or not. This probability sample is an essential requirement for a program such
as EMAP that aims to describe the condition of our national ecological resources. Further,
analytical methods that are as free as possible from the appearance of subjectivity are also
required. These two requirements are satisfied in EMAP by adherence to probability-based
sampling protocols and analytical methods that rely on the statistical design for their
inferential soundness. Thus, EMAP relies on design-based inference procedures for basic
estimates of population descriptors. See Hansen et al. (1983), Sarndal (1978), or Smith
(1976) for discussions of the issues involved in design-based versus model-based inference.
These issues are also discussed in a spatial context by de Gruijter and Ter Braak (1990) and
Brus and de Gruijter (1993).

Design-based inference relies on the methodology of statistical survey sampling (Cochran
1977) to extend the results from a sample to the population. This extension is valid only with
a probability sample. The design specifies what information is to be collected at specified
locations; there must also be protocols or methods that are coherent with the design, and that
specify how the inference is drawn. The combination of a sample design and an inference
protocol is called a sampling plan. This plan includes the prescription of not only what and
where to sample, but also how to analyze the resulting data. In many instances in EMAP, the
resource groups used novel sampling designs tailored to the resource. These designs are
documented in Overton et al. (1990) and Stevens (1994, 1995) and in the various research
plans for the particular resources. A general prescription for the analyses is given in Overton
et al. (1990), and specific details of the analyses for some designs are in Lesser and Overton
(Appendix C). However, fhese documents do not cover all of the designs that have been
applied by the EMAP resource groups. This Methods Manual fulfills the second part of the
prescription for a sampling plan by providing detailed descriptions of the methods for
analyzing data collected using any of EMAP's sample designs along with computational
algorithms where appropriate.

Sampling Ecological Resources

The property of a particular ecological resource that has the most impact on a statistical
sampling plan is the dimension of the conceptual representation of the resource in two-
dimensional space. An implementation of the sampling strategy may represent (or model) the

-------
• resource populations as points, lines, or areas. Resources that are represented as points for
sampling purposes are labeled discrete resources. A discrete resource — such as small to
medium-sized lakes—has distinct, natural units. Such a resource is represented in
2-dimensionaJ space as a point because the objective of the sampling is to describe the
resource unit as an entity, even though the resource unit may occupy appreciable area in the
landscape. An attribute associated with a unit of a discrete resource, such as pH or an
indicator of biodiversity, is viewed as a property of the entire unit. The ensemble of all units
of a discrete resource is treated statistically as a finite population. Population inferences for a
discrete resource are most appropriately based on numbers of units that possess some
property. For example, a statement about lakes in good condition would pertain to numbers
of lakes, not, for instance, surface area of lakes. An inference couched in terms of surface
area might be possible, but neither the sampling plan nor the measurements taken on the units
would be well suited for such an inference.

Resources such as streams, riparian wetlands, or forested shelter belts may be given a 1-
dimensional representation in 2-dimensional space, and sampled as linear resources. In fact,
such resources are 2-dimensional, but their area is very small in proportion to landscape area.
These features are much longer than they are wide, and they do not have well-defined natural
units. Inferences are appropriately stated in units of length, e.g., proportion of stream-miles
in poor condition. Attributes are viewed as being defined at a point rather than being
associated with a unit. Thus, a chemical concentration might change continuously along the
length of a stream and be defined and measurable at every point on the stream.

Resources that extend over large regions in a more or less continuous and connected
fashion are treated as 2-dimensional, or extensive resources. Like the linear resource, an
extensive resource does not have distinct natural units. Instead, it covers relatively large
. sections of the landscape and lacks a high degree of functional integration. For example,
forests, arid ecosystems, and large wetlands such as salt marshes or the Everglades fall into
this category. The domain of an extensive resource has area; it does not consist of a
collection of separable points. An attribute of an extensive resource is viewed as a definition
of a surface in the sense that it is possible in principle to assign a value to the attribute at
every point in the domain. Generally, the attribute surface is reasonably smooth, although
there may well be step discontinuities. For example, the domain of a forest could include
stands of 50-year-old timber and adjacent newly clear-cut areas. A parameter measuring
biomass could show a discontinuity as the boundary is crossed. Population inferences are
usually based on area of the resource with some property, e.g., acres of forest with a visual
canopy rating indicative of degraded condition.

The distinctions between discrete, linear, and extensive are not always clear, and in some
cases a resource may be viewed as both: a resource consisting of isolated fragments may be
treated as extensive for sampling but as discrete for analysis, or vice versa. Greater
efficiency, that is, lower variance for a fixed sampling effort, will usually result if the
sampling and analysis are carried out from the same viewpoint. For example, streams could
be sampled as a finite population of stream segments defined by confluences (discrete), but

-------
 analyzed in terms, of miles. of stream channel (linear).  Thus, a simple random sample of a
 stream-segment population results in a variable probability sample of points on streams, and
 is not the most efficient sample to make an inference about miles of streams.

     Linear and extensive resources are sampled somewhat differently but analyzed using
 similar methods.  The important distinction in the analysis is between finite, discrete
 populations and infinite, continuous populations.  Methods for both types of populations are
 provided in this document.

 ESTIMATION AND ANALYSIS

     Each resource to be sampled can be represented by a set, R, whose elements index the
 points where the resource  exists.  Thus, for a discrete resource, R = (s}, s2, ..., SN] where
 each si represents the location of one unit of the resource.  If R is, for example,  a set of lakes,
 then each j,- represents the location of one of the  lakes in R.  For an extensive resource, R is
 the set of points covered by the resource, for example, the area covered by forest or a linear
 stream network.  If R represents a forest,  then each s e R is  a point in the forest; if R is a
 stream network, each s e. R is a. point on  some stream in the network.  Each attribute of
 interest of the resource R is a fixed but unknown function defined on  R; that is,  at each
 element s e R there is a fixed value of the attribute denoted  as z(s).  The population
 parameter to be estimated  is the total of the attribute over R,  that is ZT  -  ^  ^C^-) m tn€
                                                                      5, € R
                       r
 discrete case or IT =  \z(s)ds  in the continuous case.  This is a quite general population
                      J
                      R
 parameter, because estimates of mean values, variances, proportions, and distribution
 functions can all  be formulated as estimates of sums or integrals over R.   For example, the
 distribution function Ft(x)  for z(s) over ^?  is the proportion of R with value of z less than or
 equal to x. For a  discrete  resource, this is
                             FM  - Oil

                                          !,.€ V
For an extensive resource, the distribution function is
where
                                                      1  r f  fi
                                                      0  otherwise

-------
    The methods from finite'-population sampling can be applied to make inferences about ZT
for discrete resources.  Finite population sampling methods are extensively developed and
well-documented (Cassel el al.  1977, Cochran 1977, Kish 1965, Thompson 1992, Yates
1960).  However, environmental populations are, in many instances, more appropriately
conceptualized as continuous, infinite populations rather than discrete and finite.

    Estimates of extent for a resource (e.g., wetlands, forests) or for a subset of a resource
(e.g.,  salt marshes, deciduous forest) can be obtained from classification of a sample.
Estimates of ecological condition for a resource class are generated from condition indicators.
Cumulative distribution functions with confidence bounds are the fundamental method for
describing regional (or national) condition in EMAP.  The essential feature of this approach is
the emphasis on estimating the  cumulative total (or proportion) of a resource class with an
indicator of condition (or area)  less than or equal to a specified value (e.g., the proportion
with indicator value less than or equal to some value of interest).  Although distribution
functions provide the estimates  of condition, the information from them can be presented in
several forms (bar graphs, tables, distribution function plots), with the choice of format
related to the intended  audience.

    The primary theoretical justification for the estimation methods presented in this
document is the  Horvitz-Thompson Theorem (Horvitz and Thompson,  1952)  or its continuous
population analogue (Cordy,  1994; Stevens, 1995 submitted). The sampling background to
Horvitz-Thompson estimation is summarized here very briefly to provide a context for the
estimation methods presented in the body of this document.  The theory and notation are very
similar for discrete and continuous resources.

    The inference paradigm  is based on the inclusion probabilities and the pairwise inclusion
probabilities of the sampled units under the following sampling model:  A sample is selected
from  the universe U by picking the values of n random variables s}, s2, •-, sn from a joint
probability distribution specified by Pr(s}, s2, ..., SN), which is defined by the sampling
design.  (In EMAP, the s, can be thought of as points, as they will be actual points in  an
extensive resource or reference  points that identify the location of a discrete resource.) The
selected points are classified as being in or out of some target population R, and z need be
determined only for those points in R.  In general,  this sampling method gives a fixed total
sample size («),  but the size of  the achieved sample in R is a random variable.  Allowing a
random sample size entails some technical complication but provides valuable flexibility.  In
particular, it provides the ability to make estimates for arbitrary subpopulations, that is, R
could be defined after all the sampling has taken place.  The only difference  between the
discrete and extensive case is the form of the probability distribution:  in the discrete case, the
probability distribution gives  the numerical probability that a particular sample is selected; in
the continuous case, the probability is replaced with a probability density  function for the
samples.

-------
     For the discrete case, the inclusion probability n(k) for unit k is the probability that unit k
 is included in the sample, i.e., Pr(sj = k or s2 = k or.. .or sn = k).   For designs such that
 Pr(sf = Sj) = 0 for all / *;', (e.g., sampling without replacement)
Tht joint inclusion probability for units k and / is the probability that units k and / are
simultaneously in the sample, and is given by
                                               .  = jfc, Sj  =
    In cases where R is not finite, but rather an extensive resource, continuous probability
distributions are used to specify the sampling design. As a result, the inclusion probability
functions used in the discrete case are replaced with inclusion density functions. Let
f(S2,s2,...,sn) be the joint probability density function (pdf) of the sample locations,/^ be the
(marginal) pdf of s,, and let/^j,  t) be the joint pdf of sf and  s-, / *j.  The inclusion density
function is defined  by
                                             n
                                    «(*)  =  E /,- (*)•
                                           i = i
The pairwise inclusion density function for s,  re U is  defined by


                              **>  o  =  E  Tfijd.v-
                                        i = 1 j * i

    Horvitz and Thompson (1952) provided an estimator of the population total for variable-
probability, without-replacement, finite-population sampling design, along with an expression
for the variance of the estimated total and a related variance estimator. Alternative
expressions for the  variance and its estimator were provided by Yates and Grundy (1953) and
Sen (1953). (The variance estimators associated with Horvitz and Thompson are given in
subsequent equations and are denoted by the subscript "HT"; the Yates-Grundy forms are
denoted by the subscript "YG".)  As was shown by Cordy (1993), a version of the Horvitz-
Thompson theorem holds when sampling from V when  the inclusion density and pairwise
inclusion density function are defined as above.

    The Horvitz-Thompson theorem provides estimators of the total (sum or integral) of z
over fl and its associated variance in terms of the quantities  z(sf), n(sf), and n(si ,Sj ).  The

                                            8

-------
 form for the estimator of the  total is the same for both the discrete and continuous versions;
 the only difference between the two is the expression for the variance of the estimator.  The
 (unbiased) estimator of ZT is given by
                                         71   I
                                         -   '
. The estimators of variance of ZT for the discrete case are
                                    (=1
     or
             n   n
                                   :• ";V
            1=1 _/>/
 and for the continuous case,
                      2/ „ \    n   n
            i-l
     or

 All of the above estimators of variance are unbiased, provided n(s, t) > 0 in U.

     An estimator of the mean of z can be obtained by dividing ZT by the size of R (the

 number of units in ^?, or the length or area of /?) , i.e.,  fi.7 = ZT / NR or  \iz = ZT / AR

-------
 The estimator Zj-  will tend to have low variance if z and it are strongly positively correlated.

 Since many environmental surveys have multiple objectives and collect observations on
 multiple attributes at the same location, there will often be little or no correlation between z
 and TI.  A ratio estimator (so-called because it is the ratio of two estimators) for ^ of the
                                    n
                  *           *     ^^>
 form fizr  = ZT I  AR , where  AR  = £, UR(S^ /  ^C^-)] estimates AR , may well be more
                                   1=1

 precise  than  pz .  The two estimators  ZT and AR  are subject to  the same sources of

 sampling variation, and  hence  are likely to be positively correlated. Thus,  if there is
 substantial variability in AR ,  pz r will likely be more precise  than \iz . The ratio estimator

 of the total is then zTr  = jiz r AR .   An approximate variance estimator for ZT r  is obtained

 by applying either the Horvitz-Thompson or  Yates-Grundy formulas with
 d(st) = z(s,) - |Xr in place  of 1(5$.


    The distribution function FZ(X) of the response  z is estimated by applying the Horvitz-
 Thompson theorem to the indicator function  I{teR\z(t)£x}(s) •   An unbiased estimator of the

 size (number, length, or area) of the subset of R with indicator z(t) < x is given by
so that  FZ(X)  - ZT(X) I AR  is an unbiased estimator of Fz(x).  The ratio estimator

 FZ r(x) = Zj(x) I AR avoids the possibility of obtaining estimates that exceed 1, and in

many cases will be more precise than  FZ(X) , for the same reasons as given for ^ r relative

to £LZ .  An approximate variance estimator for FZ T is obtained by applying the Horvitz-

Thompson or Yates-Grundy formulas with d-^x) = ^efli^f)^ }(•*,)  - FZ  r(x) in  place of
                     ~2
   i) and dividing by AR .
MISSING DATA

    All surveys must address the issue of how to handle missing data in statistical estimation.
Missing data should always be investigated for patterns, including why it is missing.  Two
types of missing data are possible in EMAP or R-EMAP surveys.  One  type is a missing
sample unit, such as  a missing lake, stream location, or forest site.  Sample units may be

                                           10

-------
missing due to inaccessibility, land owner refusal, or other reasons. Finding detectable
patterns in missing data could lead to alterations in survey management, including obtaining
access permission and identifying situations where the population inference needs to be
qualified.

The other type of missing data occurs within a sampling unit, such as a missing
observation for an indicator such as a chemical concentration or habitat structure variable.
Observations may be missing due to field collection problems, lost samples, laboratory
analysis problems, or other reasons. Although it is possible to use different statistical
methods to address the two types of missing data, for the purposes of this manual the two
types will be treated the same. We associate all missing data as being a missing sample unit.

Two views may be taken. For each view, the missing sample units unavailable for
measurement can be considered to be a subset of the target population of interest. One view
is to remove this subset from the target population by redefining the target population as the
original target population excluding the missing subset. The statistics methods may then be
applied as given without adjustments. Another view is to assume the data are missing at
random and retain the original definition of the target population. In this case, status
estimators of the cumulative distribution expressed as a proportion or fraction of the total
remain unbiased estimators. Estimators for population totals or cumulative distributions
expressed as amounts (number, length, area) are biased.
11

-------
REFERENCES
Barber, M. C. ed.  1994.  Environmental Monitoring and Assessment Program:  Indicator
    development strategy. EPA/620/R-94/022.  Athens, GA: U.S. Environmental Protection
    Agency, Office of Research and Development.

Bras, D. J.,  and  J. J. de Gruijter.  1993.  Design-based versus model-based estimates of spatial
    means:  Theory and application in environmental soil science. Environmetrics 4:123-152.

Cassel, C. M., C. E. Sarndal, and J. H. Wretman.  1977.  Foundations of inference in survey
    sampling. New York:  John Wiley.

Cochran, W. G.  1977. Sampling techniques. 3rd Edition.  New York:  John Wiley & Sons.

Cordy, C. B.  1993.  An extension  of the Horvitz-Thompson theorem to  point sampling from a
    continuous universe.  Statistics and Probability Letters 18:353-362.

de Gruijter,  J. J., and C.  J. F. Ter Braak.  1990. Model free estimation  from survey samples:;,
    A reappraisal of classical sampling theory.  Mathematical Geology 22:407-415.

Hansen, M.  H., W. G. Madow, and B. J. Tepping.  1983.  An evaluation of model-dependent
    and probability sampling inferences in sample surveys.  Journal of the American Statistical
    Association  78:776-760.

Horvitz, D. G., and D. J. Thompson.  1952.  A generalization of sampling without replacement
    from a finite universe.  Journal of the American Statistical Association 47:663-685.

Kish, L. 1965.  Survey sampling. New York: John Wiley & Sons.

Lesser,  V. M., and W. S.  Overton.  1994.  EMAP status estimation: Statistical procedures and
    algorithms.  EPA 620/R-94/008.  Corvallis,  OR:  U.S. Environmental Protection Agency,
    Environmental Research Laboratory.

Messer, J. J., R.  A. Linthurst, and W. S. Overton.  1991.  An EPA program for monitoring
    ecological status and  trends.  Environmental Monitoring and Assessment  17:67-78.

Overton, W. S., D. White, and D. L. Stevens Jr.  1990. Design report for EMAP,
    Environmental Monitoring and Assessment Program.  EPA 600/3-91/053. Corvallis, OR:
    U.S. Environmental Protection Agency,  Environmental  Research Laboratory.

Sarndal, C.  1978.  Design-based  and  model-based inference for survey sampling.
    Scandinavian Journal of Statistics 5:27-52.

                                          12

-------
Sen, A. R.  1953.  On the estimate of the variance in sampling with varying probabilities.
    Journal of the'Indian Society' of Agricultural Statistics 7:119-127.

Smith, T. H.  1976.  The foundations of survey sampling:  A review. Journal of the Royal
    Statistics  Society A.

Stevens, Jr., D. L.  1994.  Implementation of a national environmental monitoring program.
    Journal of Environmental Management 42:1-29.

Stevens, Jr., D. L.  1995.  A family of designs for sampling continuous spatial populations.
    Environmetrics.   Submitted.

Thompson,  S. K.  1992.   Sampling.  New York:  Wiley.
                 -V                                 '
Thornton, K. W., D. E. Hyatt, and C. B. Chapman, eds.  1993. Environmental Monitoring and
    Assessment Program guide.  EPA/620/R-93/012. Research Triangle Park, NC: U.S.
    Environmental Protection Agency, Office of Research and Development.

U.S. Environmental Protection Agency (EPA).  1994.  Environmental Monitoring and
    Assessment Program assessment framework. EPA/620/R-94/016. Research Triangle Park,
    NC:  U.S. Environmental Protection Agency, Office of Research and Development.

White, D., A.  J. Kimerling, and W. S. Overton. 1992:  Cartographic and geometric
    components of _a  global sampling design for environmental monitoring. Cartography and
    Geographic Information Systems  191:5-22.

Yates, F  1960.  Sampling methods for censuses and sur\>eys. London:  Charles Griffin & Co.

Yates, F., and P. M. Grundy.  1953.  Selection without replacement from within  strata with
    probability proportional to size. Journal of the Royal Statistical Society 815:253-261.
                                          13

-------
              EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 1  of 12
 ESTIMATION METHOD 1: Estimation of the Cumulative Distribution Function for the
 Proportion of a Discrete or an Extensive Resource; Horvitz-Thompson Estimator, Normal
 Approximation

 1  Scope and Application

 This method calculates the estimate of the cumulative distribution function (CDF) for the
 proportion of a discrete or an  extensive resource that has an indicator value equal to or less
 than a given indicator level.  The method applies to any probability sample and presents two
 estimators.  An estimate can be produced for the entire population  or for  an arbitrary
 subpopulation with known or  unknown size.  In the discrete case, this size is the number of
 units in the subpopulation. In the extensive case, this size is the subpopulation extent.
 Suggestions for estimating the CDF over the range of the indicator are included.
 Alternatively, the CDF can be calculated at the indicator levels found in the probability
 sample.  The method uses the Normal approximation to provide confidence bounds or
 intervals for the true cumulative distribution function. This method does  not include variance"
 estimators for the estimated CDF.  For information on appropriate variance estimators, refer
 to Section 7.

 This method has been applied in:

       The 1991 Surface  Waters Pilot Report
       EMAP-Estuaries Louisianian  Province 1991 Annual Statistical Summary
       EMAP-Estuaries Virginian  Province 1991 Annual Statistical Summary
       The Forest Health Monitoring 1992 Annual Statistical Summary

 2  Statistical Estimation Overview

 A sample of size na units is selected from subpopulation a with known inclusion probabilities
 TC = { 7tj, — ,7Cz-,-,7Cn }.  The indicator is evaluated for each unit and represented by

 y ={Ji>'"'>'/''"'3'n }•   When  sampling an extensive resource, the inclusion probabilities are
 replaced by the inclusion  density function evaluated at the sample locations.

 Estimates of the cumulative distribution function are obtained for the indicator levels of
 interest, x = {x^, — ,xk, — ,xm}. Several alternatives are available for choosing x. The
 recommended alternative is the use of equally spaced values across the range of the indicator.
Ideally, this range  is known a  priori  and extends beyond the range  of any particular data set.
 A second alternative is  to use  the set of unique values in the data set.  This alternative gives
 the classical empirical cumulative distribution function.  A third alternative is to use the
midpoints of adjacent ordered  values in y  for the levels x.

-------
EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 2 of 12

A
To obtain the estimated cumulative distribution function, Fa(xk), the Horvitz-Thompson
estimator of a cumulative total is calculated for each xk by summing up the number of
indicators which are less than or equal to the xk value. This total is then divided by the
subpopulation size, Na . When this subpopulation size is unknown, the estimated
subpopulation size, Na , is substituted for the known subpopulation size, Na , to form the
Horvitz-Thompson ratio estimator.

The Horvitz-Thompson ratio estimator may perform better than the estimator using the known
subpopulation size, Na , and may be used even if the subpopulation size is known. Some of
the conditions under which this ratio estimator is recommended are given in Section 9. This
-ratio estimator should always be used in the case of missing data.

Confidence limits for Fa(x/.) are produced by assuming a Normal distribution. These limits
may be used to construct either a lower confidence bound, an upper confidence bound, or a
confidence interval for Fa(xk). Computation of these limits requires an estimated variance

of Fa(xk) which is not provided in this method. Details for computing a suitable estimated

variance of Fa(xk) are found in other methods referenced in Section 7.

The output consists of the estimated cumulative distribution function values with either a one-
sided confidence bound (upper or lower) or a confidence interval for F (xk). *

3 Conditions Under Which This Method Applies

Probability sample with known inclusion probabilities or densities
• Discrete or Extensive resource
• Arbitrary subpopulation
• All units sampled from the subpopulation must be accounted for before applying this
method
• When the indicator value is missing, exclude this missing value and the corresponding
inclusion probability or density; use the Horvitz-Thompson ratio estimator

4 Required Elements

4.1 Input Data

y. = value of the indicator for the i'h unit sampled from subpopulation a.
K, = For discrete resources, the inclusion probability for selecting the ith unit of
subpopulation a. For extensive resources, the inclusion density evaluated at the
location of the ith sample point in subpopulation a.

-------
EMAP Estimation Method 1, Rev. No. 0, May 1996, Page;3 of 12
4.2 Additional Components

na - number of units sampled from subpopulation a.
xk = tfh indicator level of interest.
Na = subpopulation size, if known.

4.3 Graphical Display Considerations

If the empirical CDF is chosen, the number of points plotted is at most na+2. The first
plotted point is (0,0) when the indicator takes on only positive values. Otherwise, choose a
point smaller than yj as the abscissa and assign zero as the ordinate. Choose a point larger
than }' and assign 1 as its ordinate. Where there is more than one occurrence of an
a
indicator level in the data set, plot only one point using the largest cumulative distribution
function value associated with this level as the ordinate.

If the midpoints of adjacent values in y are used for the levels x, at most na+l points are
plotted. To determine the first plotted point, calculate- the distance between y\ and y2- Take
half this distance and subtract it from y^ to obtain the abscissa. If this abscissa is a negative
number and the indicator can never be negative, instead assign -zero as the abscissa. Use zero,,
as the ordinate. Similarly, to determine the last plotted point, calculate the distance between
the largest y values, y _, and yn . Halve this distance, add it to v and plot this abscissa
Q. a a
using 1 as the ordinate.

The recommended approach uses equally spaced levels across the potential range of the
indicator. The levels used should be potential real values that the indicator could attain. For
example, in the case of discrete data, integer values should be used. As mentioned
previously, ideally this range is known a priori and extends beyond the range of any
particular data set. If an informed guess cannot be made for this range, one suggested range
would be to use the midpoint approach for obtaining the first and last plot points as explained
in the previous paragraph. How many points to use is a subjective decision and should take
into account the chosen range, the size of the data set, and sometimes the data distribution
itself must be examined. The following suggestions are given to help decide how many
points to use.

-------
              EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 4 of 12


.na+2 points should be adequate. Begin  by using 100 points for these larger data sets.  The
 range of the indicator will have a part in determining if this is  an adequate number of points.
 Trying the plots with differing numbers  of points may be useful to see if the graph changes
 significantly.

 The y-axis (CDF) should range in values from zero to 1.  This method may result in
 confidence limits which drop below zero or exceed  1.  These limits should not appear on the
 plot. Instead, truncate the plotted upper limit at 1 and the plotted lower  limit at zero.

 5 Formulas and Definitions

 The estimated CDF (proportion) for indicator value xk in subpopulation a, Fa(xk), with
 known subpopulation size, Na;  Horvitz-Thompson estimator is
             1 = 1   i
                                                                       J\
 The estimated CDF (proportion) for indicator value xk in subpopulation a,  Fa(xk), with

 estimated subpopulation size, Na;  Horvitz-Thompson ratio estimator is
             1 = 1   I

 The one-sided 100(l-a)% upper confidence bound, By(xk) is





 The one-sided 100(l-a)% lower confidence bound, BL(xk) is
The two-sided 100(l-a)% confidence interval, C(xk] is

-------
             EMAP Estimation Method 1, Rev. No. 0, May 1996, Page^5 of 12
For these equations:

 ^   /\
 V [Fa(xk)] = estimated variance of the estimated CDF (proportion) for indicator value xk in

              subpopulation a.
xk
na
za
a
       [ 0, otherwise
=  tfh indicator level of interest.
=  value of the indicator for the fh unit sampled from subpopulation a.
=  For discrete resources, the inclusion probability for selecting the ith unit of
   subpopulation a. For extensive resources, the inclusion density evaluated  at the
   location of the ith sample point in subpopulation a.
=  number of units sampled from  subpopulation a.
= z-score  from the standard Normal distribution.
=  level of significance.
6  Procedure

6.1    Enter Data

Input the sample data consisting of the indicator values, y, ,  and their associated inclusion
probabilities, ft,.  For example,
Calcium
>•
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*;
'.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375

-------
               EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 6 of 12
 6.2    Sort Data

 Sort the sample data in nondecreasing order based on the y; indicator values.  Keep all
 occurrences of an indicator value to obtain correct results.
Calcium
?«
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*;
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
 6.3    Obtain Subpopulation Size

,. Input Na if using a known subpopulation size.

           s.
 Calculate Na from the sample data only if using the Horvitz-Thompson ratio estimator.  Sum

 the reciprocals of the inclusion probabilities or densities, rc,, for all units in the sample a to
 obtain NQ .

 Na = (1/.00375) + (1/.01406) + (1/.07734) + . . . + (1/.00375) = 1128.939 for this data set.


 6.4    Input Indicator Levels of Interest

 Assign indicator levels of interest, x, based on graphical display considerations.  Choose one
 of the three methods previously discussed in Section 4.3.

 6.4.1  The Recommended Approach — Levels of Interest

 Form an expected range of the indicator before looking at the data.  Next, examine the data
 set to see if the estimated range encompasses all y  values.  If not, increase the range  to
 encompass the outlying y values.  If there are large outliers, more points than na+2 may be

-------
              EMAP Estimation Method 1, Rev. No. 0, May 1996, Page' 7 of 12


needed to retain good resolution in the body of the plot.  Determine evenly spaced x values
across the chosen range.

For this example, the estimated range was .5 to 9.5 mg/L.  The range does not have to be
adjusted because it includes the observed yf values.  The point spacing interval for x, xint -
(*ma* - xminV(na - 1) = (9.5 - .5)7(10 - 1) = 9/9 =  1.0.  The 10 x  values = (xmin , xmin+l.Q,
jcmi-n+2(1.0), ... ) = (.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5).  Try obtaining the CDF first
with these x values and then again with an increased number of x values spaced closer
together.  More points across the range may be needed because all but one of the yi values
are less than 3.0.

6.4.2 The Empirical CDF — Levels of Interest

For the empirical CDF, x values = (.7395, 1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).
Duplicate values  in the data set, 1.5992, do not have to be repeated when forming x .

6.4.3 The Midpoint Approach — Levels of Interest

Calculate the  midpoints of each pair of yi values to form x . The first x value is
(.7395+1.2204)72 = .9800. In this particular data set, there are three occurrences of 1.5992.
As a result, there are two midpoints of 1.5992.  Regardless of how many times a  midpoint is -
repeated, include it only once in x.  The x values = (.9800, 1.4098, 1.5992, 1.7996, 2.1854,
2.5952, 2.8798, 4.9700).

6.5    Compute Cumulative Distribution Function Values

          s*                                                . .
Calculate Fa(xk) for each element in x using the formulas from Section 5.


To calculate Fa(x^), compare each y(- to x,. If yi is less than or equal to x^ then I/JT, is
                            .A
added to the computation of Fa(x^ ) until yt exceeds x\  (when using sorted data).  Divide the
                                 Si
cumulative sum of these I/TC/S by Na or Na (depending on the estimator used) to obtain
Similarly, to calculate Ffl(x2), compare each yi to Xj, add the  1/rc/s until yi exceeds x2, and

then divide this sum by Na  or Na.


Do this for every value  in x.

                                                             /v
Below is an example for obtaining the cumulative sum for each Fa(xk).  Complete results

for the example data are in Section 6.7.

-------
EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 8 of 12
Calcium
>'•
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
Inclusion
Probability
*,-
.00375
.01406
.07734
.75000
.03750
.75000
tlndicator Level
of Interest
xk
.7395
1.2204
1.5992

2.0000
Cumulative Sum for Fa(xk)
1/.00375
1/.00375+1/.01406
1/.00375+1/.01406+1/.07734+1/.75000+1/.03750

1/.00375+1/.01406+1/.07734+1/.75000+1/.03750+1/.75000
6.6 Compute Confidence Limits

Calculate the confidence bound (upper or lower) or confidence interval for each Fa(xk)
using the formulas from Section 5.

Fa(xk) to obtain the lower bound, BL(xk). For the confidence interval, obtain both BL(xk)
and 5(/*;.). For example, 1.645 would be the za for a one-sided 95% upper or lower
confidence bound, and the za/2 f°r a two-sided 90% confidence interval. A two-sided 95%
confidence interval would use 1.96 for z^-

6.7 Output Results

Output the indicator levels of interest, the associated CDF value, and either a confidence
bound (upper or lower) or a confidence interval for Fa(xk~). If the output generated will be
used for graphing the CDF, append the first and last graph points to this output as directed
for the three methods below. The tables in Section 6.7.1 - 6.7.3 contain results for the ratio
estimator applied to the example data. A hypothetical variance is used in confidence bound
and interval calculations.
When upper bounds exceed 1, they are set equal to 1. Lower bounds less than zero are set
equal to zero.