EPA/620/R-96/002
                                                'August  20,  199.6_
EMAP  Statistical Methods Manual
                            by

                      Susan Diaz-Ramos
                      Anteon Corportion
                     200 S.W. 35th Street
                    Corvallis, Oregon 97333

                      Don L. Stevens, Jr.
                     Dynamac Corporation
                     200 S.W. 35th Street
                    Corvallis, Oregon 97333

                      Anthony R. Olsen
               NHEERL Western Ecology Division
              U.S. Environmental Protection  Agency
                     200 S.W. 35th Street
                     Corvallis, OR 97333
                         May 1996
        Environmental Monitoring and Assessment Program
    National Health and Environmental Effects Research Laboratory
               Office of Research and Development
              U.S. Environmental Protection Agency
                     Corvallis, OR 97333

-------
TECHNICAL REPORT D^
(Please read Instructions on the reverse be]
l. REPORT NO.
EPA/620/R-96/002
2.
4. TITLE AND SUBTITLE
EMAP statistical methods manual
7. AUTHORIS)
1S. Diaz-Ramos, 'D.L. Stevens, Jr., 2A. R. Olsen
9. PERFORMING ORGANIZATION NAME AND ADDRESS
'MERSC, Corvallis, OR,
^S EPA, NHEERL, Corvallis, OR
12. SPONSORING AGENCY NAME AND ADDRESS
US EPA ENVIRONMENTAL RESEARCH LABORATORY
200 SW 35tii Street
CorvaJlis, OR 97333
15. SUPPLEMENTARY NOTES
1996. U.S. Environmental Protection Agency, National
Research Laboratory, Corvallis, OR.
tTA 	 	

^'''lil'BliBII |
. PB96-2054S3 1
5. REPORT DATE
8/20/96
6. PERFORMING ORGANIZATION CODE
8. PERFORMING ORGANIZATION REPORT NO.
10. PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
13. TYPE Of REPORT AND PERIOD COVERED
P'lhlJRhed Report
14. SPONSORING AGENCY CODE
Health and Effects Environmental
16. ABSTRACT
As part of the EMAP multi-tier design research effort, a statistical methods manual
was completed to address estimation problems for the survey design tier. The
production of this manual ensures that statistical estimation procedures used in the
survey tier are documented and available to others. A particular focus is ensuring
the methods are available to the Regional EMAP studies. The procedures are detailed
so that a scientific computer programmer can implement the methods . As new methods
are developed they will be added to the manual.
17.
Z. DESCRIPTORS
KEY WORDS AND DOCUMENT ANALYSIS
b.lDENTIFI
Environmental monitoring, Statistical
analysis, survey de.sign, Environmental
Monitoring and Assessment Program (EMAP)
IB. DISTRIBUTION STATEMENT
19. SECURI
20. SECURI
ERS/OPEN ENDED TERMS C. COSATI Field/Group

TY CLASS (This Report) 21. NO. OF PAGES
TY CLASS (This page: 22. PRICE
EPA Form 2220-1 (R»». 4-77) '   PREVIOUS  EDITION is OBSOLETE

-------
                                      ABSTRACT

       The Statistical Methods Manual documents statistical analysis methods applicable to
 data collected by the Environmental Monitoring and Assessment Program (EMAP).  The
 methods described give procedures to estimate the current status of ecological resources that
 are appropriate for survey designs implemented by EMAP. The methods apply to analyses of
 EMAP regional demonstration studies and R-EMAP studiesr-Sufficient information is given
 to enable a user to determine if the method is appropriate-for the survey design used in these
 studies.  Additional methods will be added as appropriate to include updated analyses
 procedures or to cover additional EMAP or R-EMAP studiesr- The audience for the manual
 are statisticians  or scientists with a reasonable background in statistics.  The calculations are
 detailed so that  a scientific computer programmer can implement the methods.

 Key Words:  survey design, cumulative distribution estimation,'Status estimation, ecological
 monitoring, U.S. EPA-EMAP.

 Preferred citation:

 Diaz-Ramos, S., D.L. Stevens, Jr., and A.R. Olsen.   1996.  EMAP  Statistics Methods Manual.
       EPA/620/R-96/XXX.  Corvallis, OR: U.S. Environmental Protection Agency, Office of
       Research and Development, National Health and Environmental Effects Research
       Laboratory.

                               ACKNOWLEDGEMENTS

       We could not have prepared this methods  manual without the cooperation and aid of
many individuals.  Over the past five years, we have benefitted from technical discussions
with all participants in the EMAP Design and Statistics  research effort. Kathleen Purdy, a
graduate student at Oregon State University, wrote two of the methods on deconvolution  of
measurement error.   We thank Danny  Kugler for  his work on the method that presents
simplified algorithms suitable for spreadsheet software.  We recognize the work by Doug
Heimbuch, Harold Wilson, and John Seibel of Coastal Environmental Services, Inc. and Steve
Weisberg and Jon Volstad of Versar, Inc. who  wrote two of the general overview papers
included.  Similarly, the report by Virginia Lesser and Scott Overton was central to our effort
and is  included for completeness.

                                        Notice

       The research described  in this document has been funded by the U.S.  Environmental
Protection Agency.  This document has been prepared at EPA National Health and
Environmental Effects Research Laboratory, Western Ecology Division, in Corvallis, Oregon,
through Contract #68-C4-0019. It has been subjected to the Agency's peer and administrative
review and approved for publication.   Mention  of trade names  or commercial products does
not constitute endorsement for use.

-------
                                  CONTENTS
ABSTRACT

ACKNOWLEDGEMENTS

INTRODUCTION
BACKGROUND  .......... '... .....................................  3
      Survey Design Approach .........................................  3
      Sampling Ecological Resources ...............................  4

ESTIMATION AND ANALYSIS  ...................................  6

MISSING DATA ..........................................   10

REFERENCES  .........        ................................  12

STATUS ESTIMATION METHODS

METHOD  1:   Cumulative Distribution Function for Proportion of a Discrete or an Extensive
              Resource; Horvitz-Thompson Estimator, Normal Approximation

METHOD  2:   Cumulative Distribution Function for Total Number of a Discrete or an
              Extensive Resource;  Horvitz-Thompson Estimator, Normal Approximation

METHOD  3:   Size- Weighted Cumulative Distribution Function for Proportion of a Discrete
              Resource; Horvitz-Thompson Estimator, Normal Approximation

METHOD  4:   Size-Weighted Cumulative Distribution Function for Total of a Discrete
              Resource; Horvitz-Thompson Estimator, Normal Approximation

METHOD  5:   Variance of the Cumulative Distribution Function for Proportion of a Discrete
              Resource; Horvitz-Thompson Variance Estimator

METHOD  6:   Variance of the Cumulative Distribution Function for Total Number of a
              Discrete Resource; Horvitz-Thompson Variance Estimator

METHOD  7:   Variance of the Size-Weighted Cumulative Distribution Function for
              Proportion of a Discrete Resource;  Horvitz-Thompson Variance Estimator
                                       111

-------
                                    CONTENTS
METHOD 8:  Variance of the Size-Weighted Cumulative Distribution Function for Total of
              a Discrete Resource;  Horvitz-Thompson Variance Estimator

METHOD 9:  Cumulative Distribution Function and Variance for Proportion of a Finite
              Population; Parametric Jackknife Estimator

METHOD 10: Variance of the Cumulative Distribution Function for Proportion of an
              Extensive Resource; Horvitz-Thompson Variance Estimator

METHOD 11: Cumulative Distribution Function and Variance for Proportion of a Resource;
              Simulation-Extrapolation Method

METHOD 12: Variance of the Cumulative Distribution Function for Proportion of a Discrete
              or an Extensive Resource;  Yates-Grundy Variance Formula

METHOD 13: Simplified Variance of the Cumulative Distribution Function for Proportion
              (Discrete or Extensive) and for Total Number of a Discrete Resource, and
              Variance of the Size-Weighted Cumulative Distribution Function for
              Proportion and Total of a Discrete Resource;  Simple Random Sample
              Variance Estimator

APPENDICES

    A.    Answers to Commonly Asked Questions about R-EMAP Sampling Designs and
          Data Analyses

    B.    R-EMAP Data Analysis Approach for Estimating the Proportion of Area that is
          Subnominal

    C.    EMAP Status Estimation:  Statistical Procedures and Algorithms
                                         IV

-------
INTRODUCTION

    The Statistical Methods Manual documents statistical analysis methods applicable to data
collected by the Environmental Monitoring and Assessment Program (EMAP).  A primary use
of the EMAP data is to estimate the current status of ecological resource characteristics using
scientifically sound procedures.  The methods described give procedures to estimate current
status applicable to survey designs implemented by EMAP.  A distinct feature of EMAP is
the use  of survey designs as the foundation for site selection and subsequent scientific
inference to an ecological resource target population.  Consequently, it is essential that the
appropriate statistical analysis method be linked with the survey design used for the collection
of the data.

    The audience for the manual are statisticians  or scientists with a reasonable background
in statistics. The methods were written with  sufficient detail so that a scientific computer
programmer can implement the calculations; for this reason, the methods contain more
simplified notation than that used in this introduction.  The appendices A and B are intended
for those with little statistical  training who may become involved in the analysis of R-EMAP
studies.   See appendix C for more information on the  general theoretical development upon
which the algorithms in this manual are based.

    The methods in the manual are appropriate to use for analyses of EMAP regional
demonstration studies and R-EMAP studies.  The methods give-sufficient information to  -
enable a user to determine if the method is appropriate for the survey design used in these
studies.   Most methods reference one or more EMAP  or R-EMAP studies  for which the
method  is appropriate.

    Most of the methods in the document provide estimators for the cumulative distribution
or its variance for a variety of survey designs and conditions.  Method 13  provides simplified
estimation algorithms for those using spreadsheet software.  These estimates are to be used
for internal research only and  not intended for use in any internal or external documents.
Methods 9  and 11 address the case when substantial measurement error is  present in the data
(observations).  In this case, the estimator of the cumulative function is biased.  The bias may
be substantial and is most prevalent in the tails of the distribution.  These  two methods
present  techniques to adjust for this bias.  The following table  provides a quick  summary of
the methods.

-------
STATISTIC
CDF for proportion
Variance of the CDF for
proportion
CDF for total number
Variance of the CDF for
total number
Size-weighted CDF for
proportion
Variance of the size-
weighted CDF for
proportion
Size-weighted CDF for
total
Variance of the size-
weighted CDF for
total
CDF for proportion or
total in the presence of
measurement error;
Variance
CDF for proportion or
total in the presence of
measurement error;
Variance
RESOURCE
Discrete
Extensive
Discrete
Extensive
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
ESTIMATOR
Horvitz-Thompson
Horvitz-Thompson
Yates-Grundy
Simple Random Sample
Horvitz-Thompson
Yates-Grundy
Simple Random Sample
Horvitz-Thompson
Horvitz-Thompson
Simple Random Sample
Horvitz-Thompson
Horvitz-Thompson
Simple Random Sample
Horvitz-Thompson
Horvitz-Thompson
Simple Random Sample
Parametric Jackknife;
Horvitz-Thompson
Simulation-Extrapolation
(SMEX); SIMEX
Variance
METHOD
#
1
5
12
13
10
12
13
2
6
13
3
7
13
4
8
13
9
11
    We highly recommend that any analysis of EMAP regional demonstration study data or
R-EMAP study data be preceded by a thorough reading of reports that document the survey

-------
.-design, field measurement protocols, and indicator descriptions.  This information is available
 in EMAP reports and should be reviewed.

 BACKGROUND

     The Environmental Monitoring and Assessment Program is an interagency,
 interdisciplinary program that will  contribute to decisions on environmental protection and
 management by integrating research, monitoring, and assessment.  EMAP's strategies use
 rigorous science while taking into  account social values and policy-relevant questions. It  was
 initiated by EPA's Office of Research and Development to monitor status and trends in the
 condition of ecological resources, to develop innovative methods for anticipating emerging
 environmental problems, and in general, to provide a greater capacity for assessing and
 monitoring the condition of the nation's ecological resources (Messer et al.  1991).

     EMAP was designed to provide information that will enable policy-makers, decision-
 makers and the public to:

     •  Estimate the current  status,  trends, and changes in selected indicators of the Nation's
       ecological resources  on a regional basis with known confidence.

     •  Estimate the geographic coverage and extent of the Nation's ecological resources with
       known confidence.

     •  Seek associations between selected indicators of natural and anthropogenic stresses and
       indicators of condition of ecological resources.

     •  Provide annual statistical summaries and periodic assessments of the Nation's
       ecological resources.

     A general overview of EMAP in mostly non-technical language is in the EMAP Program
 Guide (Thornton et. al., 1993).  Additional information on the assessment framework used by
 EMAP as a common  approach for  planning and conducting a wide variety of ecological
 assessments is given by EPA (1994).  The statistical analysis of data  from EMAP is best
 undertaken with an understanding of the measurement selection process.  Barber (1994)
 describes the indicator development strategy used by EMAP in their regional demonstration
 studies.  The statistical methods in this report were intended  primarily for these  demonstration
 studies and as well as studies conducted by EPA Regions in conjunction with EMAP.

 Survey Design Approach

     A distinctive feature of EMAP is strict reliance on probability sampling.  Overton et al.
 (1990) describe the conceptual framework for the sampUng-design approach for EMAP.
 Stevens (1994) gives  a description of how the conceptual framework is used in research-
 demonstration studies for particular ecological resources.  The implementation of the

-------
 conceptual framework required development of sampling designs directed at environmental
 resources distributed over space.

    Probability sampling is fundamental to EMAP. Probability sampling provides the basis
 for estimating resource extent and condition, for characterizing trends in extent or condition,
 and for representing spatial pattern, all with known certainty. A probability sample has some
 inherent characteristics that distinguish it from other samples:  first, the population being
 sampled is explicitly described; second, every element in the population has the opportunity
 to be sampled with known probability; and third, the selection is carried out by a process that
 includes an explicit random element.  A probability sample from an explicitly defined
 resource population is  a means to certify that the data collected are free from any  selection
 bias, conscious or not. This probability sample is an essential requirement for a program such
 as EMAP that aims to describe the condition of our national ecological resources.   Further,
 analytical methods that are as free as possible from the appearance of subjectivity are also
 required.  These two requirements are satisfied in EMAP by adherence to probability-based
 sampling protocols and analytical methods that rely on the statistical  design for their
 inferential soundness.  Thus,  EMAP relies on design-based inference procedures for basic
 estimates of population descriptors.  See Hansen et al. (1983), Sarndal (1978), or Smith
 (1976) for discussions  of the  issues involved in design-based versus model-based inference.
 These issues  are also discussed in a spatial context by de Gruijter and Ter Braak (1990) and
 Brus and de Gruijter (1993).

    Design-based  inference relies on the methodology of statistical survey sampling (Cochran
 1977) to extend the results from a sample to the population. This extension is valid only  with
 a probability  sample.  The  design specifies what information is to be collected at specified
 locations; there must also be  protocols or methods  that are coherent with the design, and that
 specify  how the inference is drawn. The combination of a sample design and an inference
 protocol is called a sampling  plan. This plan includes the prescription of not only what and
 where to sample,  but also  how to analyze the resulting data. In many instances in EMAP, the
 resource groups used novel sampling designs tailored to the resource.  These designs are
 documented in Overton et al. (1990) and Stevens (1994, 1995) and in the various research
 plans for the  particular resources. A general prescription for the analyses is  given in Overton
 et al. (1990),  and specific details of the  analyses for some designs are in Lesser and Overton
 (Appendix C). However, fhese documents do not cover all of the designs that have been
 applied  by the EMAP resource groups.  This Methods Manual fulfills the second part of the
prescription for a sampling plan by providing detailed descriptions of the methods for
 analyzing data collected using any of EMAP's sample designs along with computational
 algorithms where appropriate.

Sampling Ecological Resources

    The property of a particular ecological resource that has the most impact on a statistical
sampling plan is the dimension of the conceptual representation of the resource in  two-
dimensional space.  An implementation of the sampling strategy may represent (or  model) the

-------
• resource populations as points, lines, or areas.  Resources that are represented as points for
 sampling purposes are labeled discrete resources.   A discrete resource — such as small to
 medium-sized lakes—has distinct, natural units. Such a resource is represented in
 2-dimensionaJ space as a point because the objective of the sampling is to describe the
 resource unit as an entity, even though the resource unit may occupy appreciable area in the
 landscape.  An attribute associated with a unit of a discrete resource, such as pH or an
 indicator of biodiversity, is viewed as a property of the entire  unit.  The ensemble of all units
 of a discrete resource is treated statistically as a finite population.  Population inferences for a
 discrete resource are most appropriately based on numbers of units that possess some
 property.  For example, a statement about lakes in good condition would pertain to numbers
 of lakes, not, for instance, surface area of lakes. An inference couched in terms of surface
 area might be possible, but neither the sampling plan  nor the measurements  taken on the units
 would be well suited for such an inference.

     Resources such  as streams, riparian wetlands,  or forested shelter belts may be given a 1-
 dimensional representation in 2-dimensional space, and sampled as linear resources. In fact,
 such resources are 2-dimensional, but their area is very small in proportion to landscape area.
 These features are much longer than they are wide, and they do not have well-defined  natural
 units.  Inferences are appropriately stated in units of length, e.g., proportion  of stream-miles
 in  poor condition. Attributes are viewed as being  defined at a point rather than being
 associated with a unit.  Thus,  a chemical concentration might change continuously along the
 length of a stream and be defined and measurable at every point on the stream.

     Resources that extend over large regions in a more or less continuous and connected
 fashion are treated as 2-dimensional, or extensive resources.  Like the linear resource, an
 extensive resource does not have distinct natural units.  Instead, it covers relatively large
. sections of the landscape and  lacks a high degree of functional integration.  For example,
 forests, arid ecosystems, and large wetlands such as salt marshes or the Everglades fall  into
 this category.   The  domain of an extensive resource has area; it does not consist of a
 collection of separable points.  An attribute of an extensive resource is viewed as a definition
 of a surface in the sense that it is possible in principle to assign a value to the attribute at
 every point in the domain.  Generally, the attribute surface is reasonably smooth, although
 there may well be step discontinuities.  For example,  the domain of a forest could include
 stands of 50-year-old timber and adjacent newly clear-cut areas. A parameter measuring
 biomass could show a discontinuity as the boundary is crossed.  Population  inferences are
 usually based on area of the resource with some property, e.g., acres of forest with a visual
 canopy rating indicative of degraded condition.

     The distinctions between  discrete, linear, and extensive are not always clear, and in some
 cases a resource may be viewed as both: a resource consisting of isolated fragments may be
 treated as extensive  for sampling but as discrete for analysis, or vice versa.  Greater
 efficiency, that is, lower variance for a fixed sampling effort, will usually result if the
 sampling and analysis are carried out from the same viewpoint.  For example, streams could
 be sampled as a finite population of stream segments defined  by confluences (discrete), but

-------
 analyzed in terms, of miles. of stream channel (linear).  Thus, a simple random sample of a
 stream-segment population results in a variable probability sample of points on streams, and
 is not the most efficient sample to make an inference about miles of streams.

     Linear and extensive resources are sampled somewhat differently but analyzed using
 similar methods.  The important distinction in the analysis is between finite, discrete
 populations and infinite, continuous populations.  Methods for both types of populations are
 provided in this document.

 ESTIMATION AND ANALYSIS

     Each resource to be sampled can be represented by a set, R, whose elements index the
 points where the resource  exists.  Thus, for a discrete resource, R = (s}, s2, ..., SN] where
 each si represents the location of one unit of the resource.  If R is, for example,  a set of lakes,
 then each j,- represents the location of one of the  lakes in R.  For an extensive resource, R is
 the set of points covered by the resource, for example, the area covered by forest or a linear
 stream network.  If R represents a forest,  then each s e R is  a point in the forest; if R is a
 stream network, each s e. R is a. point on  some stream in the network.  Each attribute of
 interest of the resource R is a fixed but unknown function defined on  R; that is,  at each
 element s e R there is a fixed value of the attribute denoted  as z(s).  The population
 parameter to be estimated  is the total of the attribute over R,  that is ZT  -  ^  ^C^-) m tn€
                                                                      5, € R
                       r
 discrete case or IT =  \z(s)ds  in the continuous case.  This is a quite general population
                      J
                      R
 parameter, because estimates of mean values, variances, proportions, and distribution
 functions can all  be formulated as estimates of sums or integrals over R.   For example, the
 distribution function Ft(x)  for z(s) over ^?  is the proportion of R with value of z less than or
 equal to x. For a  discrete  resource, this is
                             FM  - Oil

                                          !,.€ V
For an extensive resource, the distribution function is
where
                                                      1  r f  fi
                                                      0  otherwise

-------
    The methods from finite'-population sampling can be applied to make inferences about ZT
for discrete resources.  Finite population sampling methods are extensively developed and
well-documented (Cassel el al.  1977, Cochran 1977, Kish 1965, Thompson 1992, Yates
1960).  However, environmental populations are, in many instances, more appropriately
conceptualized as continuous, infinite populations rather than discrete and finite.

    Estimates of extent for a resource (e.g., wetlands, forests) or for a subset of a resource
(e.g.,  salt marshes, deciduous forest) can be obtained from classification of a sample.
Estimates of ecological condition for a resource class are generated from condition indicators.
Cumulative distribution functions with confidence bounds are the fundamental method for
describing regional (or national) condition in EMAP.  The essential feature of this approach is
the emphasis on estimating the  cumulative total (or proportion) of a resource class with an
indicator of condition (or area)  less than or equal to a specified value (e.g., the proportion
with indicator value less than or equal to some value of interest).  Although distribution
functions provide the estimates  of condition, the information from them can be presented in
several forms (bar graphs, tables, distribution function plots), with the choice of format
related to the intended  audience.

    The primary theoretical justification for the estimation methods presented in this
document is the  Horvitz-Thompson Theorem (Horvitz and Thompson,  1952)  or its continuous
population analogue (Cordy,  1994; Stevens, 1995 submitted). The sampling background to
Horvitz-Thompson estimation is summarized here very briefly to provide a context for the
estimation methods presented in the body of this document.  The theory and notation are very
similar for discrete and continuous resources.

    The inference paradigm  is based on the inclusion probabilities and the pairwise inclusion
probabilities of the sampled units under the following sampling model:  A sample is selected
from  the universe U by picking the values of n random variables s}, s2, •-, sn from a joint
probability distribution specified by Pr(s}, s2, ..., SN), which is defined by the sampling
design.  (In EMAP, the s, can be thought of as points, as they will be actual points in  an
extensive resource or reference  points that identify the location of a discrete resource.) The
selected points are classified as being in or out of some target population R, and z need be
determined only for those points in R.  In general,  this sampling method gives a fixed total
sample size («),  but the size of  the achieved sample in R is a random variable.  Allowing a
random sample size entails some technical complication but provides valuable flexibility.  In
particular, it provides the ability to make estimates for arbitrary subpopulations, that is, R
could be defined after all the sampling has taken place.  The only difference  between the
discrete and extensive case is the form of the probability distribution:  in the discrete case, the
probability distribution gives  the numerical probability that a particular sample is selected; in
the continuous case, the probability is replaced with a probability density  function for the
samples.

-------
     For the discrete case, the inclusion probability n(k) for unit k is the probability that unit k
 is included in the sample, i.e., Pr(sj = k or s2 = k or.. .or sn = k).   For designs such that
 Pr(sf = Sj) = 0 for all / *;', (e.g., sampling without replacement)
Tht joint inclusion probability for units k and / is the probability that units k and / are
simultaneously in the sample, and is given by
                                               .  = jfc, Sj  =
    In cases where R is not finite, but rather an extensive resource, continuous probability
distributions are used to specify the sampling design. As a result, the inclusion probability
functions used in the discrete case are replaced with inclusion density functions. Let
f(S2,s2,...,sn) be the joint probability density function (pdf) of the sample locations,/^ be the
(marginal) pdf of s,, and let/^j,  t) be the joint pdf of sf and  s-, / *j.  The inclusion density
function is defined  by
                                             n
                                    «(*)  =  E /,- (*)•
                                           i = i
The pairwise inclusion density function for s,  re U is  defined by


                              **>  o  =  E  Tfijd.v-
                                        i = 1 j * i

    Horvitz and Thompson (1952) provided an estimator of the population total for variable-
probability, without-replacement, finite-population sampling design, along with an expression
for the variance of the estimated total and a related variance estimator. Alternative
expressions for the  variance and its estimator were provided by Yates and Grundy (1953) and
Sen (1953). (The variance estimators associated with Horvitz and Thompson are given in
subsequent equations and are denoted by the subscript "HT"; the Yates-Grundy forms are
denoted by the subscript "YG".)  As was shown by Cordy (1993), a version of the Horvitz-
Thompson theorem holds when sampling from V when  the inclusion density and pairwise
inclusion density function are defined as above.

    The Horvitz-Thompson theorem provides estimators of the total (sum or integral) of z
over fl and its associated variance in terms of the quantities  z(sf), n(sf), and n(si ,Sj ).  The

                                            8

-------
 form for the estimator of the  total is the same for both the discrete and continuous versions;
 the only difference between the two is the expression for the variance of the estimator.  The
 (unbiased) estimator of ZT is given by
                                         71   I
                                         -   '
. The estimators of variance of ZT for the discrete case are
                                    (=1
     or
             n   n
                                   :• ";V
            1=1 _/>/
 and for the continuous case,
                      2/ „ \    n   n
            i-l
     or

 All of the above estimators of variance are unbiased, provided n(s, t) > 0 in U.

     An estimator of the mean of z can be obtained by dividing ZT by the size of R (the

 number of units in ^?, or the length or area of /?) , i.e.,  fi.7 = ZT / NR or  \iz = ZT / AR

-------
 The estimator Zj-  will tend to have low variance if z and it are strongly positively correlated.

 Since many environmental surveys have multiple objectives and collect observations on
 multiple attributes at the same location, there will often be little or no correlation between z
 and TI.  A ratio estimator (so-called because it is the ratio of two estimators) for ^ of the
                                    n
                  *           *     ^^>
 form fizr  = ZT I  AR , where  AR  = £, UR(S^ /  ^C^-)] estimates AR , may well be more
                                   1=1

 precise  than  pz .  The two estimators  ZT and AR  are subject to  the same sources of

 sampling variation, and  hence  are likely to be positively correlated. Thus,  if there is
 substantial variability in AR ,  pz r will likely be more precise  than \iz . The ratio estimator

 of the total is then zTr  = jiz r AR .   An approximate variance estimator for ZT r  is obtained

 by applying either the Horvitz-Thompson or  Yates-Grundy formulas with
 d(st) = z(s,) - |Xr in place  of 1(5$.


    The distribution function FZ(X) of the response  z is estimated by applying the Horvitz-
 Thompson theorem to the indicator function  I{teR\z(t)£x}(s) •   An unbiased estimator of the

 size (number, length, or area) of the subset of R with indicator z(t) < x is given by
so that  FZ(X)  - ZT(X) I AR  is an unbiased estimator of Fz(x).  The ratio estimator

 FZ r(x) = Zj(x) I AR avoids the possibility of obtaining estimates that exceed 1, and in

many cases will be more precise than  FZ(X) , for the same reasons as given for ^ r relative

to £LZ .  An approximate variance estimator for FZ T is obtained by applying the Horvitz-

Thompson or Yates-Grundy formulas with d-^x) = ^efli^f)^ }(•*,)  - FZ  r(x) in  place of
                     ~2
   i) and dividing by AR .
MISSING DATA

    All surveys must address the issue of how to handle missing data in statistical estimation.
Missing data should always be investigated for patterns, including why it is missing.  Two
types of missing data are possible in EMAP or R-EMAP surveys.  One  type is a missing
sample unit, such as  a missing lake, stream location, or forest site.  Sample units may be

                                           10

-------
missing due to inaccessibility, land owner refusal, or other reasons.  Finding detectable
patterns in missing data could lead to alterations in survey management, including obtaining
access permission and identifying situations where the population inference needs to be
qualified.

    The other type of missing  data occurs within a sampling unit, such as a missing
observation for an indicator such as a chemical concentration or habitat structure variable.
Observations may be missing due to field collection problems, lost samples, laboratory
analysis problems, or other reasons. Although it is possible to use different statistical
methods to address the two types of missing data, for the purposes of this manual the two
types will be treated the same.   We associate all missing data as being a missing sample unit.

    Two views may be taken.   For each  view, the missing sample  units unavailable for
measurement can be considered to be a subset of the target population of interest.  One view
is to remove this subset from the target population by redefining the target population as the
original target population excluding the missing subset.  The statistics methods may then  be
applied as given without adjustments.  Another view is to assume the data are missing at
random and retain the original  definition  of the target population.  In this case, status
estimators of the cumulative distribution  expressed as a proportion  or fraction of the total
remain unbiased estimators.  Estimators for population totals or cumulative distributions
expressed as amounts  (number, length, area) are biased.
                                            11

-------
REFERENCES
Barber, M. C. ed.  1994.  Environmental Monitoring and Assessment Program:  Indicator
    development strategy. EPA/620/R-94/022.  Athens, GA: U.S. Environmental Protection
    Agency, Office of Research and Development.

Bras, D. J.,  and  J. J. de Gruijter.  1993.  Design-based versus model-based estimates of spatial
    means:  Theory and application in environmental soil science. Environmetrics 4:123-152.

Cassel, C. M., C. E. Sarndal, and J. H. Wretman.  1977.  Foundations of inference in survey
    sampling. New York:  John Wiley.

Cochran, W. G.  1977. Sampling techniques. 3rd Edition.  New York:  John Wiley & Sons.

Cordy, C. B.  1993.  An extension  of the Horvitz-Thompson theorem to  point sampling from a
    continuous universe.  Statistics and Probability Letters 18:353-362.

de Gruijter,  J. J., and C.  J. F. Ter Braak.  1990. Model free estimation  from survey samples:;,
    A reappraisal of classical sampling theory.  Mathematical Geology 22:407-415.

Hansen, M.  H., W. G. Madow, and B. J. Tepping.  1983.  An evaluation of model-dependent
    and probability sampling inferences in sample surveys.  Journal of the American Statistical
    Association  78:776-760.

Horvitz, D. G., and D. J. Thompson.  1952.  A generalization of sampling without replacement
    from a finite universe.  Journal of the American Statistical Association 47:663-685.

Kish, L. 1965.  Survey sampling. New York: John Wiley & Sons.

Lesser,  V. M., and W. S.  Overton.  1994.  EMAP status estimation: Statistical procedures and
    algorithms.  EPA 620/R-94/008.  Corvallis,  OR:  U.S. Environmental Protection Agency,
    Environmental Research Laboratory.

Messer, J. J., R.  A. Linthurst, and W. S. Overton.  1991.  An EPA program for monitoring
    ecological status and  trends.  Environmental Monitoring and Assessment  17:67-78.

Overton, W. S., D. White, and D. L. Stevens Jr.  1990. Design report for EMAP,
    Environmental Monitoring and Assessment Program.  EPA 600/3-91/053. Corvallis, OR:
    U.S. Environmental Protection Agency,  Environmental  Research Laboratory.

Sarndal, C.  1978.  Design-based  and  model-based inference for survey sampling.
    Scandinavian Journal of Statistics 5:27-52.

                                          12

-------
Sen, A. R.  1953.  On the estimate of the variance in sampling with varying probabilities.
    Journal of the'Indian Society' of Agricultural Statistics 7:119-127.

Smith, T. H.  1976.  The foundations of survey sampling:  A review. Journal of the Royal
    Statistics  Society A.

Stevens, Jr., D. L.  1994.  Implementation of a national environmental monitoring program.
    Journal of Environmental Management 42:1-29.

Stevens, Jr., D. L.  1995.  A family of designs for sampling continuous spatial populations.
    Environmetrics.   Submitted.

Thompson,  S. K.  1992.   Sampling.  New York:  Wiley.
                 -V                                 '
Thornton, K. W., D. E. Hyatt, and C. B. Chapman, eds.  1993. Environmental Monitoring and
    Assessment Program guide.  EPA/620/R-93/012. Research Triangle Park, NC: U.S.
    Environmental Protection Agency, Office of Research and Development.

U.S. Environmental Protection Agency (EPA).  1994.  Environmental Monitoring and
    Assessment Program assessment framework. EPA/620/R-94/016. Research Triangle Park,
    NC:  U.S. Environmental Protection Agency, Office of Research and Development.

White, D., A.  J. Kimerling, and W. S. Overton. 1992:  Cartographic and geometric
    components of _a  global sampling design for environmental monitoring. Cartography and
    Geographic Information Systems  191:5-22.

Yates, F  1960.  Sampling methods for censuses and sur\>eys. London:  Charles Griffin & Co.

Yates, F., and P. M. Grundy.  1953.  Selection without replacement from within  strata with
    probability proportional to size. Journal of the Royal Statistical Society 815:253-261.
                                          13

-------
              EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 1  of 12
 ESTIMATION METHOD 1: Estimation of the Cumulative Distribution Function for the
 Proportion of a Discrete or an Extensive Resource; Horvitz-Thompson Estimator, Normal
 Approximation

 1  Scope and Application

 This method calculates the estimate of the cumulative distribution function (CDF) for the
 proportion of a discrete or an  extensive resource that has an indicator value equal to or less
 than a given indicator level.  The method applies to any probability sample and presents two
 estimators.  An estimate can be produced for the entire population  or for  an arbitrary
 subpopulation with known or  unknown size.  In the discrete case, this size is the number of
 units in the subpopulation. In the extensive case, this size is the subpopulation extent.
 Suggestions for estimating the CDF over the range of the indicator are included.
 Alternatively, the CDF can be calculated at the indicator levels found in the probability
 sample.  The method uses the Normal approximation to provide confidence bounds or
 intervals for the true cumulative distribution function. This method does  not include variance"
 estimators for the estimated CDF.  For information on appropriate variance estimators, refer
 to Section 7.

 This method has been applied in:

       The 1991 Surface  Waters Pilot Report
       EMAP-Estuaries Louisianian  Province 1991 Annual Statistical Summary
       EMAP-Estuaries Virginian  Province 1991 Annual Statistical Summary
       The Forest Health Monitoring 1992 Annual Statistical Summary

 2  Statistical Estimation Overview

 A sample of size na units is selected from subpopulation a with known inclusion probabilities
 TC = { 7tj, — ,7Cz-,-,7Cn }.  The indicator is evaluated for each unit and represented by

 y ={Ji>'"'>'/''"'3'n }•   When  sampling an extensive resource, the inclusion probabilities are
 replaced by the inclusion  density function evaluated at the sample locations.

 Estimates of the cumulative distribution function are obtained for the indicator levels of
 interest, x = {x^, — ,xk, — ,xm}. Several alternatives are available for choosing x. The
 recommended alternative is the use of equally spaced values across the range of the indicator.
Ideally, this range  is known a  priori  and extends beyond the range  of any particular data set.
 A second alternative is  to use  the set of unique values in the data set.  This alternative gives
 the classical empirical cumulative distribution function.  A third alternative is to use the
midpoints of adjacent ordered  values in y  for the levels x.

-------
              EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 2 of 12

                                                      A
 To obtain the estimated cumulative distribution function,  Fa(xk), the Horvitz-Thompson
 estimator of a cumulative total is calculated for each xk by summing up the number of
 indicators which are less than or equal to the xk value.  This total is then divided by the
 subpopulation size, Na  .  When this subpopulation size is unknown, the estimated
 subpopulation size, Na , is substituted for the known subpopulation size, Na , to form the
 Horvitz-Thompson ratio estimator.

 The Horvitz-Thompson ratio estimator may perform better than the estimator using the known
 subpopulation size, Na , and may be used even if the subpopulation size is known.  Some of
 the conditions under which this ratio estimator is recommended are given in Section 9.  This
-ratio estimator should always be used in the case of missing data.

 Confidence limits for Fa(x/.) are produced by assuming a Normal distribution. These limits
 may be used to construct either a lower confidence bound, an upper confidence bound, or a
 confidence interval for  Fa(xk). Computation of these limits requires an estimated variance

 of Fa(xk) which is not provided in this method.  Details for computing a suitable estimated

 variance of Fa(xk) are  found in other methods referenced in Section  7.

 The output consists of the estimated cumulative distribution function values with either a one-
 sided confidence bound  (upper or lower) or a confidence interval for F (xk).  *

 3  Conditions Under Which This Method Applies

     Probability sample with known inclusion  probabilities or densities
 •   Discrete or Extensive resource
 •   Arbitrary subpopulation
 •   All units sampled from the subpopulation must be accounted for before applying this
     method
 •   When the indicator value is missing, exclude this missing value and the corresponding
     inclusion probability or density; use the  Horvitz-Thompson ratio  estimator

 4  Required Elements

 4.1 Input Data

 y.   = value of the indicator for the i'h unit sampled from subpopulation a.
 K,   = For discrete resources, the inclusion probability for selecting the ith unit of
        subpopulation a.  For extensive resources, the inclusion density evaluated at the
        location  of the ith sample point in subpopulation a.

-------
              EMAP Estimation Method 1, Rev. No. 0, May  1996, Page;3 of 12
 4.2  Additional Components

 na  -  number of units sampled from subpopulation a.
 xk  =  tfh indicator level of interest.
 Na  =  subpopulation size, if known.

 4.3  Graphical Display Considerations

 Two issues should be resolved before graphing the CDF:  1) how many points to use and 2)
 what are the first and  last points on the plot.  The following are guidelines for the three
 alternatives mentioned in Section 2.  In all three approaches, the plotted points are connected
 by line segments.  Confidence limits are not to exceed one or to drop below zero.  The
 sample y is understood to be in ascending order for this discussion.

 If the empirical CDF is chosen, the number of points plotted is at most na+2. The first
 plotted point is (0,0) when the indicator takes on only positive values.   Otherwise, choose a
 point smaller than yj as the abscissa and assign zero as the ordinate. Choose a point larger
 than }'    and  assign 1  as  its ordinate.  Where there is more than one occurrence of an
        a
 indicator level in the data set, plot only one point using the largest cumulative distribution
 function value associated with this level as the ordinate.

 If the midpoints of adjacent values in y are used for the levels x, at most na+l points are
 plotted.  To determine the first plotted point, calculate- the distance between y\ and y2- Take
 half this distance and subtract it from y^ to obtain the abscissa. If this abscissa is a  negative
 number and the indicator can never be negative, instead assign -zero as the abscissa.  Use zero,,
 as the ordinate.  Similarly, to determine the last plotted point,  calculate  the distance  between
 the largest y values, y  _, and yn  . Halve this distance, add  it to v   and plot this  abscissa
                      Q.         a                                a
 using 1 as the ordinate.

 The recommended  approach uses equally  spaced levels across the potential range of the
 indicator.  The levels used should be potential real  values that the  indicator could attain.  For
 example, in the case of discrete data, integer values should be used.  As mentioned
 previously, ideally this range is known a priori and extends beyond the range of any
 particular data set.  If an informed guess cannot be made for this range, one suggested range
 would be to use the midpoint approach for obtaining the first and last plot points as  explained
 in the previous paragraph. How many points to  use is a subjective decision and should take
into  account the chosen range,  the size of the data set, and sometimes the data distribution
 itself must be examined.  The following suggestions are given to help decide how many
points to use.

In most cases, using the same number of points as used in the empirical distribution, na+2
points, will be sufficient for plotting the CDF. Extreme outliers in a particular data  set may
have a great influence on the graph. In this case, more points may be needed to achieve
 greater resolution within the body of the data. In the case of large data sets, plotting less than

-------
              EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 4 of 12


.na+2 points should be adequate. Begin  by using 100 points for these larger data sets.  The
 range of the indicator will have a part in determining if this is  an adequate number of points.
 Trying the plots with differing numbers  of points may be useful to see if the graph changes
 significantly.

 The y-axis (CDF) should range in values from zero to 1.  This method may result in
 confidence limits which drop below zero or exceed  1.  These limits should not appear on the
 plot. Instead, truncate the plotted upper limit at 1 and the plotted lower  limit at zero.

 5 Formulas and Definitions

 The estimated CDF (proportion) for indicator value xk in subpopulation a, Fa(xk), with
 known subpopulation size, Na;  Horvitz-Thompson estimator is
             1 = 1   i
                                                                       J\
 The estimated CDF (proportion) for indicator value xk in subpopulation a,  Fa(xk), with

 estimated subpopulation size, Na;  Horvitz-Thompson ratio estimator is
             1 = 1   I

 The one-sided 100(l-a)% upper confidence bound, By(xk) is





 The one-sided 100(l-a)% lower confidence bound, BL(xk) is
The two-sided 100(l-a)% confidence interval, C(xk] is

-------
             EMAP Estimation Method 1, Rev. No. 0, May 1996, Page^5 of 12
For these equations:

 ^   /\
 V [Fa(xk)] = estimated variance of the estimated CDF (proportion) for indicator value xk in

              subpopulation a.
xk
na
za
a
       [ 0, otherwise
=  tfh indicator level of interest.
=  value of the indicator for the fh unit sampled from subpopulation a.
=  For discrete resources, the inclusion probability for selecting the ith unit of
   subpopulation a. For extensive resources, the inclusion density evaluated  at the
   location of the ith sample point in subpopulation a.
=  number of units sampled from  subpopulation a.
= z-score  from the standard Normal distribution.
=  level of significance.
6  Procedure

6.1    Enter Data

Input the sample data consisting of the indicator values, y, ,  and their associated inclusion
probabilities, ft,.  For example,
Calcium
>•
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*;
'.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375

-------
               EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 6 of 12
 6.2    Sort Data

 Sort the sample data in nondecreasing order based on the y; indicator values.  Keep all
 occurrences of an indicator value to obtain correct results.
Calcium
?«
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*;
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
 6.3    Obtain Subpopulation Size

,. Input Na if using a known subpopulation size.

           s.
 Calculate Na from the sample data only if using the Horvitz-Thompson ratio estimator.  Sum

 the reciprocals of the inclusion probabilities or densities, rc,, for all units in the sample a to
 obtain NQ .

 Na = (1/.00375) + (1/.01406) + (1/.07734) + . . . + (1/.00375) = 1128.939 for this data set.


 6.4    Input Indicator Levels of Interest

 Assign indicator levels of interest, x, based on graphical display considerations.  Choose one
 of the three methods previously discussed in Section 4.3.

 6.4.1  The Recommended Approach — Levels of Interest

 Form an expected range of the indicator before looking at the data.  Next, examine the data
 set to see if the estimated range encompasses all y  values.  If not, increase the range  to
 encompass the outlying y values.  If there are large outliers, more points than na+2 may be

-------
              EMAP Estimation Method 1, Rev. No. 0, May 1996, Page' 7 of 12


needed to retain good resolution in the body of the plot.  Determine evenly spaced x values
across the chosen range.

For this example, the estimated range was .5 to 9.5 mg/L.  The range does not have to be
adjusted because it includes the observed yf values.  The point spacing interval for x, xint -
(*ma* - xminV(na - 1) = (9.5 - .5)7(10 - 1) = 9/9 =  1.0.  The 10 x  values = (xmin , xmin+l.Q,
jcmi-n+2(1.0), ... ) = (.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5).  Try obtaining the CDF first
with these x values and then again with an increased number of x values spaced closer
together.  More points across the range may be needed because all but one of the yi values
are less than 3.0.

6.4.2 The Empirical CDF — Levels of Interest

For the empirical CDF, x values = (.7395, 1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).
Duplicate values  in the data set, 1.5992, do not have to be repeated when forming x .

6.4.3 The Midpoint Approach — Levels of Interest

Calculate the  midpoints of each pair of yi values to form x . The first x value is
(.7395+1.2204)72 = .9800. In this particular data set, there are three occurrences of 1.5992.
As a result, there are two midpoints of 1.5992.  Regardless of how many times a  midpoint is -
repeated, include it only once in x.  The x values = (.9800, 1.4098, 1.5992, 1.7996, 2.1854,
2.5952, 2.8798, 4.9700).

6.5    Compute Cumulative Distribution Function Values

          s*                                                . .
Calculate Fa(xk) for each element in x using the formulas from Section 5.


To calculate Fa(x^), compare each y(- to x,. If yi is less than or equal to x^ then I/JT, is
                            .A
added to the computation of Fa(x^ ) until yt exceeds x\  (when using sorted data).  Divide the
                                 Si
cumulative sum of these I/TC/S by Na or Na (depending on the estimator used) to obtain
Similarly, to calculate Ffl(x2), compare each yi to Xj, add the  1/rc/s until yi exceeds x2, and

then divide this sum by Na  or Na.


Do this for every value  in x.

                                                             /v
Below is an example for obtaining the cumulative sum for each Fa(xk).  Complete results

for the example data are in Section 6.7.

-------
             EMAP Estimation Method  1, Rev. No. 0, May 1996, Page 8 of 12
Calcium
>'•
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
Inclusion
Probability
*,-
.00375
.01406
.07734
.75000
.03750
.75000
tlndicator Level
of Interest
xk
.7395
1.2204
1.5992


2.0000
Cumulative Sum for Fa(xk)
1/.00375
1/.00375+1/.01406
1/.00375+1/.01406+1/.07734+1/.75000+1/.03750

1/.00375+1/.01406+1/.07734+1/.75000+1/.03750+1/.75000
6.6    Compute Confidence Limits

Calculate the confidence bound (upper or lower) or confidence interval for each Fa(xk)
using the formulas from Section 5.

Estimate the variance of Fa(xk) using an applicable method listed in Section 7.  Next, take
the square root of the variance and multiply this square root by the
z-score from the standard Normal distribution corresponding to the desired confidence  level.
                    y.
Add this quantity to  Fa(xk) to obtain the upper bound, Bv(x^. Subtract this quantity  from

Fa(xk) to obtain the lower bound, BL(xk).  For the confidence interval, obtain both BL(xk)
and 5(/*;.).  For example, 1.645 would be the za for a one-sided 95% upper or lower
confidence bound,  and the za/2 f°r a two-sided 90% confidence interval.   A two-sided  95%
confidence interval would use  1.96 for z^-

6.7    Output Results

Output the indicator  levels of interest, the associated CDF value, and either a confidence
bound (upper or lower) or a confidence interval for  Fa(xk~).  If the output generated will be
used for graphing the CDF, append the first and last graph points  to this output as directed
for the three methods below.  The tables  in Section  6.7.1  - 6.7.3 contain results for the ratio
estimator applied to the example data.  A hypothetical variance is  used in confidence bound
and interval calculations.
When upper bounds exceed 1, they are set equal to 1.  Lower bounds less than zero are set
equal to zero.

-------
             EMAP Estimation Method 1, Rev. No. 0, May 1996, Page 9 of 12
6.7.1   The Recommended Approach — Results

Append the point (0,0) to the output file for graphing purposes.  Since xmax , 9.5, exceeds the
maximum yi , 7, no other points are appended.
Calcium
xk
0*
0.5
1.5
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5
CDF for
Proportion,
Ratio Estimator
fy**)
0*
0
.2992
.5729
.7638
-.7638
.7638
.7638
1
1
1
Hypothetical
Variance
? ft(**)]
0*
0
.046005
.052579
.044710
.044710
.044710
.044710
0
0
0
One-sided 95%
Lower Conf.
Bound
BL(xk)
0*
0
0**
.1957
4160
.4160
.4160
.4160
1
1
1
One-sided 95%
Upper Conf.
Bound
*w
0*
0
.6520
.9500
1**
1**
1**
1**
'i
i
i
Two-sided 90%
Conf. Interval
C(xk)
.(0,0)*
(0,0)
(0..6520)
(.1957..9500)
(.4160,1)
(.4160,1)
(.4160,1)
(.4160,1)
(1,1)
(1,1)
(1,1)
    *appended
*set tot) or 1

-------
            EMAP Estimation Method  1, Rev. No. 0, May 1996, Page 10 of 12
6.7.2   The Empirical CDF — Results

Append the point (0,0) to the output file for graphing purposes.  Append also a point slightly
larger than the largest x value and assign an ordinate of 1.  For this example, the point (7.5,1)
is appended.
Calcium
xk
0*
0.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
7.5000*
CDF for
Proportion,
Ratio Estimator
fy**>
0*
' .2362
.2992
.3355
.3366
.5729
.6126
.7638
1
1*
Hypothetical
Variance
V tfaW
0*
.044710
.046005
.046453
.046467
.052579
.052209
.044710
0
0*
One-sided 95%
Lower Conf.
Bound
BL(*j
0*
0**
0**
0**
0**
.1957
.2368
.4160
1
1*
One-sided 95%
Upper Conf.
Bound
*
-------
             EMAP Estimation Method 1, Rev. No. 0, May  1996, Page 11 of 12
 6.7.3  The Midpoint Approach — Results

 Determine the first plotted point by calculating the distance between the first two v(- values,
 .7395 and 1.2204.  Take half this distance and subtract it from .7395 to obtain .7395 -
 [(1.2204 -.7395)72] = .4991. Append to the output (.4991,0) as the first plotted point. If a
 negative number were obtained and the indicator can never be negative, append  (0,0)  as the
 first plotted point.  Similarly, to determine the last plotted point, calculate the distance
 between the two largest y,- values, 2.9399 and 7.  Take half this distance and add it to 7 to
 obtain 7 + [(7-2.9399)72] = 9.0301.  Because the  distance  between these last two yt values is
 relatively large, choosing the last point slightly above 7 with an ordinate of 1 may be
 preferable over appending (9.0301,1) to the output.  For  this example, (7.5,1) was appended.
Calcium
xk
.4991*
.9800
1.4098
1.5992
1.7996
2.1854
2.5952
2.8798
4.9700
7.5000*
CDF for
Proportion,
Ratio Estimator
w
0*
.2362
.2992
.3355
.3355
.3366
.5729
.6126
.7638
1*
Hypothetical
Variance
V (Fa(xk)]
0*
.044710
.046005
.046453
.046453
.046467
.052579
.052209
.044710
0*
One-sided 95%
Lower Conf .
Bound
BL(*d
0*
0**
0**
0**
0**
0**
.1957
.2368
.4160
1*
One-sided 95%
Upper Conf.
Bound
Bv(xk)
0*
.5840
.6520
.6900'-
.6900
.6912
.9500
.9885
1**
1*
Two-sided 90%
Conf. Interval
C(Xk)
(0,0)*
(0..5840)
(0..6520)
(0..6900)
(0..6900)
(0..6912)
(.1957..9500)
G2368..9885)
(.4160,1)
(1,1)*
     'appended
"set to 0 or 1
7  Associated Methods

An appropriate variance estimator for this estimated CDF for discrete resources may be found
in Method 5 (Horvitz-Thompson Variance Estimator)  or Method 12 (Yates-Grundy Variance
Estimator). Extensive resources use Method 10 (Horvitz-Thompson Variance Estimator) or
Method 12 (Yates-Grundy Variance Estimator).

8  Validation Data
Actual data with results, EMAP Design and Statistics Dataset #1, are available for comparing
results from other versions of these algorithms.

-------
             EMAP Estimation Method 1, Rev. No. 0, May 1996, Pa°e 12 of 12
-9 Notes
 The method which uses the ratio estimator may perform better under certain conditions and
 may be used even if the subpopulation size is known.  Sampling done with variable
 probability and variable sample size, na , are two of these conditions.  The ratio estimator
 retains a stability under these cases which can be seen from comparing the two equations.
 The ratio estimator tends to have smaller variance than the other estimator because the
 numerator and denominator tend to be positively correlated.  The estimator using the known
 subpopulation size does not compensate for variability in the numerator.

 The ratio estimator should be used in the case of missing data.  The estimated CDF applies
"only to the subpopulation for which data were obtained.  Because the size (number or extent)
 of this subpopulation is not known, it must  be estimated.  Therefore, the  ratio estimator is the
 only alternative for'estimating the CDF.  All graphs should be labeled  as applying only to the
 population that was sampled and not to the  original target population.

 10 References

 Cochran, W  G.  1977.  Sampling techniques. 3rd Edition. New York:  John Wiley & Sons.

 U.S. Environmental Protection Agency (EPA).  1993.  Surface waters 1991 pilot report.
   EPA/620/R-93/003. Washington, D.C: U.S.  Environmental Protection Agency.

 U.S. Environmental Protection Agency (EPA).  1993.  Statistical summary: EMAP-Estuaries
   Louisianian province — 1991.  EPA/620/R-93/007.  Washington, D.C:  U.S.  Environmental
   Protection Agency.

;.;U.S. Environmental Protection Agency (EPA).  1994.  Statistical summary: EMAP-Estuaries
   Virginian province  — 1991.  EPA/620/R-94/005. Washington, D.C:  U.S.  Environmental
   Protection Agency.

 U.S. Environmental Protection Agency (EPA).  1994.  Forest Health Monitoring 1992 annual
   statistical summary.  EPA/620/R-94/010.   Washington, D.C:  U.S.  Environmental Protection
   Agency.

 Lesser, V M., and W  S. Overton.  1994.  EMAP status estimation:  Statistical procedures
   and algorithms. EPA/620/R-94/008. Washington, DC:  U.S. Environmental Protection
   Agency.

 Overton, W.  S.  1987.  A sampling and analysis plan for streams in the National Surface
   Water Survey.  Technical Report 117. Corvallis, OR:  Oregon State University, Department
   of Statistics.

-------
             EMAP Estimation Method 2, Rev. No. 0, May 1996, Page 1 of 12


 ESTIMATION METHOD 2:  Estimation of the Cumulative Distribution Function for the
 Total Number of a' Discrete Resource;  Horvitz-Thompson Estimator, Normal Approximation

 1  Scope and Application

 This method calculates the estimate of the cumulative distribution function (CDF) for the total
 number of a discrete resource that has an indicator value equal to or less than a given
 indicator level.   The method applies to  any probability sample and presents two estimators.
 An estimate can be produced for the entire population or for an  arbitrary subpopulation with
 known or unknown size. This size is the number of units in the subpopulation. Suggestions
 for estimating the CDF over the range of the indicator are included.  Alternatively, the CDF
 can.be calculated at the indicator levels found in the probability sample. The method uses the
 Normal approximation to provide confidence bounds or intervals for the true cumulative
 distribution function.  This method does not include variance estimators for the estimated
 CDF.  For information on appropriate variance estimators, refer  to Section 7.

 This method has  been applied in:

       The  1991  Surface Waters Pilot Report

 2  Statistical Estimation Overview

 A sample of size na units is selected from subpopulation a with  known  inclusion probabilities
 71 = { TCj , — , 71; , ••• , 7in  }.  The indicator is evaluated for each unit  and represented by
Estimates of the cumulative distribution function are obtained for the indicator levels of
interest, x = {x^,-,xk,-,xm}.  Several alternatives are available for choosing x.  The

recommended  alternative is the use of equally spaced values across the range of the indicator.
Ideally, this range is known a priori and extends beyond the range of any particular data set.
A second alternative is to use the set of unique  values in the data set. This alternative  gives
the classical empirical cumulative distribution function.- A third alternative, is to use the
midpoints of adjacent  ordered values in y for the levels x.

To obtain the estimated cumulative distribution  function, Ffl(;cfc), the Horvitz-Thompson

estimator of a  cumulative total is calculated for  each xk by summing up the number of
indicators which are less than or equal to the xk value. Alternatively,  when the subpopulation
size is known, first form the Horvitz-Thompson ratio estimator by dividing this cumulative
                                       /K
total by the estimated  subpopulation size, Na , and then multiply this ratio by the known
                                /K
subpopulation  size, Na , to obtain Fa(xk).
The Horvitz-Thompson ratio estimator may perform better than the estimator which does not
use the known subpopulation size Na . Some of the conditions under  which this ratio

-------
              EMAP Estimation Method 2, Rev. No. 0, May 1996, Page 2 of 12


^estimator is recommended are given in Section 9.  This ratio estimator cannot be used in the
 case of missing data.

 Confidence limits for FQ(xk) are  produced by assuming a Normal distribution. These limits
 may be used to construct either a lower confidence bound, an upper confidence bound, or a
 confidence interval for  Fa(xk}. Computation of these limits requires an estimated variance

 of Fa(x/.) which is not provided in this method.  Details for computing a suitable estimated

 variance of Fa(xk) are found in other methods referenced in Section 7.

.The output consists of the estimated cumulative distribution function values with either a one-
 sided confidence bound (upper or lower) or a confidence interval for Fa(xk).

 3  Conditions Under Which This Method Applies

 •    Probability sample  with known inclusion probabilities
 •    Discrete resource
 •    Arbitrary subpopulation
 •    All units sampled from the subpopulation must be accounted for before applying this
     method
 •    When the indicator value is missing, exclude this missing value and the corresponding
     inclusion probability;  use  the Horvitz-Thompson estimator of a total

 4  Required Elements

 4.1 Input Data

 y,   = value of the indicator for the ith unit sampled from subpopulation a.
 rr,   = the inclusion probability for selecting the i  unit of subpopulation a.

 4.2 Additional Components

 na  = number of units sampled from subpopulation a.
 xk   - tfh indicator level of interest.
 Na  = subpopulation size, if known.

 4.3 Graphical Display  Considerations

 Two issues should be resolved before graphing the CDF:  1) how many points to use and 2)
 what are  the first and last points on the plot. The following are guidelines for the three
 alternatives mentioned in Section 2. In all three approaches, the plotted points are connected
 by line segments.  The  sample y is understood to be in ascending order for this discussion.

-------
              EMAP Estimation Method 2, Rev. No. 0, May 1996, Page "3 of 12
 If the empirical CDF is chosen, the. number of points plotted is at most na+2.  The first
 plotted point is (0,0) when the  indicator takes on only positive values.  Otherwise, choose a
 point smaller than y} as the abscissa and assign zero as  the ordinate. Where there is more
 than one occurrence of an indicator level in the data set, plot only one point using the largest
 cumulative distribution function value associated  with this level as the  ordinate.

 If the  midpoints of adjacent values in _y are used for the levels x,  at most na+l points are
 plotted.  To determine the first plotted point, calculate the distance between yl and v2.  Take
 half this distance  and subtract it from yl to obtain the abscissa. If this abscissa is a negative
 number  and the indicator can never be negative, instead assign zero as the abscissa.  Use zero
 as the ordinate. Similarly, to determine the last plotted  point, calculate the distance between
 the largest y values,  yn _ j  and  yn .  Halve this distance, add it to yn  and plot this abscissa

 using the cumulative total number associated with yn as its ordinate.


 The recommended approach uses equally spaced levels across the potential range of the
 indicator.  The levels used should be potential real values that the indicator could attain.  In
 this case of discrete data, integer values should be used.  As mentioned previously, ideally
 this range is known a priori and extends beyond the range of any particular data set. If an
 informed guess cannot be  made for this range, one suggested range would be to use the
 midpoint approach for obtaining the first and last plot points as explained in the previous
 paragraph.  How many points to use is a subjective  decision and should take into account the --
 chosen range, the  size of the data set, and sometimes the data distribution.-.itself must be
 examined.  The following  suggestions are given to help  decide how many  points to use.

 In most  cases, using the same number of points as used  in the empirical distribution, na+2
 points, will be sufficient for plotting the CDF. Extreme outliers in a particular data set may
 have a great influence on the graph.  In this case, more points may be needed to achieve
 greater resolution  within the body of the data. In the case of large data sets, plotting less than
 na+2 points should be adequate.  Begin by using 100 points for these larger data sets. The
 range of the indicator will have a part in determining if this is an  adequate number of points.
 Trying the plots with differing numbers of points  may be useful to see  if the graph changes
 significantly.

 The y-axis (CDF) should range in values from zero  to either the known or estimated
 subpopulation size, depending on the estimator used. This size will be the cumulative total
 number associated with yn . This method may result in confidence limits  which drop below

 zero or exceed the applicable subpopulation size.  These limits should not  appear on  the plot.
                                          A
Instead, truncate the plotted upper limit at  Fa(yn ).  Truncate the plotted lower limit at zero.

-------
             EMAP Estimation Method 2, Rev. No. 0, May 1996, Page 4 of 12


 5  Formulas and Definitions

 The estimated CDF (total number) for indicator value xk in subpopulation a, F  (x,.)\
                                                                       Q.   A-
 Horvitz-Thompson estimator of a total is
The estimated CDF (total number) for indicator value xk in subpopulation a, Fa{x^), with
                                                            ^
known subpopulation size, Na , and estimated subpopulation size, Na ; Horvitz-Thompson
ratio estimator is
             n,
            £ 4-
                   Na                          , = i
The one-sided  100(1 -a)% upper confidence bound, B^x^) is
The one-sided  100(l-a)% lower confidence bound, BL(xk) is
The two-sided 100(l-a)% confidence interval, C(xk) is
For these equations:

V [Fa(xky\= estimated variance of the estimated CDF (total number) for indicator value x
             in subpopulation a.

  ' l~  k   { 0, otherwise
xk  =  kfh indicator level of interest.

-------
             EMAP Estimation Method 2, Rev. No. 0, May  1996, Page 5 of 12
>•<   = value of the indicator for the ith unit sampled from subpopulation a.
TT.   = the inclusion probability for selecting the i'h unit of subpopulation a.
na   = number of units sampled from subpopulation a.
za   = z-score from the standard Normal distribution.
a   = level of significance.

6  Procedure

6.1    Enter Data

Input the sample  data consisting of the indicator values, yi , and their associated inclusion
probabilities, TT( .  For example,
Calcium
>v
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*/
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375

-------
              EMAP Estimation Method 2, Rev. No. 0, May 1996, Page 6 of 12
6.2    Sort Data

Son the sample data in nondecreasing order based on the yi indicator values. Keep all
occurrences of an indicator value to obtain correct results.
Calcium
?••
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*;
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
6.3    Obtain Subpopulation Size

If using the Horvitz-Thompson ratio estimator, input Na and calculate NQ  from the sample
data.  Sum the reciprocals of the  inclusion probabilities, ni,  for all units in the sample a to
obtain NQ  .

Na = (1/.00375) + (1/.01406) + (1/.07734) + . .  . + (1/.00375) = 1128.939 for this data set.

6.4    Input Indicator Levels of Interest

Assign indicator levels of interest, x, based on graphical display considerations.  Choose one
of the three methods previously discussed in Section 4.3.

6.4.1  The  Recommended Approach — Levels of Interest

Form an expected range of the indicator before looking at the data.  Next, examine the data
set to see if the estimated range encompasses all y values. If not, increase the range to
encompass the outlying y values.   If there are large outliers, more points than na+2 may be
needed to retain good resolution in the body of the plot.  Determine evenly spaced x values
across the chosen range.

-------
              EMAP Estimation Method 2, Rev. No. 0, May 1996, Page .7  of 12
For this example, the estimated range was .5 to 9.5 mg/L.  The range does not have to be
adjusted because it includes the observed y(. values.  The point spacing interval for x, xira -
(Xma* ~ xmin)/(na - 1) = (9.5 - .5)7(10 - 1) = 9/9 =  1.0. The 10 x  values = (xmin , xmin+l.Q,
xmin+2(l.O), ... ) = (.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5).  Try obtaining the CDF first
with these x values and then again with an increased number of x values spaced closer
together. More points across  the range may be needed  because all but one of the y,- values
are less  than 3.0.

6.4.2  The Empirical CDF — Levels of Interest

For the empirical CDF, x values = (.7395, 1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).
Duplicate values in the data set, 1.5992, do not have to  be repeated when forming x .

6.4.3  The Midpoint Approach — Levels of Interest

Calculate the midpoints of each pair of y,- values to form *  . The first x value is
(.7395+1.2204)72  = .9800. In this particular data set, there are three occurrences of 1.5992.
As a result, there are two midpoints of  1.5992.  Regardless of how many times a midpoint is
repeated, include it only once in x.  The x values = (.9800, 1.4098,  1.5992, 1.7996, 2.1854,
2.5952, 2.8798, 4.9700).

6.5    Compute Cumulative Distribution Function Values

           ^
Calculate Fa(xk) for each element in x using the formulas from Section 5.

              ^
To calculate Fa(x^), compare each yi to x\. If yi is less than.or equal to JCj, then l/Ki is
                             Sl
added  to the computation of Fa(x^)  until y, exceeds xl  (when using sorted data). Multiply
                                         **.            .A
the cumulative sum of these l/7i('s by Na INa  to obtain  Fa(x^) if using the Horvitz-

Thompson ratio estimator. Otherwise, this cumulative sum is Fa(x^) if using the Horvitz-
Thompson estimator of a total.

                      A
Similarly, to calculate Ffl(jc2),  compare each y,- to x^, add the l/rc,'s until y- exceeds x2, and
                              ^ '
then multiply this sum by Na INa  if applicable.

Do this for every value in x.

                                                             *
Below is an example for obtaining the cumulative sum for each Fa('xk). Complete results
for the example data are in Section 6.7.

-------
              EMAP Estimation Method 2, Rev. No. 0, May 1996, Page 8 of 12
Calcium
yl
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
Inclusion
Probability
*i
.00375
.01406
.07734
.75000
.03750
...-.75000
Indicator Level
of Interest
xk
.7395
1.2204
1.5992
Cumulative Sum for F (xk)
1/.00375
1/.00375+1/.01406
1/.00375+1/.01406+1/.07734+1/. 75000 +1/.03750

2.0000
1/.00375+1/.01406+1/.07734+1/.75000+1/.03750+1/.75000
 6.6    Compute Confidence Limits

 Calculate the confidence bound (upper or lower) or confidence interval for each  Fa(xk)
 using the formulas from Section 5.

                         ^
 Estimate the variance of FQ(xk) using an applicable method listed in.Section 7   Next, take
 the square root of the variance and multiply this square root by the
 z-score from the standard Normal distribution corresponding to the desired confidence level.
                     j\
 Add this quantity  to Fa(xk) to obtain the upper bound,  Bjj(xk).  Subtract this quantity from
  ^,
'•Fa(xk) to obtain  the lower bound, BL(xk).  For the confidence interval, obtain both BL(xk)
 and B^Xf.).  For example, 1.645 would be the za for a one-sided 95% upper or lower
 confidence bound, and the zaJ2 for a  two-sided 90% confidence interval.  A two-sided 95%
 confidence interval would use 1.96 for za/2-

 6.7    Output Results

 Output the indicator levels of interest, the associated  CDF value, and either a confidence
 bound (upper or lower) or a confidence interval for Fa(xk).  If the output generated will be
 used for graphing  the CDF, append the first and last  graph points to this output as directed
 for the three methods below. The tables in Section 6.7.1 - 6.7.3 contain results for the ratio
 estimator applied to the example data.  A hypothetical variance is used in  confidence bound
 and interval calculations.
 Lower bounds less than zero are set equal to zero.

-------
             EMAP Estimation Method 2, Rev. No. 0, May 1996, Page..9 of 12
6.7.1   The Recommended Approach — Results

Append the point (0,0) to the output file for graphing purposes.  Since xmax , 9.5, exceeds the
maximum y, , 7, no other points are appended.
Calcium
xk
0*
0.5
1.5
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5
CDF for Total
Number, Ratio
Estimator
Fa(*j
0*
0
338.10
647.38
863.09
863.09
863.09
863.09
1130
1130
1130
Hypothetical
Variance
V (Fa(xk)]
0*
0
58744
67138
57090
57090
57090
57090
0
0
0
One-sided 95%
Lower Conf .
Bound
BL(*J
0*
0
0**
221.14
470.04
470.04
470.04
470.04
1130
1130
1130
One-sided 95%
Upper Conf.
Bound
Bv(xk)
0*
0
736.80
1073.62
1130**
1130**
1130**
1130**
1130
1130
1130
Two-sided 90%
Conf. Interval
C(xk)
(0,0)*
(0,0)
(0,736.80)
(221.14,1073.62)
(470.04,1130)
(470.04,1130) -
(470.04,1130) :
(470.04,1130)
(1130,1130)
(1130,1130) -
(1130,1130)
    *appended
**set to 0
**setto 1130

-------
            EMAP Estimation Method 2, Rev. No. 0, May 1996, Page 10 of 12
6.7.2   The Empirical CDF — Result




Append the point (0,0) to the output file for graphing purposes.
Calcium
xk
0*
0.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
CDF for Total
Number, Ratio
Estimator
FaW
0*
266.91
338.10
379.12
380.36
647.38
692.24
863.09
1130
Hypothetical
Variance
* Ffl(*t)]
0*
57090
58744
59316
59334
67138
66666
57090
0
One-sided 95%
Lower Conf.
Bound
*l(**)
0*
0**
0**
0**
0**
221.14
267.51
470.04
1130
One-sided 95%
Upper Conf.
Bound
BM
0*
659.96
• 736.80
779.76
781.06
1073.62
1116.98
1130**
1130
Two-sided 90%
Conf. Interval
C(*,)
(0,0)*
(0,659.96)
(0,736.80)
(0,779.76)
(0,781.06)
(221.14,1073.62)
(267.51,1116.98)
(470.04,1130)
(1130,1130)
    *appended
**set to 0
**set to 1130

-------
             EMAP Estimation Method 2, Rev. No. 0, May 1996, Page"ll of 12
6.7.3  The Midpoint Approach — Results

Determine the first plotted point by calculating the distance between the first two yf values,
.7395 and 1.2204.  Take half this  distance and subtract it from .7395 to obtain .7395 -
[(1.2204 -.7395)72] = .4991. Append to the output (.4991,0) as  the first plotted  point. If a
negative number were obtained and the indicator can never be negative, append (0,0)  as the
first plotted point.  Similarly, to determine the last plotted point, calculate the distance
between the two largest >', values, 2.9399 and 7.  Take half this distance and add it to 7 to
obtain 7 + [(7-2.9399)72] = 9.0301.  Because the  distance  between these last two yt values is
relatively large, choosing the last point slightly above 7 with  an  ordinate of Ffl(7) may be

preferable over appending (9.0301, Ffl (7)) to the output.  For this example, (7.5,1130) was
appended.
Calcium
xk
.4991*
.9800
1.4098
1.5992
1.7996
2.1854
2.5952
2.8798
4.9700
7.5000*
CDF for Total
Number, Ratio
Estimator
W
0*
266.91
338.10
379.12
379.12
380.36
647.38
692.24
863.09
1130*
Hypothetical
Variance
V [fy*,)]
0*
57090
58744
59316
59316
59334
67138
66666
57090
0*
One-sided 95%
Lower Conf.
Bound
BM
0*
0**
0**
0**
0**
0**
221.14
267.51
470.04
1130*
One-sided 95%
Upper Conf.
Bound
*!/(**)
0*
659.96
736.80
779.76
779.76
781.06
1073.62
1116.98
1130**
1130*
Two-sided 90%
Conf. Interval
C(xk) .
(0,0)*
(0,659.96)
(0,736.80)
(0,779.76)
(0,779.76)
(0,781.06)
(221.14,1073.62)
(267.51,1116.98)
(470.04,1130)
(1130,1130)*
     "appended
*set to 0
"*set to 1130
7  Associated Methods

An appropriate variance estimator for this estimated CDF for discrete resources may be found
in Method 6 (Horvitz-Thompson  Variance Estimator).

-------
             EMAP Estimation Method 2, Rev. No. 0, May  1996, Page  12 of 12


 8  Validation Data

 Actual data with results, EMAP Design and Statistics Dataset #2, are available for comparing
 results from other versions of these algorithms.

 9  Notes

 The method which uses the ratio estimator may perform better under certain conditions and
 may be used only if the subpopulation  size is known.  Sampling done with variable
 probability and variable sample size, na, are two of these conditions. The ratio estimator
 retains a stability under these cases and tends to have smaller variance than the other
 estimator because the numerator and denominator tend to be  positively correlated.

 In the case of missing data, the ratio estimator cannot be used because the size of the
 subpopulation is not known.  All graphs should  be labeled as applying only to the population
 that was sampled and not to the original target population.

 10  References

 U.S. Environmental Protection Agency  (EPA). 1993.  Surface waters 1991 pilot report.
  EPA/620/R-93/003.  Washington, D.C: U.S. Environmental Protection Agency.

 Lesser, V. M., and W. S. Overton.  1994. EMAP status estimation:  Statistical procedures
  and algorithms.  EPA/620/R-94/008.  Washington, DC:  U.S. Environmental Protection
  Agency.

•Overton, W. S.  1987. A sampling and analysis plan for streams in  the National Surface
'•  Water Sur\>ey.  Technical Report  117.  Corvallis, OR:  Oregon State University, Department
  of Statistics.

-------
             EMAP Estimation Method 3, Rev. No. 0, May  1996, Page! of 12
ESTIMATION METHOD 3:  Estimation of the Size-Weighted Cumulative Distribution
Function for Proportion of a Discrete Resource;  Horvitz-Thompson Estimator, Normal
Approximation

1  Scope and Application

This method calculates the estimate of the size-weighted cumulative distribution function
(CDF) for the proportion of a discrete resource that has an indicator value equal to or less
than a given indicator level. The size-weight value is a measurement of the discrete resource
such as area of a lake. The method applies to any probability sample and presents two
estimators.  An estimate can be produced for the entire population or for an arbitrary
subpopulation with known or unknown size, where this size is the size-weighted total in the
subpopulation. Suggestions for estimating the CDF over the range of the indicator are
included.  Alternatively, the CDF can be calculated at the indicator levels found in the
probability sample. The method uses the Normal approximation to provide confidence
bounds or intervals for the true cumulative distribution function.  This method does not
include variance estimators for the estimated CDF.  For information on  appropriate variance
estimators, refer to Section 7.

This method has been applied in:

       The 1991 Surface Waters Pilot Report:

2  Statistical Estimation  Overview

A sample of size na units  is selected from subpopulation a with known  inclusion probabilities
7C = {7C1,-,7C,-, — ,7C71 } and size-weight values w ={w,, — ,wi,-,w   }.   The indicator is
       L     i     r\a                               i     i     ria
evaluated for each unit and represented by y ={y\ , — i >•, — ,}>  }•
                                                         
-------
              EMAP Estimation Method 3, Rev. No. 0, May 1996, Page 2 of 12


 substituted for the known subpopulation size, Wa, to form the Horvitz-Thompson ratio
 estimator.

 The Horvitz-Thompson ratio estimator may perform better than the estimator using the known
 subpopulation size, Wa , and may be used even if the subpopulation size is known.  Some of
 the conditions under which this ratio estimator is recommended are given in Section 9.  This
 ratio estimator should always be used in the case of missing data.

 Confidence limits for  FQ(xk) are produced by assuming a Normal distribution.  These limits
 may  be used to construct either a lower confidence bound, an upper confidence bound, or a
 confidence interval for Fa(xk). Computation of these limits requires an estimated variance
    ^
 of Fa(xk) which is not provided in this method.  Details for computing a suitable estimated

 variance of Fa(xk)  are found in other methods referenced in Section  7.

 The output consists of the estimated cumulative distribution function values with either a one-
 sided confidence bound (upper or lower) or a confidence interval for FQ(xk).

 3  Conditions Under Which This  Method Applies

 •   Probability  sample with known inclusion probabilities
 •   Discrete resource
     Arbitrary subpopulation
 •   All units  sampled from the subpopulation must be accounted for before applying this
     method
: •   When the indicator value is missing, exclude this missing value and the corresponding
-    inclusion probability and size-weight; use the Horvitz-Thompson ratio estimator

 4  Required Elements

 4.1 Input Data

 y,   =  value of the indicator for the i'h unit sampled from subpopulation a.
 Ki   =  inclusion probability for selecting the i'h unit of subpopulation a.
 wi  =  size-weight value for the ilh  unit sampled from subpopulation a.

 4.2 Additional  Components

 na  =  number of units sampled from subpopulation a.
 xk   =  k?h indicator level of interest.
 Wa =  subpopulation size (size-weighted total), if known.

-------
              EMAP Estimation Method 3, Rev. No. 0, May 1996, Page 3 of 12


4.3  Graphical Display Considerations

Two issues should be resolved before graphing the CDF:  1) how many points to use and 2)
what are the first and last points on the plot.  The following are guidelines for the three
alternatives mentioned in Section 2.  In all three approaches, the plotted points are connected
by line segments.  Confidence limits are not to exceed one or to drop  below zero.  The
sample y is understood to be in ascending order for this discussion.

If the empirical CDF is chosen, the number of points plotted is at most na+2. The first
plotted point is (0,0) when the indicator takes on only positive values.   Otherwise, choose a
point smaller than y} as the abscissa and assign  zero as the ordinate. Choose a point larger
than v   and assign 1 as  its ordinate.  Where there is more than one occurrence of an
        a
indicator level  in the data set, plot only one point using the  largest cumulative distribution
function value  associated with this level as the ordinate.

If the midpoints of adjacent values in y are used  for the levels x, at most na+l points are
plotted.  To determine the first plotted point, calculate the distance between yl and y2.  Take
half this distance and subtract it from y^ to obtain the abscissa.  If this abscissa is a  negative
number and the indicator can never be negative, instead assign zero as the abscissa.  Use zero
as the ordinate. Similarly, to determine the last  plotted point,  calculate the distance  between
the largest y values, v    , and y   .  Halve this  distance, add it to y   and plot this  abscissa
                     na          o                           •   na
using 1 as  the  ordinate.
The recommended approach uses equally spaced levels across the potential range of the
indicator.  The levels used should be potential real values that the indicator could attain.  In
this case of discrete data, integer values  should be used.  As mentioned previously, ideally
this range is known  a priori and extends beyond the range of any particular data set. If an
informed guess cannot be made for this  range, one suggested range would be to use the
midpoint approach for obtaining the first and last plot points as explained in the previous
paragraph.  How many  points to use is a subjective decision and should take into account the
chosen range, the size of the data  set,  and sometimes the data distribution itself must be
examined.   The following suggestions are given to help decide how many points to use.

In most cases, using the same number of points as used in the empirical distribution, na+2
points, will be sufficient for plotting the  CDF.  Extreme outliers in a particular data  set may
have a great influence on the graph.  In  this case, more points may be needed to achieve
greater resolution within the body of the data. In the case of large data sets, plotting less than
na+2 points should be adequate.  Begin by using 100 points for these larger data sets. The
range of the indicator will have a part in determining if this is an adequate number of points.
Trying the plots with differing numbers of points may be  useful to see if the graph changes
significantly.

-------
             EMAP Estimation Method 3, Rev. No. 0, May 1996, Page 4 of 12



The y-axis (CDF) should range in values from zero to  1. This method may result in
confidence limits which drop below zero or exceed 1.  These limits should not appear on the
plot.  Instead, truncate the plotted upper limit at 1  and  the plotted lower limit at zero.


5  Formulas and Definitions


The estimated size-weighted CDF (proportion) for  indicator value xk in subpopulation a,

Fa(xk), with known subpopulation size, Wa,  Horvitz-Thompson estimator is
The estimated size-weighted CDF (proportion) for indicator value xk in subpopulation a,
 A                                       >*.
Fa(xk), with estimated subpopulation size,  Wa;  Horvitz-Thompson ratio estimator is

The one-sided 100(1-a)% upper confidence bound, B^x^ is
The one-sided 100(1-a)% lower confidence bound, BL(xk) is


 B, CO  - F..CO--
The two-sided 100(l-a)% confidence interval, C(xk) is
                                 [Fa(xk)]     , Fa

-------
              EMAP Estimation Method 3, Rev. No. 0, May 1996, Page"5 of 12
For these equations:
 A  ».
V •[Fa(xk)]= estimated variance of the estimated size- weighted CDF (proportion) for
              indicator value xk in  subpopulation a.

           I *>  >'i  - xk
          =]    "     k
            10,  otherwise
       tfh indicator level of interest.
       value of  the indicator for the i'h unit sampled from subpopulation a.
       inclusion probability for selecting the ith unit of subpopulation a.
       size-weight value for the ith  unit sampled from subpopulation a.
       number of units sampled from subpopulation a.
      z-score from the standard Normal distribution.
       level of significance.
x    =
w,   =
na   =
za   =
a    =
6  Procedure

6.1    Enter Data

Input the sample data consisting of the indicator values, y( , their associated inclusion
probabilities, nt, and their size-weights, wi.  For example,
Calcium
yt
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*i
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375
Lake
Area
W;
24.249
92.251
28.018
52.953
362.254
140.671
7.758
29.702
149.276
1.081

-------
              EMAP Estimation Method 3, Rev. No. 0, May 1996, Page 6 of 12
•6.2     Sort Data

 Sort the sample data in nondecreasing order based on the y( indicator values.  Keep all
 occurrences of an indicator value to obtain correct results.
Calcium
>',
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*«•
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
Lake
Area
w<
1.081
7.758
24.249
28.018
29.702
52.953
92.251
140.671
149.276
362.254
 6.3    Obtain Subpopulation Size (Size-Weighted Total)

 Input Wa if using a known subpopulation size.  Wa = 156000 for this dataset.

           f*.
 Calculate Wa from the sample data only if using the Horvitz-Thompson ratio estimator.
 Divide each w, by the inclusion probability, Ji- , for all units in the sample a.  Sum each of
 these quantities to obtain  WQ  .

 Wa = (1.081/.00375) + (7.758/.01406) + (24.249/.07734) + . . . + (362.254/.00375) =
 155045.265 for this data set.

 6.4    Input Indicator Levels  of Interest

 Assign indicator  levels of interest, x, based on graphical display considerations.  Choose one
 of the three methods  previously discussed in Section 4.3.

 6.4.1 The Recommended Approach — Levels of Interest
Form an expected range of the indicator before looking at the data.  Next, examine the data
set to see if the estimated range encompasses all y values.  If not, increase the range to

-------
             EMAP Estimation Method 3, Rev. No. 0, May 1996, Page 7 of 12


encompass the  outlying y values. If there are large outliers, more points than na+2 may be
needed to retain good resolution in  the body  of the plot.  Determine evenly spaced x values
across the chosen range.

For this example, the estimated range was .5 to 9.5 mg/L.  The range does not have to be
adjusted because it includes the observed y{ values.  The point spacing interval for x, xint -
(W - Wfaa ~ D = <9-5 ~ '5)/(10 ~U = 9/9 =  1-0-  ^ 10  * values = (*«,/„ ' xmin+LQ'
jcmi-n+2(1.0), ...  ) = (.5,  1.5, 2.5, 3.5, 4.5, 5.5,  6.5, 7.5, 8.5, 9.5). Try obtaining the cumulative
distribution function first with these A: values and then again with an increased number of x
values spaced closer together. More points across the range may be needed because all but
one of the v; values are less than 3.0.

6.4.2 The  Empirical CDF — Levels of Interest

For the empirical CDF, x values = (.7395, 1.2204, 1.5992,  2, 2:3707, 2.8196, 2.9399, 7).
Duplicate values in the  data set, 1.5992, do not have to be  repeated when forming x .

6.4.3 The  Midpoint Approach — Levels of Interest

Calculate the midpoints of each pair of yi values to form x  .  The first x value is
(.7395+1.2204)72 = .9800.  In this particular data set, there are three occurrences of 1.5992.
As a result, there are two midpoints of 1.5992. Regardless of how many times a  midpoint is
repeated, include it only once in x.  The x values =.(.9800, 1.4098, 1.5992, 1.7996, 2.1854,
2.5952, 2.8798, 4.9700).

6.5    Compute Cumulative Distribution Function Values

Calculate Fa(xk) for each element  in x using the formulas from  Section 5.


To calculate  Fa(x\ ) , compare each yi to xl.  If yf  is less than or equal to xv then  wl /TT, is

added to the computation of Fa(^\) until y( exceeds xl (when using sorted data).  Divide the

cumulative sum of these w- /7t('s by Wa or Wa (depending  on the estimator used)  to obtain
                      S*,
Similarly, to calculate Fa(x2), compare each y{ to j^, add the W(/K,-'S until yt exceeds x2, and

then divide this sum by Wa or Wa.


Do this for every value in x.

-------
               EMAP Estimation Method 3, Rev. No. 0, May 1996, Page 8 of 12
 "Below is an example for obtaining the cumulative sum for each  FQ(xk). Complete results
  for the example data are in Section 6.7.
Calcium
>',
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
Inclusion
Probability
*••
.00375
.01406
.07734
.75000
.03750
.75000
Lake
Area
w,
1.081
7.758
24.249
28.018
29.702
52.953
Indicator Level
of Interest
xk
.7395
1.2204
1.5992


2.0000
Cumulative Sum for
fy'ft)
1.081/.00375
1.081/.00375+7.758/.01406
1. 08 1/.00375+7.758/.01406+24.249/. 07734.
+ 28.018/.75000+29.702/.03750

1.081/.00375+7.758/.01406+24.249/.07734
+28.018/.75000+29.702/.03750+52.953/.75
  6.6    Compute Confidence Limits

  Calculate the confidence bound (upper or lower) or confidence interval-for each FAx,,) •-••
                                                                              Q-   K
  using the formulas -from Section 5.

                          >>.
  Estimate the variance of Fa(xk)  using an applicable method listed in  Section 7.  Next, take
'"the square root of the variance and multiply this square root by the z-score from the standard
.. Normal distribution corresponding to the desired confidence level.
                      A
  Add this quantity  to FQ(xk) to obtain the upper bound, B^x^.  Subtract this quantity from

  Fa(xk) to obtain  the lower bound, BL(xk).  For the confidence interval, obtain both BL(xk)
  and B^x/J. For example, 1.645 would be the za for a one-sided 95%  upper  or lower
  confidence bound, and the za]2 for a two-sided 90% confidence interval. A two-sided  95%
  confidence interval would use 1.96 for zaJ2.

  6.7    Output Results

  Output the indicator levels of interest, the associated size-weighted CDF value, and either a
  confidence bound  (upper or lower) or a confidence interval for FQ(xk). If the output
  generated will be used for graphing the  CDF, append the first and last graph  points to  this
  output as  directed  for the three methods below.  The tables in Section  6.7.1 - 6.7.3 contain
  results for the ratio estimator applied to the example data.  A hypothetical variance is used in
  confidence bound  and interval calculations.

-------
             EMAP Estimation Method 3, Rev. No. 0, May 1996, Page 9 of 12
When upper bounds exceed 1, they are set equal to 1.  Lower bounds less than zero are set
equal to zero.

6.7.1   The Recommended Approach — Results

Append the point (0,0) to the output file for graphing purposes.  Since xmax , 9.5, exceeds the
maximum yf , 7, no other points are appended.
Calcium
xk
0*
0.5
1.5
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5
Size-Weighted
CDF, Ratio
Estimator
fyf*>
0*
0
.0054
.1719
.3769
.3769
.3769
.3769
1
1
1
Hypothetical
Variance
V[^(**)J
0*
0
.000032
.032777
.084169
.084169
.084169
.084169
0
0
0
One-sided 95%
Lower Conf.
Bound
Bfrt)
0*
0
0**
0**
0**
0**
0**
0**
1
1
1
One-sided 95%
Upper Conf.
Bound
*£/<**)
0*
0
.0147
.4697
.8542
.8542
.8542'
.8542
1
1
1
Two-sided 90%
Conf. Interval
C(Xk)
(0,0)*
(0,0)
(0..0147)
(0..4697) ..
(0..8542)
(0..8542)
(0..8542)
(0..8542)
(1,1)
(1,1)
(1,1)
    *appended
"set to 0

-------
            EMAP Estimation Method 3, Rev. No. 0, May  1996, Page  10 of 12
6.7.2   The Empirical CDF — Results

Append the point (0,0) to the output file for graphing purposes.  Append also a point slightly
larger than the largest x value and assign an ordinate of 1. For this example, the point (7.5,1)
is appended.
Calcium
xk
0*
0.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
7.5000*
Size-Weighted
CDF, Ratio
Estimator
fy**)
0*
• •••-• .0019
.0054
.0128
.0132
.1719
.2127
.3769
1
1*
Hypothetical
Variance
v [Fa(xky\
0*
-:000006
.000032
.000129
.000134
.032777
.039204
.084169
0
0*
One-sided 95%
Lower Conf.
Bound
BL(xj
0*
0**
0**
0**
0**
0**
o**
0**
1
1*
One-sided 95%
Upper Conf.
Bound
*w
0*
.0057
.0147
.0314
.0323
.4697
.5383
.8542
1
1*
Two-sided 90%
Conf. Interval
C(Xk)
(0,0)*
(0..0057) .
(0..0147)
(0..0314)
(0..0323)
(0..4697)
(0..5383)
(0..8542)
(1,1)
(1,D*
    *appended
"set toO

-------
             EMAP Estimation Method 3, Rev. No. 0, May  1996, Page 11 of 12
6.7.3  The Midpoint Approach — Results

Determine the first plotted point by calculating the distance between the first two y;- values,
.7395 and  1.2204.  Take half this distance and subtract it from .7395 to obtain .7395 -
[(1.2204 -.7395)72] = .4991.. Append to the output (.4991,0) as the first plotted  point.  If a
negative number were obtained and the indicator can never be negative, append  (0,0) as the
first plotted point.  Similarly, to determine the last plotted point, calculate the distance
between the two largest y{ values,  2.9399 and 7.  Take half this distance and  add it to 7 to
obtain 7 + [(7-2.9399)72] = 9.0301. Because the distance between these last two yt values is
relatively large,  choosing the last point slightly above 7 with an ordinate of 1 may be
preferable  over appending (9.0301,1)  to the output. For  this example, (7.5,1) was appended.
Calcium
xk
.4991*
.9800
1.4098
1.5992
1.7996
2.1854
2.5952
2.8798
4.9700
7.5000*
Size-Weighted
CDF, Ratio
Estimator
fy**)
0*
.0019
.0054
.0128
.0128
.0132
.1719
.2127
.3769
1*
Hypothetical
Variance
? [fy**>]
0*
.000006
.000032
.000129
.000129
.000134
.032777
.039204
.084169
0*
One-sided 95%
Lower Conf.
Bound
BL(*k>
0*
0**
0**
0**
0**
0**
0**
0**
0**
1*
One-sided 95%
Upper Conf.
Bound
Bu(Xk)
0*
.0057
.0147
.0314-
.0314
.0323
.4697
.5383
.8542
1*
Two-sided 90%
Conf. Interval
C(Xk]
(0,0)*
(0..0057)
(0..0147)
(0..0314)
(0..0314)
(0..0323) '
(0..4697)
(0..5383)
(0..8542)
(1,D*
     'appended
"set to 0
7  Associated Methods

An appropriate variance estimator for this estimated size-weighted CDF for discrete resources
may be found in Method 7 (Horvitz-Thompson Variance Estimator).

8  Validation Data

Actual data with results, EMAP Design and Statistics Dataset #3, are available for comparing
results from other versions of these algorithms.
                                              Lie'

-------
            EMAP Estimation Method 3, Rev. No, 0, May 1996, Pase 12 of 12
9  Notes
The method which uses the ratio estimator may perform better under certain conditions and
may be used even if the subpopulation size is known. Sampling done with variable
probability and variable sample size, na, are  two of these conditions.  The ratio estimator
retains a stability under these cases which can be seen from comparing the two equations.
The ratio estimator tends to have smaller variance than the other estimator because the
numerator and denominator tend to be positively correlated.  The estimator using the known
subpopulation size does not compensate for variability in the numerator.

The ratio estimator should  be used in the case of missing data.  The estimated CDF applies
only to the subpopulation for which data were obtained. Because the size of this
subpopulation is not known, it must be estimated. Therefore, the ratio estimator is the-only
alternative for estimating the CDF. AH graphs should be labeled as applying only to the
population that was sampled and not to the original target population.

10  References

U.S. Environmental Protection Agency (EPA).  1993.  Surface waters 1991 pilot report.
    EPA/620/R-93/003.  Washington, D.C:  U.S.  Environmental Protection Agency.

Lesser, V  M.,  and W. S. Overton. 1994. EMAP status estimation: Statistical procedures
    and algorithms.  EPA/620/R-94/008.  Washington, DC:  U.S. Environmental Protection
    Agency.

Overton, W. S.  1987. A sampling and analysis plan for streams in the National Surface
    Water Survey.  Technical Report 117.  Corvallis, OR:  Oregon State University,
    Department of Statistics.

-------
              EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 1 of 12


ESTIMATION METHOD 4:  Estimation of the Size-Weighted Cumulative Distribution
Function for Total of a Discrete Resource;  Horvitz-Thompson Estimator, Normal
Approximation

1  Scope and Application

This method calculates the estimate of the size-weighted cumulative distribution function
(CDF) for  the total of a discrete resource that has an indicator value equal to or less than a
given indicator level.  The size-weight value is a measurement of the discrete resource such
as area of a lake.  The method  applies to any probability sample and presents two estimators.
An estimate can be produced for the entire population or for an arbitrary subpopulation with
known  or unknown size, where this size is the size-weighted total in the subpopulation.
Suggestions for estimating the CDF over the range  of the indicator  are included.
Alternatively, the CDF can be calculated at the indicator levels found in the probability
sample.  The method uses the Normal approximation to provide confidence bounds or
intervals for the true cumulative distribution function.  This method does not include variance
estimators  for the estimated CDF.  For information  on appropriate variance estimators, refer
to Section  7.

This method has been applied in:

       The 1991 Surface Waters Pilot Report

2  Statistical Estimation Overview

A sample of size na units is selected from subpopulation a with known inclusion probabilities-
Jc = {7r,,-,7C,-,-,7C_  } and size-weight values w ={w, ,-,w,-,-,w   }.  The indicator is
       1     (      t\Q                               I      I     fla
evaluated for each unit and represented by v = {yi ,-,}'•,•••,)'  }.
                                              I     I     rlQ

Estimates of the cumulative distribution function are obtained for the indicator levels of
interest, x = {x^, — ,xk, — ,xm}.  Several alternatives are available for choosing jc. The
recommended alternative is the  use of equally spaced values across  the range of the indicator.
Ideally, this range  is known a priori and extends beyond the range of any particular data set.
A second alternative is to use the set of unique values in the data set. This alternative gives
the classical empirical cumulative distribution function. A third alternative is to use the
midpoints of adjacent ordered values in y for the levels jc.

                                                                   ^
To obtain the estimated size-weighted cumulative distribution function,  Fa(xk), the Horvitz-
Thompson  estimator of a cumulative total is calculated for each xk by summing up the
number of indicators which are  less than or equal to the xk value, weighted by the size-weight
values \v( .  Alternatively, when the subpopulation size (size-weighted total) is known, first
form the Horvitz-Thompson ratio estimator by dividing this cumulative total by the estimated

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 2 of 12


subpopulation size,  Wa  , and then multiply this ratio by the known subpopulation size, Wa,

to obtain Fa(xk}.

The Horvitz-Thompson ratio estimator may perform better than the estimator that does not
use the known subpopulation size, Wa .  Some of the conditions under which this ratio
estimator is recommended are given in Section 9. This ratio estimator cannot be  used in the
case of missing data.

Confidence limits for  FQ(xk) are produced by assuming a Normal distribution.  These limits
may be used to construct either a lower confidence bound, an upper confidence bound, or a
confidence interval for Fa(xk}.  Computation of these limits requires an estimated variance

of Fa(xk) which is not provided in this method. Details for computing  a suitable estimated

variance of Fn(x,.)  are found in other methods referenced in Section  7.
             Ci  K.

The output consists  of the estimated cumulative distribution function values with either a one-
sided confidence bound (upper or lower) or a confidence interval for Fa{xk).

3  Conditions Under Which This Method Applies

•   Probability sample with known inclusion probabilities
•   Discrete resource
•   Arbitrary subpopulation
•   All units sampled from the  subpopulation must be accounted for before applying this
    method
•   When the indicator value is missing, exclude this missing value and  the corresponding
    inclusion probability and size-weight;  use the Horvitz-Thompson estimator of a total

4  Required Elements

4.1 Input Data

yl   = value of the indicator for the fh unit sampled from subpopulation  a.
TT,-   = inclusion probability for selecting the ilh unit of subpopulation  a.
H!,  = size-weight value for the ith unit sampled from subpopulation a.

4.2 Additional Components

na  = number of units sampled from subpopulation a.
xk   = tfh indicator level of interest.
Wa - subpopulation size (size-weighted total),  if known.

-------
              EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 3 of 12
4.3  Graphical Display Considerations

Two issues should be resolved before graphing the CDF:  1) how many points to use and 2)
what are the first and last points on the plot. The following are guidelines for the three
alternatives mentioned in Section 2.  In all three approaches,  the plotted points are connected
by line segments.  The sample y is understood to be in ascending order for this discussion.

If the empirical CDF is chosen, the number of points plotted  is at most na+2.  The first
plotted point is (0,0) when the indicator takes on only positive values.. Otherwise, choose a
point smaller than y} as the abscissa and assign zero as the ordinate. Where there is more
than one occurrence of an indicator level in the data set, plot only one point using the largest
cumulative distribution function value associated with this level  as the  ordinate.

If the midpoints of adjacent values in y are used for the levels x, at most na+l points are
plotted.  To determine the first plotted point, calculate the distance between y^ and v2.  Take
half this distance and  subtract it from vt to obtain  the abscissa.   If this abscissa is a negative -
number and the indicator can never be negative, instead assign zero as the abscissa.  Use zero
as the ordinate.  Similarly, to determine the last plotted point, calculate the distance between
the largest y values, yn _j and yn .  Halve this distance, add it  to yn  and plot this abscissa

using the cumulative size-weighted total associated with yn  as its ordinate.


The  recommended approach uses equally spaced, levels across the potential, range of the
indicator.  The levels used should  be potential real values  that the indicator could attain.  In
this case of discrete data, integer values should be used. As mentioned previously, ideally
this range is known a priori and extends beyond the range of any particular data set. If an
informed guess cannot be made for this range, one suggested  range  would be to use the
midpoint approach for obtaining  the first and last plot points as explained in the previous
paragraph.   How many points to use is a subjective decision and should take into  account the
chosen  range, the size of  the data set, and sometimes the data distribution itself must be
examined.   The following suggestions are given to help decide how many points to use.

In most cases,  using the same number of points as used in the empirical distribution, na+2
points, will be  sufficient for plotting  the CDF. Extreme outliers in a particular data set may
have a great influence on the  graph.  In this case, more points may be needed  to achieve
greater  resolution within the body  of the data. In the case of large data sets, plotting less than
na+2 points should be adequate.  Begin by using 100 points for these larger data sets. The
range of the indicator  will have a part in determining if. this is an adequate number of points.
Trying the  plots with differing numbers of points may be useful to see if the graph changes
significantly.

The y-axis  (CDF) should  range in  values from zero to either the known or estimated
subpopulation size, depending on  the estimator used.  This size will be the cumulative
size-weighted total associated  with yn .  This method may result in  confidence limits which

drop below zero or exceed the applicable subpopulation size.  These limits should not appear

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 4 of 12



on the plot.  Instead, truncate the plotted upper limit at  F (y  ). Truncate the plotted lower
                                                         a
limit at zero.


5  Formulas and Definitions

                                                                             *>
The estimated size-weighted CDF (total) for indicator value xk in subpopulation a, Fa(xk);

Horvitz-Thompson estimator of a total is
                                                                             .A
The estimated size-weighted CDF (total) for indicator value xk in subpopulation a,  Fa(xk),

with known subpopulation size, Wa , and estimated subpopulation size, Wa; Horvitz-

Thompson ratio estimator is
                W;
                    ^
The one-sided 100(1 -a)% upper confidence bound, By(x^ is
The one-sided 100(l-a)% lower confidence bound, BL(xk) is
                                [Fa(xk}}


The two-sided 100(l-a)% confidence interval, C(xC) is
                                                          Za/2 x

For these equations:


V [Fa(xk)]= estimated variance of the estimated size-weighted CDF (total) for indicator

             value xk  in subpopulation a.

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 5 of 12
           [0, otherwise
xk   = k?h indicator level of interest.
y,   = value of the indicator for the ith unit sampled from subpopulation a.
TI,   = inclusion probability for selecting the i'H unit of subpopulation a.
w-   = size-weight value for the i'h unit sampled from subpopulation a.
na   = number of units sampled from subpopulation a.
za   = z-score from the standard Normal distribution.
a   = level of significance.

6  Procedure
6.1    Enter Data

Input the sample data consisting of the indicator values, yf- , their associated inclusion
probabilities, Tt, , and their size-weights, w,-.  For example,
Calcium
y>
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
.*<'
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375
Lake
Area
w,.
24.249
92.251
28.018
52.953
362.254
140.671
7.758
29.702
149.276
1.081

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 6 of 12
6.2    Sort Data

Sort the sample data in nondecreasing order based on the yL indicator values. Keep all
occurrences of an indicator value to obtain correct results.
Calcium
yt
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*«•
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
Lake,
Area
wi
1.081
7.758
24.249
28.018
29.702
52.953
92.251
140.671
149.276
362.254
6.3    Obtain Subpopulation Size (Size-Weighted Total)

If using the Horvitz-Thompson ratio estimator, input Wa and calculate Wa from  the sample
data.  Divide each w, by the inclusion probability, TT, , for all units in the sample a. Sum
each of these quantities to obtain W  .

Wa = (1.081/.00375) + (7.758/.01406) + (24.249/.07734) + . . . + (362.254/.00375) =
155045.265 and Wa = 156000 for this data set.

6.4    Input Indicator Levels of Interest

Assign indicator levels of interest, x, based on graphical display considerations.  Choose one
of the three methods previously discussed in Section 4.3.

6.4.1  The Recommended Approach — Levels of Interest

Form an expected range of the indicator before looking  at the data.  Next, examine the data
set to see if the estimated range encompasses all y values. If not, increase the range to
encompass  the outlying y values.  If there are large outliers, more points than na+2 may be

-------
              EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 7 of 12


needed to retain good resolution in the body of the plot. Determine evenly spaced x values
across the chosen range.

For this example, the estimated range  was .5 to 9.5 mg/L.  The range does not have to be
adjusted because it includes the observed  yi values. The point spacing interval for x, xint =
(*max ~ *mm)/K ~ 1) = (9-5 - .5)7(10  - 1) = 9/9 = 1.0. The 10 x  values = (xmin , xmin+l.Q,
*ml-B+2(1.0), ... ) = (.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5).  Try obtaining the cumulative
distribution function first with these x  values and then  again with an increased number of x
values spaced closer together.  More points across the  range may be needed because all but
one of the yi values are less than 3.0.

6.4.2 The Empirical CDF — Levels of Interest

For the empirical CDF, jc values = (.7395, 1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).
Duplicate values  in the data set, 1.5992, do not have to be repeated when forming x .

6.4.3 The Midpoint Approach — Levels  of Interest

Calculate the midpoints of each pair of yi values to form x . The first x value is
(.7395+1.2204)72 = .9800. In this particular data set, there are three occurrences of 1.5992.
As a result, there are two midpoints of 1.5992. Regardless of how many times a midpoint is
repeated, include it only once in x. The x values = (.9800, 1.4098, 1.5992,  1.7996, 2.1854;
2.5952, 2.8798, 4.9700).

6.5    Compute Cumulative Distribution Function Values

Calculate Fa(xk) for each element in  x using the formulas from Section 5.

             *,
To calculate Fa(x\), compare each y{ to x\. If y,  is less than or equal to xv then w, /nf is
                             s*.                     '
added to the computation of Fa(x^) until >-, exceeds xl (when using sorted  data). Multiply
                                           A            /V
the cumulative sum of these \vi /TC/S by Wa IWa to obtain Fa(x\), if using the Horvitz-

Thompson  ratio estimator. Otherwise, this cumulative  sum is Fa(x\), if using the Horvitz-
Thompson  estimator of a total.

Similarly, to calculate Fa(x2), compare each yi to x^,  add the v^./n/s until yf exceeds x2, and
                               jA
then  multiply this sum by Wa/Wa if applicable.

Do this for every value in x.

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 8 of 12
Below is an example for obtaining the cumulative sum for each  F (xk).  Complete results
for the example data are in Section 6.7.
Calcium
>'«•
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
Inclusion
Probability
*,-
.00375
.01406
.07734
.75000
.03750
.75000
Lake
Area
w,
1.081
7.758
24.249
28.018
29.702
52.953
Indicator Level
of Interest
**
.7395
1.2204
1.5992
Cumulative Sum for
Fa(xk)
1.081/.00375
1.081/.00375+7.758/.01406
1.081/.00375+7.758/.01406+24.249/.07734
+ 28.018/.75000+29.702/.03750

2.0000
1.081/.00375+7.758/.01406+24.249/.07734
+28.018/.75000+29.702/.03750+52.953/.75
6.6    Compute Confidence Limits

Calculate the confidence bound (upper or lower) or confidence interval for each F (x,)
                                                                            Ci  K.
using the formulas from Section 5.

Estimate the variance of Fa(xk) using an applicable method listed in Section 7.  Next, take
the square root of the variance and multiply this square root by the
z-score from the standard Normal distribution corresponding to the desired confidence level.
Add this quantity to  FQ(xk) to obtain the upper bound, B^x^.  Subtract this quantity from
 <*,
Fa(xk) to obtain the lower bound, BL(xk).  For the confidence interval, obtain both BL(xk)
and By(xk).  For example,  1.645 would be the za for a one-sided 95% upper or lower
confidence bound,  and the zaJ2 for a two-sided 90% confidence interval.  A two-sided 95%
confidence interval would use  1.96 for z^-

6.7    Output Results

Output the indicator levels of interest, the associated size-weighted CDF value, and either a
confidence bound (upper or lower) or a confidence interval for Fa (xk). If the output
generated will be used for graphing the CDF, append the first and last graph points to this
output as directed for the three methods below.   The tables in Section 6.7.1 - 6.7.3 contain

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page 9 of 12
results for the ratio estimator applied to the example data.  A hypothetical variance is used in
confidence bound and interval calculations.

Lower bounds less than zero are set equal to zero.

6.7.1   The Recommended Approach — Results

Append the point (0,0) to the output file for graphing purposes.  Since xmax , 9.5, exceeds the
maximum yi , 7, no other points are appended.
Calcium
xk
0*
0.5
1.5
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5
Size-Weighted
CDF, Ratio
Estimator
fy't)
0*
0
'845
26818
58804
58804
58804
58804
156000
156000
156000
Hypothetical
Variance
V (Fa(xk)]
0*
0
775618
797663932
2048346898
2048346898
2048346898
2048346898
0
0
0
One-sided 95%
Lower Conf.
Bound
BL(*k>
0*
0
0**
0**
0**
0**
0**
0**
156000
156000
156000
One-sided 95%
Upper Conf.
Bound
*
-------
            EMAP Estimation Method 4, Rev. No. 0, May 1996, Page  10 of 12
6.7.2   The Empirical CDF — Result




Append the point (0,0) to the output file for graphing purposes.
Calcium
xk
0*
0.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Size-Weighted
CDF, Ratio
Estimator
fy**)
0*
290
845
1995
2066
26818
33174
58804
156000
Hypothetical
Variance
V Fa(**)]
0*
133932
775618
3128985
3270403
797663932
954063204
2048346898
0
One-sided 95%
Lower Conf.
Bound
BL(*J
0*
0**
0**
0**
0**
0**
0**
0**
156000 .
One-sided 95%
Upper Conf.
Bound
BM
0*
892
2294
4905
5041
73278
83984
133255
156000 .
Two-sided 90%
Conf. Interval
C(*t)
(0,0)*
(0,892)
(0,2294)
(0,4905)
(0,5041)
(0,73278)
(0,83984)
(0,133255) '
(156000,156000).
                 *appended
**set to 0

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page -11 of 12
6.7.3  The Midpoint Approach — Results

Determine the first plotted point by calculating the distance between the first two y(- values,
.7395 and  1.2204.  Take half this distance and subtract it from .7395 to obtain .7395 -
[(1.2204 -.7395)72] = .4991. Append to the output (.4991,0) as the first plotted point.  If a
negative number were obtained and the indicator can never be negative,  append (0,0) as the
first plotted point.  Similarly, to determine the last plotted point, calculate the distance
between the two largest >,- values, 2.9399 and 7.  Take half this distance and add it to 7 to
obtain 7 + [(7-2.9399)72] = 9.0301.  Because the  distance between these last two yi values is
relatively large,  choosing the last point slightly above 7 with an ordinate of Ffl(7) may be

preferable  over appending (9.0301,Ffl(7)) to the  output.  For this example, (7.5,156000) was
appended.
Calcium
xk
.4991*
.9800
1.4098
1.5992
1.7996
2.1854
2.5952
2.8798
4.9700
7.5000*
Size-Weighted
CDF, Ratio
Estimator
fy**)
0*
290
845
1995
1995
2066
26818
33174
58804
156000*
Hypothetical
Variance
v [/>*)]
0*
133932
-775618
3128985
3128985
3270403
797663932
954063204
2048346898
0*
One-sided 95%
Lower Conf.
Bound
BL(*k>
0*
0**
-- o** -•
0**
0**
0**
0**
0**
0**
156000*
One-sided 95%
Upper Conf.
Bound
*w
0*
892
. - 2294
4905
4905
5041
73278
83984
133255
156000*
Two-sided 90%
Conf. Interval
Cbj
(0,0)*
(0,892)
(0,2294)
(0,4905)
(0,4905)
(0,5041)
(0,73278)
(0,83984)
(0,133255)
(156000,156000)*
                  *appended
"set to 0
7  Associated Methods

An appropriate variance estimator for this estimated size-weighted CDF for discrete resources
may be found in Method 8 (Horvitz-Thompson Variance Estimator).

-------
             EMAP Estimation Method 4, Rev. No. 0, May 1996, Page  12 of 12
 8  Validation Data

 Actual data with results, EMAP Design and Statistics Dataset #4, are available for comparing
 results from other  versions of these algorithms.

 9  Notes

 The method which uses the ratio estimator may perform better under certain conditions and
 may be used only  if the subpopulation size is known. Sampling done with variable
 probability and variable sample size, na , are two of these conditions. The ratio  estimator
 retains a stability under these cases and tends to have smaller variance than the other
 estimator because  the numerator and denominator tend to be positively correlated.

 In  the case of missing data, the ratio estimator cannot be used because the size of the
 subpopulation is not known.  All graphs should be labeled as applying only to the population
 that was sampled and not to the original target population.

 10 References

-U.S. Environmental Protection Agency (EPA).  1993. Surface waters 1991 pilot report.
  EPA/620/R-93/003.  Washington, D.C: U.S.  Environmental Protection Agency.

 Lesser, V M., and W. S. Overton. 1994.  EMAP status estimation:  Statistical procedures
  and algorithms.  EPA/620/R-94/008.  Washington, DC: U.S. Environmental Protection
  Agency.

 Overton, W. S.  1987. A sampling and analysis plan for streams in  the National Surface
.  Water Suney.  Technical  Report 117.  Corvallis, OR:  Oregon State University, Department
  of Statistics.

-------
              EMAP Estimation Method 5, Rev. No. 0, May 1996, Page 1 of 8


ESTIMATION METHOD 5:  Estimation of Variance of the Cumulative Distribution
Function for the Proportion of a Discrete Resource;  Horvitz-Thompson Variance Estimator

1  Scope and  Application

This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the proportion of a discrete resource that has an indicator value equal to
or less than a  given indicator level.  There are two variance estimators presented in this
method. An estimate can be produced for the entire population or  for an arbitrary
subpopulation with known or unknown size.  This size is the total number of units in the
subpopulation. The method applies to any probability sample and the variance estimate will
be produced at the supplied indicator levels of interest. This method does not include
estimators for  the CDF.  For information on CDF estimators, refer  to Section 7.

2  Statistical Estimation Overview

A sample of size na units is selected from subpopulation a with known inclusion probabilities
n = {n-i, — ,K-,~',nn } and joint inclusion probabilities given by Tt,-.-,  where fcj. The indicator is

evaluated for each unit and represented by y = {y'i ,-:y,,-,yn  }• The inclusion'probabilities:~

are. design dependent and should be furnished with the design points. See Section 9 for
further discussion.

                                                   -A   i*
The Horvitz-Thompson variance estimator of the CDF, V  [Fa(xk)], is calculated for each
value of the indicator levels of interest, xk.  There are two Horvitz-Thompson variance
estimators presented in this method. The first is a variance estimator of the Horvitz-
Thompson estimator of a proportion. The second is a variance estimator of a Horvitz-
Thompson ratio estimator. The former estimator calculates the variance of the Horvitz-
Thompson estimator of a total and divides this variance by the known subpopulation size
squared, Na2.  The latter estimator requires as  input the CDF estimates produced using the
Horvitz-Thompson ratio estimator of the CDF for proportion.

The output consists of the estimated variance values.

3  Conditions Under Which This Method Applies

•       Probability sample with known inclusion probabilities and joint inclusion probabilities
•       Discrete resource
•       Arbitrary subpopulation
•       All units sampled from the subpopulation must be accounted for before applying this
       method

-------
             EMAP Estimation Method 5, Rev. No. 0, May 1996, Page 2 of 8






4  Required Elements




4.1 Input Data




y,   =  value of the indicator for the fh unit sampled from subpopulation a.


Ki   -  inclusion probability for selecting the ith unit of subpopulation a.

7tr  =  joint inclusion probability for selecting both the ith and fh units  of subpopulation a.


F  (x,.) -  estimated CDF (proportion) for indicator value xk in subpopulation a.
 a  A.




4.2 Additional Components




na  =  number of units  sampled from subpopulation a.


xk   -  tfh indicator level of interest.

Na  -  subpopulation size, if known.




5  Formulas and Definitions




The estimated variance  of the estimated CDF (proportion) for indicator value xk in


subpopulation a, V  [FQ(xk)], with known subpopulation size, Na;   Horvitz-Thompson



variance estimator of the Horvitz-Thompson estimator of a CDF is
 .
 v

The estimated variance of the estimated CDF (proportion) for indicator value xk in


subpopulation a, V [Fa(xk)], with estimated subpopulation size,  7Vfl;  Horvitz-Thompson


variance estimator  of the Horvitz-Thompson ratio estimator of a CDF is
           1



       i=l ni

-------
              EMAP Estimation Method 5, Rev. No. 0, May 1996, Page"3 of 8
For these equations,:
  .
  a(xk) =  estimated CDF (proportion) for indicator value xk in subpopulation a.
                            .
           [0,  otherwise
xk  = tfh indicator level of interest.
v,   = value of  the indicator for the ith unit sampled from subpopulation a.
ni  = inclusion probability for selecting the ith unit of subpopulation a.
KJJ  = joint inclusion probability for selecting both  the ith and fh units of subpopulation a.
na  = number of units sampled from subpopulation a.

6  Procedure
6.1    Enter Data

Input the sample data consisting of the indicator values, y( , and their associated inclusion
probabilities, TT, .  For example,
Calcium
>/
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*/
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375

-------
               EMAP Estimation Method 5, Rev. No. 0, May 1996, Page 4 of 8
- 6.2    Sort Data

 Sort the sample data in nondecreasing order based on the yf indicator values. Keep all
 occurrences of an indicator value to obtain correct results.
Calcium
y>
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*i
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
 6.3    Compute or Input Joint Inclusion Probabilities

 The required joint inclusion probabilities are in the following table.  For this example, they
 were computed by the formula TT,-,- = {2(na-l~)TiiTij} I {2na-ni-Kj} and are displayed in the
 following table.

-------
              EMAP Estimation Method 5, Rev. No. 0, May  1996, Page 5 of 8
Joint Inclusion Probability n^ = Tfy , nif = TC,
^JL'
i
2
3
4
5
6
7
8
9
10

.000047
.000262
.002630
.000127
.002630
.000013
.000075
.000020
.000013
2


.000983
.009867
.000476
.009867
.000047
.000282
.000074
.000047
3



.054457
• .002625
.054457
.000262
.001558
.000410
.000262
4




.026350
.547297
.002630
.015636
.004111
.002630
5





.026350
.000127
.000754
.000198
.000127
6






.002630
.015636
.004111
.002630
7







.000075
.000020
.000013
8








.000118
.000075
9









.000020
6.4    Obtain Subpopulation Size

Input Na if using a known subpopulation size.  Na = 1130 for this dataset.

Calculate Na from the sample data only if using the variance estimator of the Horvitz-
Thompson ratio estimator of a CDF.  Sum the  reciprocals of the inclusion  probabilities, ft, ,
for all units in the sample a to obtain NQ .

Na = (1/.00375) + (1/.01406) +  (1/.07734) + .  . . + (1/.00375) = 1128.939 for this data set.

6.5    Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,
1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).

-------
              EMAP Estimation Method 5, Rev. No. 0, May 1996, Page 6 of 8
"Input Fa(xk) for each xk if .the Horvitz-Thompson ratio estimator was used to estimate the
 CDF.
Calcium
xk
.7395
1.2204
1.5992
.2.0000
2.3707
2.8196
2.9399
7.0000
CDF for Proportion,
Ratio Estimator
fy**>
.2362
.2992
.3355
.3366
.5729
.6126
.7638
1
6.6    Compute Estimated Variance Values

Calculate V [Fa(x,/)]  for xk using the formulas from Section 5.


Compare each yl to xk   Set I(y
-------
              EMAP Estimation Method 5, Rev. No. 0, May 1996, Page"? of 8
Calcium
**
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Estimated
Variance of CDF
for Proportion,
Ratio Estimator
t[fa(Xkn
.044710
.046005
.046453
.046467
.052579
.052209
.044710
0
Estimated
Variance of CDF
for Proportion,
Na= 1130
* Ify**)]
.055482
.056116
.054400
.054346
.092363
.088936
.091247
.106996 -
7  Associated Methods

An appropriate estimator for the estimated CDF for discrete resources may be found in
Method 1 (Horvitz-Thompson Estimator).

8  Validation Data

Actual data with results, EMAP Design and Statistics Dataset #5, are available for comparing
results from other versions of  these algorithms.

9  Notes

Inclusion probabilities, n,, and joint inclusion probabilities, n^, are determined by the design
and should be furnished with the design points.  In some instances, the joint inclusion
probabilities may be calculated from a formula such as Overton's approximation where T:^ =
{2(na-l)K(-7i} / {2na-'Ki-nj} , which is used in Section 6.3.

10  References

Cochran, W. G.  1977.  Sampling techniques. 3rd Edition.  New York: John Wiley & Sons.

Lesser, V. M., and W. S. Overton.  1994.  EMAP status estimation:  Statistical procedures
 and algorithms.  EPA/620/R-94/008.  Washington, DC:  U.S. Environmental Protection
 Agency.

-------
             EMAP Estimation Method 5, Rev. No. 0, May 1996, Page 8 of 8


Overton, W. S., D. White, and D. L. Stevens Jr.. 1990.  Design report for EMAP,
 Environmental Monitoring and Assessment Program. EPA 600/3-91/053. Corvallis, OR:
 U.S. Environmental Protection Agency, Environmental Research Laboratory.

Sarndal, C.  E., B. Swensson, and J. Wretman, 1992. Model assisted survey sampling. New
 York:  Springer-Verlag.

-------
              EMAP Estimation Method 6, Rev. No. 0, May 1996, Page?! of 8
ESTIMATION METHOD 6:  Estimation of Variance of the Cumulative Distribution
Function for the Total Number of a Discrete Resource;  Horvitz-Thompson Variance
Estimator

1  Scope and Application

This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the total number of a discrete resource that has an indicator value  equal to
or less than a given indicator level.  There are two variance estimators presented in this
method.  An estimate can be produced for the entire population or for an arbitrary
subpopulation with known or unknown size.  This size is the total number of units in  the
subpopulation. The method applies to any probability sample and the variance estimate will
be produced at the supplied indicator levels of interest. This method does not include
estimators  for the  CDF.  For information on CDF estimators, refer to Section 7.

2  Statistical Estimation Overview

A sample of size na units is selected from subpopulation a with known inclusion probabilities
7C = {7C1,-,7l.,-,7tn }and joint inclusion probabilities given by jfy, where tej.  The indicator is

evaluated for each unit and represented by y={yl ,-,y,,--,yn  }•  The inclusion probabilities

are design  dependent and should be furnished with the design points. . See Section 9 for
further discussion.

The Horvitz-Thompson variance estimator of the CDF for total number, V  [Fa(x^)] ,  is
calculated for each value of the indicator levels of interest, xk .  There are two Horvitz-
Thompson  variance estimators presented in this method. The first is a variance estimator of
the Horvitz-Thompson estimator of a total. The second is a variance estimator of a Horvitz-
Thompson  ratio estimator.  This variance estimator requires as input the CDF estimates
produced using the Horvitz-Thompson ratio estimator of the CDF for total number, along with
the known  subpopulation size.

The output consists of  the estimated variance values.

3  Conditions Under Which This Method Applies

•       Probability  sample with known inclusion probabilities and joint inclusion probabilities
•       Discrete resource
•       Arbitrary subpopulation
•       All  units sampled from the subpopulation must be accounted for before applying this
       method

-------
             EMAP Estimation Method 6, Rev. No. 0, May 1996, Page 2 of 8


4  Required Elements

4.1 Input Data

v,   =  value of the indicator for the i'  unit sampled from subpopulation a.
nl   =  inclusion probability for selecting the ith unit of subpopulation a.
Ti-  =  joint inclusion probability for selecting both the ith and./"1 units of subpopulation a.
Fa(xk) =  estimated CDF (total number) for indicator value xk in subpopulation a.

4.2 Additional  Components

na  =  number of units sampled from subpopulation a.
xk   =  k*h indicator level of interest.
Na  =  subpopulation size, if known.

5  Formulas and Definitions

The estimated variance  of the estimated CDF (total number) for indicator value xk in
subpopulation a, V [Fa(xk)], with known subpopulation size, Na, Horvitz-Thompson
variance estimator of the Horvitz-Thompson estimator of a CDF is
The estimated variance of the estimated CDF (total number) for indicator value xk in
subpopulation a, V [F^x^], with estimated subpopulation size,  Na;  Horvitz-Thompson
variance estimator of the Horvitz-Thompson ratio estimator of a CDF is
               £
               ^— '
 ~  -          ' = 1
V [Fa(xk)] -
                       1-7T,    _L _^      111    !


-------
              EMAP Estimation Method 6, Rev. No. 0, May 1996, Page'"3 of 8
For these equations:
  a(*k) =  estimated CDF (total number) for indicator value xk in subpopulation a.
     <* )_^ -'  Jl  - "K
    1   k    [ 0,  otherwise
xk   = tfh indicator level of interest.
y,   = value of  the indicator for the ith unit sampled from subpopulation a.
.7t;   = inclusion probability for selecting the i'h unit of subpopulation a.
KJJ   = joint inclusion probability for selecting both the ith and fh units of subpopulation a.
na   = number of units sampled from subpopulation a.

6  Procedure

6.1    Enter Data
Input the sample data consisting of the indicator values, >>,- , and their associated inclusion
probabilities, TT,- . For example,
Calcium
y>
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*,'
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375

-------
               EMAP Estimation Method 6, Rev. No. 0, May 1996, Page 4 of 8
 .6.2    Sort Data

 Sort the sample data in nondecreasing order based on the yi indicator values.  Keep all
 occurrences of an indicator value to obtain correct results.
Calcium
yt
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
"i
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
 6.3     Compute or Input Joint Inclusion Probabilities

 The required joint inclusion probabilities are in the following table.  For this example, they
•were computed by the formula ntj = {2(nfl-l)7t,. KJ} I {2na-Til-nJ} and are displayed in the
 following table.

-------
              EMAP Estimation Method 6, Rev. No. 0, May 1996, Page' 5 of 8
Joint Inclusion Probability n^ = KJ( , nif = nf
}
i
1
2
3
4
5
6
7
8
9
10
1

.000047
.000262
.002630
.000127
.002630
.000013
.000075
.000020
.000013
2


.000983
.009867
.000476
.009867
.000047
.000282
.000074
.000047
3



.054457
.002625
.054457
.000262
.001558
.000410
.000262
4




.026350
.547297
.002630
.015636
.004111
.002630
5





. .026350
.000127
.000754
.000198
.000127
6






.002630
'.015636
.004111
.002630
7







.000075
.000020
.000013
8








.000118
.000075
9









.000020
6.4    Obtain Subpopulation Size


Input Na if using a known subpopulation size. Na = 1130 for this dataset.

          s*.
Calculate Na from the sample  data only if using the variance estimator of the Horvitz-

Thompson ratio estimator of a  CDF.  Sum the reciprocals of the inclusion probabilities, ni ,

for all units in the sample a to  obtain Na .

Na = (1/.00375) + (1/.01406) + (1/.07734) + . . . + (1/.00375) = 1128.939 for this data set.



6.5    Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,
1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).

-------
              EMAP Estimation Method 6, Rev. No. 0, May 1996, Page 6 of 8
Input Fa(xk} for. each xk if the Horvitz-Thompson ratio estimator was used to estimate the
CDF
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
CDF for Total Number,
Ratio Estimator
W
266.91
338.10
379.12
380.36
647.38
692.24
863.09
1130
6.6    Compute Estimated Variance Values

Calculate V [Fa(x/^)]  for xk using the formulas from Section 5.
Compare each y, to xk .  Set
equal to zero.
                                     = 1 if yt
-------
              EMAP Estimation Method 6, Rev. No. 0, May 1996, Page 7 of 8
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Estimated
Variance of CDF
for Total Number,
Ratio Estimator
V ftC**)]
57090
58744
59316
59334
67138
66666
57090
0
Estimated
Variance of CDF
for Total Number,
Na = 1130
V [^fl(^)]
70845
71655
69463
69394
117938
113562
116513
136623
7  Associated Methods

An appropriate estimator for the estimated CDF may be found in Method 2 (Horvitz-
Thompson Estimator).

8  Validation Data

Actual data with results, EMAP Design and Statistics Dataset #6, are available for comparing
results from other versions of these, algorithms.

9  Notes

Inclusion probabilities, 7t,-. and joint inclusion probabilities, n^, are determined by the design
and should be furnished with the design points. In some instances, the joint inclusion
probabilities may be calculated from a formula such as Overton's approximation where n^ =
{2(na-l)7i, itj} I [2na-ni-nj] , which is used in Section 6.3.

10  References

Cochran, W. G. 1977. Sampling techniques.  3rd Edition. New York:  John Wiley & Sons.

Lesser, V. M., and W.  S. Overton.  1994. EMAP status estimation:  Statistical procedures
 and algorithms. EPA/620/R-94/008.  Washington, DC:  U.S. Environmental Protection
 Agency.

-------
             EMAP Estimation Method 6, Rev. No. 0, May  1996, Page 8 of 8
Overton, W. S., D. White, and D. L. Stevens Jr.  1990. .Design report for EMAP,
 Environmental Monitoring and Assessment Program.  EPA 600/3-91/053. Corvallis, OR:
 U.S. Environmental Protection  Agency,  Environmental Research Laboratory.

Samdal, C.  E., B. Swensson, and J. Wretman, 1992.  Model assisted survey sampling. New
 York:  Spnnger-Verlag.

-------
              EMAP Estimation Method 7, Rev. No. 0, May 1996, Page"' 1 of 8


ESTIMATION METHOD 7:  Estimation of Variance of the Size-Weighted Cumulative
Distribution Function for the Proportion of a Discrete Resource; Horvitz-Thompson Variance
Estimator

1  Scope and Application

This method calculates the estimated variance of the estimated size-weighted cumulative
distribution  function (CDF) for the proportion of a discrete resource that has an indicator
value equal  to or less than a given indicator level. The size-weight is a measurement of the
discrete resource such as area of a lake. There are two variance estimators presented in this
method.  An estimate can be produced  for the entire population or for an arbitrary
subpopulation with known or unknown size.  This size is the size-weighted total in the
subpopulation.  The method applies to any probability  sample and  the variance estimate will
be produced at the supplied indicator levels of interest.  This method does  not include
estimators for the CDF.  For information on CDF estimators, refer to Section 7.

2  Statistical Estimation Overview

A sample of size na units is selected from subpopulation a with known inclusion probabilities
7C = { nl,-,nr-,Kn  } -, joint inclusion probabilities given by n^, where /*/  , and size-weight

values w ={ w, ,-,w-,-,w  }. The indicator is evaluated for each unit and represented by
                          a
y ~{>i '"•'}'(>"'On  )•  Th.£ inclusion probabilities are design dependent and should be

furnished with the design points.  See Section 9 for further discussion.

                                                                 *  *.
The Horvitz-Thompson variance estimator of the size-weighted CDF,  V [Fa(xk)], is
calculated for each value of the indicator levels of interest, xk.  There are two Horvitz-
Thompson variance estimators  presented in this method.  The first  is a variance estimator of
the Horvitz-Thompson estimator of a proportion.  The second is a variance estimator of a
Horvitz-Thompson ratio estimator.  The former estimator calculates the variance of the
Horvitz-Thompson estimator of a total and divides this variance by the known subpopulation
size (size-weighted total) squared, Wa2.  The latter estimator requires as input the  CDF
estimates produced using the Horvitz-Thompson ratio estimator of  the  size-weighted CDF for
proportion.

The output consists of the estimated variance values.

3  Conditions Under Which This Method Applies

•      Probability sample with known inclusion probabilities and joint inclusion probabilities
•      Discrete resource
•      Arbitrary subpopulation
•      All units sampled from  the subpopulation must  be accounted for before applying this
      method

-------
              EMAP Estimation Method 7, Rev. No. 0, May 1996, Page 2 of 8


4  Required Elements

4.1 Input Data

y-  =  value of the indicator for the ith unit sampled from subpopulation a.
TT-  =  inclusion probability for selecting the fh unit of subpopulation a.
nr  =  joint inclusion probability for selecting both the ith and/* units of subpopulation a.
wi  =  size-weight value for the ith unit sampled from subpopulation a.
F (x,,)  - estimated size-weighted CDF (proportion) for indicator value xk in
  a  K
           subpopulation a.

4.2 Additional Components

na  =  number of units sampled from subpopulation a.
xk  =  tfh indicator level of interest.
W  =  subpopulation size (size-weighted total), if known.
5  Formulas and Definitions

The estimated variance of the estimated size-weighted CDF (proportion) for indicator value xk
in subpopulation a, V [Fa(xk)], with known subpopulation size, Wa;  Horvitz-Thompson
variance estimator of the Horvitz-Thompson estimator of a CDF is
 .
 v
The estimated variance of the estimated size-weighted CDF (proportion) for indicator value xk
in subpopulation a,  V [Fa(xk)], with estimated subpopulation size, Wfl; Horvitz-Thompson
variance estimator of the Horvitz-Thompson ratio estimator of a CDF is


               ^ d2  l~ni'  .  xA^A  ,  ,  I 1  1   1

V [Fa(xk)} =  —	  Ki
                          .

-------
              EMAP Estimation Method 7, Rev. No. 0, May 1996, Page 3 of 8
For these equations:
         = estimated size- weighted CDF (proportion) for indicator value xk in
           subpopulation a.
            [0. otherwise
xk   = k?h indicator level of interest.
V;   = value of the indicator for the ith unit sampled from subpopulation a.
nf   = inclusion probability for selecting the ilh unit of subpopulation a.
Tt(j   = joint inclusion probability for selecting both the ith and fh units of subpopulation a.
Wj   = size-weight value for the i'h unit sampled from  subpopulation a.
na   = number of units sampled from subpopulation a.

6  Procedure

6. 1    Enter Data

Input the sample data consisting of the  indicator values, yt , and their associated inclusion
probabilities, 7t; and size-weights, wt .   For example,
Calcium
yf
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion .
Probability
*,-
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375
Lake
Area
wi
24.249
92.251
28.018
52.953
362.254
1 140.671
7.758
29.702
149.276
1.081

-------
               EMAP Estimation Method 7, Rev. No. 0, May 1996, Page 4 of 8
  6.2    Sort Data

  Sort the sample data in nondecreasing order based on the y. indicator values.  Keep all
  occurrences of an indicator value to obtain correct results.
Calcium
yi
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*i
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
Lake
Area
w,
1.081
7.758
24.249
28.018
29.702
52.953
92.251
140.671
149.276
- 362.254 '
,c-6.3    Compute or Input Joint Inclusion Probabilities

  The required joint inclusion probabilities are in the following table.  For this example, they
  were computed by the formula n^- = {2(na-l)Til 7t;} / {2na-nl-TiJ} and are displayed in the
  following table.

-------
             EMAP Estimation Method 7, Rev. No, 0, May 1996, Page'5 of 8
Joint Inclusion Probability TT^ = TT-, , nu = Tii
j
i
1
2
3
4
5
6
7
8
9
10
1

.000047
.000262
.002630
.000127
.002630
.000013
.000075
.000020
.000013
2


.000983
.009867
.000476
.009867
.000047
.000282
.000074
.000047
3



.054457
.002625
.054457
.000262
.001558
.000410
.000262
4




.026350
.547297
.002630
.015636
.004111
.002630
5





.026350
.000127
.000754
.000198
.000127
6






.002630
.015636
.004111
.002630.
7







.000075
.000020
.000013
8








.000118
.000075
9









.000020
6.4    Obtain Subpopulation Size

Input Wa if using a known subpopulation size.  Wa =  156000 for this dataset.

Calculate  Wa  from the sample data only if using the variance estimator of the Horvitz-
Thompson ratio estimator of a CDF.  Divide each w, by the inclusion probability, ni , for all
units in the sample a.  Sum each of these quantities to obtain Wa .

Wa = (1.081/.00375) + (7.758/.01406) + (24.249/.07734) + . .  + (362.254/.00375) =
155045.265 for this data set.

6.5    Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,
1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).

-------
              EMAP Estimation Method 7, Rev. No. 0, May  1996, Page 6 of 8


       "**                   "
Input Fa(xk) for each xk if the Horvitz-Thompson ratio estimator was used to estimate the

CDF.
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Size-Weighted CDF
for Proportion,
Ratio Estimator
fy**>
.0019
.0054
.0128
.0132
.1719
.2127
.3769
1
6.6    Compute Estimated Variance Values

          y*.   ,*\
Calculate V  [Fa(xk)]  for ^ using the formulas from Section 5.
Compare each yl to xk .  Set
equal to zero.
                                     = 1 if yx  .  If this is not the case, set this term
Calculate the numerator of the variance formula by summing across all the yt data values.
Divide by the applicable subpopulation size squared.

\Vhen the variance of the non-ratio form of the CDF estimator is used, the calculation may be
simplified.  Sum across the y, data values until y, exceeds xk (when using sorted data) instead
of across all the yt  data values, because each additional term will contribute zero to the sum.
Divide this sum by the subpopulation size squared.

Do this for each xk   Results for the example data are in Section 6.7.  For the example using
a known subpopulation size, Wa = 156000 is used.

6.7    Output Results

Output the indicator levels  of interest and at least the associated estimated variance,

-------
              EMAP Estimation Method 7, Rev. No. 0, May 1996, Page'7 of 8
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Estimated Variance of
Size- Weighted CDF
for Proportion,
Ratio Estimator
V ft(**)]
.000006
.000032
.000129
.000134
.032777
.039204
.084169
0
Estimated Variance of
Size-Weighted CDF
for Proportion,
Wa = 156000
V (Fa(xkV
.000003
.000014
.000032
.000031
.024361
.024451
.043356
.374148
7  Associated Methods

An appropriate estimator for the estimated CDF for discrete resources may be found in
Method 3 (Horvitz-Thompson Estimator).

8  Validation Data

Actual data with results, EMAP Design and Statistics Dataset #7, are available for comparing
results from other versions of these algorithms.

9  Notes

Inclusion probabilities, ni, and joint inclusion  probabilities, K,y, are determined by the design
and should be furnished with the design points. In some instances, the joint inclusion
probabilities may be calculated from a formula such as Overton's approximation where nfj =
{20?a-l)7t,-7c;.} / {2na-ni-7tj} , which is used in Section 6.3.

10  References

Cochran, W. G.  1977. Sampling techniques.  3rd Edition.  New York: John Wiley & Sons.

Lesser, V. M., and W.  S. Overton. 1994. EMAP status estimation: Statistical procedures
 and algorithms.  EPA/620/R-94/008.  Washington, DC: U.S. Environmental Protection
 Agency.

-------
             EMAP Estimation Method 7, Rev. No. 0, May 1996, Page 8 of 8
Overton, W. S., D. White, and D. L. Stevens Jr.  1990. Design report for EMAP,
 Environmental Monitoring and Assessment Program. EPA 600/3-91/053.  Corvallis, OR:
 U.S. Environmental Protection Agency, Environmental Research Laboratory.

Sarndal, C. E., B. Swensson, and J. Wretman, 1992. Model assisted survey sampling. New
 York:  Springer-Verlag.

-------
              EMAP Estimation Method 8, Rev. No. 0, May 1996, Page'l  of 8


ESTIMATION METHOD 8:  Estimation of Variance of the Size-Weighted Cumulative
Distribution Function for the Total of a Discrete Resource; Horvitz-Thompson Variance
Estimator

1  Scope and Application

This method calculates the estimated variance of the estimated size-weighted cumulative
distribution  function (CDF) for the total of a discrete resource that has an indicator value
equal to or less than a given indicator level.  The size-weight is a measurement of the discrete
resource such as area of a lake. There are two variance estimators presented in this method.
An estimate can be produced for the entire population or for an arbitrary subpopulation  with
known or unknown size.  This size is the size-weighted total in the subpopulation. The
method applies to any probability sample and the variance estimate will be produced at  the
supplied indicator levels of interest.  This method does not include estimators for the CDF.
For information on  CDF estimators, refer to Section 7.

2  Statistical Estimation Overview

A sample of size na units is selected from subpopulation a with known inclusion probabilities
n = { TCj, —,7C-,"-,7T   }, joint inclusion probabilities given by TC, , where fej ,  and size-weight
                   a                                      J
values H> ={ w, , — ,wi, — ,\v  }. The indicator is evaluated for each unit  and represented by
              i      i     j\a
y ~(y\ ^"^y^'"^yn  }•  ^e mclusi°n probabilities are design dependent and should be
furnished with the design points.  See Section 9 for further discussion.

                                                                        •*•   ^
The Horvitz-Thompson variance estimator of the size-weighted CDF for  total, V  [Fa(xk)], is
calculated for each value of the indicator levels of interest, xk .  There are two Horvitz-
Thompson variance estimators  presented in this method. The first is a variance estimator of
the Horvitz-Thompson estimator of a total. The second is a variance estimator of a Horvitz-
Thompson ratio estimator.  This variance estimator requires as input the CDF estimates
produced using the Horvitz-Thompson ratio estimator of the size-weighted CDF for total,
along with the known subpopulation size.

The output consists of the estimated variance values.

3  Conditions Under Which This Method Applies

•      Probability sample with known inclusion probabilities and joint inclusion probabilities
•      Discrete resource
•      Arbitrary subpopulation
•      All units sampled from  the subpopulation must be accounted for before applying this
      method

-------
             EMAP Estimation Method 8, Rev. No. 0, May 1996, Page 2 of 8
4  Required Elements

4.1 Input Data

y,  = value of the indicator for the /'  unit sampled from subpopulation a.
nt  - inclusion probability for selecting the i'h unit of subpopulation a.
ni}  = joint inclusion probability for selecting both the i'h and fh units of subpopulation a.
w;  = size-weight value for the ith unit sampled from subpopulation a.
FQ(xk) =  estimated size-weighted CDF (total) for indicator value xk in subpopulation a.

4.2 Additional Components

na  = number of units sampled from subpopulation a.
xk  - tfh indicator level of interest.
Wa = subpopulation size (size-weighted total), if known.

5  Formulas and Definitions

The estimated  variance of the estimated size-weighted CDF (total) for indicator value xk in
subpopulation  a, V [Fa(xk)], with known subpopulation size, Wa;  Horvitz-Thompson
variance estimator of the Horvitz-Thompson estimator of a CDF is

                               1-7T.
The estimated variance of the estimated size-weighted CDF (total) for indicator value xk in
                 -*•   *•                                       >v
subpopulation a, V  [Fa(xk)] , with estimated subpopulation size, Wa\  Horvitz-Thompson

variance estimator of the Horvitz-Thompson ratio estimator of a CDF is

                                             ^   ^    ^
 -  -
 V (Fa(xk)] =

        n
a w

  7T,
                   >..:=W-X
For these equations:
                                      w
                                        a  -
                                                                      a  -1
Fa(xk) =  estimated size- weighted CDF (total) for indicator value xk in subpopulation a.

-------
              EMAP Estimation Method 8, Rev. No. 0, May 1996, Page" 3 of 8
y,
             ^oihenvise
    = f'1 indicator level of interest.
    = value of the indicator for the fh unit sampled from subpopulation a.
K,  = inclusion probability for selecting the fh unit of subpopulation a.
Kfj  = joint inclusion probability for selecting both the ilh and fh units of subpopulation a.
>v,  = size-weight value for the i'h unit sampled from subpopulation a.
na  = number of units sampled from subpopulation a.

6  Procedure

6.1    Enter Data

Input the sample  data consisting of the indicator values, y- ,  and their associated inclusion
probabilities, rcf- and size-weights, w; .  For example.
Calcium
vi
1.5992
2.3707
1.5992
2.0000
7.0000
2,8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*,
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375
Lake
Area
wi
24.249
92.251
28.018
52.953
362.254
140.671
7.758
29.702
149.276
1.081

-------
              EMAP Estimation Method 8, Rev. No. 0, May 1996, Page 4 of 8
6.2    Sort Data

Sort the sample data in nondecreasing order based on the y; indicator values.  Keep all
occurrences of an indicator value to obtain correct results.
Calcium
?••
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
ni
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
Lake
Area
wi
1.081
7.758
24.249
28.018
29.702
52.953
92.251
140.671
149.276
362.254
6.3    Compute or Input Joint Inclusion Probabilities

The required joint inclusion probabilities are in the following table.  For this example, they
were computed by the formula n{- = {2(^-1)71:7^} / {2na-ni-7ij} and are displayed in the
following table.

-------
              EMAP Estimation Method 8, Rev. No. 0, May 1996, Page'' 5 of 8
Joint Inclusion Probability TT,.. = K (- , nu = Ki
ilL1
i
2
3
4
5
6
7
8
9
10

.000047
.000262
.002630
.000127
.002630
.000013
.000075
.000020
.000013
2


.000983
.009867
.000476
.009867
.000047
.000282
.000074
.000047
3



.054457
.002625
.054457
.000262
.001558
.000410
.000262
4




.026350
.547297
.002630
.015636
.004111
.002630
5





.026350
.000127
.000754
.000198
.000127
6

•




.002630
.015636
.004111
.002630
7







.000075
.000020
.000013
8








.000118
.000075
9









.000020
6.4    Obtain Subpopulation Size



Input Wa if using a known subpopulation size.  Wa = 156000 for this dataset.



Calculate  Wfl  from the sample data only if using the variance estimator of the Horvitz-


Thompson ratio estimator of a CDF.  Divide each wi by the inclusion probability, ni , for all
                                                          **.
units in the sample a. Sum each of these quantities to obtain Wa .


Wa = (1.081/.00375) + (7.758/.01406) + (24.2497.07734) + . . . + (362.254/.00375) =


155045.265 for this data set.



6.5    Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,

1.2204, 1.5992, 2,  2.3707, 2.8196, 2.9399, 7).

-------
              EMAP Estimation Method 8, Rev. No. 0, May 1996, Page 6 of 8


Input Fa(xk)  for each xk if'the Horvitz-Thompson ratio estimator was used to estimate the
CDF.
Calcium
xk
. .7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Size-Weighted CDF
for Total,
Ratio Estimator
fy**)
290
845
1995
2066
26818
33174
58804
156000
6.6    Compute Estimated Variance Values

Calculate  V [FQ(xky] for xk using the formulas from Section 5.
Compare each y, to xk   Set
equal to zero.
                                    =  1  if yx  .  If this is not the case, set this term
Calculate the variance of the Horvitz-Thompson ratio estimator of the CDF by calculating the
numerator portion of the equation that sums across all the y,- data values.  Multiply this
              9  '2
quantity by Wa /Wa to obtain the variance.

When the variance of the non-ratio form of the CDF estimator is used, the calculation is
simpler. Sum across the y. data values until y, exceeds xk (when using sorted data) instead of
across all the y-  data values, because each additional term will contribute zero to the sum.
Do this for each xk .  Results for the example data are in Section 6.7.  For the example using
a known subpopulation size, Wa = 156000 is used.

-------
              EMAP Estimation Method 8, Rev. No. 0, May 1996, Page*7 of 8


6.7    Output Results

Output the indicator levels of interest and at least the associated estimated variance,
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399 •
7.0000
Estimated Variance of
Size -Weighted CDF
for Total,
Ratio Estimator
(x 100,000)
? ft(**)]
1.33932
7.75618
31.28985
32.70403
7976.63932
9540.63204
20483.46898
0
Estimated Variance of
Size-Weighted CDF
for Total,
Wa = 156000
(x 100,000)
v [Fa(xky\
.82786
3.47933
7.80689
7.63202
5928.54471
5950.33953 ~
10551.14618
91052.68363
7  Associated Methods

An appropriate estimator for the estimated CDF may'be found in Method 4 (Horvitz-
Thompson Estimator).

8  Validation Data

Actual data with results, EMAP Design and  Statistics Dataset #8, are available for comparing
results from other versions of these algorithms.

9  Notes

Inclusion probabilities, TC,-, and joint inclusion probabilities, n^,  are determined by the design
and should be furnished with the design points.  In some instances, the joint inclusion
probabilities may be calculated  from a formula such as Overton's approximation where n^ =
{2(nfl-l)7t,-7^} / {2na-ni-Kj} , which is used in Section 6.3.

-------
             EMAP Estimation Method 8, Rev. No. 0, May 1996, Page 8 of 8


10  References

Cochran, W. G.  1977.  Sampling techniques. 3rd Edition. New York: John Wiley & Sons.

Lesser, V. M., and W. S. Overton.  1994.  EMAP status estimation:  Statistical procedures
 and algorithms. EPA/620/R-94/008.  Washington, DC:  U.S. Environmental Protection
 Agency.

Overton, W. S., D. White, and D. L. Stevens Jr.  1990.  Design report for EMAP,
 Environmental Monitoring and Assessment Program. EPA 600/3-91/053. Corvallis, OR:
 U.S. Environmental Protection Agency, Environmental Research Laboratory.

Samdal, C. E., B. Swensson, and J. Wretman, 1992.  Model assisted survey sampling.  New
 York:  Springer-Verlag. "

-------
 ESTIMATION METHOD  9: The Parametric Jackknife Estimator
 Estimation of the Cumulative Distribution of a Finite Population.

 1.    Scope and Application
 An important aspect of environmental statistics is to measure specific indicators in
 order to monitor the status of the environment.  Frequently these indicators are subject
 to measurement error.  When sample units are measured with error, the naive estimator
 of the population cumulative distribution obtained when the measurement error is
 ignored are biased and  may be misleading.  The purpose of this report is to describe a
 bias-adjusted estimator proposed by Stefanski and Bay [3] for the cumulative
 distribution of a finite population in the presence  measurement  error.  This estimator,
 called the parametric Jackknife, reduces much of  the bias induced by the measurement
 error.
      A variance estimator for the parametric Jackknife estimator is obtained using
 Horvitz-Thompson estimation.

 2.    Statistical Estimation Overview
 A sampling model is assumed in which a sample of size n is selected from a population
 U = { C/u  £/2i  . - -, UN] with inclusion probabilities  {TT ,•},-=! and joint inclusion
probabilities {TT,-^ <.,<,< N.  The observed data consists of {X,}'^ where X, is a
measured value of  [/, and is subject to measurement error.  It is assumed that X- =
 U, -r crZt, for i = 1,  .. ., n, where {Z^}"=1 are mutually independent, independent of
random sampling, and identically distributed standard normal random variables.  Thus,
the measurement errors in the observed sample are normally distributed with mean zero
and variance a2.
      The estimation procedure involves adding additional measurement error in known
increments to the observed data, computing cumulative distribution estimates from
these contaminated data, establishing a relationship between these estimates and the
measurement error  variance, and extrapolating this relationship back to the case of no
measurement error.
      Computing X^(X) = Xf + a\f\Z'^ for i — 1, . . ., n, where Z"- is a standard normal
pseudo-random variable and A > 0 is a constant, increases the variability of the
measurement error.  The total measurement error variance of the resulting data is
cr2(l 4- A). Cumulative distribution  estimates are calculated at a fixed argument from
       "_1 over  a range of values of A.  The expectation of these estimates is

-------
approximated by a quadratic function in A.  Least squares regression of the cumulative
distribution estimates on A estimate the parameters of this quadratic model.
Extrapolation to the case of no measurement error, i.e., A = - 1, gives the parametric
Jackknife estimator.
      Refer to § 8.3. for a more detailed explanation of this estimation procedure and for
details on calculating a variance estimate of a parametric Jackknife  estimate.

3.    Conditions Under Which This Method Applies
      • Probability sample with known inclusion and joint inclusion probabilities.
      • Finite population.
      • Data observed with error.
      • Additive independent and normally distributed measurement errors with mean
        zero and common variance a2.

4.    Required Elements
4.1.   Input Data
      N = population size.
      {Xn ..., Xn] = probability sample of size n where Xt is a measured value of U,,
           the ith element in population U.
      {TTJ, ..., 7rn} = vector of inclusion probabilities;  TT- is the probability of selecting
          element  Ut from population U.
      [TT.-J] = matrix of joint inclusion probabilities, where TT-. z,  j =  1, ..., n,  is the
          probability of selecting elements  Ut and Uj from population U; TT- = TT, and
          TT.J = ir}i for z, j = 1, ...,  n.
      cr2 = measurement error variance.
      Var(cr2) = variance estimate of cr2 when a2 is estimated; zero otherwise.

5.    Definitions and Formulas
Let t denote a fixed argument in the following definitions and formulas.  Define

      • FU|Ar(t) = estimand (i.e.,  the population  cumulative distribution (CD)).

      • Fjjin(t) = estimator of the population CD in the absence of measurement error.
                                           2

-------
      •  Fjf,A,n(0 — the CD estimator based on the data {X{ + aV\Z^}"=1 where {X^-
        is the observed sample, {^}"=1 are standard normal pseudo-random variables,
        and A > 0 is a constant.

      •  FA'>n|JK(i) = parametric Jackknife estimator.

      •  Var{FX|n JK(i)} = variance estimator of the parametric Jackknife estimator.

      •  L(t) = lower 100(1 - a)% confidence limit for FXpfl JK(<).

      •  U(t) = upper 100(1 - a)% confidence limit for f'x^,^)-


The formulas for the above definitions are as follows.
If (72 is known,
                      i=l
If (T2 is estimated by a2,
If a2 is known,
                       t'=l

-------
If a2 is estimated by c?2,
                         n
                        1=1
If 
-------
with
                            = standard normal density function,

           G(i; Xi, T) = (1,  -1, 1)(DTD)MDTY  (i.e., least squares solution)
         D =
                     *i   A?
and  Y =
                                          '*..-*
with
        y =
                          g(<; Jf,, r) = (1,  -I.
                               D as defined above and
                                                         x.-t
6.    Procedure
6.1.   Generate a sequence of k grid points, tx < t2 < • • • < £fc, spanning the range of
      the observed data
      E.g., Suppose rnin{Ai, X2, . . ., Jfn} = 0 and max{Xl, X^ . . ., Xn} = 25.  We could
      let k = 51 and define i, =0,  t2 = 0.5,  *3 = 1.0, t4 = 1.5, . . ., t50 = 24.5, <51 = 25.0.

6.2.   Generate a sequence of m values 0 < Ax < A2 < ••• < Am.
      See § 8.1 for more information.
6.3.   For each grid point 
-------
6.3.1.     For each data value X{, i = 1, ..., n,
6.3.1.1.       Calculate G(th; X,, a2}  (or G(th\ X,, a2) when cr2 is estimated).
                                 -i,
              where
                                        1   A,   A?
                                                XI
                     Y =
              (Note,  $ is the standard normal cumulative distribution function).
6.3.1.2.       If a2 is estimated, calculate g(th\ X^ a2}.
                          = (!, -1, iK&
             w
               here
                            D is defined in § 6.3.1.1 above,
                 y =
                                                X--
                                                                   -|T
              (Note, 


-------
          If  ^x.n^K^k)} °n {
-------
6.6.   Restrict range of {F^n, JK(^))U i to [°>
      Set h = 1
      While FnJ
      End of While
      Set /i = k
      While Fn       > 1
      End of While

6.7.   Apply isotonic regression to (L^), . . ., l>(tk)} on {ij, . . ., ifc}  auid restrict range of
      {£(**)} J = i to [0,1].
      See § 6.5. and § 6.6. above.
          •J          O

6.8.   Apply isotonic regression to (U^), . . ., U(tt)} on {tj, ..., it} and restrict range of
      See § 6.5. and § 6.6. above.

7     Associated Methods
A related method for estimating a cumulative distribution in the presence of
measurement error is described in Estimation Method 2: The Simulation- Extrapolation
Method.  This method does not assume a particular sampling model nor does it require
a finite population.

8.     Notes
8.1.   The algorithm given in § 6 requires  specification of 0 < Ax < • - - < Am.  Stefanski
and Bay [3] propose taking equally spaced values over the interval [0.05, 2.00].  They
also suggests that m > 5, although the exact number of values is not critical.

-------
8.2.   The algorithm in § 6 calculates estimates of the cumulative proportion. Estimates
of the cumulative total may be obtained by multiplying the estimates {F_y n jK(<,')},' _ l
by TV, the population size. The variance estimator for the cumulative total is equal to
the variance estimator for the cumulative proportion times TV2.  Confidence limits
would need to be recalculated.  Additionally the range of the estimates of the
cumulative total and its  confidence limits would be [0, N] rather than [0, 1] as specified
for the cumulative proportion.

8.3.   This  method of bias-adjustment is closely related to Quenouille's Jackknife. The
usual Jackknife  increases  sampling variance by decreasing sample size.  In this method
measurement error variance is increased by adding pseudo-random errors to the
observed data, achieving the same "variance-inflation" effect as in  the Jackknife
method. This is done by calculating
                        X*i(X) = Xi +  0 is a constant.  The variance of the
additional error is 
-------
estimated by least squares'regression of {F^ A .,„(*)}/*= x on {A.,}™-! where 0 <
<  Am are fixed constants.  Extrapolating to A = — 1, we obtain
which may also be expressed as
                                          t=l
where
         TT,- = inclusion probability for selecting the ith element in population U,

                         G(r; X, a2) = (1,  -1,
such that
                                 D =
1   A,  A2

1   \   A2
•*•   /\m  /^m
                     Y =
                                ~ t
      When cr2 is known, the variance of Fjy]fl]jK(i) is estimated by the Horvitz-
Thompson estimator [2, p.43],
where
                               it- and G are given above,
    7711 j = joint inclusion probability for selecting elements i and j from population U.

     When cr2 is estimated the Horvitz-Thompson estimator is still used to estimate
the variance of a parametric Jackknife estimate.  However the additional variation due
estimating cr2 must also be accounted for.  Hence when cr2 is estimated the  variance
estimator is given by
                                          ro

-------
                                     +
                                                    =i
where
                            TT,-, 7T- and G are given above,
•d^i
                y =
See [3] for more detail.

9.    References
[1]    Barlow, R. E., Bartholamu, D. J., Brenner, J. M., and Brunk, H. D. (1972),
Statistical Inference under Order Restrictions. New York: John Wiley fc Sons.
[2]    Sarndal, C. E., Swensson, B., and Wretman J. (1992), "Model Assisted Survey
Sampling,"  New York: Springer-Verlag.
[3]    Stefanski, L. A. and Bay, J. M. (1994), "Parametric Jackknife Deconvolution of
Finite Population CDF Estimators," in review.

-------
             EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 1 of 8


ESTIMATION METHOD 10:  Estimation of Variance of the Cumulative Distribution
Function for the Proportion of an Extensive Resource;  Horvitz-Thompson Variance Estimator

1  Scope and Application

This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the proportion of an extensive resource that has an indicator value equal
to or less than a given indicator level. There are two variance estimators presented in this
method.  An estimate can be produced for the entire population or for an arbitrary
subpopulation with known or unknown size.  This size is the  extent of the resource in the
subpopulation. The method applies to any probability sample and the variance estimate will
be produced at the supplied indicator levels of interest. This method  does not include
estimators for the  CDF.  For information on CDF estimators,  refer to Section 7.

2  Statistical Estimation Overview

A sample of size na units is selected from subpopulation a with known inclusion densities
n = {Ki,—,ni,-,Kn }and joint inclusion densities given by 7^, where tej. The indicator is

evaluated for each unit and represented by y = { y\ ,-,;y,-,-,;yn }.  The inclusion densities are

design dependent  and should be furnished with the design points.  See Section  9 for further
discussion.

                                                  *.   <*
The Horvitz-Thompson variance  estimator of the CDF, V [Fa(xk)], is calculated for  each
value of the indicator levels of interest, xk .  There are two Horvitz-Thompson variance
estimators presented in this method. The first is a variance estimator of the Horvitz-
Thompson estimator of a proportion. The second is a variance estimator of a Horvitz-
Thompson ratio estimator.  The former estimator calculates the variance of the  Horvitz-
Thompson estimator of a total  and divides this variance by the known subpopulation size
squared, Na . The latter estimator requires as input  the CDF estimates produced using the
Horvitz-Thompson ratio estimator of the CDF for proportion.

The output consists of the estimated variance values.

3  Conditions  Under Which This Method Applies

•       Probability sample with known inclusion densities and joint inclusion densities
•       Extensive resource
•       Arbitrary subpopulation
•       AD units sampled from the subpopulation must be accounted for before applying this
       method

-------
              EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 2 of 8


 4  Required Elements

 4.1  Input Data

 >',•   = value of the indicator for the ith unit sampled from subpopulation a.
 TT,   = inclusion density evaluated at the location of the i!h sample point in subpopulation a.
 Tiy   = joint inclusion density evaluated  at the locations of the i'h and j'h sample points in
       subpopulation a.
 /S
 Fa(xk) ~  estiraated CDF (proportion) for indicator value xk in subpopulation a.

 4.2  Additional Components

 na   = number of units sampled from subpopulation a.
 xk   = l^h indicator level of interest.
 Na   = subpopulation size, if known.

 5  Formulas and Definitions

 The  estimated variance of the estimated  CDF (proportion) for indicator value xk in
                 *   <*.
 subpopulation a, V [Fa(xk)], with known subpopulation size, Na,  Horvitz-Thompson
 variance estimator of the Horvitz-Thompson estimator of a CDF is
The estimated variance of the estimated CDF (proportion) for indicator value xk in
subpopulation a, V [Fa(xk)], with estimated subpopulation size, Na;  Horvitz-Thompson
variance estimator  of the Horvitz-Thorapson ratio estimator of a CDF is
                    d
    ,
   [Fa(Xk)] =

       "•   1
      1 = 1  ni

-------
             EMAP Estimation Method 10, Rev. No. 0, May  1996, Page 3 of 8
For these equations:
F (x,,) =  estimated CDF (proportion) for indicator value xk in subpopulation a.
     f
    1   k   [ 0, otherwise
xk  - tfh indicator level of interest.
y{  - value of the indicator for the ith unit sampled from subpopulation a.
Tt,.  = inclusion density evaluated at the location of the ith sample point in subpopulation a.
nr  = joint inclusion density evaluated at the locations of the ith and fh sample points in
       subpopulation a.
na  = number of units sampled from subpopulation a.

6  Procedure
6.1    Enter Data

Input the sample  data consisting of the indicator values, y, , and their associated inclusion
densities, 7t    For example,
Calcium
ft
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Density
*,
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375

-------
             EMAP Estimation Method 10, Rev. No. 0, May 1996, PageU of 8
6.2    Sort Data

Sort the sample  data in nondecreasing order based on the _y; indicator values.  Keep all
occurrences of an indicator value to obtain correct results.
Calcium
y>
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Density
*,-
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
6.3    Compute or Input Joint Inclusion Densities
The required joint inclusion densities are in the following table. For this example, they were
computed by the formula n^ = (na-l)7t,- rc;-/na and are displayed in the following table.

-------
             EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 5 of 8
Joint Inclusion Density Tty = ;T.. , TI(-(- = ni
i
i
1
2
3
4
5
6
7
8
9
10
1

.000047
.000261
.002531
.000127
.002531
.000013
.000075
.000020
.000013
2


.000979
.009491
.000475
.009491
.000047
.000282
.000074
.000047
3



.052205
.002610
.052205
.000261
.001550
.000408
.000261
4




.025313
.506250
.002531
.015032
.003955
.002531
5





.025313
.000127
.000752
.000198
.000127
6






.002531
.015032
.003955
.002531
7







.000075
.000020
.000013
8








.000117
.000075
9









.000020
6.4    Obtain Subpopulation Size


Input Na if using a known subpopulation size.  Na = 1130 for this dataset.

          s*.
Calculate Na from the sample data only if using the variance of the Horvitz-Thompson ratio

estimator of a CDF.  Sum the reciprocals of the inclusion densities, Ki , for all units in the

sample a to obtain Na .

Na = (1/.00375) + (1/.01406) + (1/.07734) + .  . . + (1/.00375) = 1128.939 for this  data set.


6.5    Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,
1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).

-------
             EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 6 of 8


       A
Input Fa(xk) for each xk if the Horvitz-Thompson ratio estimator was used to estimate the
CDF.
Calcium
**
. .7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
CDF for Proportion,
Ratio Estimator
fy**>
.2362
.2992
.3355
.3366 - '
.5729
.6126
.7638
1
6.6    Compute Estimated Variance Values

Calculate V [Fa(xk)]  for xk using the formulas from Section 5. ,

Compare each yi to xk  . Set I(y
-------
             EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 7 of 8
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Estimated
Variance of CDF
for Proportion,
Ratio Estimator
V t^U*)]
.044888
.046211
.046672
.046687
.052820
.052442
.044888
0
Estimated
Variance of CDF
for Proportion,
Na= 1130
v Ffl(*t)]
.055690
.056351
'.054565
.054479
.092531
.089057
.091322
.106996
7  Associated Methods

An appropriate estimator for the estimated CDF for extensive resources may be found in
Method 1  (Horvitz-Thompson Estimator).

8  Validation Data

Actual data with results, EMAP Design and Statistics Dataset #10, are available for
comparing results from other versions of these algorithms.

9  Notes

Inclusion densities, n-, and joint inclusion densities, TI^- , are determined by the design and
should be  furnished with the design points. In some instances, the joint inclusion densities
may be calculated from a  formula that uses the location of the design points or they may be
approximated by a formula that assumes simple random sampling. This simple random
sampling formula, n^ = (nfl-l)n(. 7t;-/na , is used in Section 6.3.

-------
             EMAP Estimation Method 10, Rev. No. 0, May 1996, Page18 of 8


10  References

Cordy, C. B.  1993. An extension of the Horvitz-Thompson theorem to point sampling from a
 continuous universe. Statistics & Probability Letters 18:353-362.

Lesser, V. M., and W  S. Overton.  1994.  EMAP status estimation: Statistical procedures
 and algorithms.  EPA/620/R-94/008.  Washington, DC: U.S. Environmental Protection
 Agency.

Sarndal, C. E., B.  Swensson, and J.  Wretman, 1992.  Model assisted survey sampling. New
 York:  Springer-Verlag.

-------
 ESTIMATION METHOD llrThe Simulation-Extrapolation Method
 Estimation of a Population Cumulative Distribution.

 1.     Scope and Application
 This report describes an estimation procedure called Simulation-Extrapolation [2] used
 to  estimate a population cumulative distribution when sample units are measured with
 error.  Estimates obtained when the measurement error is ignored are biased and may
 be misleading.  The Simulation- Extrapolation (SIMEX) method reduces the bias
 induced by measurement error by establishing a relationship between measurement
 error-induced bias  and  the variance of the error.  Extrapolating this relationship back to
 the case of no measurement error, an estimator  with smaller bias is produced. The
 method assumes that the variance of the measurement error in the observed sample is
 known or at least well estimated.
       A variance estimator of the SIMEX estimator is also described

 2.     Summary of Method
 Let U = { f/j, t/2? • • ••> &n} be the true (unobserved) data subject to measurement error
 and X = (X-L, X^ -..,Xn} denote the observed data where A", is a measure  [/,. A
 functional measurement error model with additive independent normal error is assumed.
 That is, X( = Ut + aZ,, for i = 1, . .., n, where {Z,}j=1 are mutually independent,
 independent of random sampling, and identically distributed standard normal random
 variables.  Hence, the measurement errors in the observed sample have mean zero and
 variance 
-------
the data {Xi}t = 1^ and identically distributed standard normal pseudo-random variables.
For A fixed, the measurement error variance of the additional errors {{avA/^J,- -i}t = \
is a2X.  Therefore, the total measurement error in Xj|t-(A), for 1 < i < n and 1 < b < B,
has variance \(a2 + 1).  The estimates F^.A.&C*) = T({^b,,(^)}"= i) ^e tnen calculated
for b = 1, ..., B.  The average of these estimates is used to estimate the expectation of
F^ A b(t) with respect to the distribution of the pseudo-random variates {Zj^;},- _ j.  This
is the simulation step of the SIMEX method.
      Next the expectation, Fx
-------
5.     Definitions and Formulas
 Let t denote a fixed argument in the following definitions and formulas.  Define

      • FU(£) = estimator of the population cumulative distribution (CD) in the
        absence of measurement error.
                       estimator based on the data {AT, + a\f\Z"bt-}1= l where
        {-₯,};_! is the observed sample, {Z£ -}-t=l are standard normal pseudo-random
        variables, a1 is the measurement error variance, and A > 0 is a constant.

        Tg(A) = estimator of the variance of F^- Aifc(i).

        F_yiA(f) = estimator of the expectation of Fx A)fc(t) with respect to the
        distribution of the pseudo-random errors  [Z*b ,;};-!.

        T2(A) = estimator of the expectation of ^(A) with respect to the distribution of
        4(A) = estimator of VarF^,^) - Fx.A(t)
         • x t x b\L)  — V-'J-' estimator based on the data {A",- + •J(ar2+€)XZ£i,•}"_. ^ where
        {X,}".-! is the observed sample, {^J ,}i=1 are standard normal pseudo-random
        variables, a2 is the measurement error variance, and e > 0 (e ~ 0) and A > 0 are
        constants.

      • Fx t x(t) = estimator of the expectation of fx,e,\,b(t]  with respect to the
        distribution of {.Z^,-}"_! only.

      • f^ x(t) = estimator of the derivative of FXt\(t) with respect to the measurement
        error variance a2.

      • FSIMEX(0 = SIMEX  estimator.

                       = variance estimator of the SIMEX estimator.

-------
      . L(<) = lower 100(1 - a)% confidence limit for FSIMEX(i).






      • \J(t) = upper 100(1 — a)% confidence limit for FSIMEX(i).








The formulas for the above definitions are as follows.


                      B
                  - vTr

-------
If a2 is known,
If cr2 is estimated,
        Var{FSIMEX(i)} =
      • L(0  —  FSIMEX(<)  -
where
=  F
                 SIMEX(i)
                      = { £/,-},- _ j = (true) unobserved data values,

                     X = {^J"_ j = sample observed with error,
       ,-}, _ j};, _ l = independent and identically distributed standard normal pseudo-
                                 random variables.
                         cr2 = variance of measurement error,

               Var(
-------
          i-a/2 ~ 100(1 ~ a/2}th percentile in the standard normal distribution.
6.     Procedure
6.1.   Generate a sequence of k grid points, t^ < t2 < • • •  < tk, spanning the range of
      the observed data
      E.g., suppose minl^j, . . ., Xn} = 0 and max{X1, . . ., Xn] = 25.  We could let
      k = 51 and define ^ = 0, *2 = 0.5, <3 = 1.0, . . ., tso = 24.5, and tsi = 25.0.

6.2.   Generate a sequence of m values 0 < A:  < A2 < • • • <  Am.
      See § 8.1 for more information.

6.3.   For each grid point th, h ~ 1. . . ., fc,

6.3.1.     For each Aj; j = 1,  . . ., m,

6.3.1.1.        For 6= 1, ..., B,

6.3.1.1.1.           Generate n standard normal pseudo-random variates (Z^^ ...Zlin}

6.3.1.1.2.           Calculate the pseudo-data set {^.-(Aj)}"- j.

                                          ii,  fori = l, ..,  n
6.3.1.1.3.           Calculate F^|A.i

-------
6.3.1.1.4.           Calculate fg(A,-).
6.3.1.1.5.           If Var{a2} > 0,






6.3.1.1.5.1.               Calculate the data set {AJ^ J"= x.
                                                          -  fori =
6.3.1.1.5.2.               Calculate Fx,t,\--,i
6.3.1.2.        Calculate Fx A .

6.3.1.3.        Calculate r2(A.,).
                            6=1
6.3.1.4.        Calculate ^(A,-
6.4.1.5.        If Var{a2} > 0,








6.4.1.5.1.           Calculate  fx,i,:





                                          x £ A  b(*/i)
                                     6=1

-------
6.4.1.5.2.           Calculate fj^.
6.3.2.     Calculate FSIMEX( 0),
            VSr{FSIMEX(t/1)} =  VTr? + Var(?2)
            where
                VT and 57 are defined above,
6.3.4.     Calculate approximate 100(1 - Q)% confidence limits, L(i/,) and U(i/,).
                = FSIMEX(ih) + ^.

-------
          where zl.Q^ is the 100(1 -a/2)th percentile in the standard normal
          distribution.
6.4.   Apply isotonic regression to (FSIMEX(<1), • - -, FSIMEX(tjt)} on {tlt . . ., tk}

         i^ {FSIMEx(*fc)}* = i is NOT non-decreasing
         Let i = 1 and j = 2
         While2 j < n
            While3 FSIMEX(VI) > FSIM
                Let j = j + l
            End of While3
            For h = i, .... j-1,
                               ~
            End of For           ' = ''
            Let i = j and j ' = j • -f 1
         End of While2
      End of
6.5.  Restrict range of {FSIMEX(th)}J = a to [0,  1].

     Set h = 1
     While FSJMEX(*A) < 0
     End of While
     Set h = fc
     While FSIMEx(
     End of While
     (Note, isotonic regression simply ensures that FSIMEX ^s a non-decreasing function
     on [
-------
 6.6.   Apply isotonic regression to {L(*i), ..., L(tk)} on {tl5 ..., tk} and restrict range of
            }£ = i to [o,i].
       See § 6.4. and § 6.5 above.


 6.7.   Apply isotonic regression to (U^), ..., U(ifc)} on  {t1( ..., tk} and restrict range of
       {U(tfc)}J = ito[0,l].

       See § 6.4. and § 6.5 above.
  7.    Associated Methods
 A similar procedure of estimating the cumulative distribution of a finite population in
 the presence of measurement error is described in Estimation Method 1: The Parametric
 Jackknife Estimator. This method assumes a particular sampling model which allows
 for the expectation of sample cumulative distributions to be obtained analytically,
 rather than by simulation as in the SIMEX method.

 8.     Notes
 8.1.   The procedure outlined in § 6 requires specification of 0 < A} < • •  • < Am. Cook
 and Stefanski [2] propose taking equally spaced values over the interval  [0.05, 2.00].
 They also suggests that m > 5, although the exact number of values  is not critical.

•8.2.   The algorithm in  § 6 is designed for calculating estimates of the cumulative
 proportion.  A slight variation  of this  algorithm would allow for estimating the
 cumulative  total.  In this case we assume that FU(<) = T( U) is an unbiased estimator of
 the cumulative total. The algorithm is modified by changing the upper bound of the
 SIMEX estimate and the confidence limits from one to the population size, if the
 population is finite, or oo if the population is infinite.  This modification is required in
 § 6.5 through § 6.7.

 9.     References
 [1]    Barlow, R. E., Bartholamu, D. J., Brenner, J. M., and Brunk,  H.  D. (1972),
 Statistical Inference under Order Restrictions, New York: John Wiley & Sons.
 [2]    Cook, J. R. and Stefanski, L. A. (1994),  "Simulation-Extrapolation Estimation  in
 Parametric  Measurement Error Models," Journal of the American Statistical
 Association, 89, 1314-1328.
                                          10

-------
             EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 1 of 8


ESTIMATION METHOD 12:  Estimation of Variance of the Cumulative Distribution
Function for the Proportion of a Discrete or an Extensive Resource; Yates-Grundy Variance
Estimator

1  Scope and Application

This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the proportion of a discrete or an extensive resource that has an indicator
value equal to or less than a given indicator level.  There are two variance estimators
presented in this method.  An estimate can be produced for the population with known or
unknown size.  In the discrete case, this size is the number of units in the population.  In the
extensive case,  this size is the population extent The method applies to any probability
sample with fixed sample  size and the variance estimate will be produced at the supplied
indicator levels of interest. This method does not include estimators for the CDF. For
information on  CDF  estimators, refer to Section 7.

2  Statistical Estimation Overview

A sample of size n units is selected from population a with known  inclusion probabilities
7i-(nl,-,Ki,-,nn} and joint inclusion probabilities given by n^, where tej. The indicator is
evaluated for each unit and represented by y -{^i »-,>',.•••,}'„ }•  When sampling an
extensive resource, the inclusion probabilities are replaced by the inclusion density function
evaluated at the sample locations.  The inclusion probabilities are design dependent and
should be furnished with the design points.  See Section 9 for further discussion.

The Yates-Grundy variance estimator of the CDF,  V [Fa(xk}], is calculated for each value
of the indicator levels of interest, xk .  There are two Yates-Grundy variance estimators
presented in this method.  The first is a variance estimator of the Horvitz-Thompson estimator
of a proportion.  The second is a variance estimator of a Horvitz-Thompson ratio estimator.
The former estimator calculates the variance of the Horvitz-Thompson estimator of a total and"
                                                         f\
divides this variance  by the known population size squared, N .  The latter estimator requires
as input the CDF estimates produced using the Horvitz-Thompson ratio estimator of the CDF
for proportion.

The output consists of the  estimated variance values.

3  Conditions Under Which This Method  Applies

•       Probability sample  with known inclusion probabilities (or densities) and joint inclusion
       probabilities (or densities)
•       Discrete or Extensive resource
•       Arbitrary population
•       All  units sampled from the population must be accounted for before applying this
       method

-------
             EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 2 of 8
4  Required Elements

4.1  Input Data

y-   = value of the indicator for the fh unit sampled from population a.
7i-   = For discrete resources, the inclusion probability for selecting the ith unit of population
       a.  For extensive resources,  the inclusion density evaluated at the location of the i'
       sample point in population a.
K,   = For discrete resources, the inclusion probability for selecting both the fh and fh units
       of population a.  For extensive resources, the inclusion density evaluated at the
       locations of the ith and fh sample points in  population a.
^F (x.) = estimated CDF (proportion) for indicator value xk in population a.


4.2  Additional Components

n    = number of units sampled from population a.
xk   - tfh indicator level of interest.
N   = population size, if known.

5  Formulas and Definitions
 The estimated variance of the estimated CDF (proportion) for indicator value xk in population
 a, V [F (xk)], with known population size, N ;'  Yates-Grundy variance estimator of the
 Horvitz-Thompson estimator of a CDF is
 V  [Fa(xk]}  = 1^±_	
-------
              EMAP Estimation Method 12, Rev. No. 0, May 1996, Page% of 8
For these equations:
xk   -
v,-
n   =
         ~ estimated CDF (proportion) for indicator value xk in population a.

         \  I 1 • >'/ - xk
         )=i    -'     *
        •    [0, otherwise
       tfh indicator level of interest.
       value of the indicator for the i'h unit sampled from population a.
       For discrete resources, the inclusion probability for selecting the ith unit of population
       a. For extensive resources, the inclusion density evaluated at the location of the i'h
       sample point in population a.
       For discrete resources, the inclusion probability for selecting both the ith and jth units
       of population a.  For extensive resources, the inclusion density evaluated at the
       locations of the ith and jth sample points in population a.
       number of units sampled from  population a.
6  Procedure

6.1    Enter Data

Input the sample data consisting of the indicator values, yi , and their associated inclusion
probabilities  (or densities), rc,  For example,
Calcium
vi
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*/
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375

-------
             EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 4 of 8
6.2    Sort Data

Sort the sample data in nondecreasing order based on the j; indicator values.  Keep all
occurrences of an indicator value to obtain correct results.
Calcium
?,-
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*,-
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
6.3    Compute or Input Joint Inclusion Probabilities (or Densities)

The required joint inclusion probabilities are in the following table.  For this example, they
were computed by the formula n^ =  {2(n-l)7r/7r/-} / {2n-ni-nj} and are displayed in the
following table.

-------
             EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 5 of 8

>
1
1
2
3
4
5
6
7
8
9
10
Joint Inclusion Probability K^ = TC^ , 7tM- = 71,
1

.000047
.000262
.002630
.000127
.002630
.000013
.000075
.000020
.000013
2


.000983
.009867
.000476
.009867
.000047
.000282
.000074
.000047
3



.054457
. .002625
.054457
.000262
.001558
.000410
.000262
4




.026350
.547297
.002630
.015636
.004111
.002630
5





.026350
.000127
.000754
.000198
.000127
6






.002630
.015636
.004111
.002630
7







.000075
.000020
.000013
8








.000118
.000075
9









.000020
6.4    Obtain Population Size



Input N if using a known population size.  N - 1130 for this data set.




Calculate N from the sample data only if using the variance of the Horvitz-Thompson ratio

estimator of a CDF.  Sum the reciprocals of the inclusion probabilities (or densities), re, , for
                                 A

all units in the sample a to obtain N .


N = (1/.00375) + (1/.01406)  + (1/.07734) + .  . . + (1/.00375) =  1128.939 for this data set.



6.5    Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,

1.2204, 1.5992, 2,  2.3707, 2.8196, 2.9399, 7).

-------
             EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 6 of 8
Input Fa(xk) for each xk if the Horvitz-Thompson ratio estimator was used to estimate the
CDF.
Calcium
*t
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
CDF for Proportion,
Ratio Estimator
^(**>
.2362
.2992
.3355
.3366
.5729
.6126
.7638
1
6.6    Compute Estimated Variance Values

          .*  /*
Calculate V [FQ(x-^)] for xk using the formulas from Section 5.
Compare each >•, to xk .  Set 7(y-
-------
             EMAP Estimation Method 12, Rev. No. 0, May 1996, Page?7 of 8
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Estimated
Variance of CDF
for Proportion,
Ratio Estimator
V l/fl(**)]
.044710
.046005
.046453
.046467
.052579
.052209
.044710
0
Estimated
Variance of CDF
for Proportion,
N= 1130
? [fy**)]
.055482
.056116
.054400
.054346
.092363
.088936
.091247
.106996 .
7  Associated Methods

An appropriate estimator for the estimated CDF for discrete or extensive resources may be
found in Method 1 (Horvitz-Thompson Estimator).

8  Validation Data

Actual data with results, EMAP Design and Statistics Dataset #12, are available for
comparing results from other versions of these algorithms.

9  Notes

Inclusion probabilities (or densities), TC;. and joint inclusion probabilities (or densities), n^,
are determined by the design and should be furnished with the design points. In some
instances, the joint inclusion probabilities may be calculated from a formula such as
Overtones approximation where ntj = {2(/j-l)7t/-7^} / {2/i-u,-^} ,  which is used in Section
6.3.   In some instances, the joint inclusion densities may be calculated from a formula that
uses the location of the design points or they may be approximated by the formula n^ =
(n-l)7C,-/« that assumes  simple random sampling.

-------
            EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 8 of 8


10  References

Cochran, W. G.   1977.  Sampling techniques. 3rd Edition. New York: John Wiley & Sons.

Cordy, C. B.  1993. An extension of the Horvitz-Thompson theorem to point sampling from a
 continuous universe. Statistics & Probability Letters 18:353-362.

Lesser, V. M., and W.  S. Overton.  1994.  EMAP status estimation: Statistical procedures
 and algorithms. EPA/620/R-94/008.  Washington, DC:  U.S. Environmental Protection
 Agency.

Overton, W. S.,  D. White, and D. L. Stevens Jr. 1990.  Design report for EMAP,
 Environmental Monitoring and Assessment Program.  EPA 600/3-91/053.  Corvallis, OR:
 U.S. Environmental Protection  Agency, Environmental Research Laboratory.

Sarndal, C.  E., B. Swensson, and J.  Wretman, 1992.  Model assisted survey sampling. New
 York:  Springer-Verlag.

Stevens, Jr., D. L. 1995.  A family of designs for sampling continuous spatial populations.
 Environme tries. Submitted.

-------
ESTIMATION METHOD 13:  Simplified Variance of the Cumulative Distribution Function
for Proportion (Discrete or Extensive) and for Total Number of a Discrete Resource, and
Variance of the Size-Weighted Cumulative Distribution Function for Proportion and Total of
a Discrete Resource;  Simple Random Sample Variance Estimator

1  Scope and Application

This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the proportion and total number of a discrete (or in  the case of proportion,
extensive) resource that has an indicator value equal to or less than a given indicator level.
The method also calculates the estimated size-weighted versions of these CDFs for a discrete
resource.  All of these CDFs are produced using Horvitz-Thompson estimators found in other
methods.  An estimate can be produced for the entire population or for  a geographic
subpopulation with unknown size. "This size is the total number of units or extent in the
subpopulation.

The estimation  algorithms have been simplified for use in spreadsheet software such as Lotus
1-2-3 and Quattro  Pro; however, because of this simplification, the use  of these variability
estimates is restricted. This method provides a mechanism for generating quick summaries of
indicators to assist in internal research and is distributed with  the restriction that results for
inclusion in peer-reviewed documents or EPA reports should be cleared by EMAP
statisticians.  The variance estimates will be produced at the supplied indicator levels of
interest. For information on the Horvitz-Thompson estimators  of the CDF, refer to Section 7.

2  Statistical Estimation Overview
                                              r,
A sample of size na units is selected from subpopulation a with known  inclusion probabilities
K = {nl,-,ni,-,nn } and if applicable, the size-weight values \v = {wl.-,wl,-,\vn }.  The

indicator is evaluated for each unit and represented by .y = {y^ , — ,yj, — ,yn  }•   When sampling.

an extensive resource, the inclusion probabilities are replaced by the inclusion density
function evaluated  at the sample locations.  The inclusion probabilities are design dependent
and should be furnished with the design points. See Section 9 for further discussion.

The variance estimators of the CDF are calculated for each value of the indicator levels  of
interest, xk.  The units are assumed to come from an independent sampling design that
reduces the usually required joint inclusion probabilities  given by n^ , where  i*j , to
7t.. = [(n^-l^Tt  ] lna . Under the independent random  sampling model, the Horvitz-
Thompson variance estimator simplifies to the usual simple random sampling variance
estimator, s  2, applied to a cumulative total.  This total differs depending upon whether the
CDF is for proportion or for total number.  In the case of proportion, the Horvitz-Thompson
ratio estimator is used to calculate the CDF and because both  the numerator of the proportion
are estimated, there is more variability in the estimate.  As a result, the  variance estimators of
the CDF for proportion and the  size-weighted CDF for proportion  require as input the CDF
estimates produced using the Horvitz-Thompson ratio estimator.

-------
              EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 2 of 9


 The output consists of the estimated variance values.

 3 Conditions Under Which This Method Applies

 •      Independent random sample (IRS) with fixed sample size and known inclusion
       probabilities or densities
 •      Discrete resource (or extensive, in the case of proportion)
 •      Subpopulation is defined geographically,  or the number of sites within the
       subpopulation of interest is known; examples: by ecoregion or first order stream length
 •      All units sampled from the subpopulation must be accounted for before applying this
       method; Missing values are excluded

 3.1  Restrictions

 Variability estimates of the CDF for non-geographic subpopulation estimates cannot be made
 using the supplied calculation routines. For example, the supplied routine does not apply for
 the estimates of variability of the percentage of lakes that are hypereutrophic and have ANC
 < 200, or the estimated number of streams containing a species of fish for a subset of the
 sample that is determined by a chemistry response. A more sophisticated variance estimator is
 needed for these cases;  contact EMAP Design and Statistics for assistance.

 4 Required Elements

•4.1  Input Data

 y,   = value of the indicator for the ith unit sampled from subpopulation a.
 nL   - inclusion probability for selecting the i'h unit of subpopulation a.
 wt   - size-weight value for the ilh unit sampled from subpopulation a. This applies to
       discrete resources only.  An example would be area  of a lake.

 4.2  Additional Components

 na   = number of units  sampled from subpopulation a.
 xk   = tfh indicator level of interest.
    l   k   lO,  otherwise   '
For the estimated variance of the estimated CDF for proportion, also input
             °   1
           £
                 '               '  tne estimated CDF for proportion for indicator value xk in
                   "a
subpopulation a with estimated subpopulation size, Na = J^ —

-------
             EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 3 of 9


For the estimated variance of the estimated size-weighted CDF for proportion, also input
               w,
           E -1 *
           1 = 1 ^J
 Ga(xk^  = -  ' tne estimated size-weighted CDF for proportion for
                   *a
indicator value xk in subpopulation a with estimated subpopulation size-weighted total,
           H<
       1 = 1   j

5  Formulas and Definitions

The estimated variance of the estimated CDF (proportion) for indicator value xk in
subpopulation a, V [F^x^)] ; Simple random sample variance estimator of the Horvitz-
Thompson ratio estimator of a CDF is


                              ?('r?)

The estimated variance of the estimated CDF (total number) for indicator value xk in
                yv   ^v  ^
subpopulation a, V [NaFa(xk)]  ; Simple random sample variance estimator of the Horvitz-

Thompson estimator of a CDF is
 ? [#]  =nas"  '   ** =  "„  ^      '  '« =  7^/^) X ^

The estimated variance of the estimated size-weighted CDF (proportion) for indicator value xk
in subpopulation a, V [Gfl(^)J ;  Simple random sample variance estimator of the Horvitz-
Thompson ratio estimator of a CDF is


               v2       2   l(r'"7)
     "x"      ~JJ              n«-^      '   '                         "•>

The estimated variance of the estimated size-weighted CDF (total) for indicator value xk in
subpopulation a, V [lfra
-------
             EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 4 of 9
                                                                    Wi
                                                ;   r  =  /(>><*) x _i
                                                   ,
                                     nfl - 1                           71,

For these equations:

W  -  estimated subpopulation size-weighted total.
 ^.
Na -  estimated subpopulation size.

Fa(xk) = estimated CDF (proportion) for indicator value xk in subpopulation a.

G (xk) =  estimated size-weighted CDF (proportion) for indicator value xk in

           subpopulation a.
           [0, otherwise
xk  =  kfh indicator level of interest.
y;  =  value of the indicator for the ith unit sampled from subpopulation a.
n-  =  inclusion probability for selecting the i'h unit of subpopulation a.
\Vj  =  size-weight value for the ilh unit sampled from  subpopulation a.  This applies to
       discrete resources only.  An example would be area of a lake. •
s1  -  sample variance of /.
n   -  number of units sampled from subpopulation a.
7 =  —I	,  the arithmetic mean of/.
      na
6  Procedure

6.1 Enter Data

Input the sample data consisting of the indicator values, v(- ,  their associated inclusion
probabilities, Tt^ and if applicable, the size-weight values,'w,  and CDF estimates. For this
example data, the variance of the empirical CDF is of interest; xk values are equal to v, .

6.2 Sort Data

Sort the sample data in nondecreasing order based on the y-  indicator values. Keep all
occurrences of an indicator value to obtain correct results. Our sample  data is

-------
             EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 5 of 9
Indicator
Vi
(1)
1.9
6.0
9.8
10.9
11.0
11.8
12.0
12.3
13.6
14.2
Inclusion
Probability
*/
(2)
.042201
.059245
.023847
.060562
.037023
.055115
.102785
.059545
.084789
.083752
Size-weight
(ex. area)
wi
(3)
117.85
147.30
185.55
55.55
239.91
165.09
129.83
51.42
262.33
74.58
fy**>
(4)
.1219
.2087
.4244
.5093
.6482
.7415 -
.7916
.8779
.9386
1.0000
,-<1.9 .  If this is
not the case, set l(y>,.< 1.9) is abbreviated as 7(1.9).

-------
             EMAP Estimation Method  13, Rev. No. 0, May 1996, Page 6 of 9
Indicator
y>
a)
1.9
6.0
9.8
10.9
11.0
11.8
12.0
12.3
13.6
14.2
Inclusion
Probability
*«
(2)
.042201
.059245
.023847
.060562
.037023
.055115
.102785
.059545
.084789
.083752
-7(1.9)
(3)
1
0
0
0
0
0
0
0
0
0
-7(1.9)-Ffl(1.9)
(4) = (3) - .2087
.8781
-.1219
-.1219
,1219
-.1219
-.1219
.1219
-.1219
-.1219
.1219
7(1.9)-Fa(1.9)
rei
(5) = (4) * (2)
20.8
-2.1
-5.1
-2.0
-3.3
-2.2
-1.2
-2.0
-1.4
-1.5
/(1.9)
*,'
(6) = (3) H- (2)
23.696
0
0
0
0
0
0
0
0
0
 For estimating the variance of the estimated CDF for proportion for xk = 1.9, calculate the
 sample variance, s2, of column (5).  (In Excel, use the VAR( ) function),  s2 = 54.765.
                                2
 V [/fl( 1.9)]  = V [.1219] =  n°S   = (10)(54-765) = 0.0145   .
                             Na2        194.432

-For estimating the variance of the estimated CDF for total number for xk = 1.9, calculate the
 sample variance, s2, of column (6).  s2 = 56.15.
 V [NaFa(l.9)] = V [23.70]  = nas2 =  (10X56.15) = 561.5  .
 Do this same procedure for the next xk value, xk = 6.0.  The table now becomes

-------
             EMAP Estimation Method 13, Rev. No. 0, May 1996, Page"'? of 9
Indicator
Vi
(1)
1.9
6.0
9.8
10,9
11.0
11.8
12.0
12.3
13.6
14.2
Inclusion
Probability
*/
(2)
.042201
.059245
.023847
.060562
.037023
.055115
.102785
.059545
.084789
.083752
7(6.0)
' (3)
1
1
0
0
0
0
0
0
0
0
7(6.0) -Fa(6.0)
(4) = (3) - .2087
.7913
.7913
-.2087
-.2087
.2087
-.2087
-.2087
-.2087
-.2087
-.2087
7(6.0) -Ffl(6.0)
*,'
(5) = (4) - (2)
18.751
13.357
-8.751
-3.446
-5.637 .
-3.786
-2.030
-3.505
-2.461
-2.492
7(6.0)
*,-
(6) = (3) - (2)
23.696
16.879
0
0
0
0
0
0
0
0
For estimating the variance of the estimated CDF for proportion for xk = 6.0, calculate the
sample variance, s2, of column (5).  s2 - 77.026.

 V [F (6.0)]  = V [.2087]  = "^  = (10)(77.026) =  QQ2Q4
     a                       N 2       194.432
For estimating the variance of the estimated CDF for total number for xk = 6.0, calculate the
sample variance, j2, of column (6). s2 - 75.75.
V [NaFa(6W] = V [40.58] =  V2 = (10)(75.75) = 757.5   .

Repeat this process for the remaining xk values.

6.4 Compute estimated variance of the estimated size-weighted CDF for proportion,
     V [Ga(xk)], and for total, V
The procedure for calculating the variance estimates for the size-weighted CDFs is the same
as the one used in Section 6.3.  The only difference between the estimates is that w,/r:, is
substituted for I/TT, in every part of the calculation.  The following example is for xk  = 6.0.

Create a new table of 6 columns. Use column (1) from the table in Section 6.2 for the first
column. In the second column, enter the result from dividing column (3) by column  (2) of
the table in Section 6.2.  Insert the I(y
-------
             EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 8 of 9
7(>'1-<;c/(:) = 1 if y(<6.0 .  If this is not the case, set 7(>'(-<^)  = 0 .  In column (4), insert

the difference between column (3) and the size-weighted CDF value corresponding to xk =
6.0.  This CDF value is .1786, from the table in Section 6.2.  In column (5), put the  result
from multiplying column (4) by column (2).  In column (6), put the result from multiplying
column (3) by column (2).  Results are in the following table; -7(;y.<6.0)  is abbreviated as

7(6.0).
Indicator
y>
(i)
1.9
6.0
9.8
10.9
11.0
11.8
12.0
12.3
13.6
14.2
wi / *i
(2)
2792.5879
2486.2858
7780.8529
917.2418
6480.0259
2995.3733
1263,1221
863.5486
3093.9155
890.4862
7(6.0)
(3)
1
1
0
0
0
0
0
0
0
0
7(6.0) -Gfl(6.0)
(4) = (3) - .1786
.8214
.8214
-.1785
-.1786
.1786
-.1786
-.1786
-.1786
.1786
-.1786
w
[7(6.0) -Gfl(6.0)]_i
Ki
(5) = (4) x (2)
2293.8317
2042.2351
-1389.6603
-163.8194
-1157.3326
-534.9737
-225.5936
-154.2298
-552.5733
-159.0408
[7(6.0)] .Hi
*,-
(6) = (3)x(2)
2792.5879
2486.2858
0
0
0
0
0
0
0
0
For estimating the variance of the estimated size-weighted CDF for proportion for xk - 6.0,
calculate the sample variance, s2, of column (5).  s2 = 1491256.3.

 V [Gfl(6.0)]  = V [.1786]  = HaS   =  (10X14912563)  = 0.0171
                              W 2       29563.602
For estimating the variance of the estimated size-weighted CDF for total number for jc^ = 6.0,
calculate the sample variance, s2, of column (6).  s2 - 1243723.68.
                                     .2
V [WflGfl(6.0)]  = V [5280.06]  = nasL  = (10)(1243723.68)  = 12,437,237
Repeat this process for the remaining xk values.

-------
             EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 9 of 9


7  Associated Methods

An appropriate estimator for the estimated CDF for proportion for discrete or extensive
resources may be found in Method 1 (Horvitz-Thompson Estimator).  For the estimated CDF
for total number, size-weighted CDF for proportion, and size-weighted CDF for total (which
apply only to discrete resources), see Methods 2, 3, and 4, respectively.

8  Validation Data

Actual data with results, EMAP Design and Statistics Dataset #13, are available for
comparing results from other versions of these algorithms.

9  Notes

Inclusion probabilities, Tt,, are determined by the design and should be furnished with the
design points.

Population estimates are calculated using inclusion  probabilities or densities and differ by
indicator. For example, in the 1993 stream sample,  periphyton, and full physical habitat
(P-hab) were measured  only on the IX grid streams, requiring use of the  IX inclusion
probabilities. Water chemistry measurements were taken on both IX and 7X  streams, and in
this case, the IX inclusion probabilities should be used. Reference/test sites (both lakes and
streams) were hand picked, and can not.be used, to  make population estimates. These
restrictions apply to all  sampling years.

If estimates across multiple years are required, responses for sites sampled in  multiple years
should only be included for the initial year and the  inclusion probabilities should be
multiplied by the number of years of data.

10  References

Lesser, V. M., and W. S. Overton.  1994.  EMAP status estimation:  Statistical procedures
    and algorithms.  EPA/620/R-94/008.  Washington, DC: U.S. Environmental Protection
    Agency.

Overton, W.  S.,  D.  White,  and D. L. Stevens Jr.  1990. Design report for EMAP,
    Environmental Monitoring and Assessment Program.  EPA 600/3-91/053. Corvallis, OR:
    U.S. Environmental Protection Agency,  Environmental Research Laboratory.

Stevens, Jr., D. L.   1995.  A family of designs for sampling continuous spatial populations.
    Environmetrics.  Submitted.

-------
ANSWERS TO COMMONLY ASKED QUESTIONS
     ABOUT R-EMAP SAMPLING  DESIGNS
            AND DATA ANALYSES
                   Prepared for

                  Victor Serveiss
         U.S. Environmental Protection Agency
             Research Triangle Park, NC
                   Prepared by

                  Jon H. Volstad
                  Steve Weisberg

                   Versar, Inc.
               Columbia, MD 21045

                      and

                Douglas Heimbuch
                  Harold Wilson
                   John Seibel

         Coastal Environmental Services, Inc.
                  Linthicum, MD
                   March 1995

-------
ANSWERS TO  COMMONLY ASKED  QUESTIONS ABOUT
  R-EMAP  SAMPLING  DESIGNS AND DATA ANALYSES
                                                    INTRODUCTION

                                   The Environmental Monitoring and Assessment Program
                                   (EMAP) is an innovative, long-term research, and moni-
                                   toring program  designed to measure the current and
                                   changing conditions of the nation's ecological resources,
                                   EMAP achieves this goal by using statistical survey
                                   methods that allow scientists to assess the condition of
                                   large areas based on data collected from a representative
                                   sample of locations. Statistical survey methods are very
                                   efficient because they require sampling relatively few
                                   locations  to make valid scientific statements about the
                                   condition of large areas (e.g., all wadable streams within
                                   an EPA Region).

                                   Regional-EMAP  (R-EMAP)  is  a  partnership between
                                   EMAP, EPA Regional offices, states, and other federal
                                   agencies  to adapt  EMAP's  broad-scale  approach to
                                   produce ecological assessments at regional, state, and
                                   local levels.  R-EMAP is based on the same statistical
                                   survey techniques used in EMAP,  which have proven
                                   successful in  many disciplines of science.  Applying
                                   these techniques effectively requires recognizing several
                                   key principles of survey sampling and using specialized,
                                   although not difficult, data analysis methods.

                                   This document provides a nontechnical overview of the
                                   survey sampling and data analysis concepts underlying
                                   R-EMAP  projects.  It is intended for regional resource
                                   managers who have had little statistical training, but
                                   who feel they would benefit from a better understanding
                                   of the statistical and scientific underpinnings of R-EMAP.
                                   Familiarity with these concepts is helpful for understand-
                                   ing the kinds of information R-EMAP can provide and
                                   appreciating the strengths of R-EMAP.  Several  addi-
                                   tional documents are being prepared for scientists with
                                   some statistical  training who may become involved in
                                   analyzing  R-EMAP data.

                                   This document is organized in two sections. The first
                                   section explains the  general  principles  of  survey
                                   sampling and its application to determining ecological
                                   condition.   Terms such as target population, sampling

-------
frame, and random sampling are defined.  The second
section addresses questions frequently asked about the
R-EMAP sampling  design and data analysis methods.
Throughout  the document,  the concepts of survey
design are illustrated first with examples from everyday
life,  and then with examples from a typical R-EMAP
study.  The  R-EMAP examples involve a stream study;
however, the concepts are equally applicable to assess-
ing the  condition  of  other resources  such  as lakes,
estuaries, wetlands, or forests.
PRINCIPLES OF SURVEY DESIGN

There  are two generally  accepted  data  collection
schemes for studying the characteristics of a population.
The first is a census, which entails examining  every unit
in the  population  of  interest.   For most  ecological
studies, however, a census is impractical.  For example,
measuring fish assemblages everywhere to assess condi-
tions within a watershed that has 1000 kilometers of
stream would be prohibitively expensive.

A more practical approach for studying  an  extensive
resource, such as a watershed, is to examine  parts of it
through probability  (or random) sampling.  Studies based
on statistical samples rather than complete coverage^or
enumeration) are referred to as sample surveys. Sample
surveys are highly cost-effective,  and the  principles
underlying such surveys are well developed and docu-
mented.  The principles of survey design provide the
basis for (a) selecting  a  subset of sampling units from
which to  collect data, and (b) choosing  methods for
analyzing the data.

One example of a  sample survey is  an opinion poll to
estimate the percentage of eligible voters who plan to
vote Democratic in  a presidential election.  Such opinion
polls are based on interviews with only a small fraction
of all eligible voters. Nevertheless, by using statistically
sound survey methods, highly accurate estimates can be
obtained by interviewing  a representative sample of only
around 1200 voters. If 700 of the polled  voters plan to
vote Democratic, then the fraction 700/1200,  or 58 per-
cent, is a reliable estimate of the percent of  all voters
who plan to vote Democratic.

-------
A target population of enrolled students at •
university.   Sampling unit  » individual
student.
A target population of perennial, variable
streams in  a watershed.  Sampling unit «
point location and associated plot.
The approach  used in conducting a R-EMAP stream
survey  is  basically  the same as  in  an opinion poll.
Instead of collecting the opinions of a sample of people,
a R-EMAP  project might collect data about fish assem-
blages from a representative sample of point locations
along the stream length of a watershed to determine the
percent of kilometers of streams in which ecological con-
ditions are  degraded. If data are collected from plots of,
say, 40 times the stream width in  length at each of 40
randomly selected sites, and 16  of the 40 sites exhibit
degraded conditions, then the estimated proportion of
degraded stream kilometers in  the watershed would be
40% {i.e.,  16/40).
STEPS FOR IMPLEMENTING A SAMPLE SURVEY

The survey design is a  plan  for selecting  the  sample
appropriately so that it provides valid data for developing
accurate estimates for the entire population or  area of
interest.    Planning  and  executing a  sample  survey
involves three primary steps:   (1) creating  a list of all
units of the target population  from which to select the
sample, (2) selecting a random sample of units from this
list, and (3) collecting data from the selected units.  The
same techniques used to select the sample of people to
interview in an opinion poll are used to select the  sample
of sites from which to collect  field data.
                                       Developing a Sampling Frame

                                       Before the sample  survey can be conducted, a clear,
                                       concise description of the  target population is needed.
                                       In statistical terminology the target  population (often
                                       shortened to "population")  does not necessarily refer to
                                       a population of people. It could be a population  of
                                       schools, area units of farm land, freshwater lakes, or
                                       length-segments of streams.

                                       The list or map that identifies every unit within the popu-
                                       lation of interest is the samp/ing frame.  Such a list is
                                       needed so that every individual member of the popula-
                                       tion  can be identified unambiguously.   The individual
                                       members of the target population whose characteristics
                                       are to be measured are the sampling units.
                                                    J

-------
A random umpk of students from the target
population. The poll results in "yes" or "no"
response!.  _.
 For example, if we were conducting a sample survey to
 estimate the percentage of students at a university who
 participate in intramural sports, the target population
 would consist of all the enrolled students. The individual
 students would be the sampling units, and the registrar's
 office could provide a list of students to serve  as the
 sampling frame.  We could draw a representative (ran-
 dom) sample  of  students  from  this list  and interview
 them about their participation in sports. Their responses
 would be "yes" or "no."  The percentage of interviewed
 students who participate in intramural sports would yield
 an estimate of the "true" percentage for all students.

 For a stream survey, the target population might be all
 perennial, wadable streams in a watershed. .The sam-
 pling  unit is  a point along the stream length, and an
 associated plot,  e.g.  40 times the  stream width  in
 length. The response  variable might be "degraded" or
 "non-degraded" based on  measures of water quality.
 Conceptually,  the collection  of  all  possible  point
 locations along these streams serve as a sampling frame,
 similar to the list of students in the previous example.
 The  sampling frame  for  streams  typically would  be
 established  by using  U.S.  Geological Survey stream
 reach  files through a  geographic  information system
 (CIS).
A random sample of location* from the target
population.
Selecting a Representative Sample

Survey sampling is intended to characterize the entire
population  of interest; therefore,  all  members of the
target population must have a known chance  of being
included in the sample.  Conducting an election poll by
asking only your neighbors' opin'ons probably would not
enable you to predict the outcome of a national election
accurately.

Simple  random  selection  ensures that  the sample is
representative because all members  of  the population
have an equal chance of being selected.  Random selec-
tion can be thought of as a kind of lottery drawing to
determine which stream reaches, for example, are in-
cluded in the sample. The selection is non-preferential
towards any  particular reach or group of reaches. One
way to make a random selection would be to place
uniquely numbered ping-pong balls (one for each sam-
pling unit) into a drum, blindly mix the drum, and then

-------
                                        blindly pick one ball corresponding to each stream reach
                                        (i.e., sampling unit) from which data are to be collected.
                                        In practice, computers are used to  make the random
                                        selections. Either way, the result is a subset of sampling
                                        units randomly selected from the sampling frame.
Students potted at the  entrance to the
gymnasium are not representative of  all
students on the university campus.
  A biased sample of locations from the
  target population of afl streams In the
  shaded area.
                                                 FREQUENTLY ASKED QUESTIONS
Upon thoughtful  consideration of the sample survey
approach, several questions may come to mind.  This
section  answers  several commonly asked  questions.
Some of them concern survey sampling, and  some of
them concern  data analysis.   These questions are
addressed in fairly general terms.  As noted in the intro-
duction, additional technical detail will  be available in a
series of methods manuals.
Why is ft so important to select sampling sites ran-
domly?

The way we select the sample (i.e., choose the units
from which to collect data)  is  crucial  for  obtaining
accurate estimates of population parameters. We clearly
would  not get a good estimate of the percentage of all
students at  a  university who  participate in intramural
sports  if we polled  students  at  the entrance to  the
gymnasium. This preferential sample would, most likely,
include a much higher proportion of athletes than the
general population of students.

Similarly in a stream study, preferential sampling occurs
if the sample includes only sites downstream of sewage
outfalls in a watershed where sewage outfalls affect
only a small percentage of total stream length. This kind
of sampling program may  provide  useful  information
about conditions downstream of sewage outfalls, but it
will not produce estimates that accurately represent the
condition of the whole watershed.

Preferential selection can be avoided by taking random
samples. Simple random sampling ensures that no par-
ticular  portion of the sampling frame  (i.e., groups of
students or kinds of river reaches) is favored.  Within
streams/the chance of selecting  a  sampling unit that
has degraded ecological conditions  would be proportional

-------
to the number of sampling units within the target popu-
lation that have degraded conditions.  For example, if
30% of the target population has degraded conditions,
then on average 30% of the  (randomly selected) units in
the sample will exhibit degraded conditions.  This pro-
perty of random sampling allows estimates (based only
on the sample) to be used to  draw conclusions about the
target population as a whole.
For 305b reports, I need to estimate the total number of
stream miles in my EPA Region that are degraded.  Can
I do this from sample survey data?

The number of degraded stream miles can be calculated
in two steps.  First, the proportion of stream miles that
are degraded  is calculated as  illustrated earlier.  Then,
that fraction is multiplied by the total number of stream
miles in the population.  The total number of stream
miles is  available  from  the  sampling frame,  which
delineates all members of the target population.

Defining "degraded" is an important part of the calcula-
tion, regardless of whether rt is for percent or absolute
number of stream miles.  "Degraded" can be defined; if
a threshold value or goal for each measurement variable
can be established.  Most of the variables measured, in
stream surveys, such as pH, have continuous rangesiof
response (e.g., between 1  and 14 for pH).  Calculating
the proportion of stream miles that are degraded requires
converting this continuous data into binary, or yes/no
(e.g., degraded or not degraded) form.  The question of
how many stream miles are degraded,  therefore, must
be rephrased to include a threshold value for the relevant
measurement variable.  For pH, the question might be
rephrased as "What are the total number of stream miles
in my Region with pH below 6.5?"
I am accustomad to seeing estimates of average condi-
tion Instead of estimates of proportion.  Can R-EMAP
data be used to estimate average condition?

Yes, estimates of average condition, such as the average
pH in a watershed, provide valuable information and can
be  calculated with  R-EMAP data as a simple mean.
The  principles  of  survey sampling,  particularly the
emphasis on selecting a  representative sample,  also
     6

-------
                                      apply to estimating a population mean. Just as an esti-
                                      mate of the percent of stream miles in a Region in which
                                      pH is below 6.5 is biased if data are collected only from
                                      sites downstream of sewage outfalls, so is the estimate
                                      of mean pH.

                                      EMAP  emphasizes estimating spatial extent (e.g., per-
                                      cent of river miles) because it has  several advantages
                                      over estimating the mean.  For instance, a Region with
                                      an average stream  pH of 7 might be composed entirely
                                      of streams with a pH of 7; however, the same average
                                      would  occur if half the streams have a pH of 6 and the
                                      other half a pH of 8.  Estimating the spatial extent of the
                                      resource that fails to meet some standard (e.g., pH of at
                                      least 6.5) provides more information about the condition
                                      of the resource and is consistent with EPA initiatives to
                                      establish  environmental  goals and measure progress
                                      toward meeting them.
Distribution of umpllng locations along a
transect for different sampling >chemes.
  RAMDOM aAMPL
  RESTMCTED RANDOM SAJVUMO
  SYSTEMATIC tAMPUMO
Many EMAP documents refer to hexagons in describing
the sampling design. How are hexagons involved?

In geographic  studies,  such as a stream survey, it is
often desirable to-distribute samples  throughout  the
study area.  Often this is accomplished using a syste-
matic design in which  samples are placed at regular
intervals.  In EMAP, this is  accomplished by a special
kind of  random sampling known as restricted random
sampling. This type of random sampling has a syste-
matic component. The systematic element causes the
selected sampling units to be spread out geographically.
The random element ensures that  every sampling unit
has an equal chance of being selected.  The illustration
at left compares the typical allocations of sampling units
along a transect for random,  restricted  random,  and
systematic sampling designs.

In EMAP, hexagons are used to add the systematic  ele-
ment to the design.  The hexagonal grid is positioned
randomly on the map of the target resource, and sam-
pling  units  from within each grid  cell  are  selected
randomly.   The  grid  ensures spatial  separation  of
selected sampling units; randomization ensures that each
sampling unit has an equal chance of being  selected.

-------
Target population: aD eligible voters to all
states.   Area of  special interest  (stratum):
voters to Rhode bland.
Target population:  watershed with 1000 km
of streams. Area of special interest (stratum):
200 km of streams.
EMAP documents suggest that the sampling design is
"flexible to enhancement."  What does this mean?

One goal of a sample survey may be to compare a sub-
population with the target population.  For instance, an
opinion poll might be used to determine if a higher per-
centage of the people living in Rhode Island are likely to
vote Democratic than in the nation as a whole. Given its
small size, Rhode Island probably would receive  very
little attention in a national poll if samples are allocated
randomly.  One way to achieve a sample  of people  in
Rhode Island that is  sufficient to make this comparison
is to increase sampling effort for the nation as a whole
until enough people  from Rhode Island are included in
the randomly selected national sample.  This option is
not very cost-effective because it requires considerable,
unnecessary sampling effort  in other areas to achieve a
desired sample size in one small area.

Another, preferable,  alternative would be to divide the
entire  target population into two  subpopulations,, or
strata.  Voters in the United States  could  be stratified
into (1) those living in Rhode Island, and (2) those living
elsewhere.   A simple random sample of  desired size
could then be selected from each of these groups. Stat-
isticians refer to this as stratified random sampling.
Stratified sampling designs  can have any number of
strata with a different level of sampling effort in each.
                                               f ;

Stratified sampling could be used in a stream survey to
enhance sampling effort in a watershed of special inter-
est so that its condition could be compared with that of
a larger area. In a study area with 1000 kilometers of
streams, for example, an area of special interest  may
contain 200 kilometers of streams. If budget constraints
limit the size of the total sample to 60 sampling units,
30 could be randomly selected from the special interest
area,  and 30 from the rest of the sampling frame.  If
simple random  sampling is used, the area of  special
interest, which represents  20%  of  the  area, will
contain only about 12 of the 60 selected sampling units.
A sample of 12 would be insufficient to estimate the
condition of the special interest area reliably.
                                             8

-------
Doesn't enhancing the sampling intensity for an area of
special Interest bias the overall estimate?

No.  Sampling units inside an area of special interest
usually have a higher chance of being selected than sam-
pling units outside the special interest area. Within each
stratum, however, the chance of selecting any location
is equal; therefore, a separate (unbiased) estimate can
be computed for each stratum.

With stratified random sampling, estimates are generated
first for individual strata, then  the stratum-specific
estimates are combined into an overall estimate for the
whole target population.  Stratum-specific estimates are
combined by weighting each one by the fraction of all
sampling units that are within the stratum.   For the
simple two-stratum example given above, the weights
would be 200/1000 for stratum  1  and 800/1000 for
stratum 2.  So, if the stratum-specific estimates are 0.5
for  stratum 1 and 0.25 for stratum 2, the overall esti-
mate  is 0.30  [(0.5 x 2/10) + (0.25 x  8/10)1.  This
approach ensures that the overall estimate is corrected
for the intentional selection emphasis within a particular
subpopulation.
EMAP's objectives state that estimates are made with
known confidence.  What is "known confidence"?

An estimate of a population parameter is of limited value
without some indication of how confident one should be
in it.  Scientists typically describe the  appropriate level
of confidence in an estimate derived from  a sample sur-
vey by defining confidence  limits or margins of error,
This description  of statistical confidence is used  fre-
quently in reporting the results of opinion  polls using
statements such as "this poll has a margin of error of
± 4%'.   Provided  random  sampling  is  used,  similar
statements can be made about estimates from biological
sample surveys.

Sample surveys provide estimates that  are used to make
inferences  about  parameters for the  population  as  a
whole. Two types of estimates are commonly provided:
the  point estimate and the interval estimate.  For ex-
ample, the estimated proportion of voters that support
a party is a point estimate. It is important  to know how
likely it is that such a point estimate deviates from the
     9

-------
Percent of Democratic voters estimated from
a sample of 30; note the wide confidence
interval.
 Pollerigesponses    14 of 30 (47%)

 Confidence Interval  29% - 65%

 Margin of Error     18%
           100

         gM
               x   too   am
            I POLLED RESPONSES
  A cample of 300 people produces a better
  estimate; the confidence interval k narrower.
Po4l«d

Confidence Interval

Margin of Error
                    140 of 300 (47%)

                    41% -63%

                    6%
            100
                                     true  population parameter by no  more  then a given
                                     amount. An interval estimate for a parameter is defined
                                     by upper and lower limits estimated from  the sample
                                     values. A confidence interval is constructed so that the
                                     probability of the interval  containing the parameter of
                                     interest can be specified.  We do  not  know with cer-
                                     tainty whether an individual interval,  specified as  a
                                     sample estimate plus minus a margin of error, includes
                                     the true population parameter. For repeated sampling,
                                     however, the estimated 95% confidence intervals would
                                     include the true parameter  95% of the times.  The
                                     length of the confidence intervals is a measure of how
                                     precise the parameter is being estimated:  a narrow
                                     interval signify  high precision. The margin of error is
                                     often used for defining the  upper and lower limits of the
                                     confidence interval, it is half the width of the confidence
                                     interval.  Thus, if a poll estimates that 55% of the popu-
                                     lation will vote  Democratic and the margin of error is
                                     ± 4%, then the  estimated  95%  confidence interval
                                     ranges from 51 % to 59%.

                                     A great advantage of using  a random sampling design is
                                     that statisticians have developed procedures for calcu-
                                     lating confidence intervals for the estimates.  For most
                                     R-EMAP projects,  in which the goal is to estimate the
                                     proportion of the resource that is degraded, a standard
                                     probability  distribution  known  as the binomial distri-
                                     bution can be used to determine the upper and lower
                                     bounds of confidence intervals.
What are the most Important factor* affecting the size
of the confidence interval?

The sample size (# of sampling units collected) and the
proportion of yes answers are the primary factors affect-
ing the  size of the  confidence interval with  binary
(yes/no)  data.  The effect of  sample size can be  illu-
strated with a  pre-election poll of  voters.  If only 30
people are  sampled, and 14 indicate that they will vote
Democratic, it would  be unwise to  predict the winner.
With such a small sample size, the margin of error would
be about ±18% for a 95% confidence interval.  The
degree of confidence would be higher rf  140 people out
of a sample of 300 say they will vote Democratic (47%
±6%), and higher still if 1400 people out of a sample
of 3000 say they will vote Democratic (47% ± 2%). In
this example, the estimated proportion of sampled voters
                                           TO

-------
A sample of 3000 people produces a very
accurate estimate, with a narrow confidence
interval.
         m
          lRa|-
          vKmtifc''1'^'
xiiMffi
 Polled .^eponeea    1400 Ol 3000 (47%)

 Confidence Interval   45% • 48%  -

 Margin of Error     2%
         100
          50
             30   900   MOO
           t POLLED R£SPOMSCS
Margin of error as a function of the percent
yes responses for fixed sample size* of 30 and
100 (90% confidence Interval).
Plot of margin of error versus sample size
when 20% of the population b to the YES
category (P •» 0.2).
who will vote Democratic stays the same (p = 47%), but
the width of the confidence interval decreases with
increasing sample size.

Confidence intervals for estimated percentages (p) are
affected to a lesser degree by the  proportio'n of yas
answers (P)  in the population.  The widest confidence
interval occurs  for P equal to  50%.   For values of P
ranging from 20% to 80%, the margin of error will not
vary much with P; it will be determined mainly by the
sample size. The fact that there is a maximum margin
of error for binomial estimates of proportions is very
useful for planning a survey.  If we plan for the  worst
case (i.e., when  half of the population is in  the yes
category) we can select a sample size that ensures that
the confidence interval  for P  will  be smaller than a
specified limit.
Doesn't  the size  of  the target population affect
confidence in the estimates?

The size of the target population theoretically affects the
precision of the estimates.  For most sample surveys,
however, the effect is negligible because the sampled
fraction of the target population is so small.  When the
sampled fraction is small, the size  of the  sample rather
than the size of the target population determines the
precision of the estimate. Polling 1000 people in the
state  of  Rhode  Island,  for example,  would yield as
precise an estimate as polling 1000 people in the state
of Texas.  In both cases, a very small proportion of the
total population is polled.

If the sample includes a large proportion of the popu-
lation,  in  contrast, the accuracy of the estimate  is
improved. For instance, if a local town has a population
of 1400 people, then a sample of 1200 people would
produce a substantially more accurate  estimate than a
sample of 1200 people from a population of 100 million.
As the size of the sample approaches the  size of the
population, statisticians adjust the confidence interval
using the finite population correction factor.  In practice,
however,  most sampling efforts don't sample a large
enough  fraction of the  population for this  correction
factor to become important. That is why pollsters inter-
view approximately the  same number  of people for a
local election as for a presidential election.
                                          11

-------
 For R-EMAP projects, the fraction of the population that
 is sampled is generally very small. Fish assemblages, for
 example,  are  generally  sampled  from   100-meter
 segments.  If 50 such  samples are  collected from a
 Region with  1000 miles of streams, the sampled fraction
 is .0031.
               CLOSING COMMENTS

The approaches and concepts described in this overview
document  are  generally  applicable to  all  R-EMAP
projects. They are appropriate whether the purpose of
sampling is to estimate the proportion of the number of
resource units (e.g., numbers of lakes), the proportion of
total length of a resource (e.g., miles of streams), the
proportion of area of a resource (e.g., square miles of an
estuary), or the proportion of volume of a resource (e.g.,
cubic meters of one of the Great Lakes). The approaches
and concepts can  be  applied without modification to
each of these situations.

This overview document purposefully was written non-
technically; it does not contain enough detail to help
someone analyze data. Three companion documents are
being prepared  to  provide  additional technical  detail
about recommended methods. These manuals describe
data analysis methods (1)  for assessing  status  (e.g.,
proportion of area  with  degraded conditions),  (2) for
assessing differences in proportions  between two sub-
populations of interest  (e.g., deep versus shallow  areas,
two different states, two different stream orders), and
(3) for assessing long-term trends.  The methods  manu-
als  are  intended for  scientists with some statistical
training.  Technical documentation targeteJ for  statis-
ticians is available from the EMAP Statistics and Design
Team in Corvallis, Oregon.
                  BIBLIOGRAPHY

Cochran, W. G. 1977.  Sampling  Techniques. 3rd ed.
  John Wiley and Sons. New York.

Gilbert,   R.  O.   1987.  Statistical  Methods  for
  Environmental Monitoring.  Van Nostrand Reinhold.
  New York.
    12

-------
Jessen, R. J. 1978. Statistical Survey Techniques. John
   Wiley and Sons. New York.

Stuart, A. 1984.  The Ideas of Sampling.  MacMillan
   Publishing Company. New York.
  13

-------
                       R-EMAP
             Data Analysis Approach for
Estimating the Proportion of Area that is Subnominal
                     Prepared for

                    Victor Serveiss
          U.S Environmental Protection Agency
              Research Triangle Park, NC
                     Prepared by

                  Douglas Heimbuch
          Coastal Environmental Services, Inc.
                    Linthicum, MD

                    Harold Wilson
          Coastal Environmental Services, Inc.
                    Linthicum, MD

                     John Seibel
           Coastal Environment Services, Inc.
                    Linthicum, MD

                   Steve Weisberg
                     Versar, Inc.
                    Columbia, MD
                     April 1995

-------
                           TABLE OF CONTENTS







 !.  Introduction	     	   1




 li.  Estimation of Proportion of Area that is Subnominal	2




      II.A.  The Resource, the Sample and the Estimate	3




      II.B.  Probability Distribution for Possible Values of the Estimate	5




      II.C.  Factors Affecting the  Estimated Proportion     .    .     .....  7




            II.C.1. The True Proportion Subnominal	     	7




            II.C.2. Sample Size and Variance	     	    9




 III.  Construction of Confidence Limits	10




      III.A.  What  are Confidence  Limits?	11




      III.B.  Factors Affecting Width of the Confidence Interval	14




      lll.C.  How to Compute Confidence Limits	16




            III.C.I. Standard Graphs and Tables for Confidence Limits	  16




            lll.C.2. Normal Approximation	17




IV.  Data Analysis  for  Stratified Random  Sampling	19




V. Closing Comments	21

-------
 I.  Introduction

       The  Environmental  Monitoring
 and Assessment Program (EMAP) is an
 innovative,   long-term  research  and
 monitoring program that is designed to
 measure  the current and  changing
 conditions of the nation's  ecological
 resources.  EMAP achieves this goal by
 utilizing  sample survey  approaches
 which allow scientific statements to be
 made   for   large  areas   based   on
 measurements taken at a sample of
 locations. Regional-EMAP (R-EMAP) is
 a  partnership  among  EMAP,  EPA
 Regional    offices,    other    federal
 agencies,  and   states.   R-EMAP  is
 adapting EMAP's broad-scale  approach
 to  produce  ecological  assessments at
 regional, state, and local levels.

      The sample survey approaches
 utilized by R-EMAP are very efficient in
 terms   of   the   (small)   number   of
 locations that need  to be sampled in
 order   to    make   valid    scientific
 statements  about the  condition of a
 large area  (e.g., all estuarine waters
 within  a  Region).    This  efficiency
 carries with it a small  additional cost,
 however.    Specialized data analysis
 methods must be applied to insure that
 the results are scientifically valid.

      This document is the  first in a
 series   of  methods  manuals   being
prepared to assist the R-EMAP partners
 in  implementing  EMAP's   sampling
 approach.  These manuals  build upon
 basic concepts that were addressed in
the document "Answers to Commonly
Asked   Questions  About  the  EMAP
 Sampling Design" by providing a more
 thorough discussion of specific topics.
 The intended audience of the manuals
 are   scientists   without   extensive
 statistical training who may become
 involved in analysis of R-EMAP  data.
 Technical documentation, written  for
 statisticians, is also being prepared.

    This  manual  describes  two  data
 analysis  methods  for  assessing  the
 status of ecological  condition.  One
 primary measure of ecological condition
 addressed by EMAP and R-EMAP is  the
 proportion of area that has subnominal
 (i.e., not meeting some environmental
 criterion)  conditions.    This manual
 describes methods for:

  o Estimation of the  proportion of
    area   that   has   subnominat
    conditions, and

  o Construction   of   confidence
    intervals for the  estimates of
    proportion of area.

These  methods are equally applicable
to any type  of proportion  including
proportion of numbers  (e.g., numbers
of lakes), proportion of length (e.g.,
miles of  streams), proportion of area
(e.g.. square miles- of estuaries},  or
proportion  of  volume  (e.g.,  cubic
meters of a lake) that has subnominal
conditions.       The  methods    are
appropriate  only  for  a   sampling
program  in which 1)  every location
within the resource of interest has the
same chance  of being selected  for
sampling, and 2) the selection of any
one location does not affect the chance
of selection for any other location. The

-------
methods can  also be applied to data
from  stratified sampling if these two
conditions  are satisfied within  each
defined stratum.

      These  methods  are  described
along with an in-depth discussion  of
underlying   concepts.     Underlying
concepts   (rather  than  'cook-book'
instructions) are emphasized for two
reasons.   The first is that  proper
interpretation  of the results from the
data     analyses     requires     an
understanding   of   ithe  -underlying
concepts.  Correct interpretation of the
results  of data analyses is a key link
between  quality  data  and  sound
resource management decisions.  The
second  reason is that each of  the R-
EMAP  projects is  unique  and  may
require custom application of the data
analysis    methods.       Thoughtful
application  of  these  methods cannot
occur without  an understanding of the
underlying  concepts.  Furthermore, a
solid  understanding of the underlying
concepts can  be a  great help when
defending results and conclusions.
II.    Estimation of Proportion of
      Area that is Subnominal

      In this section, the recommended
method for estimating  proportion of
area that  is subnominal is  described
and  the  rationale  for the  method  is
presented.   Also, properties of the
estimates are discussed. To make this
information easier  to understand, the
distribution pattern of  the response
variable within the  resource of interest
is  treated  as  if  it  is  known  with
certainty  (i.e., a map of the response
variable  for the  entire  resource of
interest  is   presented).     Clearly,
complete information  of this  kind is
never available in practice; if it were,
there would be no need to sample!

    Although the recommended method
will   provide  an  estimate  of  the
proportion of area that is subnominal, it
does  not provide  an estimate of the
location of the subnominal areas.  The
only locations that are truly known to
be subnominal are the specific sampled
locations.  Other analysis approaches
may  be  used  to  map  the  actual
subnominal areas.

    This section is organized into three
parts.   The  first  section  contains a
discussion of the general relationship
between a) the true-distribution pattern
of the response variable, b) the sample
of  response- variables from selected
locations,  and  c)   the  estimate  of
proportion of area that is subnominal.
Next,  the  probability  of  observing
different estimated values,  based on
which (randomly selected) locations are
included in the sample,  is discussed.
Finally, the  last  section contains a
discussion of factors that affect  the
estimate of the proportion of area that
is subnominal, and the kinds of effects
generated by these factors.

-------
                                        II.A.  The Resource, the Sample and the
                                              Estimate
 Figure 1. Hypothetical resource with a
 subnominal area proportion of 0.2.
Figure 2. Hypothetical resource with 15
sampling locations.
     For  the  purposes of  assessing  the
 proportion  of  area  that  is  subnominal,  a
 simple map of the resource  of interest can
 be  envisioned  in which  areas  that  are
 subnominal are shaded and  all other areas
 are  left unshaded (Figure 1).  The resource
 depicted in Figure 1 is a hypothetical estuary
 with the upstream and downstream limits of
 the  resource of interest marked with dotted
 lines. The shaded areas, in this case, might
 represent  areas'  with concentrations  of
 metals  in sediments  that are in  excess of
 some standard. For the example depicted in
 Figure 1, a total  of  200 square miles are
 subnominal  and  the  total   area  of  the
 resource of interest (i.e., shaded plus non-
 shaded areas)  is  1000 square miles.  The
 true proportion of area that is subnominal is
 the  ratio of a) the extent of shaded areas
 (e.g., 200 square miles)'divided  by b)  the
 entire  extent of  the  resource of  interest
 (e.g., 1000 square miles).  Therefore, 0.2 is
 the  true   proportion  of  area  that  is
 subnominal for this  example.   The true''
 proportion  is often   referred to  (e.g.,  in •
 textbooks) as the 'population parameter1 P.

     Now  suppose that  a  sample  of 15
 locations within the resource of interest are
selected at random.   Furthermore, suppose
the random selection of locations is made so
that a) every location  within the resource of
 interest  has  an  equal  chance  of  being
selected,  and b)  after each  selection of  a
location, all locations are again equally likely
to be chosen as the next selected location
 (Figure  2).   This would  be  like  blindly
throwing a dart at the map 15 times,  each

-------
                                        time  ignoring  where  the  previous  darts
                                        landed.
             &>mp*n« Point
                  - In nen-cubnwnlnil «rai
                    - In lubnomlnil «r»i
Figure 3. Hypothetical resource with 3 of 1 5
samples in subnominal areas.
            Sunpivg Point
                  - In norv-cubnemlnil »r»«
                    - Ir •ubnomlnil ir*a
Figure 4. Hypothetical resource with 5 of 15
samples in subnominal areas.
     After the locations  are  selected,  the
condition of the resource (e.g., whether or
not the  metals concentration  exceeds  the
standard) is recorded. In practice this might
be accomplished by sending a field crew to
the  location   to   collect  a   sample   for
laboratory  analysis.  For this example, a
selected  location is designated  subnominal if
it is in a  shaded area of the map  (Figure 3).
Next, the total number of  selected locations
with subnominal condition is  recorded  (call
this number '*'), and the total sample  size
(call this number 'n') is noted.  These  two
numbers, x and n, are all that are needed to
estimate  the  proportion  of  area that  is
subnominal, and to construct the confidence
limits for the estimated proportion.

     The estimate  of the proportion of area
that is subnominal (referred to as £) is simply
the ratio  of x divided by n:

                ?>= x/n .

For the sample depicted in Figure 3,  x = 3
and  fl =  15.   Accordingly, the  estimated
proportion for this example is 3 divided by
15 which is equal to 0.2,  In this  case, the
estimate  is the same as the true population
proportion.  This will not always occur.  For
example, the.randomly selected  locations
could have produced 5 samples  that were
subnominal instead of 3 (Figure 4).  In  this
case the  estimate would have been 5 divided
by 15 which is equal to 0.33.  This estimate
is not equal to the true proportion. In fact,
the estimated value can be any  one  of 16
numbers  from 0 to  1  (i.e., 0/15, 1 /15, 2/15,
... , and  15/15). However, it  is much more
likely  for the  estimate to  be near 0.2 (i.e.,
the true proportion) than any other value.

-------
                                           I. B.   Probability  Distribution  of Possible
                                                 Values of the Estimate
  •5  »°°
  2
  n
     ISO
   o.
   in
     ISO
  W  100
     88
        I
 ll.
         0 0.1) OJT  0.40  OJJ OJT  0.10  OJ3
          007   OJC  OJ1  0.4T  0.10  0.71  OJ7  1.00
               iM PfotArtoAvf l«k«4MMlAi*»
  Figure 5. Frequency distribution of f> based on
  one realization of 1000 random selections (15
  samples, P  = 0.2).
    OJ
  J0.25
  1015
    0.05
I,
       COO  0 IS OJT 0*0  O.SJ O.BT  0.10  O.M
         047 0.20  033  0.47 OJO OTJ  OJT 140
           EittniM PiBpoflton if tuMumlMl AIM

Figure 6. Probability distribution of t> based on
exact Binomial distribution (15 samples. P =
0.2).
      The chance of observing each  of the
 possible  values  of the  estimate can  be
 summarized  in  what  is  referred  to  as  a
 probability    distribution    (or     sampling
 distribution) of the estimate. The probability
 distribution  depicts  the  likelihood of each
 possible  outcome  (i.e.,  estimate  of  the
 proportion) of the random sampling compared
 to all other possible outcomes and provides a
 basis  for  assessing  the  reliability of  the
 estimate.

     The probability distribution of possible
 values  of the estimate can be approximated
 by repeating the random selection of locations
 over and over. For each random selection of
 15 locations, the estimate f) is recorded  and a
 cumulative tally is kept of the number of times
 each possible value of f)  is observed.   For"
 example, with 1000 random selections (of  15
 locations) the frequency distribution depicted
 in Figure 5 is produced.  Because each of the
 1000 random selections (of 15 locations) is
 equally likely, the value of P with the highest
 tally (or frequency) is the most likely value of
 f>.  Furthermore, the probability of observing
 any  particular value f> is the limit (as  the
 number of random selections goes to infinity)
 of the  ratio of a)  the  number of  selections
 producing  that value of  P,  to b)  the  total
 number of selections of 15 locations {Figure
 6).  Accordingly, the y-axis in Figure 6 has a
 possible range from 0 to 1 .

     In  practice,  it  is   not  necessary  to
 repeatedly sample the resource in order to
.construct the  probability distribution  of the
 estimated values. The probability distribution
 of estimates of  proportion  (based on data
 from  the type of random sampling  addressed
 in this document) can always be described by
 a standard distribution  called the  Binomial

-------
Figure 7. Average value of $ {15 samples, P =
0.2).
 distribution.  The Binomial distribution is fully
 defined by  only  two  parameters: the true
 proportion and the sample size.   Therefore,
 the probability distribution of possible values
 of the  proportion of area can be constructed
 by plugging a value for the true  proportion
 (assumed to be known in this section of the
 manual) and the sample size into the formula
 for the Binomial distribution.

      Also, it is worth noting that the average
 of all possible values of P is equal to the true
 proportion.  This can be seen by envisioning
.the probabilities (Figure 6) as weights  on a
 board.  The average is the center of mass for
 the weights (Figure 7) and is equal to the true
 proportion.  In general, if the  mean of the
 probability distribution of an estimate is equal
 to the parameter being estimated (in this  case
 the  true proportion),  then  the method  of
 estimation is referred to as being  unbiased.
 Because  this condition is satisfied for the
 recommended method of  estimation,  this
 method is unbiased.
Figure 8.  Hypothetical resource with scattered
subnominal areas (P = 0.2).
     Another  important  property  of  this
method of estimation  is that the  specific
pattern of subnominal areas on the map does
not  affect the estimates (providing that the
true  proportion  remains  unchanged).   For
example,  with  the  shaded  areas  more
scattered   (Figure   8),   the   probability
distribution of values of £ is exactly the same
as  the  probability  distribution  (Figure  6)
associated  with  the   map   in  Figure  1.
Therefore the same sampling design can  be
used  regardless of  the  underlying  spatial
pattern, which  generally  is  unknown.  The
specific pattern of  the  response  variable
within the resource of interest does not affect
the probability distribution of the estimate 0).
However as described in the next  part of this
section, the probability distribution  of £ is
affected by  the true proportion and  by the
sample size (n).

-------
Figure 9. Hypothetical resource with a
subnominal area proportion of 0.3.
                                           .C.  Factors  Affecting  the  Estimated
                                                Proportion
                                               II.C.1. The     True
                                                      Subnominal
                                                                  Proportion
                                       There   is   a   different   probability
                                  distribution of values of ft for every possible
                                  value of the true proportion.  For example, if
                                  the  total  area  that  is  subnominal is 300
                                  square miles (Figure  9), the true proportion
                                  is 0.3  (i.e., 300 square miles divided  by
                                  1000   square  miles).    The   probability
                                  distribution  for  estimated  values  of  the
                                  proportion  can  be   generated  from  the
                                  formula for the.Binomial distribution.  In this
                                  case  the  probability  distribution  is  as
                                  depicted  in  Figure  10.   Notice that  the
                                  distribution has shifted to the  right.   The
                                  most likely values are near 0.3  and the  mean
                                  of the distribution is exactly 0.3 (Figure 11).
  0.25
   "
  0.15

   "
  0.05
il
I,
     0.00 0.13 OJ7  0.40  O.U  0.8T  OJO 049
       0.07 OJO 03S  0.47  OJO  0.73  OJ7 1.00
          E*ttnaM PfopofflMi «* SuMiornkMl AIM
 Figure 10. Probability distribution of £ 05
 samples, P = 0.3).
                                       Figure 11. Average value of £ (15 samples, P
                                       • 0.3).

-------
Figure 12. Hypothetical resource with
subnominal area proportion of 0.4.
                                             The same  exercise  can be  conducted
                                        for  a map in which 400 square miles are
                                        subnominal (Figure 12). In this case the true
                                        proportion is 0.4  (i.e.,  400  square  miles
                                        divided  by  1000  square  miles).    The
                                        probability distribution for a true  proportion
                                        equal to 0.4  is depicted in Figure 13.  Now
                                        the  most likely values are near 0.4 and the
                                        mean of the probability distribution is exactly
                                        0.4  (Figure 14).

                                             The mean of the probability distribution
                                        of values of f) is  always exactly equal to the
                                        true proportion.  This is true for  any value of
                                        the true proportion (from  0.0 to 1.0), and is
                                        true  regardless  of  the  spatial pattern  of
                                        subnominal  areas within the  resource  of
                                        interest.  Also,  the  most likely values of $
                                        are   always   near  the   true  proportion.
                                        Furthermore,  by increasing the  sample size,
                                        the probability that the estimate will be very
                                        close   to   the  true  proportion  can  be
                                        increased.
    0.2S
     0.2
 I
 LU
 S
 £
    0.16
     0.1
  .
  e
  Q.
    0.03
       0.00  0.1)  0.27 0.40 O.U  0.87 010 O.B3
         0.07  OJO  0.» 0.47  O.M  0.73  0.87 1.00
           E*flm«»d Praporfen of • i*nomln«l A/M
Figure 13. Probability distribution of
samples, P = 0.4).
                                (15
                                                       ,l
Figure 14. Average value of f> (15 samples, P
= 0.4).
                                        8

-------
                                              II.C.2.  Sample Size and Variance
  Figure 15. Hypothetical resource with 30
  samples, subnominal area proportion of 0.2.
•o
 v
UJ
"o
o
Q.
      J
        o.u

     Estimated Proportion of Subnomral Area

Figure 16. Probability distribution of £ (30
samples, P = 0.2).
     Intuitively,  it  seems that  estimates
based on  larger samples should  be more
reliable than estimates based on just a few
locations.  The effect of sample size on the
probability  distributions   of  values  of  $
supports this position. As  can be seen from
the following examples the effect of sample
size on the probability distribution of values
of t> can be quite dramatic.

     First consider a random sample of 30
locations from  a   resource  with  a  true
proportion of subnominal  area equal to 0.2
(Figure 15).  The probability distribution of
values of f> in this case is depicted  in Figure
16.   The  probability distribution  is much
more   concentrated   around   the   true
proportion of 0.2. Also, notice  that instead
of 16 possible values of P,  there are now 31
possible values (i.e., 0/30, 1/30, 2/30, ...  ,
and 30/30).  This provides a  finer scale of
resolution for the estimates.

     Now consider a random sample of 50
locations from the  same  resource (Figure
17). The probability  distribution  of values of
P is extremely concentrated around the true
proportion of 0.2 (Figure 18).  The scale of
resolution for the estimates is improved  as
wen.  There are now 51 possible values of f>
(i.e., 0/50, 1/50, 2/50	and  50/50) with
a step size between possible values of only
0.02 (i.e., 1/50).

     The benefits of increased  sample size
are readily apparent from these examples.
The scale of resolution of the  estimates  is
improved and the spread or dispersion of the
probability  distribution  is  reduced with
increased sample size. More specifically, the
dispersion of the probability distribution of

-------
                Stmpkmg Pont
                  In noA-tubnemh*! in*
                      tubnemtMl »r»«
   Figure 17. Hypothetical resource with 50
   samples, subnomlnal area proportion of 0.2.
T3
41
£
2
o
      0-00 0,1]  OJ4 g_M O.M OJO 0.71 OX O.M
       0.0* 0.11 046 OU 0*4  OJ»  07|  O.M
      Estimated Proportwn of Subnormal Area
  Figure 18. Probability distribution of t> (50
  samples, P = 0.2).
                                         possible values  of P (as  measured by  the
                                         variance)  is inversely proportional to  the
                                         sample  size.    For   example  if  the  true
                                         proportion is 0.2 and the sample size is 30,
                                         then the variance is equal to 0.0053 (i.e.,
                                         [0.2 x 0.8] / 30 ), whereas if the sample size
                                         is 50, then the  variance   is equal to  only
                                         0.0032 (i.e., [0.2 x 0.8] / 50).

                                             The   variance   is a  useful  summary
                                         measure  of the spread  of a  probability
                                         distribution.    It  can  also  be  used  in
                                         constructing  confidence   limits   in  some
                                         cases.   For  example, if  the  probability
                                         distribution of  the estimate is known to be
                                         approximately  Normal in shape (Figure 19),
                                         then knowledge of the variance can be  used
                                         to construct confidence limits.  However in
                                         general, the Normal approximation may not
                                         be  accurate.   This  is' particularly  true  for
                                         estimates based  on relatively small sample
                                         sizes.   The  following section  describes a
                                         method for constructing  confidence  limits
                                         that produces accurate limits for any sample
                                         size.  Confidence  limits based on the Normal
                                         approximation  are   also   described,  and
                                         conditions    under   which  the   Normal
                                         approximation is  appropriate are discussed.
                                          III.  Construction of Confidence Limits
Figure 19. Probability distribution of
approximately Normal.
                                that is
                                               After reviewing Section II (Estimation of
                                          Proportion  of Area  that  is Subnominal) it
                                          should  be  clear why  confidence limits, in
                                          addition to the estimate (£), are sometimes
                                          needed.  Although values of P near the true
                                          proportion  are the  most  likely values, it is
                                          possible to  observe a value of P  that is as
                                          small as 0.0 or as large as 1.0, regardless of
                                          the value of the true proportion. This raises
                                          the  question:  even though  you have  an
                                          10

-------
 estimate  of  the proportion of area  that  is
 subnominal,  how can you  be  sure that the
 true  proportion isn't some other  number
 entirely?  Unfortunately, you can't be sure.
 However,  you can put confidence  limits
 around the estimate.

      In  this  section,  the recommended
 method for constructing confidence limits  is
 described and the  rationale for  using  the
 method is presented.  For  the purposes of
 this section,  the pattern of the response
 variable within the resource of interest  is
 treated as if  it is not known.  This is in
 contrast  to   the  previous section,  and
 represents the more realistic situation.

     This section  is  organized into  three
 parts.  In the first part,  the concept  of
 confidence    limits  ' for   estimates    of
 proportions    is   discussed,    and    the
 recommended  approach for  constructing
 confidence limits is presented.  Next, factors
 that affect the width of confidence limits are
 described. In the final part of this section,
 standard  graphs  and   tables  for  exact
 confidence  limits,  and the   use  of  an
 approximation (the  Normal  approximation)
 are discussed.
III.A. What are Confidence Limits?

     Confidence limits are bounds around
the estimate that are  determined such that
there is a known probability that the bounds
will  bracket  the  true  proportion.   For
example, 90% confidence limits have the
property that  over  many replications  of
sample  selection and confidence  interval
calculation,  90% of the  resulting intervals
will cover  the true value.  Therefore, with
11

-------
   0-2
  0.15
 •
 I
 u
 o
 I
 •
 JJ
o.t
  0.05
                  prob{p>.3}
               .

     0.00 0.19  0.17 0 «0  0.53 OJT  O.»0 OJ3
       0.07  OJO  OJ3  0.4?  O.M  0.71  0.17  1.00
          E«Om»»S Preperten »f tufen»m*ul Am«

Figure 20. Probability distribution of f> (30
samples, P = 0.3).
OJ

0.09
(D g
_3
> "
i 0.1

LJJ
0
r= 0.1
0} OiS
l»
a.



0

r
"nnfl



J '1 1 1

n
:pffl


|
n
n 1
"in t^11 proportion • 0.19
I In
HT|I.:" 	 	 	

"In true proportion • 0.18
•Jfln 	
Iflll^ 	 •".:" 	 - 	

— tru« proportion • 0.17
.|(]nz:::::::-:;;;..:::;:: 	

tru« proportion • 0.1B
T
ru
l ln«_
Rgure 21.  Construction of lower 5%
confidence bound.
symmetric confidence limits there is  a  5%
chance that the  lower limit will be greater
than the true proportion and there is  a  5%
chance that the upper limit will  be less than
the true proportion.

     The    approach    for   constructing
confidence limits for estimates of proportion
may be understood by considering a simple
example.  Suppose 30 locations are randomly
selected and measurement at these locations
generates an  estimate of the proportion of
area that  is subnominal equal to 0.3.  As is
almost  universally  the   case,  the  true
proportion subnominal is  not known.  To
place a lower bound on the  estimate we can
start  by  asking  the  question:  II the true
proportion was 0.20,  what would  be  the
probability of observing an estimate of 0.3 or
larger?  The answer to the  question can be
found in the probability distribution of values
of P for the case of a true proportion equal to
0.20 and  a sample size  of 30  (Figure 16).
The answer is the sum  of  the  probabilities
from 0.3 through 1.0 (Figure 20),  which for
this example is 0.13.   This means that even
if the true  proportion was  as  low as 0.2 there
would  be a  13% chance of observing  an
estimate of 0.3 or larger.  Therefore, 0.2 can
be  taken   as  the lower 13%  confidence
bound.

     If some pre-determined probability level
(e.g., 5%) is desired, a  range of hypothesized
values  of  the  true   proportion  could  be
evaluated.   For  example,  the cumulative
probabilities (for values of P of 0.3 through
1.0) could be computed for cases of the true
proportion being 0.19, 0.18, 0.17 and 0.16
(Figure 21). The cumulative probabilities of
greater values of P for these four scenarios
are 0.10,  0.08, 0.06,  and 0.04.  Therefore,
the lower 5% confidence bound is between
0.17  and  0.16 ( further evaluation  can show

-------
                                          that, to two decimal places,  the bound is
                                          0.17).
>
•o
UJ
*o
 o
£
0 14

0 \
0 M
0 06
0 04
0 01
0

_


	 	 n
f
prob 1 1
{Pi. 3} |f f
- 	 	 ~\ --^M- * -i
Ml



~\
FV


,.tT:::
      0.00 0.11  027  040 0 SI  0.67 0 BO CO
       0.07  O.ZO 0.3J 047  O.«0 O.TJ  0.67  1.00
       Estimated Proportion of Subnominal Area
 Figure 22. Probability distribution of f) (30
 samples, P = 0.45).
 0)

•«
(A
UJ
 OJ
CMS
 0 1
OM
  0

 OJ
015
 0.1
OM
  0

 OJ
0.11
 0.1
O.OS
  0

 OJ
0.1 B
 0.1
CM
  o !
                       try* proportion « 0.46
                       ITM!* P"??™0!1 * °-47
                          ru
              	_fru« P_f»portton • 0.48	
                      . *5*.P!?R*?*?.?_"A'4.*-
 Figure 23.  Construction of upper 5%
 confidence bound.
      Similarly,  to  place an  upper bound on
 the estimate, we can  start by  asking the
 question; 11 the true proportion  was  0.45,
 what would be the probability of observing
 an estimate of 0.3 or smaller?  In this case
 we  need  to  examine   the   probability
 distribution of P for a sample size of 30 and
 assuming the true  proportion is equal to
 0.45.  The answer in this case is  the sum of
 probabilities from  0.0 through 0.3 (Figure
 22),   which  for  this  example  is   0.07.
 Therefore, 0.45 can be taken as the  upper
 7% confidence bound .

     Again,  if a pre-determined  probability
 level  (e.g.,  5%)  is  desired,  a  range  of
 hypothesized values of the true  proportion
 could be evaluated. In this  case,  cumulative
 probabilities (from values of  P of 0.0 through
 0.3) could be computed for true  proportion
 values of 0.46, 0.47, 0.48  and 0.49 (Figure
 23),   for  example.     The   cumulative
 probabilities  for these  four scenarios  are
 0.06, 0.04, 0.04  and 0.03. Therefore,  the
 upper 5%  confidence  bound  is between
 0.46 and 0.47 (further evaluation can show
 that,  to two  decimal places,  the bound is
 0.47).

     The upper 5% confidence bound  and
 the lower 5%  confidence bound  form  the
 90%  confidence limits for the  proportion of
 area that  is subnominal.  For  this example
 (i.e.,  with a  sample size   of  30  and  an
 observed value of f) equal to 0.3), the  90%
 confidence limits are 0.17 and 0.47. There
 is a  90% chance that confidence  limits
 constructed  in this manner  will  bracket
 the true  proportion,  and   a  10% chance
 that the  limits will miss the true  proportion.
This result can be demonstrated by repeating

-------
 the random selection of 30 locations from a
 known pattern of subnominal areas (as was
 discussed in Section II.A) over and over. For
 each of the random selections, the value of ft
 can  be  computed  and  the  corresponding
 confidence limits determined.   In 90 out of
 100 iterations (on average), the confidence
 limits  will  bracket the  true  proportion,
 regardless of the value of the true proportion.
III.B.  Factors  Affecting  Width  of  the
      Confidence Interval

     The  interval  between  the lower  and
upper 90% confidence limits  is the 90%
confidence  interval.     In   the  example
discussed  above,  the width  of the 90%
confidence  interval  is .0.30  (0.47  minus
0.17).    Two  factors  (given  a specific
estimate,   £)  affect   the   width  of   the
confidence  interval:  the  confidence level
(e.g., 90%), and the sample size.  If a higher
level of  confidence  had been specified,  say
95%, then  the  confidence  interval  would
have been  wider. On the other hand, if the
sample  size had  been  larger,  then  the
confidence  interval  would   have   been
narrower.

    The fact that the  width of a confidence
interval  increases  as  the  confidence level
increases  is intuitively appealing.  This is
consistent with being  more confident about
making  general statements (wide intervals)
and being less confident about making more
specific  statements (narrow  intervals).  The
reason for this effect of confidence level on
confidence intervals is clear from the way the
confidence  limits  are determined.     For
the example discussed above,   if a  higher

-------
   0.23
   OJ -
I
I
0.
   0.85 -

(1


1

-1
n
.n 	
jl preb (p > .3)
, ill./ 	
     C-DO B.1J  0.27 040 0.63  O.B7 O.tO  0 iS
        0.07  DJO  OSS  0.47  O.BO  07)  OJ7  1.00
               Pmpoftnn el Bubnomml An*
Figure 24. Probability distribution of f> (30
samples, P = 0.15).
confidence level is desired  (e.g., 95% rather
than  90%), then the upper and lower 2.5%
confidence bounds would be used. The true
proportion would have to be 0.15 in order for
the cumulative probability (from 0.3 through
1.0}  to   equal  only  2.5%  (Figure   24).
Similarly, the true  proportion would have to
be   0.49   in  order  for   the  cumulative
probability (from, 0.0  through  0.3) to equal
only 2.5% (Figure 25). The effect  of varying
confidence levels from 75% to 99%  on the
size of the confidence intervals is summarized
in Figure 26 (for a sample size  of  30 and an
observed £> of 0.3).
   „
    (.1

   Mi



r

prob
(p£-3> n
.IJ',1 ,

n


1


. . k. . , .
»te t.ii ur t4t LU i*7 «j> tn
*JT tJl Ui MT •«• «.TJ ««? 1M
Estimated Proportion of Subnominal Araa
Figure 25.  Probability distribution of t> (30
samples, P = 0.49).
                                                  «
                                                  i
           tm

           »Mt

            *4

           •*n
nt*

5
                                                  o
                                                  O
                                                                 Ui   tJ    «J

                                                               Confidence Level
     Figure 26. Confidence interval widths as a
     function of confidence level (30 samples, f>
     0.3).

-------
 0.32

•3 »•»

IOJB

jjoje

lo.24
o
O


I"
 0.15 -

 0,18

 0.14
' 20
       40
              SO     10
            S«mpt» Stz*
                          100
Figure 27.  Confidence interval width as a
function of sample size (90% confidence, f> =
0.3).
                                         The fact that the width of the confidence
                                    interval  decreases   as   the   sample   size
                                    increases  is   also  intuitively  appealing.
                                    Increased sample size, as discussed in Section
                                    II, increases the reliability of  estimates and
                                    should allow more detailed statements to be
                                    made  without  diminishing   the  level  of
                                    confidence.  For example,  with a sample size
                                    of 60  and assuming the  true proportion is
                                    0.17 (the lower 5% bound for a sample size
                                    of 30) there is only  a 1% chance that the
                                    observed value of f> would be  0.3 or greater.
                                    The 5% lower  confidence  bound in  this case
                                    is .0.20. Similar effects are exhibited with the
                                    upper  confidence bound.    The effect of
                                    varying sample size on the size of confidence
                                    intervals is summarized in Figure 27 (for 90%
                                    confidence and an observed £ of 0.3) for a
                                    range of sample sizes from 10 to 100.
                                    III.C.  How to Compute Confidence Limits

                                        III.C. 1.   Standard Graphs and Tables
                                                  for Confidence Limits
                                        No special  calculations  are  needed  to
                                    determine confidence limits for the situations
                                    described above.  The  required  confidence
                                    limits   are  tabulated  and  summarized  in
                                    standard graphs in many textbooks  and
                                    handbooks on statistics (e.g., see W.H. Beyer
                                    [ed.] 1976. CRC  Handbook of  Tables  for
                                    Probability  and Statistics. CRC Press).  The
                                    confidence   limits  are  referred   to    as
                                    "Confidence Limits  for  Proportions" for  the
                                    "Binomial Distribution".  Separate tables  are
                                    published   for  different  confidence  levels
                                    (usually 90%, 95% and 99%). The tables are
                                    read by specifying x   ( referred  to  as the

-------
 numerator or the number of successes)
 and n (referred to as  the denominator
 or the sample size).  The corresponding
 table entries  are the lower and upper
 confidence limits.  This information is
 also summarized in graphs that depict
 the upper and lower confidence limits
 on  the  y-axis   and   the  estimated
 proportion on the x-axis.
   III.C.2.   Normal Approximation

       An alternative approach to the
 one  based on the Binomial distribution
 (described above)  is to construct the
 confidence limits based on the Normal
 distribution.        Construction     of
 confidence limits based on the Normal
 distribution provides a greater degree
 of    flexibility    which    can    be
 advantageous   for  more   complex
 sampling   designs   (e.g.,   stratified
 random designs discussed in  the next
 section).

       As  discussed  in  the  previous
 section the Binomial distribution exactly
 describes the probability distribution  of
 possible   values   of   the  estimate.
 However, under certain circumstances,
 the  Normal  distribution is  a  good
 approximation    to   the    Binomial
 distribution.  In these cases, tonfidence
 limits based on the Normal distribution
 may  be used  instead of those based on
 the Binomial  distribution.

      Approximate confidence limits,
 based on  the Normal distribution, are
 computed  from   a simple  formula.
Therefore,  confidence  limits  do  not
 have to be  restricted  to  confidence
 levels  and  sample  sizes  listed  in
 standard   tables   and  graphs,  and
 interpolations between tabulated values
 are  not required.   For  example,  a
 standard table of Binomial  confidence
 limits  may  list confidence  limits  for
 confidence levels  of 95% and 99%,
 and  may list sample sizes in steps of
 10 (e.g., 10, 20, 30, etc.). By using
 the   Normal  approximation,   any
 combination  of confidence  level  and
 sample size can be addressed directly
 (e.g., 85% confidence  and a sample
 size of 53). The Normal  approximation
 requires  information  on  only  two
 quantities:  a  coefficient based on the
 Normal distribution, and the variance of
 the probability distribution .of possible
 values of the estimate (£).

   The variance  of  the  probability
 distribution of possible  values  of the
 estimate can be  estimated  as  the
 product of the estimated  proportion
 times   one  minus   the  estimated
 proportion, all divided by the sample
 size:

           (P (1-p) 1 '"•

This is the  same as the expression for
 the  variance  presented  in   Section
 II.C.2, except that the true proportion
 (which in practice  is unknown) in this
 case is replaced by the estimate of the
proportion ($>).  For example, if f> is 0.4
 and  the sample  size is  50, then the
estimate of  the variance is 0.0048 (i.e.,
 (0.4x0.61/50).

-------
     The  required  Normal  coefficients are
 tabulated  in  most  introductory  statistics
 textbooks  as well  as in  advanced  texts.
 Generally the tabulations are presented in
 steps of 1% or less (e.g., 90%, 91%, 92%,
 etc.).  For the standard confidence levels of
 90%, 95% and 99%, the  Normal coefficients
 (c) are 1.65, 1.96, and 2.58 respectively.

     The lower confidence limit based on the
 Normal approximation is simply the estimated
 value minus a quantity equal  to the product of
 the  Normal coefficient (c)  for  the desired
 confidence level 3nd the  square root of the
 estimated variance:
           P ~ c / p  (l-p) I n  .

Similarly, the upper confidence limit based on
this approximation is the estimated value plus
that same quantity:
           p + c v/ p (l -/?) / n  .

For a 95% confidence  interval based on the
example in the previous paragraph, the lower
confidence limit is 0.26,
           0.4  - 1.96 y/ 0.0048  ,


and the upper confidence limit is 0.54,
           0.4  + 1.96V7 0.0048
     As   noted  previously,   the   Normal
approximation  is not always  accurate.   In
particular,  it is not accurate if the sample
size is too small.   Working rules have been

-------
                                     established  (e.g., see W.G.  Cochran.  1977.
Table 1. Minimum sample sizes required .      Sampling Techniques. John Wiley and Sons)
        heNomTa^pF^^               regarding  the minimum sample size that  is
                                     required    when    using    the   Normal
         £                           approximation. The required minimum sample
         or                           size is larger for estimates of the proportion
         .p)          n               (values  of  £) close  to  0.0  or 1.0, and  is
                                     smallest if the estimate is 0.5 (Table 1). The
                                     required   minimum  sample   size  for  all
        04           50               situations is never less than 30 and is as large
        03           BO               as 1400 when the estimated proportion  is
                                     0.05   (or  0.95).    Clearly,   the  Normal
                                     approximation must be applied with caution.
        0.1          eoo              Whenever  possible,  the  exact confidence
        005         1400              limits  based  on  the  Binomial  distribution
         : -        should be used.
                                     IV.  Data    Analysis   for   Stratified
                                          Random Samples

                                          The discussion of data analysis methods
                                     up to this point has assumed that all locations
                                     within the resource of interest have an equal
                                     chance of being selected for sampling.  This
                                     may not  always be the case.   In  some
                                     situations,  areas  of special  interest (strata)
                                     may be  identified and additional  sampling
                                     effort expended in  these areas.  Although
                                     every location within a stratum may have an
                                     equal chance of being selected for sampling,
                                     a location within a special interest area would
                                     have a higher chance of being selected than
                                     a location outside the special interest area.
                                     To ensure that estimates are unbiased, the
                                     analysis of data from this type of sampling
                                     design  (referred to as  a stratified  random
                                     sampling  design)  must account  for  the
                                     different levels  of  sampling  effort -in  the
                                     different strata.

                                          The recommended method for analyzing
                                     data from  stratified sampling  designs  is
                                     straight-forward and   intuitively   appealing.

-------
                                       The  method  is illustrated in the following
                                       paragraphs with a hypothetical example for
                                       the  case of  two  strata.    The  method,
                                       however, is not limited to two strata and can
                                       be extended to analyze data from a stratified
                                       random sample with any number of strata.
Rgure 28. Hypothesized resource divided into
two strata.
                Sam ping Point
                 - In non-«ubnomlnil irti.
                    • In tubnomlnal ir»« a
Rgure 29.  Stratified resource with 20 samples
in Stratum  1 and 10 samples in Stratum 2.
     Suppose the resource being studied is
divided into two strata: a 200 square mile
area of special interest (labeled Stratum 1 in
Figure 28) and the remaining  800 square
miles of the  resource (labeled  Stratum  2).
The intention is to be  able  to  characterize
the entire resource but also to  characterize
just the area  of  special interest.   For this
reason,  suppose  that  most  samples  are
allocated to stratum  1.   For this example,
the total sample size  of 30 is split between
the two strata with  20 samples  going  to
stratum 1 and 10 samples going to stratum
2.   Within  each  stratum,  the sampling
locations are selected  randomly (Figure 29).

     Two steps are needed to estimate the
proportion  of the total area  (i.e., the entire
resource)  that  is subnominal in this case.
First, a separate  estimate is computed for
each of the two strata  (say£1 and £2) using
the  methods  described  in  the  foregoing
sections. For this example, the  estimate for
stratum 1 is based on 20 samples,  and the
estimate for stratum 2  is based  on   10
samples.  The second step is to compute a
weighted average of  these two estimates.
The weight associated with each stratum is
the ratio of the area of the stratum divided
by the total area of the resource.   For this
example the weight for stratum 1 is 0.2 (i.e.,
200/1000) and the weight for stratum 2  is

-------
0.8  (i.e.,  800/1000).  Therefore,  the
weighted  average (0 ) is:

        p. = 0.2/51 + 0.8/2 .


The  resulting estimate  for the entire
resource is unbiased.
                                          Similarly, the upper confidence limit is
                                          the  weighted average  proportion plus
                                          that same quantity.  As with any use of
                                          the  Normal approximation,  adequate
                                          sample  sizes  in  each  stratum  are
                                          necessary for reliable results.
      A  confidence  interval  for  the
overall  estimate  can  be  calculated
based on the  Normal  approximation.
Using  the  previous  example,  the
estimated variance of the estimate from
stratum 1 is (01(7-07) ]  / 20, while the
estimated variance of the estimate from
stratum  2 is [ £2(7-02) 1/10.  The
estimated variance  of  the weighted
average  is  the  weighted sum of  the
variances from the two  strata:

  Var(p ) = (0.2)2[(/51(1-PO)/20)
           * (0.8)2[(/2(l-/2))/10]  .

Each weight used  to   compute  the
overall variance is simply the square of
the  corresponding weight  that was
used to compute the overall  proportion
(0.).

      The lower limit of  the confidence
interval    is  the  weighted   average
proportion minus the quantity equal to
the product of the Normal coefficient
(c) for the desired confidence level, and
the square root of the variance of the
weighted  average:
         p  -
                                         V.  Closing Comments

                                             The methods  described  in  this
                                         document are appropriate for analyzing
                                         data from simple random samples and
                                         stratified random samples.  However,
                                         some R-EMAP  programs use neither
                                         simple random  nor stratified  random
                                         sampling designs.  In these cases, the
                                         described methods should only  be used
                                         as a last resort, and will only  produce
                                         approximations. ,   A   more  general
                                         method of data analysis should  be used
                                         that is consistent with the complexity
                                         of the sampling design.   This  more
                                         general approach is  conceptually similar
                                         to the described methods, but is more
                                         involved   and   requires  that   the
                                         probability of selecting each  location'
                                         and the probability of selecting  every
                                         pair  of locations  is  known.   This
                                         additional information may not always
                                         be  available.   if  any  doubt exists
                                         regarding which method  to use, the
                                         EMAP Statistics and Design "ieam at
                                         the EPA Corvallis Laboratory can make
                                         the  determination.

                                            The hypothetical example  referred
                                         to  throughout  this   document  is
                                         intended simply as an instructional tool.
                                         The methods described are not limited
                                         to analyzing data from estuaries.  They
                                         are appropriate whether the purpose of

-------
sampling is to estimate the proportion
of The number of resource units (e.g.,
numbers of lakes), the proportion of
total length of a resource (e.g., miles of
streams),  the  proportion of area of a
resource  (e.g.,  square  miles  of  an
estuary), or the proportion of volume of
a resource (e.g., cubic meters of one of
the Great Lakes).  The methods  can be
applied without modification to each of
these  situations.     More   detailed,
technical   documentation  on   the
methods described in this document is
available from the EMAP Statistics and
Design Team in Corvallis.

-------
                       EMAP Status Estimation:
                Statistical Procedures and Algorithms
                        V.M. LESSER AND W.S. OVERTON

                   Department of Statistics, Oregon State University,

                                 Corvallis, Oregon
                                 Project Officer

                                Anthony R. Olsen

                       U.S. Environmental Protection Agency

                        Environmental Research Laboratory

                       200 SW 35th Street, Corvallis, Oregon
The  information in this  document has been funded wholly or in part by  the  U.S.

Environmental  Protection  Agency  under cooperative agreement CR-816721 with Oregon

State University at  Corvallis.  It has been subject to the agency's peer and administrative

review. It has been approved for publication as an EPA document.

-------
                                  CONTENTS
INTRODUCTION                                                              1
1.1 Overall Design                                                              2
1.2 Resources                                                                  2
1.3 Response Variables of Interest                                                3
1.4 Statistical Methods                                                          4
1.5 Use of this Manual                                                          6

GENERAL THEORETICAL DEVELOPMENT                                    9
2.1 Design-Based Estimation Methods                                             9
    2.1.1  Discrete Resources                                                     9
          2.1.1.1 General Estimator and Assumptions                             10
          2.1.1.2 Tiered Structure                                              14
    2.1.2  Extensive Resources                                                   22
          2.1.2.1 Areal Samples                                                23
          2.1.2.2 Point Samples                                                25
          2.1.2.3 Alternative Variance Estimators                                 29
2.2 Model-Based Estimation Methods                                            33
    2.2.1  Prediction Estimator                                                  34
    2.2.2  Double Samples                                                      36
    2.2.3  Calibration                                                           37
2.3 Other Issues                                                               38
    2.3.1  Missing Data                                                         38
          2.3.1.1 Missing Sampling Units                                        38
          2.3.1.2 Missing Values within Sampling Units                            39
    2.3.2  Censored Data                                                        39
    2.3.3  Combining Strata                                                     41
          2.3.3.1 Discrete Resources                                             42
          2.3.3.2 Extensive Resources                                            43
    2.3.4  Additional Sources of Error                                             43
    2.3.5  Supplementary Units or Sites                                           44

DISTRIBUTION FUNCTION ALGORITHMS                                     45
3.1  Discrete Resources                                                          46
    3.1.1  Estimation of Numbers                                                47
          Equal probability sampling                                            47
                Case 1   N  known/unknown, Na known                           48
                Case 2 - N  known, N0 unknown                                  53
                Case 3 - N  known/unknown, Na known/unknown                  55
          Variable probability sampling                     ___                    56
                Case 4   Na unknown or Na known and equal Nfl                   57
                Case 5 - Na known and not equal N0                             60
    3.1.2 Proportions of Numbers                                                61
          Equal probability sampling                                             61
                Case 6 - N0 known/unknown                                     62
                Case 7 - Na known                                              67
          Variable probability sampling                                          68
                Case 8 - N0 unknown or Na known and not equal Na                69
                Case 9 - Na known and equal Na                                  72
    3.1.3  Rationales for the Algorithms in Section 1.1 and 1.2                        74

-------
                                       SECTION 1




                                    INTRODUCTION











       The Environmental  Protection Agency (EPA) has  initiated a  program  to  monitor




 ecological status and  trends and  to establish  baseline environmental  conditions against




 which  future  changes  can be  monitored  (Messer et al.,  1991).   The  objective of  this




 environmental program, referred to as EMAP (Environmental Monitoring and  Assessment




 Program),  is to assess  the status  of  a  number  of different ecological resources, including




 surface  waters,  wetlands,  Great  Lakes,  near-coastal   waters,  forests,  arid   lands,  and




 agroecosystems.






       A  design  plan and a number  of support documents have been  prepared  to  guide




 design development for  EMAP (Overton et al., 1990; Overton  and Stehman, 1990;  Stehman




 and  Overton,  in press; Stevens,  in  press).   The statistical  methods outlined in earlier




 documents, such as those analyzing  the  EPA  National Surface Water Surveys,  are also




 relevant to  EMAP (Linthurst et al., 1986; Landers et al., 1987;  Messer et al., 1986).





       This report  presents statistical procedures  collected from other EMAP documents, as




 well  as from  Oregon  State University technical  reports describing data  analyses for  other




 EPA designs.  By integrating  this  information,  this  manual and the EMAP design report




 will serve as reference  sources  for statisticians  who  implement  an ecological  monitoring




 program  based on  the EMAP design  framework.  Spatial and temporal analyses of EMAP




 data are  not covered in  this version  of the report.   A brief discussion of the four-point




 moving average, which  combines data  over the interpenetrating sample,  is  presented in




Overton et  al. (1990; Section 4.3.7).  Algorithms listed in this  report cover most design




options discussed in the  EMAP design  report. It  is expected that  any further realizations of




 the EMAP design will also include documentation of corresponding variance estimators.

-------
 1.1  Overall Design






       The EMAP design  is a probability sample of resource units or  sites that is based  on




 two tiers of sampling.   The first  tier (Tier 1) primarily supports landuse characterization




 and description  of  population structure,  while the  second tier  (Tier 2) supports status




 assessment by the indicators.  The second tier sample is  a probability  subsample of the first




 tier sample; such a sample is referred to as  a double sample.  Across the ecological resource




 groups, it is expected that discrete, continuous, and extensive populations will be monitored.




 The statistical methods outlined  in  this report address  these different population types  at




 both sampling tiers.   A description of the  sampling design is presented  in Overton et  al.




 (1990).










 1.2  Resources






       EMAP  is designed to provide the capability of sampling any  ecol'ogical resource.  To




 achieve this objective, explicit design  plans  must'be specific  for a particular resource  and  all




 resources to be  characterized must be  identified.  Currently, the resources to be sampled




 within EMAP include:  surface waters, wetlands, forests, agroecosystems,  arid lands, Great




 Lakes, and near-coastal wetlands.  These resources are further divided by major classes  to




 represent the specific 'resource'  that will be  addressed by  the sampling effort.  For example,




 surface waters can be partitioned  into  classes such as very small lakes,  intermediate-sized




 lakes, very  large lakes, very small streams, intermediate-sized streams,  rivers, and very large




 rivers.   Because  each class  potentially generates different sampling issues, each would  be




considered a different entity. The design structure meets this condition by identifying each




such class as a resource, thereby  resulting in 6 to 12 surface water resources.   Each major




resource group may also have as many divisions.






       Most resources will  be sampled via the basic EMAP  grid and  associated structures.




However, other  resources,  such as very large lakes and  very large rivers, represent  unique
                                          \  2

-------
 ecological entities and cannot  be treated as  members of a  population of entities to  be




 described via a sample of the set. For example, Lake Superior and the Mississippi River are




 unique, although the tributaries of the Mississippi might be treated as members of a wider




 class of tributaries.






       Resources sampled by  the  EMAP grid  will be associated with  an  explicit domain in




 space, within which the resource is confined.  This domain should  be established early in the




 design process. Within the defined domain, it  is not expected  that the resource will occupy




 all space  or  that  no other resource will  occur.  Domains of different resources  will overlap,




 but the domain  of  a particular  resource  is an  important parameter of its design.   For




 purposes  of  nomenclature, the resource  domain is a region containing the resources.   The




 resource universe  is either  a point set  within one  point for  each resource unit  (discrete




 resource)  or  the continuous space actually  occupied by the resource (for extensive resources).




 A resource class will be a subset of its universe. Such a class may or may not be treated as




 a sampling stratum and may or may not have an  identified subdomain.









 1.3  Response Variables of Interest





       The  term  response, variable  is  used  generally  for  the measured characteristic  of




 interest in the sample survey.  In EMAP, a special class of response variables is referred to




 as indicators, such as indicators of ecological status (Hunsaker and Carpenter, 1990). These




 indicators are  the environmental and ecological variables measured in the field on  resource




 units or at  resource  sites;  they may be measured  directly or  modified via  formulae or




analytic protocols.





       The term, indicators, should not be applied to the structural variables defined at Tier




 1.   The Tier 1 variables are used to estimate population structure and to partition  the Tier




 1 sample  into  the  necessary  structural  parts for  Tier  2.  Then the indicator variables are




determined  on the  Tier 2 sample.   When the Tier 2  sample includes  the entire Tier 1

-------
sample, it is  still  appropriate  to  make the distinction  between indicator and structural




variables,  both of which are response variables.  Because of this distinction, it is sometimes




appropriate to distinguish  Tier 1 and Tier 2 in terms of the variables, rather than strictly in




terms of a subsample.










1.4  Statistic*] Methods






       The primary statistic used  to summarize population characteristics  is the estimated




distribution function.  This  distribution function estimates the number or .proportion  of




units or sites for which the value of the indicator is  equal to  or less than y.   For discrete




resources,  the  estimated distribution  function for numbers is designated as N(y), while the




estimated  distribution  function  for the proportion of numbers is designated as  F(y).  The




estimated  distribution  function of size-weighted totals in discrete resources is designated  as




Z(y), while a size-weighted proportion is designated as G(y).   Examples of size-weights are




lake  area, stream   miles,-  and  stream  discharge.    There are  no distribution functions




comparable to Z(y) and G(y)  in the continuous and extensive populations because there are




no objects in  these resources.   Therefore, there are  no object sizes to use as weights.  In




extensive  resources, the estimated  distribution function representing actual area! extent for




which the value of the indicator is equal to or  less than y is designated as A(y), while the




proportion of area! extent  is designated as F(y).  Thus A(y) is analogous to N(y); A is the




size of an  extensive resource and N is the size of a finite resource.






       A number  of estimates of interest, which can be obtained from  the  distribution




function, have been used  quite successfully in the National Surface Water Surveys (NSWS)




(Linthur'st et a].,  1986; Landers et  al.,  1987;  Kaufmann  et al.,  1988).   For example, any




quantile,  including  the median of  the  distribution,  can  be  interpolated  easily from the




distribution  function.    In addition, the distribution function can be supplemented with




tables of means,  quantiles, or  any other statistics of  particular interest, providing  greater




accuracy than can.be obtained  from the plotted distribution function.

-------
        The basic formula for estimated distribution functions is
 where  S is the sample  of units representing the universe ("U)  and the variable y represents

 any response variable.  The subscript a denotes a subpopulation of interest; Sa is the portion

 of the  sample in subpopulation a, and S   is the portion of the sample in subpopulation  a

 having values  < y.  We associate the inclusion probability, TT,-, with each  ith sampling unit.

 Each sampling unit is a representation of a subset of the universe, and the weight (wt = i)

 accounts for the  size of the  subset.   Na denotes the estimated  population  size  for  the

 subpopulation a.  Fa(y) is calculated for each value of y appearing in the sample.


       As given, Fa(y)  is a step function not suitable to general EMAP -needs; a smoothed
version is  desirable.   Thus, we propose the following method.   In  this method,  FQ(y)  is

                                re y' is the next lesser value to y, For the minimum
                            (y)
            r /  \  i  r  /  ?\
replaced by — - - ~ — - - , where y' is the next lesser value to y,  For the minimum values
of y, Fa(y) is replaced by — ^ — .   A linear interpolation  is then made between these points

to  generate the  plot  or  to determine  quantiles.   For  each  of the distribution  function

algorithms provided in this report, two successive values are averaged in this manner and

used to develop an interpolated distribution function.  Confidence bounds are constructed on

F0(y) and then averaged and interpolated in the same manner.  We rest justification for this

procedure on the interpretation of the initial and  final values  of the resulting distribution

function.   The initial value is our best estimate of the proportion of the population below

the minimum observed value, and  similarly, one minus the last  point is our best estimate of

the proportion of the population above the maximum observation.


       Computation of the distribution function and the associated confidence bounds differs

slightly 'for specific resource groups, reflecting differences in detail of the sampling design.


                                          \

-------
For  example, in  some  cases simplifications of the computing algorithms result  from equal

probability designs.  Some algorithms are presented in this report to accommodate the range

of conditions and objectives anticipated  across the resource groups.  These algorithms have

been previously discussed in greater detail in other documents; references are given for those

requiring  a  more in-depth  approach.   Table  1 provides  an  outline of  the  distribution

functions, which  are presented in greater detail  throughout this  report.  Table 2 provides a

table of notation  used throughout this document.



1.5  Use of this Manual


       The body  of this manual has been separated into two sections.  Section 2 includes the

genera] theoretical  development  upon  which the  algorithms  are  based.   Formulae for

discrete,  continuous, and  extensive resources are presented;  the mathematical notation  is

introduced and defined; and both  design-based and model-based approaches for  computing

the distribution  functions are  discussed.  Other issues relevant  to  the analysis of EMAP

data, such as handling of missing data, are also discussed in this section.


       Section 3  includes  the  algorithms used  to  produce  the  distribution functions,  the

conditions that provide for the application of the algorithms, and the rationales that support

the choice and derivation  of  the  algorithms presented.   References will be made  to  the

general formulae  (discussed in Section 2)  used to develop these algorithms.


       The following list  outlines a  step-by-step sequence for  obtaining  the distribution

functions:

       1.  Determine whether the data represent a discrete or extensive resource.
       2.  Determine the type of distribution function to compute.  For example, for discrete
          resources, the distribution  of numbers and/or proportions of numbers will be
          of interest.
       3.  Determine whether the sampling units were collected with equal or variable
          probability of selection.  The inclusion probabilities, TTj, and TTj.iii discussed in
          Section 2.1.1, are to be a permanent part of a datum record, as are the
          identification code of the sampling unit and the variable of interest. In some
          cases,  it is also necessary to identify the grid point, which can be included as
          part of the identification code.
       4.  Determine whether the size of the subpopulation of interest is known or unknown.

                                           V

-------
           The subpopulation is the group of population units about which one wishes to
           draw inference.
       5.  Using the conditions from steps 1-4, refer to the example of that specific algorithm
           in Section 3.
       6.  Optional, but suggested:  Refer to the formulae referred to in Section 2 for a
           description of the formulae and for clarification of any notational  problems.
       7.  Optional:  An algorithm to obtain specific quantiles is presented in Section 3.

       This manual is expected  to be updated as research  continues in the development of

statistical  procedures for  EMAP,  as  EMAP adapts  to changing concerns and orientation,

and as EMAP  makes and accumulates more in-depth frame materials.   For  example, efforts

to date  have been focused on  design-based  approaches  to  confidence bound  estimation,

therefore this version  reflects a fairly in-depth  approach to design-based estimation  over all

resources.  Further discussion of model-based  approaches currently under development are

expected in future versions of this manual.
                                           \

-------
                                       SECTION 2




                      GENERAL THEORETICAL DEVELOPMENT










       Two approaches are commonly used to draw inferences from a sample survey relative




 to  a  real  population.   In  the design-based  approach,  described  in  Section  2.1, the




 probabilities of selection are accounted for by the estimators and the properties of inferences




 are derived from the design and analytical protocol.  In contrast, the model-based approach




 (Section 2.2) assumes  a model and requires knowledge of auxiliary variables  for inference.




 Properties of  model-based inference  are  derived from the assumed model and  analytical




 protocol.  A model-based estimator takes into account only model features, while a model-




 assisted  estimator  takes into account  both model and design features.   For a discussion of




 the relationship between these two approaches and  the way they are used together, refer to




 Hansen et  al. (1983).  The paper by Smith (1976) also provides useful insight.
2.1  Design-Based Estimation Methods




2.1.1' Discrete Resources





       A  population of natural  units  readily identified  as  objects is  defined  as  a discrete




resource.    For  example,  lakes,  stream segments,  farms, and  prairie  potholes are  all




considered discrete resources.  Populations of a large  number of discrete resource  units that




can  be described by a sample are considered for EMAP  representation.  It is suggested,  for




example,  that lakes less  than 2,000 hectares  be characterized as populations  of discrete




resources.   Distribution functions of the numbers of units or proportions  of these numbers




may be  of interest.  On the other hand,  very  large lakes are  unique,  and less usefully




characterized as members of populations of lakes.

-------
2.1.1.1  General Estimator and Approximations of Design-Based Formulae


       Because the  EMAP design is based on a probability sample, design-based estimators,

which  account for this  structure, are applicable.  The Horvitz-Thompson theorem (Horvitz

and  Thompson, 1952)  provides general estimators  of the  population attributes for general

probability  samples  and  for  estimators of variance  of  these estimators (Overton and

Stehman, 1993a; Overton et al., 1990).


       In  Horvitz-Thompson estimation,  the probability of inclusion, 7T,, is associated with

the ith sampling unit.  Each  sampling unit is a representation  of a subset of  the universe,

and  the weight (wf  = ^-) accounts  for  the  size  of the  subset.  Therefore,  estimates  of

population parameters,  such as totals or means, simply  sum the variables collected over the

sampling  units, expanding them by the sampling weights.  The Horvitz-Thompson  estimator

proposed  for EMAP is unbiased for the population (and subpopulation)  totals and means, if

7T,>0 for all units in the population.


       The general form of the Horvitz-Thompson estimator is
                                                                                     (2)
where  S is  the sample of units  representing the universe (1L), wf is the weight, and  the

variable y represents any  response variable.  The total  of y on the universe is defined as Ty

= £y and is  generally referred to as the population total.  This estimator (Equation 2)
   1L
yields estimates of many parameters simply by the definition of y. For example, if y,-=l for

all units in the  population, then Tv =  N, the population size; it follows that N=^wf-.
                                                                             S

       Suppose  further  that we are interested in a subpopulation, a.  The  portion of the

sample, Sa, that  came from  this subpopulation  is also  a probability sample from this

subpopulation.   To obtain parameter estimates  for a  eubpopulation, Equation  2 is simply

summed over the subpopulation sample,
                                           10

-------
                                                                                   (3)
       The Horvitz-Thompson theorem also  provides  for an  unbiased estimator of the



 variance  of these estimators under the condition  that  7T,j>0 for all  pairs of units in the



 population.  The quantity  v^ is the  probability that unit i and unit j are simultaneously



 included  in the sample.  The estimator of the variance is designated  in lower case as var,



 and w(J is the inverse of the pairwise inclusion probability.   The variance of Tya or Na is



 obtained  by the choice of y,:
             var(Tyj = \ ^yXK-i) + 2- L^K",-*,;))•             (4a)
                             5,               S,  S_
                                                    a
             var(Na)  =       *,.(*,.-!) +         K^ - *,-,•)                     (4b)
This  presentation shows that the form  of the variance estimator does not  change when



estimating variance  for the estimator based on a full sample or a subset of the sample.  The



subsetting device, with summation  over  the appropriate subset of the  sample, will  always



represent the appropriate estimator.  The principal reason for using the Horvitz-Thompson



form  (Equation 4)  is its  subsetting capability; the  commonly used form  for  the  Yates-



Grundy variance estimator does not  permit the convenience of subsetting.




      The  EMAP  design  is  based on  a systematic  sampling  scheme.   The  Horvitz-



Thompson theorem does not provide a design-unbiased estimator of variance based on this



design, because some pairwise inclusion probabilities are zero.  The following sections include



a discussion of assumptions and approximations applied to Equation 4 in order to apply this



variance estimator in EMAP.



                                          11

-------
Systematic sampling






       Because  EMAP units are selected by  a systematic sampling design, many  of the




pairwise inclusion probabilities (TT^) equal zero and an unbiased variance estimator is not




available.  However, it has been established that in many  cases the variance of a systematic




design can be satisfactorily approximated by the variance  that applies to a sample taken on




a  randomly  ordered list  (cf., Wolter,  1985).   A  common  systematic sample selected on  a




randomly ordered list  is a simple random sample.  Therefore, a simple random sample is an




approximate  model  for  an  equiprobable systematic  sample.    The  randomized  model




proposed  here  provides  approximate variance  estimation for  a  systematic  variable




probability design.






       A  modification  of the  randomized  sampling model  provides  only   for  'local'




randomization of the  position of the population units, rather than global  randomization.




Good  behavior of  the variance estimator results  from  this  assumption  (Overton  and




Stehman, 1993b).  As a consequence, we can justify use of the suggested pairwise inclusion




probability  with less  restriction as compared  with  the global randomization assumption.




We will refer to the local randomization model  as the weak randomization assumption.






       The  Horvitz-Thompson  estimator  of  variance,  Equation 4, is thus  proposed  for




EMAP indicators under  the  weak  randomization  assumption.    The simulation  studies




conducted on the behavior of this estimator suggested that  this assumption was adequate in




most situations expected for EMAP (Overton,  1987a; Overton and Stehman, 1987;  Overton




and  Stehman,  1992; Stehman and  Overton, in  press). In a few situations  the estimator




overestimated the true variance, thus providing for a conservative estimate of variance.  In




certain circumstances,  as discussed  in Section  2.1.2.3, it is  appropriate  to modify  the




estimation methods to account for the spatial patterns.

-------
 Pairwise inclusion probability




       Approximations for the pairwise inclusion  probability  under the randomized model



 have been proposed in the literature (Hartley and Rao, 1962).  A major disadvantage with



 these approximations, as discussed by Stehman and Overton (1989), is the requirement that



 all inclusion probabilities  for the population must  be known.   For large populations such as



 those studied in EMAP,  it is practically impossible to obtain inclusion probabilities for all



 units in the  populations.  Another approximation for this  pairwise  inclusion probability



 requires that the inclusion probabilities be known only on the sample (Overton,  1987b).



 The formula for the inverse of this pairwise inclusion probability  is
                                      2nw,w • - w-- w.
                               w   -       J        ;
                                   -
 where n is sample size.




       Investigation of this approximation indicates that it performs at least as well as other



 commonly recommended  approximations (Overton and  Stehman,  1992; Overton,  1987a).



 Therefore, this pairwise inclusion probability  will be used  in  the approximation  of the



 variance  estimator for the population parameter  estimates collected  in EMAP,  for those



 circumstances in which the randomization assumption is justified.




       This variance estimator (Equation 4) accommodates variable probability of selection,



 but it is also appropriate for equal probability designs.  The approximation for the pairwise



 weight given in Equation 5 is also appropriate for randomized equal probability designs.  As



 a consequence, Equation 4 with 5  is valid for either equal or variable  probability  selection,



 under  the weak assumption of a randomized model,  as  discussed above under systematic



sampling.




       When the randomization  model  is not  acceptable,  alternative  variance  estimators,



based on  the mean square successive difference, have been developed for use with an equal

-------
probability systematic design and regular spacing (Overton  and Stehman,  1993a).   The




conditions  and assessment  of these  and other  variance  estimators  are presently under




investigation; subsequent  versions  of  this document  will discuss  these alternate methods.




Extension  must be made  to account for irregular spacing.  It should also be noted that in




some circumstances the methods of spatial  statistics may provide adequate assessment of




variance.






       The  confidence  bounds obtained  using  the Horvitz-Thompson variance estimator




(Equation  4)  are based on normal  approximation.  This approximation may be inadequate




for estimating confidence  bounds at the tails of the distribution, even  for moderate sample




sizes.    In  the special  case  of equal  probability  of  selection  and  the  randomization




assumption, confidence bounds can  be  obtained by exact methods (see Section 3). However,




exact methods may also yield inadequate confidence bounds at the tails of the distribution




(also discussed in Section 3).











2.1.1.2  Tiered Structure






       The  following description of the  tiered structure was summarized in  the EMAP design




report. (Overton et al., 1990).










The Tier 1 sample






       The  EMAP  sample design  partitions the area of the  United States into hexagons,




each comprising approximately 635 square kilometers (Overton et al., 1990), and selects a




point at an identical  position in  each hexagon;  selection  of  this one  position is  random




(equiprobable) over the hexagon.  This method results in a triangular grid of equally spaced




points.   An areal sample of a 40-km2  hexagon (40-hex) is imposed on  each point, with  the




sampled hexagonal  area containing  -^ of the  total  area of the larger hexagon. This fraction,




  ,  therefore  represents  a constant inclusion  probability, TT, and  16 represents a constant

-------
 weight,  w, to  be applied to each fixed-size areal sample.  Because other enhancements of the
 grid  are expected,  possibly  with different sized areal samples,  the following formulae will
 incorporate general notation.

       No  detailed  characterization  of indicators is collected at  Tier 1, so no distribution
 functions  will  be  computed based on the Tier 1 data.  It is of interest, however, to estimate
 the total number of discrete resource units in specific populations at Tier 1.  This  estimation
 is possible for  any resource class for  which units can be uniquely located by a position point.
 The  following formula can  be  used  to estimate the total number of units  for a particular
 resource (r) at Tier  1:
                                       Nr = "^nr,   ,                                (6)
                                                r

where ^tr  is the domain for resource  r.  A domain  of a resource is a feature of the spatial

frame that delineates the entire area within which  a sample might encounter  the resource

(Section 1.1).   In these formulae, the quantity nri  represents the number of units for  the

particular  resource at grid point i.  The variance can be estimated using Equation 4b, as

follows:
                                                «3>r         \

where nr is the number of grid points for which the areal sample hexagon includes part of
the domain of the resource.  It is worth noting again that the estimates of variance are often
expected to slightly overestimate variance if the systematic design results in  greater precision
than would a randomized design, thus providing for a conservative estimate  of variance.

-------
The reduced Tier 1 sample


       In preparation for selecting the Tier 2 sample, resource classes are identified.  Some of

these classes will be treated as sampling strata, and hence be designated as 'resources'.  The

Tier 1 sample for  such a 'resource' is reduced so that it contains only  one unit at  each 40-

hex  at  which  that resource is represented.  This condition effectively changes the sample

from a set of systematic areal  samples to a spatially well-distributed subset of units  from the

population of units for the particular  resource.


       A  consequence of this sample  reduction step is the introduction  of variable inclusion

probabilities in  the Tier 1 sample,  reflecting  the scheme  used to reduce  nri  to  1.   For

example, if a random  sample  of size  1  is selected from the nri units of hexagon z,  then the

selected  unit will have 7Tlrl  = ^—,   A consequence of this is that Nr =  5Zwlr( =  w£3nr,i
                                                                       S,r        -Slr
where Slr is now the sample of points  for which nri->0; at each of these  points, the sample

now consists of one unit of resource r.  Because this estimate, Nr, is identical to the original

Tier 1 estimate, it has the same variance.  This sample, Slr,- is then subsampled to  generate

the Tier 2 sample, S2r-  Again, note that it is now a resource-specific  sample of units, not

the original areal sample.




The Tier 2 sample


      The Tier 2  sample, S2, is a probability subsample, a double sample, of the Tier 1

sample of resource units.  At this  tier, a specific resource has been identified and the reader

should remember  that subsequent equations are  for specific resources.   The reader should

also  be  aware  that the subscript t will now  index a resource unit,  not  the grid point.   All

Tier 2 samples for  discrete resources consist of individual units  from the universe of discrete

resource  units.


      With  these  changes,  the estimator presented in Equation  2 is  appropriate for  the

sample  collected at  the second tier.   The indicator values are summed over  the  samples

-------
 surveyed at the second tier by the assigned weights.  The inclusion probabilities account for




 the probability structure of  this double sample.   Overton et  al.  (1990) identified  the




 probability of the inclusion of the i   unit in  the Tier 2 sample as the product of the Tier 1




 inclusion probability and the conditional Tier 2 inclusion  probability.  The conditional Tier




 2 inclusion probability is defined as  the  probability of inclusion  at Tier 2, given  that the




 unit was included at Tier 1.  This product is still conditional on the Tier  1 sample and leads




 to conditional Horvitz-Thompson -estimation.






       In subsequent equations,  the subscripts 1 and 2 represent the first and second tiers,




 respectively.  The weighting factor for unit i at Tier  2 is defined as










                                     W2r, = Wlr,w2.1r,   '                                (8)









 where w]ri is the weighting factor for the  i-t  unit in the Tier 1 reduced sample and  w2 lri is




 the inverse of the conditional Tier 2 inclusion probability for resource r.  .•=•






       Selection of the Tier 2 sample from the reduced  Tier 1 sample and calculation of the




 conditional Tier 2 inclusion  probabilities  are  discussed in Section 4.0  of the EMAP  design




 report (Overton et al.,  1990).  This procedure generates a list in a specific order, based on




 spatial clusters.  Clusters of  40-hexes are arbitrarily constructed  with uniform  size  of the




 initial Tier 1 sample of the specific resource. The reduced Tier 1 sample is sorted at random




 within clusters, and  then the clusters are arranged in an  arbitrary order.  A subsarr.ple  of




fixed size, n2r, is selected  from Slr by ordered  variable probability systematic sampling from




this list.  The purpose of this elaborate procedure is to generate a spatially well-distributed





Tier 2 sample.






       The Tier 2  conditional inclusion probabilities are proportional to the weights at  Tier





1:




                                     IT  j  • =  yirl = "ar^lri   _                      (9)

-------
where Nr was defined for a specific resource in Equation 6.   However,  for  some  units

      N
wiri->fT!"'  To obtain conditional inclusion  probabilities  < 1, these units are placed  into an


artificial  'certainty' stratum,  all having ""j.iri^-  This step takes place prior to the cluster


formation.   For the remaining units, the selection protocol and achieved  probabilities are


modified  to adjust for  the  number of units having probability 1. These remaining units now


have conditional inclusion  probabilities:
                                    *2.,r, = ^i   ,                               (10)

                                             s,%Wlr'



where n^ equals n2r less the number of units entering S2r with probability 1, and Sjr equals


Slr less these same units that were included  with probability 1.



       Note  that this selection protocol is designed to create Tier 2 inclusion probabilities as


nearly equal as possible:




                       7Tlrl ,   if t is in  the artificial stratum with l"2.ir, = l




                        „  — ,  otherwise,


                       Slr



and if no units are in the artificial stratum,
where N, is the Tier  1 estimate  of the'total  number of population units in resource r.  For


generality, we  will retain  the  variable probability notation, but ideally  the sample will now


be equal  probability.  If there is great deviation from  equiprobability, then consideration


should be given to enhancement  of the grid,  perhaps by reducing the size of the Tier  1 areal


sample, in order to better achieve the goal of equiprobability.



       The variance  estimator presented  in  Equation 4  is also appropriate for estimating


variance from  the Tier  2 sample,  using  inclusion  probabilities defined above for Tier 2.

-------
When no units enter with 7r2.iri-=l, lhen



                                _ 2n2rw2MW2rj-W2r.-w2r;
                          W'"'-        2(n2r-l)


However, if unit i enters with ""2.iri=l>  then,
                                    2nlrwlr.Wl  .-wlr.--w1  - |
                                    	      'r    '            (   }
Because the term in the bracket equals wjri; •, Equation 14 simplifies to w2r|j = Wir,jw2.irj-



Special case: The Tier 2 point sample of lakes

       We assume  & stratified design  with equiprobable selection within strata.  If a quasi-

stratified  design is used instead, appropriate analysis can condition on the realized sample

sizes in the classes and use post-stratification.



Special case: The Tier 2 point sample of streams

       A  point sample of streams at Tier 2, rather than a sample of stream  reaches, has

been  proposed.   With  a  few  simple  changes,  that  point  sample  will  be  a  rigorous

equiprobable  point  sample  of  the  stream  population  with  a  very simple estimation

algorithm.   A  probability  sample  of stream  reaches,  on  which  the sample points are

represented and from which other estimates of population  structure can  be obtained, will

also be provided.  The protocol provided will apply to  the sample of stream reaches and the

point sample design proposed to us.

       We assume  a stratified design  with equiprobable selection within strata.  If a quasi-

stratified  design is used instead (as has  been proposed), appropriate analysis can  condition

on the realized sample sizes in the classes and use post-stratification.

-------
       Slr  is the Tier 1 collection  of  reaches in a resource stratum identified via the 40-

hexes.  S2r is generated by selecting njr points from this set using the  frame representation

of stream  length.   This process  results in (1) the selection of n^r frame  reaches  with

probability proportional  to frame length, and (2) the random  selection of 0, 1, 2,  ... points

in each selected  reach with inclusion density, given reach selection, inversely proportional to

length.  The resultant point sample is equiprobable on the population  of stream reaches.

Then, in terms of the sample of reaches,

                                            k
                             L^-^IJ^T— = -^2-,7T               <15a>



estimates the total length  of population reaches,  where for resource  r, /rlj- represents  the

length of the j   actual  reach in the  i   sampling unit, /", represents the length of the i

frame  reach, and Zri- represents the sum of length of all  reaches in the I   sampling unit.

Recall that a sampling unit is a frame reach.  Also, Dr is the total frame reach  length in  the

Tier 1 sample of resource r and L"=wDr is the Tier 1 estimate of L*.  Because L" is known

on the frame, wDr is replaced by L", resulting in:



                                       Lr=L^ R                                  (15b)


              n?
where R= -~~ V^ •=—. Also,
where kri represents the number of actual reaches in the i1  sampling unit.  Again, wDr can

be replaced by L*, which is known for the frame, resulting in:

-------
       For  these estimates, the variance estimators for Lr and  Nr, are given by L*2var(R),

 where the variance of the ratio can be approximated by:
 where fri=  -p1 , when computing var(Lr), or ur,= -^ , when computing var(Nr).  Note
              ri                                     'ri
 that this formula is different from most ratio variance estimators in this report.


       The distribution function  is estimated from the data collected  at the sample points,

 not from the set of sample reaches, as in the above estimation of N and L.  Recall that each

 selected  frame reach  will have an associated sample point; this may result in 0, 1, 2, or more

 sample points for the actual streams represented by the frame reach. Let:
                                                                                    (18)
where nr is the total number of sample points in the resource, as realized in stream reaches,

and nr(y) is the number  of these for which the observed indicator  value  is less than y;

                  )'  w'tn summation  over all frame reaches, t, and all points, j, for each
        S
frame reach.  Then, rewrite nr(y) =  2_,2r,(y), so that
                                    i=i

       For Fp(y), under  the  randomization assumption, it  is appropriate  to treat nr(y) as

conditionally  binomial(nr, Fr(y)) and to  use the  binomial algorithm.  Alternatively,  the

variance of Fr(y) can be estimated in the manner of ratio estimators.  For the equiprobable

point sample, this is:
            var(Fr(y)) =  IX, w»,. + £ £dri drj (w2( w2j - w2|J) / nj2  ,        (19a)
                           O            ^

-------
                       ~              wD                w^D^
where dri(\)=(zr,(\) - Fr(y)xr-), w2(-= —^ ,  and w2- =       . r     .  Here, xri equals the
                                       n2r          J  D2r (n2r- l)


number of sample points  for the I   frame reach.  This then simplifies to:
                                                                                    (19b)
Then it is necessary  to  estimate L(y)  as  a product, Lr(y)  =  LrFr(y),  with  variance



estimator,  var(Lr(y))=L? vai(Fr(y)) + F?(y) var(Lr).



      This analysis presumes that there  are no strata for stream  reaches.  For two strata



(!*'  and 2"*" order),  simple modification  of these  formulae will suffice.  The numbers and



length of reaches in  the  cross-classified strata are  estimated  and  then combined.   For F,



sample  points  from  units  in the wrong stratum  are simply combined  with  the correct



stratum.  If more than these two strata are desired, the general method of frame correction



via sample unit correction is not feasible, and the method prescribed here is not appropriate.






2.1.2 Extensive Resources



      The universe of an extensive resource is a continuous spatial  region.  If the domain is



correctly identified, the universe of the resource will be a subset of the domain and may be



fragmented over  that domain.   Extensive resources may  have  populations  of two  kinds,



continuous  or  discontinuous.   Because these discontinuous  populations are defined  on a



continuous universe, they are referred to as extensive resources.  Continuous populations are



referred to  as extensive resources as well.  Section 3.3.4 of the EMAP design report (Overton



et ah, 1990) describes two methods for sampling extensive populations, via a point or area)



sample.  For each  resource, the design provides for  the classification of a large area] sample



(40-hex) at  each  grid  point; these area! samples are also subject to subsampling via points or



area! subsamples.

-------
       At Tier  2,  two distinct  directions are available, depending on the nature  of the




 resource.  Specifically, if  the domain of the resource  is well known  from existing materials,




 as are boundaries of the Chesapeake Bay or Lake Superior, then the  Tier 1 areal sample is of




 little value either  in  estimating  extent or in  obtaining a sample at Tier 2.   In these cases,




 the domain should correspond to the  universe.  Conversely,  if the spatial distribution  or




 pattern  of a resource is poorly known, as it will be  for certain  arid land types or  for certain




 types of wetlands, then the Tier  1 areal sample may provide the best  basis for obtaining a




 well-distributed  sample at Tier 2. Other  factors, such as size of the domain and degree of




 correspondence of universe and  domain,  will influence  the  sampling  design.  In  the  first




 circumstance,  the  Tier 2  sample will be selected from the area! sample obtained at Tier 1.




 In the other,  the Tier 2 sample  will  be selected  from  the  known universe by a  higher




 resolution point  sample that contains the base Tier 1 sample.











 2.1.2.1  Area] Samples                                                  •,•*•-.;_




 Tier 1





       All Tier  1  areal  samples  are  expected  to  be  collected  with  equal  probability.




 Enhancement  of the  grid may be made  for any  resource, but any resource should have




 uniform  grid density over its domain.  Further, the  area! sample imposed on the grid  points




 will be of the same size for  any  resource, so that  algorithms  are presented  only for equal




 probability sampling. The following formula  estimates the total areal extent of a particular




 resource (r) over its domain "jDr:
                                                                                     (20)
where the domain  was discussed in Section 2.1.1.2.   The  value ar, defines the area of




resource r in the areal sample at grid point i, and w is the inverse of the density of the grid




divided  by the size of the areal sample.   For equal  probability sampling, the variance

-------
estimator is
                                                                                      (2.)
                                          GJ         GJ
where n is the number of whole or partial areal sample hexagons located in *Jr. As with the



discrete resources, even though the sample is selected by a systematic grid, we assume, in



order  to estimate variance, that the sample  was taken from  a locally randomized scheme.



The justification of this assumption is similar  to  that  for  discrete resources.   Alternate



procedures are available when the assumption is questioned (see Section 2.1.2.3).







Tier 2




       At the second stage of sampling for extensive resources, the distribution function for a



particular resource is estimated.  To identify  the  objective of Tier 2 sampling, we  can  write



estimating  equations as  though a complete census  were made at  Tier  2.  The general



conceptual  formula for  the distribution function  of  areal extent for  a specific  resource  (r)



over its domain 1, is
                                                                                     (22)
where ar((y) is the area of the resource in areal sample i such that the value of the indicator



is less than y.  The estimated variance follows  Equation 21 as
               var(Ar(y)) =   r  * _\   ' { ^ a&y) - (^  *r,(y)f/nr }   .        (23)








       The estimate of areal  proportion  for an extensive population  divides Equation 22 by



the estimate  of total areal extent:

-------
                                                                                     (24)
 In the rare instance in which Ar is known, then an improved estimator of Ar(y) is given by





                                     Ar(y) = ArF» .                               (25)





 Ordinarily,  these  distribution  functions  will  be calculated  at each  distinct  value of  y


 appearing in the sample.   The variance associated with the areal proportion is  the general


 form  for a  ratio  estimator  (Sarndal  et  al.,  1992,  Equation  7.2.11).   In  writing this


 expression, it is necessary to identify the specific value, y,, at which Fr(y) is being assessed.





              var(Fr(y,.)) =  [S:d;.w.(w.-l) +  £ £ d^t(w-wfc- w-t)]/Aj  ,         (26)
                             J                   3  *
                                                  k* 3



 where d  =  [a   (y^) - ar Fr(y|)], a   is the area of sample j in  resource r, w fc is defined as


 in Equation 5,  and A^ is replaced by A^ when Ar is known.



       In practice, the Tier 2 assessment will be based on a subsample of some kind, and the


 above ideal  estimation will not be available. The only method  proposed for subsampling is


 use of a Tier 2  point sample.





 2.1.2.2  Point Samples



       Two methods of directly sampling objects from the  grid points are discussed  in the


.EMAP design report (Overton  et al., 1990, Section 4.3.3.2).   A Tier 1 reduced sample  of


discrete resource units can be selected by choosing the units into  whose areas of influence the


points fall; this method is  not currently scheduled for use, but it is a viable method for


several discrete  resources.   The same procedure can be used to select area! sampling units


from an arbitrary spatial partitioning of the United States.  The agroecosystem component

-------
of  EMAP  provides  such  an example:  the units selected  for  the sample are  secondary




sampling units of the National Agricultural Statistics Service  (NASS)  frame, and estimates




are of totals over subsets of the  frame units.  Each selected unit is a mosaic of fields and




other  land  use structures.   These structures  are then classified  and sampled  to provide




ecological indicators  for characterizing the sampling unit.  Essentially, this  areal sample is




analyzed exactly  like the 40-hex  fixed area! unit discussed in the previous section, with the





exception that  inclusion  probabilities are now  proportional to the size of the unit,  and the




genera] formulae (e.g., Equations  2-4) must  be used.






       An alternate use of the point sample  can be applied to an  extensive resource, with the




ecological  indicators of  the resource  measured  at  the  grid  points.    For  continuous




populations, such as temperature or pH, the response can be measured exactly at a selected




point.  For other populations, it is necessary  to make observations  on a  support region




surrounding the point, like a quadrat.   For  example, the wetlands resource group could




obtain an indicator, such as plant diversity,  from a quadrat sample centered on a grid point.




The indicator measured in the quadrat can  be  treated like a point measurement.  A cluster




of quadrats centered on the grid  point provides yet another method for  sampling extensive




resources.






       This point sample will  be applied  at Tier 2 in either of two ways.  For resources that




depend on the  Tier  1 areal sample to provide  a sample frame, a high-resolution  sample of




points is to be imposed  on  each  40-hex containing the resource; this arrangement will




generate an  equiprobable point sample  of  the areal fragments of all resources  that  were




identified at Tier 1.  For a resource in which the universe is clearly identified, such  as  Lake




Superior, a better spatial pattern  of sample points will be obtained by imposing an enhanced




grid on the  entire universe.   ID the latter case,  the universe is known, whereas in the former




case, the Tier 1 sample provides a sample of the universe, which is then sampled by a Tier 2




point sample.

-------
       In  either case,  an equiprobable sample of points  is obtained  from which resource




 indicators will  be  measured, and  the  estimation equations will  differ  only  by the weights.




 Variance estimators will  differ, as one is a single-stage sample and the other is a double




 sample.









 Point sample for a universe with well-defined boundaries





       For a resource in which the universe is known (e.g.,  the  Chesapeake Bay), the general




 formula for equiprobable point samples for a resource class is presented.  A  resource class is




 defined as a subset of  the resource.  For  example, two classes of substrate, sand and  mud,




 can be defined  in  the  Chesapeake Bay.  The  distribution function  of the  proportion  of a




 specific class of a specific resource (re)  having the indicator < y reduces to
where nrc(y) is the number of points in resource class re with the indicator equal  to or less




than a specific  value, y,  and nrc is the total number of sample points in the resource class




re.  Under the  randomization assumption, the conditional distribution of nrc(y), given nrc,




is Binomial(nrc,  Frc(y)),  so that  confidence  bounds  are  readily  s«t  by  the  binomial




algorithm   in   those  instances   in  which   spatial  patterns indicate  adequacy of the




randomization  model (Overton et al., 1990, Section 4.3.5). Alternate protocols are available




when the randomization model cannot be assumed (Section 2.1.2.3).





       Estimation of the area occupied by an extensive resource class is provided by









                                 Arc = Ar^ = Arprc  ,                             (28)









where  nr is the  number of grid points falling  into the domain of the resource, and  Ar is the




area of the resource.   Under the randomization assumption, nrc, conditional on  nr, is a

-------
 binomial random variable;  bounds on prc are again set by the  binomial  algorithm, as are
 bounds on Arc.
 Point sample for universe with poorly defined boundaries


       When  the  universe of the resource is not known and one must use the Tier  1 areal

 sample as a base for the Tier 2 sample,  then Equation 19 provides  the estimates of Ar at

 Tier  1.  Then the Tier 2 sample is an equiprobable sample of points selected from the area

 of  the  resource class  contained in the  40-hexes.   This procedure is implemented as  a

 tessellation stratified sample in each 40-hex, with k=l to 6 sample points per 40-hex.  With

 only  1 point per 40-hex,  the binomial algorithm will be appropriate under the randomization

 assumption; multiple points per 40-hex will require an explicit design-based expression for

 variance.  In all cases,




                                 Arc = Ar^ = Arprc   ,                           (29a)
                                   ~.        n  lv\
                                   FPC(>')=  -IT22  .                              (29b)
                        Arc(y) = ArcFrc(y)  = Ar -^ = ArR  .                   (29c)




It should  be recognized that Equation 29a is a special case of Equation 29c.


       When k>l, the following variance formula is appropriate:




                                                                                    (30)
The outside summation is over the 40-hexes and d-  = (I(rc, y,-,•<>)  —  Frc(y)I(rc)). This

expression   is  derived  from  the  general  Horvitz-Thompson  formula  used  with  ratio

estimators.  The formula can  be recognized as the usual stratified random sample variance

-------
 formula, applied to d, .





        In addition,





                           var(Arc(y)) = var(Ar) R2 + A? var(R)                     (31a)









              where var(R) follows,
                                                                                  <31b>
              where d(J = [I(rc, ytj < y)  - I(r)R]).









 Note that var(Ar(y)) = var(Ar)F^(y) -I-  A?var(Fr(y)); F replaces R in  Equation 31a as well




 as in Equation  29c.









 2.1.2.3  AJternative Variance Estimators





       Confidence  bounds for distribution functions based on point samples of continuous




 and extensive populations can be computed by several methods.  The choice of a method  is




 determined by  the pattern of the resource area.  First, the binomial approach is suggested




 for fragmented  area distributed randomly across the domain.  When this condition has been




 met, the  randomization  assumption  holds  and  the binomial  model is  appropriate  for




 computing confidence bounds.





       If the area,  Ar(y),  is in an entire block, rather than fragmented, then  the binomial




algorithm  will  overestimate  variance,  and alternative  estimators will be  needed.  Other




methods allow for a nonfragmented area  and  the randomization assumption is  not required.




The mean  square successive difference (MSSD) is suggested for a strict  systematic sampling




scheme.    Another  method,  the  probability  sampling  method  using the Yates-Grundy




variance estimator, requires that the design have all positive pairwise inclusion  probabilities.

-------
One such design that provides this structure is  the  two-stage tessellation stratified model.




The MSSD is  discussed by Overton and Stehman  (1993a) and the probability estimator is




discussed by Cordy (in press).  Methods of spatial statistics are also available for estimating




this variance.










Mean square successive difference estimator






       The variance estimator based on the mean square successive difference is intended to




provide  an  estimate of variance  for either  the mean of values from  a set  of points  on a




triangular  grid or  obtained  from  a random  positioning of the tessellation  cells of the




hexagonal dual to  the triangular  grid.  In  the  latter  case, the data are analyzed as though




the values were taken from  the center  of the tessellation  cell.  The  data set consists of all




points falling in the target resource. The MSSD  has  not been developed for this tessellation




formed by triangular decomposition of the hexagons.










Smoothing






       Smoothing often  results in  improved variance  estimation (Overton  and  Stehman,




1993a).   The  following  method is from  that report.    For each datum,  y, calculate  a




'smoothed1 value, y", as a weighted average of  the datum  and its immediate  neighbors  (i.e.:




distance of one sampling interval).   Weighting for this  procedure is provided  below.   As a




result, two new statistics are generated at ;ach point:  y'  and A.

-------
      Number of Neighbors        y* values             A values-
6
5
4
3
2
1
0
(6y,+ EyjO/12
(7y,-+ £yj)/i2
(8y,- + S>,-)/12
(9y,+ Lyj)/i2
(I0y,+ Eyj)/u
(iiy,-+ Ey,-)/u
y,-
7/24
5/24
5/36
1/12
> 1/24
> 1/72
0
Given these smoothed values, summing over all data points,
                                   ~2 =             .                              (32)
Mean Square Successive Difference





      Identify  the data along the three axes of the triangular  grid;  each point will appear




once in the analyses of each axis.  Analyze  the y*,  not the  original y.   For each axis,




calculate









                                 «= £(y;-y£)2  -                              <33)








where y  and yfc represent members of a pair of adjacent points, and where the summation




is over all adjacent pairs identified on this axis.  Also, calculate  for each axis,
                                                                                  (34)

-------
 where it  is necessary that all pair differences be taken  in the same direction.  From  these

 statistics, calculate for each axis,
                                           •*  ~_m)
                                                                                   (35a)

               and

                                    A^(S2-*2)  ^



 where m denotes the number of pairs in the above summations.

       These statistics are  then combined over the three  axes, where summation is over all

 successive pairs in the k   axis.



                                    V!=2E ^ ,                               (36a)
                                        i	  .                             (36b)
                                        E'(mfc-l)
Lastly,  the following are computed to provide estimates of the variance of the mean values.
                                        y.) = vi  + TT7                              (37a)

               and
where  nr equals the number  of. sample points  in resource r.   This method has not been

exteDded to distribution functions, but the extension  is straightforward.

-------
 Yates-Grundy variance estimator for tessellation stratified probability samples






       Investigation  of this variance estimator is continuing.  The method will  be included




 in the next version of this manual.
 2.2  Model-Baaed Estimation Methods






       The previous section was devoted to design-based methods used to derive population




 estimates, distribution functions, and confidence bounds.  Model-based estimation is another




 common  approach   to  computing  population  estimates.    In  this  approach,  certain




 assumptions with  regard  to the underlying model are made, and the information provided




 by auxiliary variables often provides'greater precision of the estimates.





       Within EMAP,  these model-based  methods have not  been  developed to  the  same




 degree as  the  design-based methods.  No  algorithms for confidence  bounds of distribution




 functions using model-based  methods are presented in this  report, although they are under




 development.  The  purpose  of including  this section is to provide a  brief description  of




 currently available model-based methods.  Further, application of the model-based methods




 has so far been restricted  to discrete populations.  Investigation of the applicability of these




 methods in continuous and extensive populations is under way.





       Three ways in which model-based methods can be used within EMAP are discussed:




 (1) data collected on the full frame  across  the  population  can be incorporated into the




estimation process using prediction estimators to improve precision;   (2) because the EMAP




design is a double sample (Section  2.1.2.2), auxiliary variables on the first-stage  sample can




be used  to improve  the precision  at  the  second stage;  and (3) a calibration method  is




described for  modifying an indicator variable to adjust for changes in instrumentation  or




protocol  - such methods are needed to maintain  the viability  of a long-term  monitoring




program.





       The  strategy  is to  begin with  the  basic  design-based  methods  and to.  incorporate

-------
model-based methods as the opportunity to do so becomes apparent and the necessary frame




materials are developed.  The design-based  methodology will  be enhanced by the  use  of




models whenever feasible.








2.2.1 Prediction Estimator





       If auxiliary data  that can  be used to  predict certain  indicators are available on the




entire  frame, model-based prediction techniques can be  used to obtain predictions  of the




response variable for the population. These predictions  then  become the base for population




inference.





       These methods require a vector of predictor variables  defined on the frame,  while the




response variable is  measured on  the Tier 2  sample.   A model is postulated for the




relationship between the response variable, y,  and the vector of predictor variables, x:








                           y=g(x)  + e,  with Var(0 = h(x)  .                       (38)








Based on this model, a predictor equation, y"=gXx), is estimated from the Tier 2 sample.





       The equation for  the  basic estimator,  which is referred to as the general  regression




estimator, is defined as
                                =   Ey,+  E W2,(y,-y,)  ,                       (39)
                                    IL       s,
where  Hi and S2 designate the universe of units and sample units  at  Tier  2, respectively.



The variance of this estimator is estimated by
              var(f tf) = Ed? w2, (w2, - 1) +  E Ed, d, (w2|. w2j - w2
-------
 where d, = (y(. - y •) (Sarodal et al., 1992,  Equation 7.2.11).  Our experience (Overton  and




 Stehman, 1993b) suggests that this equation slightly underestimates the variance; this result




 is not unexpected because Equation 40 is based only on the variance of the second term of




 Equation 39.






       One model-based estimator of the distribution function  of the proportion of numbers,




 as established by Rao et al.  (1990), is  based on the general regression estimator and defined
 as
 where N  is the  target population size, and l(y,
-------
2.2.2  Double Samples


       As  mentioned  previously,  the  EMAP  design  is  a  double  sample  with  Tier  1


representing the first stage (or phase) and Tier 2 the second stage.   Through  most of this

document,  design-based methods are  provided  for the  Tier  2 sample;  these  methods  are

similar to those described for single-stage samples.  However, where  model-based  methods

are used, double sampling formulae  can be  quite  different from single-stage formulae.   An

elementary  discussion of double sampling with model-based methods is presented in  Cochran

(1977).


       Existence of an auxiliary  variable on the Tier  1  sample  will enable  model-based

double-sample methods at Tier 2.  EMAP does  not require a resource-specific frame, but it

does allow for acquisition of more detailed information for many resources.  There is a Tier

1 sample for all resources, and for most resources, the Tier  2 sample is a subset of the Tier 1

sample,  thus providing the basis for model-based double-sample methods.


       The  model specification follows  the developments  under the  general prediction model

(Equation 38).  The  basic  estimator, derived  from  the  general  regression  estimator,  is

defined as
                              = { £ wlty • +  £ w2l.(y,. - y ,)}  ,                     (42)
                                   S          S
where Sj and  S2 define the sample at Tiers 1 and 2, respectively.  The form of this estimator

allows equal or variable probability at Tier 1.  The variance estimator for Equation 42

follows Sarndal et al. (1992, p. 365, Eq. 9.7.28):
               E(wi,-wi,--wi,->)y,-y>w2.iij +  f £(w2a.-w2.iy-w2.i.->)d,-wi,-d>wij -   (43)
             n Jn                            Qn ^O
where di=(y(-f1).

-------
       The estimate of the distribution function of the proportion of numbers is developed as




an extension of Equation 41,
                                             S2
 When N is unknown, N is a suitable replacement.  Smoothed versions of Equations 41 and




 44, along with confidence bound algorithms, are under development.









 2.2.3 Calibration






       Calibration is defined as the replacement of one variable in the data set by a function




 of that variable representing  another  variable.   For example, in  a long-term monitoring




 program such as EMAP, it is expected that some laboratory or data management protocols




 will  change over time.  Using this analytical tool, data from old protocols can be calibrated




 to represent data from the  new protocols, thereby allowing assessment of trends across the




 transition.





      Overton (1987a, 1987b, 1989a) described the application of calibration issues for the




 National  Surface Water Surveys.   In  that instance, protocols  were  unchanged  but  the




 extensive data of 1984 were calibrated to the same variable in 1986 to take advantage of the




strong predictive relationship through the double sample.  The algorithms are provided for




this  calibration  in Technica' Report  130 (Overton, 1989a).  Tailoring of these  methods to




the specific needs of EMAP will be required in certain instances.  However, each application




is likely to present some unique issues and properties, so  that general development does not




appear feasible.

-------
2.3  Other Issues




2.3.1 Missing Data






       Two types of missing data are expected to arise in EMAP.  One type  is a missing




sampling unit,  such as a missing lake.   The other type  of missing value occurs within a.




sampling unit,  such as a missing observation on a specific chemical variable or a missing




suite of chemical  variables for a lake.  In this situation, information is available on  some,




but not all, indicators for a specific unit or site.










2.3.1.1  Missing Sampling Units






       There appears to be no basis for imputation of a missing sampling unit where no Tier




2 information is available  to predict that observation.  Therefore, missing sampling units




should  be considered  as  representing a  subset  of the subpopulation  of  interest  that is




unavailable for  measurement.   All procedures outlined in this document-accommodate data




sets  that contain missing  units.  No  adjustments to the weighting factors  are  necessary;




summation is over the observed portion' of the sample, and the estimates produced apply to




the subpopulation  represented by the sample analyzed.  When Yates-Grundy estimation of




variance is used, it will be necessary  to modify the equation; this requirement  is  the primary




reason for using the Horvitz-Thompson variance estimator when  possible.






       In a  long-term  program, this approach  of  classifying  missing  units  with  the




subpopulation not  represented  by the sample  is  clearly  appropriate;  such units  can  be




sampled in  subsequent  years  without having to  modify sample  weights  again.    This




approach is  also  consistent  with  the  practice  of allowing  sampling units  to  change




subpopulation classes from time to  time.   Comparisons must  take  this into account,  but




such class changes will always be a feature of long-term monitoring programs.






       A general problem remains when  a substantial number  of  resource sites  cannot be




measured;  EMAP must find a  way to provide  indicator values for  such sites.   When the

-------
 problem  is severe, it might be  possible to develop an alternate indicator suite  that can  be




 obtained via aerial television or photography.  Perhaps it will be possible to impose a higher




 (lower resolution) sample level that will provide for model-based methods and predictors of




 the indicator. (This option will be difficult because the predictor relation must be developed




 specifically for  the subpopulation of concern,)  But whatever the solution, some method is




 required to provide representation of these sites.  Until then, it is appropriate for these to  be




 identified  in  the  subpopulation for which no sample  has been  generated and about which




 nothing is known.









 2.3.1.2 Missing Values within Sampling Units





       It is advantageous to use information collected  on a specific sampling unit to impute




 any missing observations for that sampling unit. To minimize error, a multivariate analysis




 is suggested, utilizing  the  data collected for that particular unit.   No specific procedure is




 suggested  for this analysis, .because most standard analyses will impute similar  values, and




 because the method  must  be  tailored to the  circumstances.  Some multivariate procedures




 are discussed in statistics books that concentrate on imputation  of missing values (cf., Little




 and Rubin, 1987).









 2.3.2  Censored Data





       For  certain measurements,  values for indicator?  will be  less than  the identified




detection limit; exact values cannot be measured for such units or sites. This problem is not




uncommon and has  been discussed frequently  in the literature applying to water quality




management  programs (cf., Porter  et al., 1988). Caution is prescribed when characterizing




data  that  consist of many observations below  the detection limit.   Proper analysis  and




reporting can prevent improper  inference for these data; specifically, it must be noted that




although  reliable  values  are not provided, a great deal  is known about the site that has a




value at or below  the detection limit.

-------
       To  guide the data analyst in  the  treatment of the indicator that  contains censored




observations,  the proximity of the detection limit to the critical value of the indicators needs




to be considered.  Indicators, such as chemical variables, that have detection limits  near or




above the critical  value should  not  be considered  meaningful  indicators; the information




supplied by such an indicator  is too fuzzy to justify  inferences.   In  such cases, the most




meaningful parameters  are  those whose  estimates  are not  affected by  censoring.   Other




indicators  have a detection  limit well below  the critical value.  For  these indicators, it  is




suggested that values below the detection limit should be scored to the detection limit  and




analyzed with the uncensored data.






       The mean is  a poorly defined statistic to describe censored data.  However, the scored




mean  can  be  interpreted, even though  it is slightly biased.   Another statistic, the scored




mean  minus the detection limit,  is unbiased for the mean in excess of the detection limit,




which is a well-defined population parameter.  If the distribution below the detection  limit  is




modeled, and  the mean  value below the detection limit is calculated, then the scored mean




can be converted into an  unbiased estimate of  the true mean, given the model.






       On  the other hand,  the  median  is less ambiguous  than  the mean  and  is more




appropriate for  characterizing these indicators.  Usually the median will not be  affected by




scoring.  Distribution functions  also should not be described below the  detection limit. This




restriction  is  another reason for scoring; standard  analyses of the scored  data yield  the




desired distribution  function, emphasizing that the  shape of the curve below  the detection




limit  is unknown.   Because  the critical level  changes with circumstance,  it is desirable to




present the truncated  (scored) distribution function, to be interpreted  as the situation




dictates.   In  fact,  the  capacity  to  truncate  the distribution  function without impairing




inferences is one of  the  strong  arguments for  choosing this parameter to characterize these




data.

-------
       Modeling  the function  below  the  detection  limit  is  one method  proposed  in  the




 statistical literature to modify estimates from censored data (Cox and Oakes,  1984;  Miller,




 1981).   However, a hypothetical  distribution must be assumed  to  represent  the  censored




 data.  In EMAP, distributions are defined  on real populations and are unlikely to follow any




 distributional  law.  We propose that the distribution function reflect the data alone  and that




 the  unsupported  portion of the distribution function  is not described.  Use of the  scored




 mean is somewhat less justifiable, but  generally consistent with this position.









 2.3.3 Combining Strata





       The strata that form the structure of the Tier 2 sample are established from classes of




 resources identified  at  Tier  1, on  the Tier  1 sample.   The seven  basic  resources are  the




 foundation of  this structure,  but there  is provision for further classification leading to several




 strata for lakes, several  for forests, and so on.  These  strata are referred'to as resources in




 this report.





       Tier 2  selection  is  then stratum (resource) specific and independent  among strata.




 This structure is  chosen to provide inferences within  strata, with  the  thought that few




 occasions will  arise for inferences  involving  combined  strata.  For example, a distribution




 function [F(y)j combining small and middle-sized lakes will be dominated by small lakes.  If




 the population of large lakes  is of interest, it must be characterized separately.  Further, a




 wide range  of sizes  makes  the  frequency  distribution less useful  in characterizing the




 population.  Still,  because there may be interest in a population  consisting of the largest of




the small  lakes and the  smaller  of the middle-sized lakes, analysis of combined strata is




needed.

-------
2.3.3.1 Discrete Resources

       Samples are combined across strata to compute the Tier 2 estimates.   Weights will

not.be uniform, so the Horvitz-Thompson algorithms using weights are needed. Estimation

of  N0(y) and  Fa(y)  is  identical  to  the estimation algorithms for a  single stratum, but

estimation  of variance requires  modification.  The  basic formula for estimating variance is

also unchanged, only the W2ii must be modified.  Specifically,



       if i and j are from the same stratum, then
                                   2n2w2,w2  -  w2|  - w2
                                   	   ;                        (4o)
       or if i and  j are from different strata, then
                                 W2,j = Wl,jw2.1.w2.1j
              where, if i and j are from different 40-hexes, then
                                   2n, w,-w,   - w, •  - w, •
              or, if i and j are from the same 40-hex, then
                                                                                     (48)
       where w is the weight associated with the basic Tier 1 areal sample.

       In  the case  of  the  quasi-stratified  design   used for  lakes  and  streams,  the

recommendation  is that the sample be conditioned on the realized sample sizes in the several

distinct classes having equal inclusion probabilities (within class).  This  approach leads  to a

-------
 post-stratified sample that can be analyzed exactly like the sample from a stratified design.




 The  gain in  precision will  carry  over into analysis  of  combined strata in the manner




 discussed in this section.
 2.3.3.2  Extensive Resources
       Procedures for combining strata for point samples in extensive resources are the same




 as those outlined for  discrete  resources (Section 2.3.3.1).  Methods to combine strata for




 areal samples in the extensive resources are still under consideration and will be addressed at




 a later time.
 2.3.4  Additional Sources of Error





       Other  potential sources of error  can  be expected in the process of developing the




 distribution  function  and  confidence  bounds.  Some  of  these  have been  discussed  after




 evaluation of the Eastern  and  Western Lake Surveys (Overton,  1989a,  1989b).   These




 additional sources of error  add  to the uncertainty  and bias of the estimated distribution




 function.  Research is presently under way to investigate methods, such as deconvolution, to




 correct for  these added  components  of  error  and   bias.    Preliminary   methods are




 unsatisfactory, and two different approaches are being  followed  to  improve results.  These




 methods will be introduced Lo EMAP analyse^ as they become available.





       The  rounding  of  measurements reduces  precision  in  quantiles  and  distribution




function estimation.    Analyses  of  the  National Surface  Water  Surveys suggested  that




reporting  data at two decimal  places beyond  the  inherent  accuracy  of  the  indicator




satisfactorily reduces bias attributed  to rounding error (Overton, 1989b).   It is recommended




that additional decimal places  be carried into the  data set if they are provided  by the




instrumentation.  Additional rounding should be made only at  the  reporting step, and the

-------
rule for  rounding  should take into  account gain in  precision from  averaging and other




statistical practices.










2.3.5  Supplementary Units or Sites






      Supplementary units, in addition to the yearly EMAP grid points, have been selected




and measured or remeasured by some resource groups.  For example, a set of supplementary




units  can   be  selected   as  a  subset of  one  of the  interpenetrating  replicates.   The




remeasurement of these supplementary units is directed at specific issues, such as estimation




of variance, and  the  selection  procedure  is likely to be influenced by this purpose.  If data




from supplementary probability samples  are combined with  the general EMAP sample, it is




necessary to use  a protocol  for combining two  probability  samples.  If the  supplementary




data are  not from a probability sample, then it  is necessary  to use a protocol  for combining




found data with probability sample data  (Overton et  al.,  1993).  Ordinarily, a good strategy




will be to  use  these  supplementary data only  for analyses initially  intended.   The  effort




necessary to satisfactorily combine  supplementary data within the general sample  analysis,




such as  the distribution functions, is sufficiently great that  one  should be reluctant to




attempt  this combination.  On the  other hand,  there will be certain circumstances  in which




this effort is justifiable.

-------
                                       SECTIONS




                       DISTRIBUTION FUNCTION ALGORITHMS
       The types of distribution function algorithms, along with their associated conditions




for application,  are presented  in  Table 1.  The first  part  of  this table  (A) presents the




various cases yielding the distribution of numbers, N(y).  Part B presents  the various cases




discussed in this  report yielding the  distribution functions for the  proportions of numbers.




The methods of obtaining the distribution  functions for  size-weighted statistics are presented




in Part C.





       To explain the notation presented in the following algorithms, some terminology  is




introduced.  The  target  population size,  N, is the size of the  target  subset of the universe  of




units,  defined as  11.   The  following algorithms  are  written to obtain estimates over  a.




particular subpopulation of interest.  For a particular subpopulation (a), the distribution  of




numbers is denoted as Na(y)  and the  distribution of the proportion of numbers is denoted  as




F0(y).  Na denotes the subpopulation size over  the subpopulation, a.  In addition,  the n and




nQ refer to the sample size from the population  and  subpopulation, respectively.





       The  variance estimator  discussed in  Section 2  is based on  the  Horvitz-Thompson




theorem and is appropriate for both equal and  variable probability sampling, independent  of




a known population or  subpopulation size.   The confidence bounds  using this  variance




estimator are then based on  the normal  approximation.  Therefore,  for any  condition, the




general Horvitz-Thompson algorithms for  Na(y) and Fa(y), as presented  in the  following




subsections under variable probability sampling, are appropriate.





       Estimation of these bounds  simplifies under equal probability sampling when the size




of either the population  or the subpopulation is known.  For example,  an exact confidence




bound  for Fa(y)  can be  based on  the  hypergeometric distribution in the  case  of equal

-------
probability sampling when the subpopulation size is known.  When the subpopulation size is




unknown,  these bounds can be based on the binomial distribution.






       It should be emphasized  that  there are no differences in  the distribution functions




obtained from  the alternative design-based approaches discussed in this report.  Further, the




distribution functions obtained under the same conditions based on the Horvitz-Thompson,




the binomial, or the hypergeometric algorithm are the same. The differences occur in the




computation of the confidence bounds.   Note,  however,  that  model-based distribution




functions will be different from those obtained from design-based methods.






       In all situations, the algorithms in this report provide two one-sided 95%  confidence




bounds.  The combined upper and  lower confidence bounds enable two-sided 90%  confidence




bounds on the  distribution function.  The Horvitz-Thompson algorithm estimates standard




errors  from  which  the confidence bound  is based  on  a  normal  approximation.   The




alternative methods directly  provide confidence  bounds  based on the  exact  binomial or




hypergeometric distributions.   All  design-based methods suggested for discrete populations




assume the randomized model, as discussed  in Section 2.  Because  exact methods are usually




preferred over  approximate methods, the exact  methods are suggested  for those cases by




which  the  conditions justify their use.






       A test  data set  was  applied  to the following  algorithms.   Any  resource group




interested  in comparing their versions  of these algorithms to the ones  provided in  this report




are encouraged to contact the authors.  A copy of the test data set will be provided in order




to compare results from other programs.










3.1 Discrete Resources






       In this section, examples are provided for each of the possible approaches to obtaining




N0(y)  and  F0(y) for discrete  resources.  For each of these approaches, the conditions and




assumptions of the selection of the sampling units are defined. For quick reference, Table 1

-------
 (Section  4)  presents  this information in condensed form.   An  interest  in  obtaining the




 distribution  function  of  numbers and proportion  of numbers across the subpopulation  is




 expected  for all  resource groups.   For example, the lakes and streams resource group can




 compute  the numbers or proportions of numbers of lakes with some attribute based on this




 algorithm.









 3.1.1  Estimation of Numbers






       A  number of algorithms are  presented for  computing the distribution function for




 numbers.  The choice of the algorithm is dependent on  whether  the  units were chosen by




 either equal or variable selection.  The first three cases (algorithms) in this section derive the




 distribution functions based on an equal probability selection of units and the latter  two




 cases (algorithms) are based on an unequal probability selection of units.









 Equal Probability of Selection





       In  this subsection, three  cases are  provided  based on information that is known or




 unknown.  For the  first algorithm, N  is  either known  or unknown and Nq is known;  this




 algorithm produces  confidence bounds  based  on the hypergeometric  distribution.  For the




second algorithm, N is  known, but  Na is  unknown; this algorithm  is also  based on the




hypergeometric distribution.  For  the third algorithm, both N and Na can be either known




or unknown; this algorithm produces confidence bounus based on  the Horvitz-Thompson




variance estimator and the normal approximation.

-------
                                                                               (Case 1)
Case 1— Estimation of NQ(y): Discrete Resource, Equal Probabilities, NQ known, n=nQ.
          Confidence Bounds by Hypergeometric Algorithm.
   Conditions for approach

       1.  The frame population size, N, can be known or unknown.
       2.  The subpopulation size, N0, is known.
       3.  There is an equal probability of selection of units from the subpopulation.
       4.  Sample size condition: n=n .
   Chiiline for Algorithm


       Under the given conditions,  the  confidence bounds can be obtained by either the

exact hypergeometric distribution or by  the normal approximation.  This case  provides the

confidence  bounds for NQ(y) by the hypergeometric distribution, when Na  is known.  The

normal approximation bounds are provided in the next subsection (see Examples 4 and  5).


       This  algorithm computes the confidence bounds for each point along the curve using

the hypergeometric distribution.  In the  following formula, Nfl is the subpopulation size; na

is  the  sample size from the subpopulation; NQ(y) refers to the number of units, u, in the

subpopulation,  J., for which yu < y; and nQ(y) refers  to  the number of units in the sample

from   NO,  Sa,  for  which  yu 0.05.  The  lower confidence bound is  computed by obtaining the smallest

value  of N (y) for which Prob[X > n (y)] > 0.05.

-------
                                                                  (Caael)

      A GAUSS program is presented here that derives the confidence bounds based on the

hypergeometric  distribution.   Comments  in  capital  letters  in  braces explain  the

programming steps.  Under the conditions of Case 1, the upper and lower halves of the

confidence bounds are symmetric.
          CALCULATION OF CONFIDENCE BOUNDS ON Na(y) BY THE
                     HYPERGEOMETRIC DISTRIBUTION
load x[a,b] = data;  {LOADS DATA FILE WHICH INCLUDES LABEL CODE AND
                      VARIABLE TO BE ANALYZED. HERE a DESIGNATES
                      THE SAMPLE SIZE, n0, AND b DESIGNATES THE
                      NUMBER OF COLUMN VECTORS}
x=x[.,2];  {IDENTIFIES THE VARIABLE OF INTEREST IN SECOND COLUMN]
nm=rows(x);  {NUMBER OF ELEMENTS OF INTEREST IN  SUBPOPULATION, nc
                      IN THIS ALGORITHM, n=nm=n0}
n=rows(x);
NN=Na;  {DEFINES TOTAL SUBPOPULATION SIZE HERE, Nfl }

x=sortc(x.2);           {SORTS VARIABLE OF INTEREST}
y=seqa(U,nm);         {CREATES SEQUENCE OF NUMBERS}
x2=x[.,2j;              {DEFINES VARIABLE OF INTEREST AS X2}
x=y x2;                {CREATES MATRIX x}
zz=x;                  {DEFINES MATRIX x as zz}


{THE FOLLOWING COMBINES RECORDS WITH DUPLICATE VALUES
      OF THE VARIABLE}
xx=zeros(l,2);
q=0;
 i=i;
 do while i < nm;
    ifx[i,2]==x[i+l,2];
    q=q+l;  I;
      else;
      xx=xx|x[i,.j;
    endif;
 endo;
xx=xx|x[nm,.];

-------
                                                                       (Case 1)
{THE FOLLOWING STEPS BEGIN CONFIDENCE BOUND ESTIMATION)
r=rows(x);
z=zeros(r,l)l
xl=x[.,l];
x2=x[.,2];
x=x2~xr(NN*xl/nm)~z;
{THE FOLLOWING STEPS GENERATE THE UPPER CONFIDENCE BOUND}
i=lj
do while i <= r;   {BEGINS INITIAL DO LOOP)
   rr=x[i,2];
   mm=trunc(NN*rr/nm);
      if mm >= NN;  .,
        goto three;
      endif;

one::
 mm=mm-(-l;
   if NN <= 160;
      aa=n!*mm!/NN!*(NN-mm)!*(NN-n)!;
   else;
      aa=lnfact(n) •+ Infact(mm) - Infact(NN) + Infact(NN-mm) + Infact(NN-n);
   endif;
 j=0;
   if (NN-mm-.n) < 0;
      j=-(NN-mm-n);
   endif;
 s=0;
 do while j <= rr;
   if NN <= 160;
   else;
      a=aa  Infact(j) - Infact(mm-j)  Infact(n-j)  lnfact(NN-mm-n+j);
      a=exp(a);
   endif;
   s=s+a;
 endo;
   if s>= .05;
      goto one;
   endif;

three:;
 if mm>=NN;
   x[i,4] = NN;
 else;
   x[i,4]=mrn-l;
 endif;
ENDO;   {ENDS INITIAL DO LOOP}

-------
                                                                      (Casel)

{THE FOLLOWING STEPS ADD AN EXTRA LINE TO DATA MATRIX NEEDED IN
  CONFIDENCE BOUND ADJUSTMENT COMPUTED AT END OF ALGORITHM]
r=rows(x);
y=zeros(r,l)
x=x~y;
y=zeros(l,5);
y[l(2:4]=x[r,2:4];
x=x|y;
{THE FOLLOWING STEPS GENERATE THE LOWER CONFIDENCE BOUND}
r=rows(x);
i=l;
do while i <= r;  (BEGINS SECOND DO LOOP}
 rr=x[i,2];
 mm=trunc(NN»rr/n);
    if mm==0;
      goto six:
    end if;

four:;
mm=mm-l;
 if NN <= 160;
    aa=n!*mm!/NN!*(NN-nim)!*(NN-n)!;
 else;
    aa=lnfact(n) -t- Infact(mm)  Infact(NN) + Infact(NN-mm) + Infact(NN-n);
 endif;
 j=rr;
 mnm=minc(n|mm);
 6=0;
do while j <= mnm;
 if NN <= 160;
 else;
   a=aa- lnfact(j)  Infact(mm-j) - Infact(n-j)   lnfact(NN-mm-n+j);
   a=exp(a);
 endif;
   s=s-(-a;
endo;
 if s>= .05;
   goto four;
 endif;

six:;
 if mm==0;
   x[i,5]=0;
 else;
   x[i,5]=mm+l;
 endif;
ENDO;   {ENDS SECOND DO LOOP}

-------
                                                                     (Case!)
{ASSIGN LABELS}
 "N= "  NN  ",  n = "  n;
x;
OUTPUT OFF;
{ADJUST Nfl(y) and CONFIDENCE BOUNDS  AVERAGE SUCCESSIVE VALUES}
r=rows(x);
xx=x;
i=2;
do while i <= r-1;
 jtx[i,3:5] = (x[i,3:5] + x[i-l,3:5])/2;
 i=i+l;
endo;

{OUTPUT FILE AND PRINT MATRIX x}
OUTPUT FILE=NAME;
OUTPUT reset;
"  x   r "  Sequence #"  " F(x)  "  " F-lower(x) "  " F-upper(x) ";
format /ml/id 12,7;
print x;
OUTPUT OFF;

end;

-------
                                                                              (Case 2)
Case 2- Estimation of N0(y): Discrete Resource, Equal Probabilities,
           N known, Na unknown, n>nfl.
           Confidence Bounds by Hypergeometric Algorithm.
   Conditions for approach

       1.  The frame population size, ,N, is known.
       2.  The subpopulation size, Nfl, is unknown.
       3.  There is an equal probability of selection of units from the subpopulation.
       4.  Sample size condition: n>n,,.
   Outline for Algorithm


       Under the  given conditions, the confidence  bounds can be obtained  by either  the

exact hypergeometric distribution or by the normal approximation.  This example provides

the confidence bounds for N0(y) by the hypergeometric distribution, when N is known, but

Nffl is  unknown.   Normal approximation  bounds are  provided  in the next subsection (see

Examples 4  and 5).

       This  algorithm computes the confidence bounds for each  point along the curve using

the hypergeometric distribution. In the following formula, N is the frame population size; n

is  the sample size  from  the frame population; Na(y) refers to the number of units, u, in the

subpopulation,  J., for which  yu < y; and na(y) refers to  the number of units in the sample

from N0, Sa, for which yu < y.  Under the conditions, n0(y) has the following hypergeometric

distribution.  Let  X represent the random variable for which na(y) is a realization.  Note

that n(y) < n  and that Na(y) < Na < N.
                                X(y)WN-Na(y)\
                                   X   II   n-X   I
                                ^	A-^r	L                               (50)

-------
                                                                         (Case 2)


The upper confidence bound is computed by obtaining the largest value of Na(y) for which

Prob[X < na(y)j > 0.05. The lower confidence bound is computed by obtaining the smallest

value of NQ(y) for which Prob[X > nfl(y)] > 0.05.


      To obtain the distribution function, the data file needs to be sorted on  the indicator,

either in  an ascending or descending order.  When the data file is sorted in ascending order

on the indicator, the distribution function of numbers, NQ(y), denotes the number of units in

the target population- that have  the value less than or equal to the specific y.  Conversely, if

it  is of  interest to  obtain  bounds on the number  of units  in the target population with

indicator values greater than or equal to y, the data file must be sorted and analyzed in

descending order on  this  variable.   The distribution function generated by the analysis in

descending order is [NQ - Na(y)]-


      A  GAUSS program  provided  in Case 1 derives the confidence bounds  based on the

hypergeometric distribution. However, under the conditions  discussed here, the sample size

and population sizes are defined  as follows.
           CALCULATION OF CONFIDENCE BOUNDS ON N0(y) BY THE
                       HYPERGEOMETRIC DISTRIBUTION
load x[a,b] = data;  {LOADS DATA FILE WHICH INCLUDES LABEL CODE AND
                        VARIABLE TO BE ANALYZED. HERE a DESIGNATES
                        THE OBSERVED SAMPLE SIZE, DO, AND b DESIGNATES
                        THE NUMBER OF COLUMN VECTORS}
x=x[.,l];
nm=rows(x);    {NUMBER OF ELEMENTS OBSERVED, nfl.  IN THIS ALGORITHM,
                  °*na.  }
n=#;        {FULL SAMPLE SIZE}
NN=N;      {DEFINES TOTAL POPULATION SIZE HERE }
REFER TO CASE 1 (AFTER LINE 13) FOR THE REMAINING STEPS
   IN THIS PROGRAM.

-------
                                                                               (Case 3)
Case 3- Estimation of Na(y): Discrete Resource, Equal Probabilities.
           Confidence Bounds by Horvitz-Tbompeon Standard Error
           and Normal Approximation.
   Conditions for approach

       I.  The frame population size, N, can be known or unknown.
       2.  The subpopulation size, N0, can be known or unknown.
       3.  There is an equal probability of selection of units from the subpopulation.
          [Note that this algorithm can also be applied to those cases presented in
          Examples  1 and  2.]
   Outline for Algorithm


       The algorithm recommended, given  the foregoing conditions, is based on the Horvitz-

Thompson formulae, which were discussed in Section 2.  The algorithm presented for the

general case  of variable probability of selection  (the following subsection) is, appropriate to

use given the foregoing conditions.


       Equal probability selection is a special case of variable probability selection.  In equal

probability of selection of units, the weighting factors are equal for all units, wi=:wj=w.  If

the weights,  w1| and w2 lt are appropriately identified, then  the general  algorithm presented

in Example  4  will  not need any  modification.  The Tier 2 weight,  w2|.,  computed  by

Equation 4 is the same for all units.

-------
Variable Probability Selection






      In this subsection, two examples are provided to demonstrate variable probability of




selection.  For both cases, the frame population size can be known or unknown.  In  Case 4,




NQ can be unknown or known and equal to Nfl.  For Case 5, Na is known and not equal to




N0.  Both algorithms produce confidence bounds based on the Horvitz-Thompson variance




estimator and the normal approximation.

-------
                                                                              (Case 4)
 Case 4-  Estimation of N0(y): Discrete Resource, Variable Probabilities,
           N0 unknown or N0 known and equal to Na.
           Confidence Bounds by by Horvita-Tbompeon Standard Error
           and Normal Approximation.
   Conditions for approach

       1. The frame population size, N, can be known or unknown.
       2. The subpopulatjon size, NQ, is unknown or known and equal to N0
       3. There is a variable probability of selection of units from the subpopulation.
   Outline for A Igoriihm


       The algorithm supplied for this example is based on the Horvitz-Thompson formulae,

which were discussed in Section 2.  This algorithm is appropriate for a sample subset for any

subpopulation a that is of interest.   It is useful to identify the estimator of Na from Tier 2.

The design-based estimator of Na is
                                    Na=     w2,  ,                                (51)
                                          sa


where Sa is the portion of the sample from the subpopulation, J(, over which the weighting

factors  (w) are summed.  The  variance estimator for Na is  presented in Equation  3b of

Section  2.



   Calculation of confidence bounds on Na(j) ky the HorviU-Tbompson formulae

      For each indicator, the following algorithm  derives the  distribution function and the

confidence bound for N(y) or Na(y).  This algorithm is similar to the algorithm defined for

the National Surface Water Surveys (Overton, 1987a,b).  The Horvitz-Thompson variance

estimator,  discussed in Section 2.1, is used to compute the variance in  this algorithm.  The

confidence bounds are computed  based on a normal approximation.

-------
                                                                          (Case 4)
1. Data set
     a. Unit identification code
     b. Tier 1 weighting factor, wlt-
     c. Tier 2 conditional weighting factor, w21(
     d. Indicator of interest (y)

2. Sorting of data
       The data file  needs to be sorted  on the indicator, either in an ascending or
descending order.  When "the data file is sorted in ascending order on the indicator,
the distribution  function of numbers,  Na(y),  denotes the number of units in the
target  population  that have  a value less than  or  equal to the y  for a specific
indicator.  Conversely, if it is of interest to estimate the number of units in the
target  population with  indicator variables greater than or equal  to y, the data file
would  be sorted in descending  order on this variable.    The distribution function
generated by the analysis in descending order is  [Nfl - NQ(y)j.

3. Computation of weighting factors
       The Tier 1 and  Tier 2  weights  are included for each observation in the data
set.  These weights are used to compute the total weight of selecting the l   unit in
the Tier 2 sample.  Compute the following weight for each observation:
where wlf is the weighting factor for the 2(  unit in the Tier 1 sample (the inverse of
its Tier  ] inclusion probability)  and w2 1( is the inverse  of the conditional Tier  2
inclusion probability.

4.  Algorithm for Na(y)

     a.  Define a matrix of q column vectors, which will be defined  as the following.
        There is one row for each data record and five statistics for each row.
            qj = value of y variable for the record
            q2 = Na(yJ
            qj = var[Na(y)]
            q_4 = upper confidence bound for Nfl(y)
            q5 = lower confidence bound for N0(y)

     b.  Index rows using t from 1 to n; the t' row will contain q-values
        corresponding to the i   record in the file, as analyzed.

     c.  Read first observation (first row of data matrix), following  with the
        successive observations, one at a time.  Accumulate the q-statistics as each
        observation is read into file.  Continue this loop until the end of file  is
        reached.  At that time, store these vectors and go to d.  This algorithm is
        calculating the distribution for the number  of units [Na(y)} in the
        subpopulation. It is necessary to identify the records for which

-------
                                                                     (Case 4)
'•  qiM = y[«)

ii. q2[,j = q2[, - 1] + Wj •

iii.  o^,] = q^, - 1] + w2.*(w2i - 1)

       where, if neither w2 j. nor w2 j,= l:
                                                    w2l-w2 . - w2,- •
                                               3 < •
              where, if either w2 li or w2 j .= 1:
                     W2,j = wl,jw2.1.W2.1;

                        where:
                            w, •  =
       iv. q4[i] = q2[j] + 1.645*

       v. q5[i] =-q2p]  - 1.
   Multiple observations with one y value create multiple records in the above
   analysis for one distinct value of y.  The last record for that y contains all
   the information needed for Na(y).  Therefore, at this stage of the analysis,
   eliminate all but the last record for  those y values that have multiple
   records.

d. Output of interest
   From the last entry of the row of q-vectors just computed:
       i.  qj = largest value of y (or smallest if analysis is descending)
       ii.  q2 = N0 _
       iii. q^ = var(N0)
       iv. Standard error of N  =
                              a
   From the q column vectors:
       i.   qj represents the ordered vector of distinct values of y
       ii.   q2 represents the estimated distribution function,  N0(y),
           corresponding to the values of y
       iii.  c^ represents the 95% one-sided upp«r confidence
           bound of the distribution function, Nfl(y)
       iv.  qs represents the 95% one-sided lower confidence
           bound of the distribution function, N  (y)

-------
                                                                               (CaseS)
Case 5— Estimation of N0(y): Discrete Resource, Variable Probabilities,
          Na known and not equal to Na.
          Confidence Bounds by by Horvite-Thompeon Standard Error
          and Normal Approximation.
   Conditions for approach

       1.  The frame population size, N, can be known or unknown.
       2.  The subpopulation size, NQ, is known and not equal to NQ.
       3.  There is a variable probability of selection of units from the subpopulation.
   Outline for Algorithm


       The algorithm supplied in this section  is based on the Horvitz-Thompson  formulae,

which  were discussed in Section 2.  This algorithm is appropriate for a sample subset for any

subpopulation a  that  is of interest.  The algorithm for  the distribution  function for the

proportion  of numbers,  Fa(y), given exactly the same conditions listed above, is presented  in

Case 8.  To compute the distribution function of numbers, NQ(y), first use  the algorithm  in

Case 8 to compute the distribution function  with  the  corresponding confidence bounds for

the proportion of numbers. Then, compute the following:



                                   Na(y) =Fa(y)*Na,                              (52)



where  NQ is the known subpopulation  size.  To compute  the confidence bounds for NQ(y),

simply multiply the upper and lower confidence limits of Fa(y) by Na.

-------
3.1.2 Proportions of Numbers






       A number of algorithms are presented to compute the distribution function  for the




proportion of numbers. For any  case in a resource group, the choice of the algorithm is first




determined  by the method  by which the units  were selected.  The first two algorithms in




this  section  derive the distribution  functions  based on an equal probability selection of units




and  the latter two algorithms are based on an unequal probability selection of units.










Equal  Probability of Selection






       In  this  subsection,  two  examples  are  provided  based  on  whether  or not  the




subpopulation size is  known or  unknown.   For the first  algorithm,  Na can  be known or




unknown; this algorithm produces confidence bounds based  on the binomial  distribution.




For  the  second  algorithm.  N0  is  known; this  algorithm  is  based on  the  hypergeometric




distribution.

-------
                                                                               (Case 6)
Case 6— Estimation of F0(y): Discrete Resource, Equal Probabilities,
           N0 known or unknown.
           Confidence Bounds by Binomial Algorithm.
   Conditions for approach


       1.  The frame population size, N, can be known or unknown.
       2.  The subpopulation size, NQ, can be known or unknown.
       3.  There is an equal probability of selection  of units from the subpopulation.
   Outline for Algorithm


       Under the given conditions in which  NQ may not be known,  the confidence  bounds


can  be based on the binomial distribution.   In addition, Example 8 provides  the  normal


approximation approach to the confidence bound estimation.


       A program,  based  on the binomial distribution and written in the GAUSS language.


is  presented  in   this   section.     We  assume   X  has  the  binomial  distribution,


X ~ Binomial[na,Fg(y)],  where na(y) is the observed realization of X, na  represents the


number  of  "trials",  F^(y)=  — £ —   represents the true finite population  proportion of
                      ^
                                a
"successes"1, and  Fa(y) is the infinite  population parameter.   The estimated distribution


function is denoted as FQ(y)= — ^ — , where na is the  sample size from the subpopulation and


na(y) refers to the number of units in  the sample for which yu < y.  The upper confidence


bound is computed by obtaining the largest value of F0(y) for which Prob[X < na(y)] > 0.05.


The lower confidence bound is computed by obtaining the smallest value of Fa(y) for which


Prob[X > na(y)] > 0.05.  As written, the algorithm calculates the upper and lower confidence


bounds to three decimal places.


       Comments  in capital letters in braces explain the programming steps.  Under  these


conditions, the upper and lower halves of the confidence bounds are symmetric.

-------
                                                               (Case 6)
          CALCULATION OF CONFIDENCE BOUNDS ON Fa(y) BY THE
                         BINOMIAL DISTRIBUTION
 load x[a,b] = data;  {LOADS DATA FILE FOR THE TARGET SUBPOPULATION
                     WHICH INCLUDES LABEL CODE AND VARIABLE TO
                     BE ANALYZED. HERE a DESIGNATES THE SAMPLE
                     SIZE, na, AND b DESIGNATES THE NUMBER OF
                     COLUMN VECTORS}
 n=rows(x);            {SAMPLE SIZE IN TARGET SUBPOPULATION, na}
 x=sortc(x.2);           {SORTS VARIABLE OF INTEREST}
 y=seqa(l,l,nm);        {CREATES SEQUENCE OF NUMBERS}
 x2=x[.,2];             {DEFINES VARIABLE  OF INTEREST AS X2}
 x=y x2-               {CREATES MATRIX x}
 {THE FOLLOWING STEPS COMBINE RECORDS WITH COMMON y-VALUES}
 xx=zeros(l,2);
do while i < n;
 if x[i,2]==x[i+l,2];
   q=q+l:   I;
   else; xx=xx|x[i,.j;
 end if;
endo;

xx=xx|x[n,.];
r=rows(xx);
x=xx;
{THE FOLLOWING STEPS FORM DATA MATRIX  x}
r=rows(x);
z=zeros(r,l);
xl=x[.,l];
x2=x[.,2];
x=x2"xl"(xl/n)"z;
{THESE STEPS GENERATE BINOMIAL COMBINATION TERMS}
f=zeros(n-fl,l);
i=0;
if n<160;
 do while i<=n;

-------
                                                                        (Case 6)
  endo;
 else;
    f[i+l,l]=lnfact(n)-lnfact(i)  - Infact(n-i);
 endif;
 {THE FOLLOWING STEPS GENERATE UPPER CONFIDENCE BOUND}
 i=l;
 do while i  <= r;  {BEGINS INITIAL DO LOOP}
  rr=x[i,2];
  p=(trunc(100*x[i,3]))/100;
    if p==1.0;
      p=p- .001;
      goto three;
    endif;

 one:;
 p=p+.01;
j=0;
s=0;
 do while j  <= rr;
 s=s+a;
endo;
  if s >= .05:
    goto one;
  endif;

two:;
p=p- .001;
j=0;
s=0;
C*D while j <= rr;
  a=fIi + l,l]*PJ*(l-p)'(n-J);
  s=s+a;
endo;
 if s <= .05;
    goto two;
 endif;

three:;
x[i,4)=p4.001;
ENDO;   {ENDS INITIAL DO LOOP}
                                     "64

-------
                                                                    (Case 6)
{THE FOLLOWING STEPS ADD AN EXTRA LINE TO DATA MATRIX NEEDED IN
   CONFIDENCE BOUND ADJUSTMENT COMPUTED AT END OF ALGORITHM}
r=rows(x);
y=zeros(r,l);
x=x~y;
y=zeros(l,5);
y[l,2]=n;
x=x|y;

{THE FOLLOWING STEPS GENERATE LOWER CONFIDENCE BOUND}
r=rows(x);
i = l:
do while i <= r;   {BEGINS SECOND DO LOOP}
 rr-x[i,2j;
 p=(trunc(100*x[i,3]))/100;
    if p==0;
      p=.001;
      goto six;
    endif;

four:;
p=p-.01;
 if p<=0;
    p=.001:
    goto six;
 endif;
j=rr;
s=0;
do while j <= n;
endo;
 if s >= .05;
    goto four;
 endif;

five:;
p=p+.001;
j=rr;
s=0;
do while j <= n;
 6=s-t-a;
endo;
 if s <= .05;
    goto five;
 endif;

-------
                                                                    (Case 6)
six:;
x[i,5]=p-.001;
ENDO;   {ENDS SECOND DO LOOP}
{ADJUST F0(y) and CONFIDENCE BOUND - AVERAGE SUCCESSIVE VALUES]
r=rows(x);
xx=x;
i=2;
do while i <= r-1;
 xx[i,3:5] = (x[i,3:5] + x[i-l,3:5))/2;
 i=i+l;
endo;

{OUTPUT FILE AND PRINT MATRIX x)
OUTPUT FILE=NAME;
OUTPUT ON;
"x"   "Sequence #r   "F(x)"   "F-upper(x)"   "F-lower(x)"  ;
format /ml/rd 12,7;
print x;
OUTPUT OFF;

end;

-------
                                                                              (C«e7)
Case 7-  Estimation of F0(y): Discrete Resource, Equal Probabilities, Na known.
           Confidence Bounds by Hypergeometric Algorithm.
   Conditions for approach

       1.  The frame population size, N, can be known or unknown.
       2.  The subpopulation size, NQ, is known.
       3.  There is an equal probability of selection of units from the subpopulation.
   Outline for Algorithm


       Under  the given  conditions,  the confidence  bounds can  be based  either on  the

binomial  or on  the  hypergeometric distribution.   The  binomial  algorithm presented in

Example  6  is appropriate to use given  the foregoing conditions.  In addition,  Example 9

provides the normal approximation approach, which  is also applicable,  given the foregoing

conditions, to the confidence bound estimation.


       To obtain confidence bounds for F(y) based on the hypergeometric distribution, refer

to the algorithm provided  for the confidence bound calculation for Na(y) in Example 1.

Simply divide  the lower  and  upper  confidence  bounds, and  Na(y),  by  the  known

subpopulation size, Nfl.  No further  changes are necessary to this algorithm  to  provide

confidence bounds for F (y) based on the hypergeometric distribution.

-------
Variable Probability Selection






      In this  subsection,  two cases are provided  to demonstrate variable probability of




selection.  For both cases,  the frame population size can be known or unknown. In Case 8,




N0  can  be unknown or known but not equal  to  N0;  this algorithm produces confidence




bounds based on the Horvitz-Thompson ratio standard error and the normal approximation.




For Case 9, Na is known and equal to  NQ; this algorithm produces confidence bounds based




on the Horvitz-Thompson variance estimator and the normal approximation.
                                         w6—••

-------
                                                                               (Case 8)
Case 8-  Estimation of F0(y): Discrete Resource, Variable Probabilities,
           N0 unknown or known and not equal to Na.
           Confidence Bounds by Horvitz-Thompeon Ratio Standard Error
           and Normal Approximation.
   Conditions for approach

       1.  The frame population size,  N, can be known or unknown.
       2.  The subpopulation size, Na, is known and not equal to N0.
       3.  There is a variable probability of selection of units from  the subpopulation.
   Ontltne for Algorithm


       The algorithm supplied in  this section is based on the  Horvitz-Thompson formulae,

which were discussed in Section 2.  This algorithm is appropriate for a sample subset for any

subpopulation a that is of interest.



   Calculation of confidence bounds on Fa(y) by the Eoroiiz-Thompson formulae


       For each indicator, the following algorithm derives the distribution function and the

confidence bound for N0(y) similar to  that given in  Example  4.  In this section, however,

the interest  is in obtaining a distribution function for proportions.  Therefore, the variance

of a ratio estimator is used in  this algorithm.  The confidence bounds are computed based

on a normal approximation.
        1.  Data set
             a. Unit identification code
             b. Tier 1 weighting factor, wjt
             c. Tier 2 conditional weighting factor, w2 1(
             d. Indicator of interest (y)
             e. The subset of data corresponding to the subpopulation of interest, indexed
                  by a.

        2.  Computation of weighting factors

              This step does not  have to be made with  each use of the  datum,  as  the
        weights  are permanent  attributes of a sampling unit.  The following details  are

-------
                                                                         (CaseS)
given for completeness.

       The  Tier  1  and Tier 2 weights are included for each record in the data set.
These  weights are  used  to  compute the total  weight  of selecting the i*  unit  in the
Tier 2 sample. Compute the following weight for each record:

                                  W2i - Wliw2.1i

where  w1( is the  weighting  factor for the t*  unit in the Tier 1 sample (the inverse of
its Tier  1  inclusion probability) and w2 1( is the inverse  of the conditional Tier 2
inclusion probability.  The pairwise  inclusion weight is defined below.  The sample
size at Tier 2, n2, is not subpopulation specific.
3.  Algorithm for Fa(y) and Confidence Intervals

     a.  Sorting of data.  The data file needs to be sorted on the indicator, either in
        an ascending or descending order.  When the data file is sorted in ascending
        order on the indicator, the distribution function of proportions, F0(y),
        denotes the proportion of units in the target population that have a value
        less than or equal to the y for a specific indicator. Conversely, if it is of
        interest to  estimate the proportion of units in the target population with
        indicator variables greater than or equal to  y, the data file would be sorted
        in descending  order on this variable.   The distribution function generated
        by the analysis in descending order is [1 - FQ(y)].
     b. First, compute N0 =  \^ W2i  (this sums over data matrix).
                             5a
     c. Define a matrix of q column vectors, which will be defined as the following.
        There is one row for each data record and five statistics for each row.
            q.j = value of y variable for the record
            q2 = Fa(yj
            q3 = var[Fa(y)]
            q4 = upper confidence bound for Fa(y)
            q5 = lower confidence bound for Fa(y)

     d. Index rows using i from  1 to n; the  il  row  will contain q-values
        corresponding to the t'   record in the file, as analyzed.

     e. Read first observation (first row of data matrix), following with the
        successive observations, one at a time. Accumulate and store the q
        -statistics, below, as each observation is read into file.  Continue this loop
        until the end of file is reached.

            »• qiW  = y[-]

            ii.  q.  = q.'-l   +
        Multiple observations with one y-value creates multiple records in the
        preceding analysis for one distinct value of y.  The last record for that  y
                                   Tfr-

-------
                                                                    (Case 8)
   contains all the information needed for Ffl(y).  Therefore, at this stage of
   the analysis, eliminate from the q-file all but the last record for those y
   values that  have multiple records.

f. Entries in the first column (qj) of the q-matrix identifies the vector of y
   -values for  the remainder of the calculations.  For each such y-value, yt-,
   make the following calculations.  Note that this part of the algorithm is not
   recursive; each calculation is made over the entire sample.
             where,
                            2n2w2j.w2fc-w2j.-w2fc
                    "W-         2(n2-l)
             and,


                    Similarly for dfc".


       iv. q4[i]  = q2[i] + 1.645*v/o~[i]

       v.  q.fil  = q,[il - 1.645*,
  Output of interest
  From the q column vectors:
      i.   qj represents the ordered vector of distinct values of y.
      ii.  q2 represents the estimated distribution function, Fa(y),
          corresponding to the values of y.
      iii.  q4 represents the 95% one-sided uppjer confidence
          bound of the distribution function, F0(y).
      iv.  q5 represents the 95% one-sided lower confidence
          bound of the distribution function, F0(y).
                             74.

-------
                                                                               (Case 9)
Case &- Estimation of Ffl(y): Discrete Resource, Variable Probabilities,
          Na known and equal to Nfl.
          Confidence Bounds by Horvitz-Thompeon Standard Error
          and Normal Approximation.
   Conditions for approach

       1.  The frame population size, N, can be known or unknown.
       2.  The subpopulation size, Nfl, is known and equal to NQ.
       3.  There is a variable probability of selection of units from the subpopulation.
   Outline for Algorithm


       The algorithm supplied in this section is based  on the  Horvitz-Thompson formulae,

which  were discussed in Section 2.  This algorithm is appropriate for a sample subset1 for any

subpopulation a that is of interest.



   Calculation of confidence bounds on Fa(y) by the Horvitz-Thompson formulae


       For each  indicator, the following algorithm derives the distribution function and  the

confidence bounds for Na(y) exactly as given in Example 4. Because  NQ is known and equal

to NQ, it  is not necessary  to  use the ratio  estimator applied in Case 8.  The distribution

function of FQ(y) is obtained by dividing the distribution function, NQ(y), and the associated

bounds, by  Nn.   (These additional steps are  included in this algorithm.)   The Horvitz-

Thompson variance estimator, discussed in  Section 2.1, is used to compute  the  variance in

this algorithm. The confidence bounds are computed based on a normal approximation.
        1.  Data set
             a. Unit identification code
             b. Tier 1  weighting factor, w]t.
             c. Tier 2  conditional  weighting factor, w2j,
             d. Indicator of interest (y)

-------
                                                                         (C*»e9)
2. Sorting of data

       The data file  needs to be sorted on  the indicator, either in an ascending or
descending order.  When the data file is sorted in ascending order on the indicator,
the distribution  function of proportions,  FQ(y), denotes the proportion of units in
the target population that have a value  less than or equal to the  y  for a specific
indicator.  Conversely, if it is of interest  to estimate the proportion of units  in the
target population with indicator variables greater than or equal to  y, the data file
would be sorted in descending  order on  this variable.   The  distribution function
generated by the analysis in descending order is [1  - F0(y)].

3. Computation of weighting factors

       For this step, refer  to the  program  steps  given in Example  4  to derive the
distribution function and the confidence bound for N0(y).  Follow the steps labeled
3  and  4.   Additional  steps,  shown  here,  are  needed  to  obtain  Fa(y)  and  its
corresponding confidence bounds.  Proceed with the following steps after conducting
steps 3 and 4  from Example 4:
     e. The operations that follow generate the q vectors to compute the estimated
        distribution function and appropriate confidence bounds for FQ(y).  These
        are denoted by q6 through q^.  Each element of q6-qg is computed by
        performing the following operations on the corresponding elements of q2, q4,
        and qs.

            i.   qe = Divide each element of q2 by the known subpopulation size
            ii.   q7 = Divide each element of q4 by the known subpopulation size
            iii.  q^ = Divide each element of q5 by the known subpopulation size
        From the q vectors:

             i.  q^ represents the estimated distribution function, Fa(y)
            ii.  q7 represents the 95% one-sided uppjer confidence
                bound of the distribution function, Fa(y).
            iii.  q^ represents the 95% one-sided lower confident
                bound of the distribution function, Fa(y).

-------
3.1.3 Rationales for Approaches






       Justification for the  variance estimators  used in the algorithms in Sections 3.1.1 and




3.1.2 was presented in Section 2 of this report.  The different choices proposed for confidence




bound estimation, under some conditions, were also discussed.   For example, both  the




hypergeometric  and  binomial approaches  to the confidence bound calculation  for  F0(y),




when Na is  known,  were provided in the above cases.   Choice of one of the approaches




presented to compute confidence bounds for FQ(y), when the subpopulation size is  known,




depends  in part on the'available information and in part on the purpose of inference.  The




bounds based on the  hypergeometric distribution provide for inferences directed to the finite




population.  For example, if data are available for every lake in a small population of lakes,




there is  no  uncertainty relative to  this attribute  for this population (in the absence  of




measurement error).   Bounds based  on hypergeometric or on  the normal  approximation




approach will  reduce to zero width  as  n~N,  because of the  finite population correction.




These bounds are more relevant for management purposes. In contrast, those bounds based




on   the  binomial  distribution provide  for  inferences  directed  to the  superpopulation




parameter.   In  this  situation, the entire  population is  considered as  a  sample  from the




superpopulation.  Statements about the set of high  mountain  lakes in New  England are




finite, but general statements about  high  mountain  lakes, based on  those  found in New




England, are relative to a hypothetical,  infinite, superpopulation. Therefore, the  confidence




bounds obtained  by  the binomial distribution are broader than  those  provided  by the




hypergeometric distribution  to account for this additional level of variability.

-------
3.1.4 Estimation of Size-Weighted Statistics





       A  few algorithms  are  presented  to compute the  distribution  functions  for  size-




weighted  totals  and size-weighted proportions of totals.  The following subsection  describes




algorithms  to  compute  the distribution  function  for size-weighted  totals.   The  next




subsection presents algorithms to compute the  distribution function for the proportions of




size-weighted totals.









Estimation of Size-Weighted Totals





       In  this subsection, two examples are provided based on information that is known or




unknown.  For  the first algorithm, the size-weight,  ZQ, is  unknown or known  and equal to




Za;  this algorithm produces confidence bounds based on  the Horvitz-Thompson  standard




error and the normal approximation.  For the second algorithm, Za is known but not equal




to ZQ;  this  algorithm  produces  confidence bounds  based  on the  Horvitz-Thompson  ratio




standard error and the normal approximation.

-------
                                                                              (Case 10)
Caae 10-  Estimation of Za(y): Discrete Resource, Sise-Weighted Estimate,
            Equal or Variable Probabilities.  Za unknown or known and
            equal to ZQ.
            Confidence Bounds by HorviU-Thompeon Standard Error
            and Normal Approximation.
   Conditions for approach

       1. The frame population size-weighted total, Z, can be known or unknown.
       2. The subpopulation size-weighted total, Za, is unknown or known and
         equal  to ZQ.
       3. There can be an equal or variable probability selection of units from
         the subpopulation.
   Outline for Algorithm


       Genera] formulae for  Tier  1 estimates were provided in Section 2.1.1.  The general

form  of a size-weighted estimate in a subpopulation  at Tier 1, denoted as ZQ, is similar to

Equation 2. The y, in that equation refers to the size-weight value, now denoted as zt:
                                           sa
where z, defines a size-weight, such as the area of a lake or the stream length in miles, and

w  is the inverse  of the inclusion probability  at Tier  1.  Using these same definitions,  the

variance estimator for Z0 is similar to Equation 3a.



   Estimation of Za(y) by ike Hormtz-Thompso* formulae

      For  each indicator,  the following algorithm derives the distribution function and  the

confidence  bound  for Za(y).  This algorithm is similar to the algorithm defined  for  the

National Surface  Water  Surveys  (Overton,  1987a,b).   The  Horvitz-Thompson  variance

estimator, discussed in Section 2.1, is used to compute the variance in this algorithm.  The


                                          TO-

-------
                                                                                (Case 10)



confidence bounds  are  computed based on  a normal approximation.   This algorithm is

appropriate for a sample subset for any subpopulation  a that is of interest.
        1.  Data set
             a. Unit identification code
             b. Tier 1 weighting factor, w](
             c. Tier 2 conditional weighting factor, w2 ,(
             d.  Size-weighted value (z)
             e. Indicator of interest (y)

        2.  Sorting of data

               The data file needs  to be sorted on the indicator, either in an ascending  or
        descending order.   When the data  file is sorted in ascending order on the indicator,
        the distribution function of size-weighted totals,  Za(y), denotes the size-weights  in
        the target population  that have a value less than  or  equal to  the  y  for a  specific
        indicator.  Conversely, if  it is of interest to estimate  the size-weight  in the target
        population with indicator  variables greater than  or equal to y,  the  data file would
        be sorted in descending order on this variable.  The distribution function generated
        by the analysis in descending order is [Za -  Za(y)].

        3.  Computation of additional weighting factors

               The Tier 1  and Tier 2 weights  are included for  each observation  in the data
        set.  These weights are used to compute the total weight of selecting the I   unit  in
        the Tier  2 sample.  First, compute this weight for  each  observation:

                                        W2i = Wliw2.1i  '


        where w]( is the weighting factor for the i*  unit in the Tier  1 sample and  w2 1( is
        the inverse of the conditional Tier 2 inclusion probability.

        4.  Algorithm for Za(y)
             a. Define a matrix of q column vectors, which will be defined  as the following.
               There is one row for each data record and  four statistics for each row.
                   q}  = value of y variable for the record
                   q2 = z0(y)
                   qa = var[za(y)J
                   q^  = upper confidence bound for Za(y)
                   qs = lower confidence bound for Za(y)

             b. Index rows  using i from 1 to n; the i'  row will contain q-values
               corresponding to the t   record in the file,  as analyzed.

             c. Read first observation (first row of data matrix), following with the
               successive observations, one at a time.  Accumulate the q-statistics as each
               observation is read into file. Continue this loop until the end of file is
               reached. At that time, store these vectors and go to d.  It is  necessary, as
               shown below for q4, to identify the records for which w2 ^=1.

-------
                                                                    (Case 10)
       i-  qJO = y[«]

       ii.  q2[.] = q2[i - 1] + w2 -*z •

                  q.j[i - 1] + z?*w2l-*(w2|. - 1) + 2£ z,-z -(w2l-w2 -- w2l- )
                                                  3 < '

             where, if neither w2 1(  or w2il  — 1:

                         _ 2n2w2l-w2 j - wg|- - w2j
                                 2(n2-l)
             where, if either w2 1( or w2 !,-=!:
                    W2.j =
                        and where:
                                   2n1w1|wlj-wli-wlj
       iv. q4[i] = q2[i]
       v-
   Multiple observations with one y value create multiple records in the
   preceding analysis for one distinct value of y. The last record for that y
   contains all the information needed for Za(y).  Therefore, at this stage of
   the analysis, eliminate all but the last record for those y values that have
   multiple records.

d. Output  of interest
   From the last entry of the row of q-vectors just computed:
      i.   qj = largest value of y (or smallest if analysis is descending).
      ii.   q2 = Za
      iii.  q3 = var(Za)
      iv.  Standard error of Z  =
                             Q
   From the q column vectors:
      i.  qj represents the ordered vector of distinct values of y
      ii.  q2 represents the estimated distribution function, Zfl(y),
          corresponding to the values of y.
      iii. q^ represents the 95% one-sided upper confidence
          bound of the distribution function, Z0(y).
      iv. q5 represents the 95% one-sided lowjer confidence
          bound of the distribution function, Z0(y).
                              -78"

-------
                                                                              (Case 11)
 Case 11-  Estimation of Za(y): Discrete Resource, Size-Weighted Estimate,
            Equal or ^Variable Probabilities.  Z0 known and not
            equal to Za.
            Confidence  Bounds by Horvite-Tbompeon Ratio Standard Error
            and Normal Approximation.
    Conditions for approach

       1. The frame population size-weighted total, Z, can be known or unknown.
       2. The subpopulation size-weighted total, Za, is known and not  equal
          toZ0.
       3. There can be an equal or variable probability selection of units from the
          subpopulation.
   Outline for Algorithm


       The algorithm supplied  in this section is based on the Horvitz-Thompson formulae,

which were discussed in Section 2.  This algorithm  is appropriate for a sample subset for any

subpopulation a that  is of interest.   The algorithm for the distribution function for  the

proportion of numbers, G0(y),  given exactly  the same conditions listed here, is presented in

Case 12.   To compute the distribution function of size-weighted totals, Z0(y), first use  the

algorithm  in Case 12 to compute the distribution function with the corresponding confidence

bounds for the proportion  of size-weighted totals.  Then, compute the following:



                                    Za(y) =Ga(y)*Za ,                              (54)



where Z  is the known size-weighted total.  To compute the confidence bounds  for ZQ(y),

simply multiply the upper and lower confidence limits of Ga(y) by Za.

-------
                                                                             (Case 11)










Estimation of Proportion of Size-Weighted Totals






      ID this subsection, two examples are provided based on varying conditions.  For the




first  algorithm, the size-weight, Za, is unknown or  known  and  not  equal to Za;  this




algorithm produces confidence bounds based on the Horvitz-Thompson ratio standard error




and  the normal approximation.  For the second algorithm, Z0 is known and equal  to  Z0;




this  algorithm  produces confidence bounds  based on  the Horvitz-Thompson  standard error




and the normal approximation.

-------
                                                                              (Case 12)
 Case 12-  Estimation of Ga(y): Discrete Resource, Siae-Weighted Estimate,
            Equal orJVariable Probabilities.  Za unknown or known and not
            equal to Za.
            Confidence Bounds by Horvitz-Thompeon Ratio Standard Error
            and Normal Approximation.
   Conditions for approach

       1.  The frame population size-weighted total, Z, can be known or unknown.
       2,  The sub-population size-weighted total, Za, is unknown or known and not
          equal  to ZQ.
       3.  There can be an equal or variable probability selection of units from the
          subpopulation.
   Outline for Algorithm


       The algorithm  supplied in this section is based on the Horvitz-Thompson  formulae,

which were discussed in Section 2.  This algorithm is appropriate for a sample subset for any

subpopulation a  that is of interest.  Another discussion of the formulae are presented in  the

previous section,  Estimation of Size-Weighted Totals.



   Calculation of confidence bovnds on Ga(y) by the Bormtz-Tbompson formulae


       For each  indicator, the following algorithm derives the distribution function and  the

confidence bound  for  Za(y) similar to that given in Case 10.  Because Za is unknown  or

known and not equal  to  Za in this example,  however, the variance of a ratio estimator is

used in this algorithm.  The confidence bounds are based on a normal approximation.
        1.  Data set
             a. Unit identification code
             b. Tier 1 weighting factor, w,(
             c. Tier 2 conditional weighting factor, w2 1(
             d. Size-weighted value (z)
             e. Indicator of interest (y)
             f. The subset  of data corresponding to the subpopulation of interest, indexed
               by a.

-------
                                                                        (Case 12)
2. Computation of additional weighting factors

       This step does Dot  have to  be  made with  each use of the datum, as the
weights are  permanent  attributes of a sampling unit.  The following details are
given for completeness.

       The Tier 1 and Tier 2  weights are included for  each  observation in the data
set.  These weights are used to compute the total weight of selecting  the i   unit in
the Tier 2 sample.  First, compute this weight for each observation:
                                 W2i — Wiiw2.ii >

where  wlt  is the weighting factor for the I*  unit in  the,Tier 1 sample and w2 lt is
the inverse of the conditional Tier  2 inclusion probability.  The pairwise inclusion
weight  is  defined below.   The sample size  at  Tier 2, n2, is not  subpopulation
specific.

3.  Algorithm for Ga(y) and Confidence Intervals

     a. Sorting  of data. The data file needs to be sorted on the indicator, either in
        an ascending or descending order. When  the data file is sorted in ascending
        order on the indicator, the distribution function of size-weighted
        proportions, Ga(y), denotes  the  proportion of size-weights in the target
        population, such as stream miles, that have a value  less than or equal to the
        y for a specific indicator.  Conversely,  if it is of interest to estimate the
        proportion of size-weights in  the target population with indicator variables
        greater than or equal to y, the data file would be sorted in descending order
        on this variable.  The distribution function generated by the analysis in
        descending order is [1  -  G0(y)j.

     b.  Compute ZQ = \J w2) * z(
                        So
     c. Define a matrix of q column  vectors, which will be defined as the following.
        There is one row for each data record and five statistics for each row.
            q-j = value of y variable for  the record
            q2 = Gfl(yJ
        .    03 = var[Ga(y)]
            q^ = upper confidence bound for Ga(y)
            q5 = lower confidence bound for G0(y)

     d. Index rows using t from 1 to n;  the t*  row will contain q-values
        corresponding to the i'  record in the file, as analyzed.

     e. Read first observation (first row of data matrix), following with the
        successive observations, one  at a time.  Accumulate  and store the q
        -statistics as each observation is read into file.  Continue this loop until the
        end of file is reached.

            '• qiM =  y[>]

            ii.  q2[,]  = q2[f - 1]  + ?&•

-------
                                                                    (Caael2)
   Multiple observations with one y-value create multiple records in the
   preceding analysis for one distinct value^pf y.  The last record for that y
   contains all the information needed for Ga(y).  Therefore, at this stage of
   the analysis, eliminate from the q-file all but the last record for  those y
   values that have multiple records.

f.  Entries in  the first column (qj) of the q-matrix identifies the vector of y
   -values for the remainder of the calculations. For each such y-value, y(,
   make the following calculations.  Note that this part of the algorithm  is  not
   recursive; each calculation is made over  the entire sample.
             where,
                               2n2w2>i
                      ** ~           2(n2-l)

             and,



                    Similarly for dk.
       iv. qji] = q2(i] -f  1.

       v-  qs(i] = q2M  -
   Output of interest
   From the q column vectors:
       i.  qj  represents the ordered vector of distinct values of y.
       ii.  q2  represents the estimated distribution function, G0(y),
          corresponding to the values of y
       iii. q^  represents the 95% one-sided upper confidence
          bound of the distribution function, GQ(y)
       iv. q5  represents the 95% one-sided lower confidence
          bound of the distribution function, Ga(y)

-------
                                                                              (Case 13)
Case 13— Estimation of G0(y): Discrete Resource, Size-Weighted Estimate,
            Equal or Variable Probabilities.  Za known and equal to Za.
            Confidence Bounds by HorvitB-Thompson Standard Error
            and Normal Approximation.
   Conditions for approach

       1. The frame population size-weighted total, Z, can be known or unknown.
       2. The subpopulation size-weighted total, Za, is known and equal to Za.
       3. There can be an equal or variable probability selection of units from the
         subpopulation.
   Outline for Algorithm


       The algorithm supplied in  this section is based on  the Horvitz-Thompson formulae,

which  were discussed in Section 2. This algorithm is appropriate for a sample subset for any

subpopulation a that is of interest.



   Calculation of confidence bounds on Ga(y) by the Horoitz-Thompson formulae


       For each  indicator, the following algorithm derives the distribution function and the

confidence bound for ZQ(y) exactly as given in Case 10.  Because ZQ is known  and equal to

ZQ, it  is not necessary to  use the ratio estimator.  The distribution function of Gfl(y) is

obtained by dividing the distribution function, Za(y), and  the associated  confidence bounds

by Za.   (These  additional steps  are  included in this algorithm.)  The Horvitz-Thompson

variance estimator,  discussed  in  Section  2.1,  is used  to compute the  variance  in this

algorithm. The  confidence bounds are computed based on a normal approximation.
        1.  Data set
             a. Unit identification code
             b. Tier 1 weighting factor, w](
             c. Tier 2 conditional weighting factor, w2 ](
             d.  Size-weighted value (z)
             e. Indicator of interest  (y)

                                         -84-

-------
                                                                       (Case 13)
2. Sorting of data

       The data file needs to be sorted on the indicator, either in an ascending or
descending order.  When the data  file  is sorted in ascending order on the indicator,
the distribution function of size-weighted proportions, Ga(y), denotes the proportion
of size-weights in  the target population, such as lake area, that have  a value less
than  or equal  to the  y  for a specific indicator.  Conversely,  if it  is of interest to
estimate  the  proportion of  size-weights in  the  target  population with  indicator
variables  greater than or equal to y,  the data file would be sorted  in descending
order on  this  variable.  ^Tbe  distribution  function  generated  by the analysis in
descending order is  [1  -  Ga(y)].

3. Computation of weighting factors

       For this step,  refer to  the  program steps given in Case  10  to derive the
distribution function and the confidence bound for Za(y).  Follow the steps labeled 3
and  4.   Additional  steps,  shown  here,  are needed  to  obtain  Ga(y)  and  its
corresponding confidence bounds.  Proceed with the following steps  after conducting
steps  3 and 4 from Case  10:
     e. The operations that follow generate the q vectors to compute the estimated
        distribution function and appropriate confidence bounds for G0(y).  These
        are denoted by qg through qg. Each element of q6-q^ is computed by
        performing the following operations on the corresponding elements of q2, q^,
        and q5.

           i.   q6  =  Divide each  element of q2 by the known subpopulation size
           ii.  q7  =  Divide each  element of q4 by the known subpopulation size
           iii. qg  =  Divide each  element of q5 by the known subpopulation size
        From the q vectors:

             i.  qg represents the estimated distribution function, Ga(y)
            ii.  q7 represents the 95% one-sided upper confidence
               bound of the distribution function, Ga(y)
            iii. qg represents the 95% one-sided lower confidence
               bound of the distribution function, G  (y)
                                  "85,

-------
3.2  Extensive Resources






       A  detailed  discussion  of  the formulae for obtaining area  and proportion  of  areal




extent for continuous and extensive resources was presented in Section  2.1.2.  Formulae were




presented for both areal and point samples.










3.2.1 Estimation of Proportion of Areal Extent






       As discussed in Section 2.1.2, the confidence bounds for the proportion of areal extent




in  continuous and  extensive  resources  can.be based  on the binomial  distribution.   This




algorithm  was presented in Section  3.1.2,  Case 6, for discrete resources.  No changes in this




algorithm  are needed.










3.2.2 Estimation of Area






       Formulae  for the  estimation of total areal  extent  of  the surveyed resources  was




proposed  in  Section 2.1.2.  Proposed methods to compute areal  extent for point and areal




samples are discussed in the following subsections.










Point Samples






       Formulae for the estimation  of areal  extent based  on  point sample  was presented in




Section 2.1.2.2.  To obtain confidence bounds for Aa(y) based on the binomial distribution,




refer to the  algorithm provided  for the confidence bound calculation for  Ffl(y) in Section




3.1.2, Case 6.  Simply multiply the lower and upper confidence bounds, and Fa(y), by the




known area  or estimated area of the  resource.   No further changes  are  necessary to  this




algorithm to provide confidence bounds for A (y) based on the binomial distribution.

-------
 Area] Samples





       Formulae for  the estimation of areal extent based  on areal samples are still under




 development.   However, some preliminary formulae are proposed in Section 2.1.2.1.  Work




 in this area is continuing and will be included in the next version of this report.









 3.3  Estimation of Qu&ntiJes






       Overton (1987a)  defines the calculations for both  the ascending and descending sorted




 indicators.   For the algorithm used in  this report, it is  not necessary to employ a different




 definition for  percen tiles for an  ascending or descending analysis; distributions are identical




 as generated either  way.   The  general algorithm computes  the linear  interpolation  of the




 distribution  function for both types of analyses.  In the following equation, let r represent




 the proportion of the desired percentile.  The fraction  in this equation can be interpreted as




 the slope of  the line.   The coefficient  of this  fraction interpolates to the value [Q(r)-a].




 The lower bound, a, is added to  this piece, [Q(r) -a], to obtain the quantile of interest.





       Assuming an ascending analysis and that the generated distribution function is  F(y):









                           Q(r) = a + [r  - F(a)] { F(^a }  .                    (55)
where F(a) is the greatest value of F(y) <  r and F(b) is the least value of F(y)  > r.





       For a descending analysis,  the distribution function generated was F*(y)=[l - F(y)].




To obtain the  percentiles, calculate F(y)=l - F*(y);  the foregoing formula is appropriate for





the analysis.

-------
                                      SECTION 4

                                       TABLES
Table 1.  Reference to Distribution  Function Algorithms

A.  Distribution Functions for Numbers  Estimation of N  (y)
 Equal Probability Selection:
Population Size
Known/ unknown
Known
Variable Probability
Population Size
Known/unknown
Known/unknown
Subpopulation Size
Known
Unknown
Selection:
Subpopulation Size
Unknown or ^
known and = Na
Known and ^ Na
Algorithm
Hypergeometric1
HT-NA2
Hypergeometric
HT-NA
Algorithm
HT-NA
HTR-NA3
Case
1
3
2
3
Case
4
5
1  Hypergeometric refers to the exact hypergeometric distribution algorithm
   used to obtain confidence bounds.
2  HT-NA refers to Horvitz-Thompson variance with normal approximation
   to obtain confidence bounds.
3  HTR-NA refers  to Horvitz-Thompson ratio estimator of variance with
   normal approximation to obtain confidence bounds.
4  Binomial refers  to the exact binomial distribution algorithm
   used to obtain confidence bounds.
                                         89

-------
Table 1  Continued.
B.  Distribution Functions for Proportions of Numbers - Estimation of F (y)
Fx}ua) Probability Selection:
  Population Size
Subpopulation Size
Algorithm
Case
Known/unknown
Known/unknown
Known/unknown
Known
Binomial4


Hypergeometric
Variable Probability Selection:
Population Size
Known/unknown
Known/unknown
Subpopulation Size
Unknown or
known and ^ NQ
Known and = Nn
Algorithm
HTR-NA
HT-NA
Case
8
9
1  Hypergeometric refers to the exact hypergeometric distribution algorithm
   used to obtain confidence bounds.
2  HT-NA refers to Horvitz-Thompson variance with normal approximation
   to obtain confidence bounds.
3  HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
   normal approximation to obtain confidence bounds.
4  Binomial refers to the exact binomial distribution algorithm
   used to obtain confidence bounds.

-------
 Table 1   Continued.
 C. Distribution Functions for Size-Weighted Statistics for Both Equal and
             Variable Probability Selection
  Population Size	Subpopulation Size
                        Algorithm
                      Section
 Estimation of Zn(y)

 Known/unknown


 Known/unknown
Unknown or
  known and = Zc

Known and ?  Z
HT-NA
HTR-NA
 10
 11
Estimation of Ga(y)

Known/unknown


Known/unknown
Unknown or
  known and  ^ Z£

Known and = Z_
ETR-NA
HT-NA
"12
 13
1  Hypergeometric refers to the exact hypergeometric distribution algorithm
   used to obtain confidence bounds.
2  HT-NA refers to Horvitz-Thompson variance with normal approximation
   to obtain confidence bounds.
3  HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
   normal approximation to obtain confidence bounds.
4  Binomial refers to the exact binomial  distribution algorithm
   used to obtain confidence bounds.

-------
Table 2.  Summary of Notation Used in Formulae and Algorithms
Svmbol
                          Definition
Populations:
N
Distribution Functions:
Discrete Resources:
N(y)
F(y)
Z(y)
G(y)
                          Population size
                          Subpopulation size
                          Estimated distribution function for total number
                          Estimated distribution function for proportion of numbers
                          Estimated distribution function of size-weighted totals
                          Estimated distribution function for a size-weighted proportion
Continuous and Extensive Resources:
A(y)                     Estimated  distribution function for area! extent
F(y)                     Estimated  distribution function for proportion of areal extent
Inclusion Probabilities:
7T
'l!

 2i
                          Probability of inclusion of unit I
                          Probability that unit i and j are simultaneously included
                          Probability of inclusion of unit i at Tier 1
                          Probability of inclusion of unit i at Tier 2
                          Conditional Tier  2 inclusion probability
Weights:
w
                          Inverse of the above inclusion probabilities
                          (Same definitions apply with corresponding subscripts)
Sample Notation:
n
Di
                          General notation for sample size
                          Sample size at Tier 1
                          Sample size at Tier 2
                          Sample of units at Tier 1
                          Sample of units at Tier 2
(These may be made specific for subpopulations or resources by appending
    an a or r.  For example: )
na                        Sample size for subpopulation a
nri                        Sample size for a resource r at grid point i
Slr                       Sample of units at Tier 1 for resource r
S2r                       Sample of units at Tier 2 for resource r

-------
                                      SECTION 5




                                    REFERENCES









Chambers, R.L., and R. Dunstan. 1986. Estimating distribution functions from survey




   data.  Biometrika, 73, 597-604.









Cochran. W.G.  1977.  Sampling Techniques, Third Edition.  Wiley, New York.









Cordy, C.B.  In press.  An extension of the Horvitz-Thompson theorem to point sampling




   from a continuous universe.  Accepted in Probability in Statistics Letters.









Cox, D.R., and  D: Oakes.  1984.  Analysis of Survival Data.  Chapman and Hall, New




   York.









Hansen,  M.H., W.G. Madow, and B.J. Tepping.  1983.  An evaluation of model-dependent




   and probability-sampling inferences in sample surveys.  J. Amer. Stat. Asspc. 78:




   776-793.









Hartley,  H.O., and J.N.K. Rao. 1962. Sampling with unequal probability and without




   replacement.  The Annals of Mathematical Statistics, 33, 350-374.









Horvitz,  D.G., and D.J. Thompson.  1952. A generalization of sampling  without




   replacement from  a finite universe. J. Amer. Stat.  Assoc: 47: 663-685.









Hunsaker, C.T., and  D.E. Carpenter, eds. 1990.  Environmental Monitoring and




   Assessment Program:  Ecological Indicators. EPA/600/3-90/060. U.S.EPA, Office of




   Research and Development, Washington, DC.

-------
Kaufman, P.R., A.T. Herlihy, J.W. Elwood, M.E. Mitch, W.S. Overtoil, M.J. Sale, K.A.




   Cougan, D.V. Peck, K.H. Reckhow, A.J. Kinney. S.J. Christie, D.D.  Brown,  C.A. Hagley




   and H.I. Jager.  1988.  Chemical Characteristics of Streams in the Mid-Atlantic and




   Southeastern United States. Volume I: Population Descriptions and  Physico-Chemica!




   Relationships.  EPA/600/3-88/021a. U.S. EPA, Washington, DC.









Landers, D.H.,  J.M.  Eilers,  D.F. Brakke, W.S. Overton,  P.E. Kellar, M.E. Silverstein, R.D.




   Schonbrod, R.E. Crowe,  R.A. Linthurst, J.M. Omernik, S.A. Teague, and E.P. Meier.




   1987.  CharacteristicsofLakesintheWesternUnitedStat.es.  Volume I: Population




   Descriptions  and Physico-Chemical Relationships.  EPA-600/3-86/054a. U.S. EPA,




   Washington, DC.









Linthurst, R.A., D.H. Landers, J.M.  Eilers, D.R. Brakke, W.S.Overton,  E.P. Meier, and




   R.E. Crowe.   1986.  Characteristics of Lakes in the Eastern United States, Volume I:




   Population Descriptions and Physico-Chemical Relationships. EPA-600/4-86/00"a.




   U.S. EPA, Washington, DC.









Little, R.J.A., and D.B. Rubin.  1987.  Statistical Analysis with  Missing Data.  Wiley,




   New York.









Messer, J.J., R.A. Linthurst, and W.S. Overton.  1991.  An EPA Program for Monitoring




   Ecological  Status and Trends.  Environ. Monit. and Assess. 17, 67-78.









Messer, J.J., C.W. Ariss, R. Baker, S.K. Drouse, K.N. Eshelman, P.R. Kaufmann, R.A.




   Linthurst,  J.M. Omernik, W.S. Overton, M.J. Sale, R.D. Schonbrod,  S.M. Stambaugh,




   and J.R. Tutshall, Jr.  1986.  National Surface Water Survey: National Stream Survey,




   Phase 1  Pilot Survey. EPA/600/4-86/026.  U.S.  EPA, Washington, D.C.

-------
 Miller, R.G.  1981.  Survival Analysis. * Wiley, New York.









 Sarndal, C-E., B. Swensson, and J. Wretman. 1992. Model Assisted Survey Sampling.




   Springer-Verlag, New York.









 Overton. W.S.  1987a.  Phase II Analysis Plan, National  Lake Survey, Working




   Draft. Technical Report 115, Department of Statistics, Oregon State University.









 Overton. W.S.  1987b.  A Sampling and Analysis Plan  for Streams in the National




   Surface Water Survey. Technical Report 117, Department of Statistics. Oregon State




   University.









 Overton, W.S.  1989a.  Calibration Methodology for the Double Sample Structure of the




   National  Lake Survey Phase II Sample.  Technical Report  130, Department of Statistics,




   Oregon State University.









 Overton, W.S.  1989b.  Effects of Measurements and Other Extraneous Errors on Estimated




   Distribution Functions in the National Surface Water Surveys. Technical Report 129,




   Department of Statistics, Oregon State University.









Overton, W.S., and S.V. Stehman.  1987.  An Empirical Investigation of Sampling and




   Other Errors in National Stream Survey: Analysis of a Replicated Sample of Streams.




   Technical Report  119, Department of Statistics, Oregon State University.









Overton, W.S., and S.V. Stehman.  1992.  The Horvitz-Thompson theorem as a unifying




   perspective  for sampling.  Proceedings of the Section on Statistical Education of the




   American Statistical Association, pp. 182-187.

-------
Overton. W.S., and S.V. Stehman.  1993a. Properties of designs for sampling




   continuous spatial resources from a triangular grid. Communications in Statistics -




   Theory and Methods, 22, 2641-2660.










Overton, W.S., and S.V. Stehman.  1993b. Improvement of Performance  of Variable




   Probability Sampling Strategies Through Application of the Population Space and the




   Fascimile  Population Bootstrap.  Technical Report 148, Department of Statistics, Oregon




   State University.










Overton, W.S., D. White, and D.L. Stevens.  1990.  Design  Report for EMAP,




   Environmental Monitoring  and Assessment Program.  EPA/600/3-91/053.




   U.S. EPA, Washington, DC.










Overton, J.M.. T.C. Young, and W.S. Overton.  1993.  Using found data to augment a




   probability sample: procedure and case study. Environ. Monitoring and




   Assessment, 26, 65-83.










Porter, P.S.,  R.C. Ward, and H.F. Bell.  1988.  The detection limit.  Environ. Sci.




   Technol., 22, 856-861.










Rao, J.N.K.,  J.G. Kovar, and  H.J. Mantel. 1990. On estimating distribution functions and




   quantiles from survey data using auxiliary  information.  Biometrika, 77, 365-375.










Sarndal, C.E., B. Swensson, and J. Wretman.  1992.  Model Assisted Survey Sampling.




   Springer-Verlag, New York.

-------
Smith, T.M.F.  1976.  The foundations of survey sampling: a review.  J. of Roy. Stat.




   Soc., A,-139,  Part 2, 183-195.









Stehman, S.V., and W.S. Overton.  1989. Pairwise Inclusion Probability Formulas in




   Random-order, Variable Probability, Systematic Sampling. Technical Report  131,




   Department of Statistics, Oregon State University.









Stehman,'S.V., and W.S. Overton.  In press.  Comparison of Variance Estimators of




   the Horvitz-Thompson Estimator for Randomized Variable Probability Systematic




   Sampling.  Jour, of Amer. Stat. Assoc.









Stevens,  D.L.  In press.  Implementation of a National Monitoring Program. Jour, of Envir.




   Management.









Thomas, Dave.  Oregon  State University, Statistics Department, Corvallis, OR.









Wolter, K.M.  1985.  Introduction to  Variance Estimation, New York: Springer-Verlag.

-------
                                       SECTION 6
                      GLOSSARY OF COMMONLY USED TERMS
 Continuous attribute:  an attribute that is represented as  a  continuous  surface over some
 region.  Examples are  certain  attributes of large bodies of water, such as chemical variables
 of estuaries or lakes.

 Discrete resource:  resources consisting of discrete resource units, such as lakes  or  stream
 reaches.  Such a resource will be described as a finite population of such units.

 Distribution  function:    a  mathematical  expression  describing a  random variable or  a
 population.   For  real-world finite populations,  these  distributions are knowable  attributes
 (parameters)  of the population, and may be determined  exactly by  a census, or  estimated
 from a sample.  The general form  will be the  proportion (or other measure, like numbers,
 length, or area) of the  resource having a  value  of an attribute  equal  to or less than  a
 particular  value.  Proportions may also be of the different possible  measures, like number
 (frequency distributions), area (areal distributions), length, or volume.

 Domain: a frame feature that includes the entire area within which a potential sample might
 encounter the resource.   The domain of any one resource can include other resources.

 Extensive resource:  resources without natural  units.  Examples of extensive  resources are
 grasslands  or  marshes.

 40-hez: a term for the landscape description  hexagon or areal sampling unit centered on
 each  of  the  grid points  in the EMAP sampling  grid.   The area  of each hexagon  is
 approximately 40 km .

 Inclusion  probability (*<):  the  probability of including the ith  sampling unit  within  a
sample.

 Pairwiae inclusion probability  (T,-J): the probability  that  both element t  and element j are
 included in the sample.

-------
Population:  often  used  interchangeably with  the term  universe to designate the total  set of
entities addressed  in  a sampling effort.  The term  population  is defined  in  this report to
designate any collection of units of a specific discrete resource, or any subset of a specific
extensive resource, about which inferences are desired or made.

Randomized  model: a model  invoked in analysis, assuming the population units have been
randomly arranged prior to sample selection.  In many cases, this is equivalent to assuming
simple random sampling.

Resource: an ecological entity that is identified as  a  target of sampling, description, and
analysis by  EMAP-   Such  an  entity will ordinarily  be thought  of  and described  as  a
population.  Two  resource types,  discrete and extensive,  recognized in EMAP pose different
problems of sampling and representation.   EMAP resources are ordinarily  treated as strata
at Tier 2.

Resource class: a subset of a resource, represented  as  a subpopulation.  For  example, two
classes of substrate, sand  and mud, can be defined  in  the Chesapeake Bay.  Subpopulation
estimates require only that the classification be known on the sample.

Stratum: a stratum is a sampling structure that restricts sample-randomization/selection to
a  subset of the  frame.    Samples from different strata are  independent.    Inclusion
probabilities may or may not differ among strata.

Tierl/Tier2:  these terms  represent different phases of the EMAP program. Relative to the
EMAP sample, they refer  to the two phases (stages) of the EMAP double sample.  The Tier
1 sample is common to all resources and  provides for each a sample from  which the Tier  2
sample  is selected.  The Tier  2 sample for any resource is a set  of resource units or sites at
which field data will be obtained.

Weights:    in a  probability  sample,  the  sample  weights are  inverses   of  the  inclusion
probabilities;  these are always known for a probability sample.

-------
                  Reproduced by NTIS
      ££
£ 0 0) 0)
M. ^ +J +J
S 0) o C
££•-.2
t 0 3*3
0) o 015
ajS>c
+j Q-On
0£ = o
      ="0
      - a,
                  National Technical Information Service
                  Springfield, VA 22161
             report was printed specifically for your order
      from nearly 3 million titles available in our collection.


For economy and efficiency, NTIS does not maintain stock of its vast
collection of technical reports.  Rather, most documents are printed for
each order. Documents that are not in electronic format are reproduced
from master archival copies and are the best possible reproductions
available. If you have any questions concerning this document or any
order you have placed with NTIS, please call our Customer Service
Department at (703) 605-6050.

About NTIS
NTIS collects scientific, technical, engineering, and business related
information — then organizes, maintains, and disseminates that
information in a variety of formats — from microfiche to online services.
The NTIS collection of nearly 3 million titles includes reports describing
research conducted or sponsored by federal agencies and their
contractors; statistical and business information; U.S. military
publications; multimedia/training products; computer software and
electronic databases developed by federal agencies; training tools; and
technical reports prepared by research organizations worldwide.
Approximately 100,000 new titles are added  and indexed into the NTIS
collection annually.
    For more information about NTIS products and services, call NTIS
    at 1-800-553-NTIS (6847) or (703) 605-6000 and request the free
     NTIS Products Catalog, PR-827LPG, or visit the NTIS Web site
                      http://www.ntis.gov.
                                              NTIS
                        Your indispensable resource for government-sponsored
                                  information—U.S. and worldwide

-------