EMAP Status Estimation:
Statistical Procedures and Algorithms
V.M. LESSER AND W.S. OVERTON
Department of Statistics, Oregon State University,
Corvallis, Oregon
Project Officer
Anthony R. Olsen
U.S. Environmental Protection Agency
Environmental Research Laboratory
200 SW 35th Street, Corvallis, Oregon
The information in this document has been funded wholly or in part by the U.S.
Environmental Protection Agency under cooperative agreement CR-816721 with Oregon
State University at Corvallis. It has been subject to the agency's peer and administrative
review. It has been approved for publication as an EPA document.
-------
EMAP Status Estimation:
Statistical Procedures and Algorithms
V.M. LESSER AND W.S. OVERTON
Department of Statistics, Oregon State University,
Corvallis, Oregon
Project Officer
Anthony R. Olsen
U.S. Environmental Protection Agency
Environmental Research Laboratory
200 SW 35th Street, Corvallis, Oregon
The information in this document has been funded wholly or in part by the U.S.
Environmental Protection Agency under cooperative agreement CR-816721 with Oregon
State University at Corvallis. It has been subject to the agency's peer and administrative
review. It has been approved for publication as an EPA document
-------
CONTENTS
1 INTRODUCTION 1
1.1 Overall Design 2
1.2 Resources 2
1.3 Response Variables of Interest 3
1.4 Statistical Methods 4
1.5 Use of this Manual 6
2 GENERAL THEORETICAL DEVELOPMENT 9
2.1 Design-Based Estimation Methods 9
2.1.1 Discrete Resources 9
2.1.1.1 General Estimator and Assumptions 10
2.1.1.2 Tiered Structure 14
2 1.2 Extensive Resources 22
2.1.2.1 Areal Samples 23
2.1.2.2 Point Samples 25
2.1.2.3 Alternative Variance Estimators 29
2.2 Model-Based Estimation Methods 33
2.2.1 Prediction Estimator 34
2.2.2 Double Samples 36
2.2.3 Calibration 37
2.3 Other Issues 38
2.3.1 Missing Data 38
2 3 1.1 Missing Sampling Units 36
2.3.1.2 Missing Values within Sampling Units 39
2.3 2 Censored Data 39
2.3.3 Combining Strata 41
2.3.3.1 Discrete Resources 42
2.3.3 2 Extensive Resources 43
2.3.4 Additional Sources of Error 43
2.3.5 Supplementary Units or Sites 44
3 DISTRIBUTION FUNCTION ALGORITHMS 45
3.1 Discrete Resources 46
3.1.1 Estimation of Numbers 47
Equal probability sampling 47
Case 1 - N known/unknown, Na known 48
Case 2 - N known, Na unknown 53
Case 3 - N known/unknown, Nfl known/unknown 55
Variable probability sampling ^ 56
Case 4 - Na unknown or Na knownand equal Na 57
Case 5 - Na known and not equal N0 60
3.1.2 Proportions of Numbers 61
Equal probability sampling 61
Case 6 - N0 known/unknown 62
Case 7 - N0 known 67
Variable probability sampling ^ 68
Case 8 - N0 unknown or Na known and not equal N0 69
Case 9 - Na known and equal Na 72
3.1.3 Rationales for the Algorithms in Section 1.1 and 1.2 74
ii
-------
SECTION 1
INTRODUCTION
The Environmental Protection Agency (EPA) has initiated a program to monitor
ecological status and trends and to establish baseline environmental conditions against
which future changes can be monitored (Messer et al., 1991). The objective of this
environmental program, referred to as EMAP (Environmental Monitoring and Assessment
Program), is to assess the status of a number of different ecological resources, including
surface waters, wetlands, Great Lakes, near-coastal waters, forests, arid lands, and
agroecosystems.
A design plan and a number of support documents have been prepared to guide
design development for EMAP (Overton et al., 1990, Overton and Stehman, 1990; Stehman
and Overton, in press; Stevens, in press). The statistical methods outlined in earlier
documents, such as those analyzing the EPA National Surface Water Surveys, are also
relevant to EMAP (Linthurst et al., 1986, Landers et al., 1987; Messer et al., 1986).
This report presents statistical procedures collected from other EMAP documents, as
well as from Oregon State University technical reports describing data analyses for other
EPA designs. By integrating this information, this manual and the EMAP design report
will serve as reference sources for statisticians who implement an ecological monitoring
program based on the EMAP design framework. Spatial and temporal analyses of EMAP
data are not covered in this version of the report. A brief discussion of the four-point
moving average, which combines data over the interpenetrating sample, is presented in
Overton et al. (1990; Section 4.3.7). Algorithms listed in this report cover most design
options discussed in the EMAP design report. It is expected that any further realizations of
the EMAP design will also include documentation of corresponding variance estimators.
1
-------
1.1 Overall Design
The EMAP design is a probability sample of resource units or sites that is based on
two tiers of sampling. The first tier (Tier 1) primarily supports landuse characterization
and description of population structure, while the second tier (Tier 2) supports status
assessment by the indicators. The second tier sample is a probability subsample of the first
tier sample; such a sample is referred to as a double sample. Across the ecological resource
groups, it is expected that discrete, continuous, and extensive populations will be monitored.
The statistical methods outlined in this report address these different population types at
both sampling tiers. A description of the sampling design is presented in Overton et al.
(1990)
1.2 Resources
EMAP is designed to provide the capability of sampling any ecological resource. To
achieve this objective, explicit design plans must be specific for a particular resource and all
resources to be characterized must be identified Currently, the resources to be sampled
within EMAP include surface waters, wetlands, forests, agroecosystems, arid lands, Great
Lakes, and near-coastal wetlands. These resources are further divided by major classes to
represent the specific 'resource' that will be addressed by the sampling effort. For example,
surface waters can be partitioned into classes such as very small lakes, intermediate-sized
lakes, very large lakes, very small streams, intermediate-sized streams, rivers, and very large
rivers. Because each class potentially generates different sampling issues, each would be
considered a different entity. The design structure meets this condition by identifying each
such class as a resource, thereby resulting in 6 to 12 surface water resources. Each major
resource group may also have as many divisions.
Most resources will be sampled via the basic EMAP grid and associated structures.
However, other resources, such as very large lakes and very large rivers, represent unique
2
-------
ecological entities and cannot be treated as members of a population of entities to be
described via a sample of the set. For example, Lake Superior and the Mississippi River are
unique, although the tributaries of the Mississippi might be treated as members of a wider
class of tributaries.
Resources sampled by the EMAP grid will be associated with an explicit domain in
space, within which the resource is confined. This domain should be established early in the
design process. Within the defined domain, it is not expected that the resource will occupy
all space or that no other resource will occur. Domains of different resources will overlap,
but the domain of a particular resource is an important parameter of its design. For
purposes of nomenclature, the resource domain is a region containing the resources. The
resource universe is either a point set within one point for each resource unit (discrete
resource) or the continuous space actually occupied by the resource (for extensive resources)
A resource class will be a subset of its universe. Such a class may or may not be treated as
a sampling stratum and may or may not have an identified subdomain.
1.3 Response Variables of Interest
The term response variable is used generally for the measured characteristic of
interest in the sample survey. In EMAP, a special class of response variables is referred to
as indicators, such as indicators of ecological status (Hunsaker and Carpenter, 1990). These
indicators are the environmental and ecological variables measured in the field on resource
units or at resource sites; they may be measured directly or modified via formulae or
analytic protocols.
The term, indicators, should not be applied to the structural variables defined at Tier
1. The Tier 1 variables are used to estimate population structure and to partition the Tier
1 sample into the necessary structural parts for Tier 2. Then the indicator variables are
determined on the Tier 2 sample. When the Tier 2 sample includes the entire Tier 1
3
-------
sample, it is still appropriate to make the distinction between indicator and structural
variables, both of which are response variables. Because of this distinction, it is sometimes
appropriate to distinguish Tier 1 and Tier 2 in terms of the variables, rather than strictly in
terms of a subsample.
1.4 Statistical Methods
The primary statistic used to summarize population characteristics is the estimated
distribution function. This distribution function estimates the number or proportion of
units or sites for which the value of the indicator is equal to or less than y. For discrete
resources, the estimated distribution function for numbers is designated as N(y), while the
estimated distribution function for the proportion of numbers is designated as F(y). The
estimated distribution function of size-weighted totals in discrete resources is designated as
Z(y), while a size-weighted proportion is designated as G(y). Examples of size-weights are
lake area, stream miles, and stream discharge. There are no distribution functions
comparable to Z(y) and G(y) in the continuous and extensive populations because there are
no objects in these resources. Therefore, there are no object sizes to use as weights. In
extensive resources, the estimated distribution function representing actual areal extent for
which the value of the indicator is equal to or less than y is designated as A(y), while the
proportion of areal extent is designated as F(y). Thus A(y) is analogous to N(y); A is the
size of an extensive resource and N is the size of a finite resource.
A number of estimates of interest, which can be obtained from the distribution
function, have been used quite successfully in the National Surface Water Surveys (NSWS)
(Linthurst et al., 1986; Landers et al., 1987; Kaufmann et al., 1988). For example, any
quantile, including the median of the distribution, can be interpolated easily from the
distribution function. In addition, the distribution function can be supplemented with
tables of means, quantiles, or any other statistics of particular interest, providing greater
accuracy than can be obtained from the plotted distribution function.
4
-------
The basic formula for estimated distribution functions is
(I>.) -
F(y)-
°(} "TTwJ n
s„
(1)
where S is the sample of units representing the universe (11) and the variable y represents
any response variable. The subscript a denotes a subpopulation of interest; Sn is the portion
of the sample in subpopulation a, and Say is the portion of the sample in subpopulation a
having values < y. We associate the inclusion probability, 7T,, with each Ith sampling unit.
Each sampling unit is a representation of a subset of the universe, and the weight (wt = ^-)
accounts for the size of the subset. N0 denotes the estimated population size for the
subpopulation a. Fa(y) is calculated for each value of y appearing in the sample.
As given, Fa(y) is a step function not suitable to general EMAP needs; a smoothed
version is desirable. Thus, we propose the following method. In this method, Fa(y) is
F (y) + F (y')
replaced by 0 2—' where y is the next lesser value to y. For the minimum values
- F (y)
of y, Fa(y) is replaced by —^— . A linear interpolation is then made between these points
to generate the plot or to determine quantiles. For each of the distribution function
algorithms provided in this report, two successive values are averaged in this manner and
used to develop an interpolated distribution function. Confidence bounds are constructed on
F (y) and then averaged and interpolated in the same manner. We rest justification for this
procedure on the interpretation of the initial and final values of the resulting distribution
function. The initial value is our best estimate of the proportion of the population below
the minimum observed value, and similarly, one minus the last point is our best estimate of
the proportion of the population above the maximum observation.
Computation of the distribution function and the associated confidence bounds differs
slightly for specific resource groups, reflecting differences in detail of the sampling design.
5
-------
For example, in some cases simplifications of the computing algorithms result from equal
probability designs. Some algorithms are presented in this report to accommodate the range
of conditions and objectives anticipated across the resource groups. These algorithms have
been previously discussed in greater detail in other documents; references are given for those
requiring a more in-depth approach. Table 1 provides an outline of the distribution
functions, which are presented in greater detail throughout this report. Table 2 provides a
table of notation used throughout this document.
1.5 Use of this Manual
The body of this manual has been separated into two sections. Section 2 includes the
general theoretical development upon which the algorithms are based. Formulae for
discrete, continuous, and extensive resources are presented, the mathematical notation is
introduced and defined; and both design-based and model-based approaches for computing
the distribution functions are discussed. Other issues relevant to the analysis of EMAP
data, such as handling of missing data, are also discussed in this section.
Section 3 includes the algorithms used to produce the distribution functions, the
conditions that provide for the application of the algorithms, and the rationales that support
the choice and derivation of the algorithms presented. References will be made to the
general formulae (discussed in Section 2) used to develop these algorithms.
The following list outlines a step-by-step sequence for obtaining the distribution
functions:
1. Determine whether the data represent a discrete or extensive resource.
2. Determine the type of distribution function to compute. For example, for discrete
resources, the distribution of numbers and/or proportions of numbers will be
of interest.
3. Determine whether the sampling units were collected with equal or variable
probability of selection. The inclusion probabilities, 7rlt and 7T21,, discussed in
Section 2.1.1, are to be a permanent part of a datum record, as are the
identification code of the sampling unit and the variable of interest. In some
cases, it is also necessary to identify the grid point, which can be included as
part of the identification code.
4. Determine whether the size of the subpopulation of interest is known or unknown
6
-------
The subpopulation is the group of population units about which one wishes to
draw inference.
5. Using the conditions from steps 1-4, refer to the example of that specific algorithm
in Section 3.
6. Optional, but suggested: Refer to the formulae referred to in Section 2 for a
description of the formulae and for clarification of any notational problems.
7. Optional: An algorithm to obtain specific quantiles is presented in Section 3.
This manual is expected to be updated as research continues in the development of
statistical procedures for EMAP, as EMAP adapts to changing concerns and orientation,
and as EMAP makes and accumulates more in-depth frame materials. For example, efforts
to date have been focused on design-based approaches to confidence bound estimation,
therefore this version reflects a fairly in-depth approach to design-based estimation over all
resources Further discussion of model-based approaches currently under development are
expected in future versions of this manual.
7
-------
8
-------
SECTION 2
GENERAL THEORETICAL DEVELOPMENT
Two approaches are commonly used to draw inferences from a sample survey relative
to a real population. In the design-based approach, described in Section 2.1, the
probabilities of selection are accounted for by the estimators and the properties of inferences
are derived from the design and analytical protocol. In contrast, the model-based approach
(Section 2.2) assumes a model and requires knowledge of auxiliary variables for inference.
Properties of model-based inference are derived from the assumed model and analytical
protocol A model-based estimator takes into account only model features, while a model-
assisted estimator takes into account both model and design features. For a discussion of
the relationship between these two approaches and the way they are used together, refer to
Hansen et al. (1983). The paper by Smith (1976) also provides useful insight.
2.1 Design-Based Estimation Methods
2.1.1 Discrete Resources
A population of natural units readily identified as objects is defined as a discrete
resource. For example, lakes, stream segments, farms, and prairie potholes are all
considered discrete resources. Populations of a large number of discrete resource units that
can be described by a sample are considered for EMAP representation. It is suggested, for
example, that lakes less than 2,000 hectares be characterized as populations of discrete
resources. Distribution functions of the numbers of units or proportions of these numbers
may be of interest. On the other hand, very large lakes are unique, and less usefully
characterized as members of populations of lakes.
9
-------
2.1.1.1 General Estimator and Approximations of Design-Based Formulae
Because the EMAP design is based on a probability sample, design-based estimators,
which account for this structure, are applicable. The Horvitz-Thompson theorem (Horvitz
and Thompson, 1952) provides general estimators of the population attributes for general
probability samples and for estimators of variance of these estimators (Overton and
Stehman, 1993a; Overton et al., 1990).
In Horvitz-Thompson estimation, the probability of inclusion, 7T,, is associated with
the Ith sampling unit. Each sampling unit is a representation of a subset of the universe,
and the weight (w( = ^-) accounts for the size of the subset Therefore, estimates of
population parameters, such as totals or means, simply sum the variables collected over the
sampling units, expanding them by the sampling weights. The Horvitz-Thompson estimator
proposed for EMAP is unbiased for the population (and subpopulation) totals and means, if
7Tt>0 for all units in the population.
The general form of the Horvitz-Thompson estimator is
where S is the sample of units representing the universe (CU), w, is the weight, and the
variable y represents any response variable. The total of y on the universe is defined as Ty
= £y ai)d is generally referred to as the population total. This estimator (Equation 2)
yields estimates of many parameters simply by the definition of y. For example, if y,=l for
all units in the population, then Tv = N, the population size; it follows that N=2Jw(.
J S
Suppose further that we are interested in a subpopulation, a. The portion of the
sample, S0, that came from this subpopulation is also a probability sample from this
subpopulation. To obtain parameter estimates for a subpopulation, Equation 2 is simply
summed over the subpopulation sample,
(2)
S
10
-------
Tya= £y.w,
s„
a
(3)
The Horvitz-Thompson theorem also provides for an unbiased estimator of the
variance of these estimators under the condition that 7r,j>0 for all pairs of units in the
population. The quantity 7TtJ is the probability that unit i and unit j are simultaneously
included in the sample. The estimator of the variance is designated in lower case as var,
and w(j is the inverse of the pairwise inclusion probability. The variance of Ty0 or Na is
obtained by the choice of y,:
This presentation shows that the form of the variance estimator does not change when
estimating variance for the estimator based on a full sample or a subset of the sample. The
subsetting device, with summation over the appropriate subset of the sample, will always
represent the appropriate estimator. The principal reason for using the Horvitz-Thompson
form (Equation 4) is its subsetting capability; the commonly used form for the Yates-
Grundy variance estimator does not permit the convenience of subsetting.
The EMAP design is based on a systematic sampling scheme. The Horvitz-
Thompson theorem does not provide a design-unbiased estimator of variance based on this
design, because some pairwise inclusion probabilities are zero. The following sections include
a discussion of assumptions and approximations applied to Equation 4 in order to apply this
variance estimator in EMAP.
(4a)
J*'
(4b)
it'
11
-------
Systematic sampling
Because EMAP units are selected by a systematic sampling design, many of the
pairwise inclusion probabilities (7r(J) equal zero and an unbiased variance estimator is not
available. However, it has been established that in many cases the variance of a systematic
design can be satisfactorily approximated by the variance that applies to a sample taken on
a randomly ordered list (cf., Wolter, 1985) A common systematic sample selected on a
randomly ordered list is a simple random sample. Therefore, a simple random sample is an
approximate model for an equiprobable systematic sample. The randomized model
proposed here provides approximate variance estimation for a systematic variable
probability design.
A modification of the randomized sampling model provides only for 'local'
randomization of the position of the population units, rather than global randomization.
Good behavior of the variance estimator results from this assumption (Overton and
Stehman, 1993b). As a consequence, we can justify use of the suggested pairwise inclusion
probability with less restriction as compared with the global randomization assumption
We will refer to the local randomization model as the weak randomization assumption.
The Horvitz-Thompson estimator of variance, Equation 4, is thus proposed for
EMAP indicators under the weak randomization assumption. The simulation studies
conducted on the behavior of this estimator suggested that this assumption was adequate in
most situations expected for EMAP (Overton, 1987a; Overton and Stehman, 1987; Overton
and Stehman, 1992; Stehman and Overton, in press). In a few situations the estimator
overestimated the true variance, thus providing for a conservative estimate of variance. In
certain circumstances, as discussed in Section 2.1.2.3, it is appropriate to modify the
estimation methods to account for the spatial patterns.
12
-------
Pairwifle inclusion probability
Approximations for the pairwise inclusion probability under the randomized model
have been proposed in the literature (Hartley and Rao, 1962). A major disadvantage with
these approximations, as discussed by Stehman and Overton (1989), is the requirement that
all inclusion probabilities for the population must be known. For large populations such as
those studied in EMAP, it is practically impossible to obtain inclusion probabilities for all
units in the populations. Another approximation for this pairwise inclusion probability
requires that the inclusion probabilities be known only on the sample (Overton, 1987b).
The formula for the inverse of this pairwise inclusion probability is
2nw w - w - w,
w - !_J I (5)
2(n -1) W
where n is sample size.
Investigation of this approximation indicates that it performs at least as well as other
commonly recommended approximations (Overton and Stehman, 1992, Overton, 1987a).
Therefore, this pairwise inclusion probability will be used in the approximation of the
variance estimator for the population parameter estimates collected in EMAP, for those
circumstances in which the randomization assumption is justified.
This variance estimator (Equation 4) accommodates variable probability of selection,
but it is also appropriate for equal probability designs. The approximation for the pairwise
weight given in Equation 5 is also appropriate for randomized equal probability designs. As
a consequence, Equation 4 with 5 is valid for either equal or variable probability selection,
under the weak assumption of a randomized model, as discussed above under systematic
sampling.
When the randomization model is not acceptable, alternative variance estimators,
based on the mean square successive difference, have been developed for use with an equal
13
-------
probability systematic design and regular spacing (Overton and Stehman, 1993a). The
conditions and assessment of these and other variance estimators are presently under
investigation; subsequent versions of this document will discuss these alternate methods.
Extension must be made to account for irregular spacing. It should also be noted that in
some circumstances the methods of spatial statistics may provide adequate assessment of
variance.
The confidence bounds obtained using the Horvitz-Thompson variance estimator
(Equation 4) are based on normal approximation. This approximation may be inadequate
for estimating confidence bounds at the tails of the distribution, even for moderate sample
sizes. In the special case of equal probability of selection and the randomization
assumption, confidence bounds can be obtained by exact methods (see Section 3). However,
exact methods may also yield inadequate confidence bounds at the tails of the distribution
(also discussed in Section 3).
2.1.1.2 Tiered Structure
The following description of the tiered structure was summarized in the EMAP design
report (Overton et al., 1990).
The Tier 1 sample
The EMAP sample design partitions the area of the United States into hexagons,
each comprising approximately 635 square kilometers (Overton et al., 1990), and selects a
point at an identical position in each hexagon; selection of this one position is random
-(equiprobable) over the hexagon. This method results in a triangular grid of equally spaced
points. An areal sample of a 40-km2 hexagon (40-hex) is imposed on each point, with the
sampled hexagonal area containing ^ of the total area of the larger hexagon. This fraction,
g, therefore represents a constant inclusion probability, jt, and 16 represents a constant
14
-------
weight, w, to be applied to each fixed-size areal sample. Because other enhancements of the
grid are expected, possibly with different sized areal samples, the following formulae will
incorporate general notation.
No detailed characterization of indicators is collected at Tier 1, so no distribution
functions will be computed based on the Tier 1 data. It is of interest, however, to estimate
the total number of discrete resource units in specific populations at Tier 1. This estimation
is possible for any resource class for which units can be uniquely located by a position point.
The following formula can be used to estimate the total number of units for a particular
resource (r) at Tier 1:
where ^Dr is the domain for resource r. A domain of a resource is a feature of the spatial
frame that delineates the entire area within which a sample might encounter the resource
(Section 1.1). In these formulae, the quantity nr| represents the number of units for the
particular resource at grid point i. The variance can be estimated using Equation 4b, as
follows:
where nr is the number of grid points for which the areal sample hexagon includes part of
the domain of the resource. It is worth noting again that the estimates of variance are often
expected to slightly overestimate variance if the systematic design results in gieater precision
than would a randomized design, thus providing for a conservative estimate of variance.
r
(6)
V"(N,)
r Cfl
¦*> r
(7)
15
-------
The reduced Tier 1 sample
In preparation for selecting the Tier 2 sample; resource classes are identified. Some of
these classes will be treated as sampling strata, and hence be designated as 'resources'. The
Tier 1 sample for such a 'resource' is reduced so that it contains only one unit at each 40-
hex at which that resource is represented. This condition effectively changes the sample
from a set of systematic areal samples to a spatially well-distributed subset of units from the
population of units for the particular resource.
A consequence of this sample reduction step is the introduction of variable inclusion
probabilities in the Tier 1 sample, reflecting the scheme used to reduce nr, to 1. For
example, if a random sample of size 1 is selected from the nrt units of hexagon t, then the
selected unit will have 7Tlr, = w^—. A consequence of this is that Nr = ^w,n = w£nri,
" Slr Slr
where Slr is now the sample of points for which nri>0, at each of these points, the sample
now consists of one unit of resource r. Because this estimate, Nr, is identical to the original
Tier 1 estimate, it has the same variance. This sample, Slr, is then subsampled to generate
the Tier 2 sample, S2r. Again, note that it is now a resource-specific sample of units, not
the original areal sample.
The Tier 2 sample
The Tier 2 sample, S2, is a probability subsample, a double sample, of the Tier 1
sample of resource units. At this tier, a specific resource has been identified and the reader
should remember that subsequent equations are for specific resources. The reader should
also be aware that the subscript i will now index a resource unit, not the grid point. All
Tier 2 samples for discrete resources consist of individual units from the universe of discrete
resource units.
With these changes, the estimator presented in Equation 2 is appropriate for the
sample collected at the second tier. The indicator values are summed over the samples
16
-------
surveyed at the second tier by the assigned weights. The inclusion probabilities account for
the probability structure of this double sample. Overton et al. (1990) identified the
probability of the inclusion of the t''1 unit in the Tier 2 sample as the product of the Tier 1
inclusion probability and the conditional Tier 2 inclusion probability. The conditional Tier
2 inclusion probability is defined as the probability of inclusion at Tier 2, given that the
unit was included at Tier 1. This product is still conditional on the Tier 1 sample and leads
to conditional Horvitz-Thompson -estimation.
In subsequent equations, the subscripts 1 and 2 represent the first and second tiers,
respectively. The weighting factor for unit t at Tier 2 is defined as
w2r, = wlr,w2.1r, ' (8)
where wlr( is the weighting factor for the ith unit in the Tier 1 reduced sample and w2 lr( is
the inverse of the conditional Tier 2 inclusion probability for resource r.
Selection of the Tier 2 sample from the reduced Tier 1 sample and calculation of the
conditional Tier 2 inclusion probabilities are discussed in Section 4.0 of the EMAP design
report (Overton et al., 1990) This procedure generates a list in a specific order, based on
spatial clusters. Clusters of 40-hexes are arbitrarily constructed with uniform size of the
initial Tier 1 sample of the specific resource. The reduced Tier 1 sample is sorted at random
within clusters, and then the clusters are arranged in an arbitrary order. A subsarr.ple of
fixed size, n2r, is selected from Slr by ordered variable probability systematic sampling from
this list. The purpose of this elaborate procedure is to generate a spatially well-distributed
Tier 2 sample.
The Tier 2 conditional inclusion probabilities are proportional to the weights at Tier
1:
_ _ n2rwlr« _ n2rWlrt fa\
2-ir,~r^;~ n • ( )
S,r
17
-------
where Nr was defined for a specific resource in Equation 6. However, for some units
N
w, >5r-^. To obtain conditional inclusion probabilities < 1, these units are placed into an
lrl 2r ~
artificial 'certainty' stratum, all having 7T2 lr,= l. This step takes place prior to the cluster
formation. For the remaining units, the selection protocol and achieved probabilities are
modified to adjust for the number of units having probability 1. These remaining units now
have conditional inclusion probabilities:
_ °2rWlri (10)
2-,r'" ' ( }
lr
where njr equals n2r less the number of units entering S2r with probability 1, and SJr equals
Slr less these same units that were included with probability 1.
Note that this selection protocol is designed to create Tier 2 inclusion probabilities as
nearly equal as possible:
7Tlr, , if t is in the artificial stratum with n"2.]ri=l
(")
n2rt ~
^ 2r— , otherwise,
I>lr,
Sir
and if no units are in the artificial stratum,
_ ni
2" ~ Nr
*2r. = # , (12)
where Nr is the Tier 1 estimate of the total number of population units in resource r. For
generality, we will retain the variable probability notation, but ideally the sample will now
be equal probability. If there is great deviation from equiprobability, then consideration
should be given to enhancement of the grid, perhaps by reducing the size of the Tier 1 areal
sample, in order to better achieve the goal of equiprobability.
The variance estimator presented in Equation 4 is also appropriate for estimating
variance from the Tier 2 sample, using inclusion probabilities defined above for Tier 2.
18
-------
When no units enter with ^en
2(n2r - 1)
However, if unit l enters with l,2.in=^' then,
2n2rw2r,w2 -w2r,-w2rj
w,_.. = ~
[2nlrwlrtwlrj-wlr,-wlrj|
2(nlr -1) J"21'' 1 j
Because the term in the bracket equals wlr(J, Equation 14 simplifies to w2rtJ = wlriJw2 lrj.
Special case: The Tier 2 point sample of lakes
We assume a stratified design with equiprobable selection within strata. If a quasi-
stratified design is used instead, appropriate analysis can condition on the realized sample
sizes in the classes and use post-stratification.
Special case: The Tier 2 point sample of streams
A point sample of streams at Tier 2, rather than a sample of stream reaches, has
been proposed. With a few simple changes, that point sample will be a rigorous
equiprobable point sample of the stream population with a very simple estimation
algorithm. A probability sample of stream reaches, on which the sample points are
represented and from which other estimates of population structure can be obtained, will
also be provided. The protocol provided will apply to the sample of stream reaches and the
point sample design proposed to us.
We assume a stratified design with equiprobable selection within strata. If a quasi-
stratified design is used instead (as has been proposed), appropriate analysis can condition
on the realized sample sizes in the classes and use post-stratification.
19
-------
sir is the Tier 1 collection of reaches in a resource stratum identified via the 40-
hexes. S2r is generated by selecting njr points from this set using the frame representation
of stream length This process results in (1) the selection of nj,. frame reaches with
probability proportional to frame length, and (2) the random selection of 0, 1, 2, ... points
in each selected reach with inclusion density, given reach selection, inversely proportional to
length. The resultant point sample is equiprobable on the population of stream reaches.
Then, in terms of the sample of reaches,
*.
wD Si £ "J wn r
L-= ^ E ir—=# £-r <15a)
2r , = 1 n n2r , = 1 'ri
estimates the total length of population reaches, where for resource r, lri} represents the
length of the Jth actual reach in the Ith sampling unit, /*, represents the length of the Ith
frame reach, and zTt represents the sum of length of all reaches in the ith sampling unit.
Recall that a sampling unit is a frame reach. Also, Dr is the total frame reach length in the
Tier 1 sample of resource r and L"=wDr is the Tier 1 estimate of L*. Because L* is known
on the frame, wDr is replaced by L*, resulting in-
L=L! R (15b)
_ "2r z
where R= 7^. Also,
"2' , = 1 rt
-------
~ L" 23r L
- * ^ " *-K (16b)
s'=sf =L'e-
2r , = i 'r»
For these estimates, the variance estimators for Lr and Nr, are given by L*2var(R),
where the variance of the ratio can be approximated by:
n2r
var(R) = I £(vr, - R)2 , (17)
2r( 2r ~ ) ,=i
2 k -
where vri= , when computing var(Lr), or Ur,= -p* < when computing var(Nr). Note
ri r«
that this formula is different from most ratio variance estimators in this report.
The distribution function is estimated from the data collected at the sample points,
not from the set of sample reaches, as in the above estimation of N and L. Recall that each
selected frame reach will have an associated sample point; this may result in 0, 1, 2, or more
sample points for the actual streams represented by the frame reach. Let:
Fr(y) = ^P, (18)
where nr is the total number of sample points in the resource, as realized in stream reaches,
and nr(y) is the number of these for which the observed indicator value is less than y;
nr(y)= ^I(ytJ
-------
Here, xr| equals the
number of sample points for the l"1 frame reach. This then simplifies to:
(19b)
Then it is necessary to estimate L(y) as a product, Lr(y) = LrFr(y), with variance
estimator, var(Lr(y))=L? var(Fr(y)) + F?(y) var(Lr).
This analysis presumes that there are no strata for stream reaches For two strata
(l" and 2 + order), simple modification of these formulae will suffice. The numbers and
length of reaches in the cross-classified strata are estimated and then combined. For F,
sample points from units in the wrong stratum are simply combined with the correct
stratum If more than these two strata are desired, the general method of frame correction
via sample unit correction is not feasible, and the method prescribed here is not appropriate.
2.1.2 Extensive Resources
The universe of an extensive resource is a continuous spatial region. If the domain is
correctly identified, the universe of the resource will be a subset of the domain and may be
fragmented over that domain Extensive resources may have populations of two kinds,
continuous ort>discontmuous. Because these discontinuous populations are defined on a
continuous universe, they are referred to as extensive resources. Continuous populations are
referred to as extensive resources as well. Section 3.3.4 of the EMAP design report (Overton
et al., 1990) describes two methods for sampling extensive populations, via a point or areal
sample. For each resource, the design provides for the classification of a large areal sample
(40-hex) at each grid point; these areal samples are also subject to subsampling via points or
areal subsamples.
22
-------
At Tier 2, two distinct directions are available, depending on the nature of the
resource. Specifically, if the domain of the resource is well known from existing materials,
as are boundaries of the Chesapeake Bay or Lake Superior, then the Tier 1 areal sample is of
little value either in estimating extent or in obtaining a sample at Tier 2. In these cases,
the domain should correspond to the universe. Conversely, if the spatial distribution or
pattern of a resource is poorly known, as it will be for certain arid land types or for certain
types of wetlands, then the Tier 1 areal sample may provide the best basis for obtaining a
well-distributed sample at Tier 2. Other factors, such as size of the domain and degree of
correspondence of universe and domain, will influence the sampling design. In the first
circumstance, the Tier 2 sample will be selected from the areal sample obtained at Tier 1.
In the other, the Tier 2 sample will be selected from the known universe by a higher
resolution point sample that contains the base Tier 1 sample.
2.1.2.1 Area! Samples
All Tier 1 areal samples are expected to be collected with equal probability.
Enhancement of the grid may be made for any resource, but any resource should have
uniform grid density over its domain. Further, the areal sample imposed on the grid points
will be of the same size for any resource, so that algorithms are presented only for equal
probability sampling. The following formula estimates the total areal extent of a particular
resource (r) over its domain
where the domain was discussed in Section 2.1.1.2. The value ari defines the area of
resource r in the areal sample at grid point i, and w is the inverse of the density of the grid
divided by the size of the areal sample. For equal probability sampling, the variance
Tier 1
r
(20)
23
-------
estimator is
r S[ OJ
r r
(21)
where n is the number of whole or partial areal sample hexagons located in ^r. As with the
discrete resources, even though the sample is selected by a systematic grid, we assume, in
order to estimate variance, that the sample was taken from a locally randomized scheme.
The justification of this assumption is similar to that for discrete resources. Alternate
procedures are available when the assumption is questioned (see Section 2.1.2.3).
At the second stage of sampling for extensive resources, the distribution function for a
particular resource is estimated. To identify the objective of Tier 2 sampling, we can write
estimating equations as though a complete census were made at Tier 2. The general
conceptual formula for the distribution function of areal extent for a specific resource (r)
over its domain ^Dr is
where ar|(y) is the area of the resource in areal sample t such that the value of the indicator
is less than y. The estimated variance follows Equation 21 as
Tier 2
(22)
(23)
The estimate of areal proportion for an extensive population divides Equation 22 by
the estimate of total areal extent:
24
-------
Fr(y) = ^ • (24)
Ar
In the rare instance in which Ar is known, then an improved estimator of Ar(y) is given by
Ar(y) = ArFr(y) . (25)
Ordinarily, these distribution functions will be calculated at each distinct value of y
appearing in the sample. The variance associated with the areal proportion is the genera)
form for a ratio estimator (Sarndal et al., 1992, Equation 7.2.11) In writing this
expression, it is necessary to identify the specific value, y,, at which Fr(y) is being assessed.
var(Fr(y,)) = [£d*w (w,-l) + EZ . (26)
} } *
* * }
where d; = [arj(y,) - arjFr(y,)], &rj is the area of sample j in resource r, wjfc is defined as
in Equation 5, and A^ is replaced by A^ when Ar is known
In practice, the Tier 2 assessment will be based on a subsample of some kind, and the
above ideal estimation will not be available. The only method proposed for subsampling is
use of a Tier 2 point sample.
2.1.2.2 Point Samples
Two methods of directly sampling objects from the grid points are discussed in the
EMAP design report (Overton et al., 1990, Section 4.3.3.2). A Tier 1 reduced sample of
discrete resource units can be selected by choosing the units into whose areas of influence the
points fall; this method is not currently scheduled for use, but it is a viable method for
several discrete resources. The same procedure can be used to select areal sampling units
from an arbitrary spatial partitioning of the United States. The agroecosystem component
25
-------
of EMAP provides such an example: the units selected for the sample are secondary
sampling units of the National Agricultural Statistics Service (NASS) frame, and estimates
are of totals over subsets of the frame units. Each selected unit is a mosaic of fields and
other land use structures. These structures are then classified and sampled to provide
ecological indicators for characterizing the sampling unit. Essentially, this area] sample is
analyzed exactly like the 40-hex fixed areal unit discussed in the previous section, with the
exception that inclusion probabilities are now proportional to the size of the unit, and the
general formulae (e.g., Equations 2-4) must be used.
An alternate use of the point sample can be applied to an extensive resource, with the
ecological indicators of the resource measured at the grid points. For continuous
populations, such as temperature or pH, the response can be measured exactly at a selected
point. For other populations, it is necessary to make observations on a support region
surrounding the point, like a quadrat. For example, the wetlands resource group could
obtain an indicator, such as plant diversity, from a quadrat sample centered on a grid point.
The indicator measured in the quadrat can be treated like a point measurement. A cluster
of quadrats centered on the grid point provides yet another method for sampling extensive
resources.
This point sample will be applied at Tier 2 in either of two ways. For resources that
depend on the Tier 1 areal sample to provide a sample frame, a high-resolution sample of
points is to be imposed on each 40-hex containing the resource; this arrangement will
generate an equiprobable point sample of the areal fragments of all resources that were
identified at Tier 1. For a resource in which the universe is clearly identified, such as Lake
Superior, a better spatial pattern of sample points will be obtained by imposing an enhanced
grid on the entire universe. In the latter case, the universe is known, whereas in the former
case, the Tier 1 sample provides a sample of the universe, which is then sampled by a Tier 2
point sample.
26
-------
In either case, an equiprobable sample of points is obtained from which resource
indicators will be measured, and the estimation equations will differ only by the weights.
Variance estimators will differ, as one is a single-stage sample and the other is a double
sample.
Point sample for a universe with well-defined boundaries
For a resource in which the universe is known (e.g., the Chesapeake Bay), the general
formula for equiprobable point samples for a resource class is presented. A resource class is
defined as a subset of the resource. For example, two classes of substrate, sand and mud,
can be defined in the Chesapeake Bay. The distribution function of the proportion of a
specific class of a specific resource (re) having the indicator < y reduces to
where nrc(y) is the number of points in resource class rc with the indicator equal to or less
than a specific value, y, and nrc is the total number of sample points in the resource class
rc. Under the randomization assumption, the conditional distribution of nrc(y), given nrc,
is Binomial(nrc, Frc(y)), so that confidence bounds are readily set by the binomial
algorithm in those instances in which spatial patterns indicate adequacy of the
randomization model (Overton et al., 1990, Section 4.3.5). Alternate protocols are available
when the randomization model cannot be assumed (Section 2.1.2.3).
Estimation of the area occupied by an extensive resource class is provided by
where nr is the number of grid points falling into the domain of the resource, and Ar is the
area of the resource. Under the randomization assumption, nrc, conditional on nr, is a
F,(y)= ^
(27)
(28)
27
-------
binomial random variable; bounds on prc are again set by the binomial algorithm, as are
bounds on Arc.
Point sample for universe with poorly defined boundaries
When the universe of the resource is not known and one must use the Tier 1 areal
sample as a base for the Tier 2 sample, then Equation 19 provides the estimates of Ar at
Tier 1. Then the Tier 2 sample is an equiprobable sample of points selected from the area
of the resource class contained in the 40-hexes. This procedure is implemented as a
tessellation stratified sample in each 40-hex, with k= 1 to 6 sample points per 40-hex. With
only 1 point per 40-hex, the binomial algorithm will be appropriate under the randomization
assumption; multiple points per 40-hex will require an explicit design-based expression for
variance. In all cases,
Kc = Ar^ = Arprc , (29a)
Frc(y)= . (29b)
Arc(y) = ArcFrc(y) = Ar ^ = ArR . (29c)
It should be recognized that Equation 29a is a special case of Equation 29c.
When Jk>l, the following variance formula is appropriate:
The outside summation is over the 40-hexes and dtJ = (I(rc, ytJ
-------
formula, applied to dfJ.
In addition,
var(Arc(y)) = var(Ar) R2 + A? var(R) (31a)
where var(R) follows,
var(R) = £{,?,d;'+ (.?1d')!/t} (3,b)
where dtJ = [I(rc, ytJ < y) - I(r)R]).
Note that var(Ar(y)) = var(Ar)F^(y) + A^var(Fr(y)); F replaces R in Equation 31a as well
as in Equation 29c.
2.1.2.3 Alternative Variance Estimators
Confidence bounds for distribution functions based on point samples of continuous
and extensive populations can be computed by several methods. The choice of a method is
determined by the pattern of the resource area. First, the binomial approach is suggested
for fragmented area distributed randomly across the domain. When this condition has been
met, the randomization assumption holds and the binomial model is appropriate for
computing confidence bounds.
If the area, Ar(y), is in an entire block, rather than fragmented, then the binomial
algorithm will overestimate variance, and alternative estimators will be needed. Other
methods allow for a nonfragmented area and the randomization assumption is not required.
The mean square successive difference (MSSD) is suggested for a strict systematic sampling
scheme. Another method, the probability sampling method using the Yates-Grundy
variance estimator, requires that the design have all positive pairwise inclusion probabilities.
29
-------
One such design that provides this structure is the two-stage tessellation stratified model.
The MSSD is discussed by Overton and Stehman (1993a) and the probability estimator is
discussed by Cordy (in press). Methods of spatial statistics are also available for estimating
this variance.
Mean square successive difference estimator
The variance estimator based on the mean square successive difference is intended to
provide an estimate of variance for either the mean of values from a set of points on a
triangular grid or obtained from a random positioning of the tessellation cells of the
hexagonal dual to the triangular grid. In the latter case, the data are analyzed as though
the values were taken from the center of the tessellation cell. The data set consists of all
points falling in the target resource The MSSD has not been developed for this tessellation
formed by triangular decomposition of the hexagons
Smoothing
Smoothing often results in improved variance estimation (Overton and Stehman,
1993a). The following method is from that report. For each datum, y, calculate a
'smoothed' value, y", as a weighted average of the datum and its immediate neighbors (i.e.:
distance of one sampling interval). Weighting for this procedure is provided below. As a
result, two new statistics are generated at :»ch point: y* and A.
30
-------
Number of Neighbors
y* values
A values
6
(6y, + £yj)/12
7/24
5
(7y, + £yj)/i2
5/24
4
(8y,+ £y,)/i2
5/36
3
(»y.+ Ey,)/i2
1/12
2
(lOy, + Ey,)/"
1/24
1
(iiy. + Eyp/12
1/72
0
y.
0
Given these smoothed values, summing over all data points,
= m
Mean Square Successive Difference
Identify the data along the three axes of the triangular grid; each point will appear
once in the analyses of each axis. Analyze the y", not the original y. For each axis,
calculate
^ = E(y» - yfc)2 - (33)
where ya and y^ represent members of a pair of adjacent points, and where the summation
is over all adjacent pairs identified on this axis. Also, calculate for each axis,
s2 = (E(y; - yJ)]2- (34)
31
-------
where it is necessary that all pair differences be taken in the same direction. From these
statistics, calculate for each axis,
s =
2_(*2"S
2(m -1)
(35a)
and
A2 = , (35b)
m(m - 1)
where m denotes the number of pairs in the above summations.
These statistics are then combined over the three axes, where summation is over all
successive pairs in the fc''1 axis.
«. = % to '
£(mfc-l)s£ t
s2=^ . (36b)
k= 1
Lastly, the following are computed to provide estimates of the variance of the mean values.
v(y ,„,) = vi + (37a)
and
%.«r)=C ' (3?b)
where nr equals the number of sample points in resource r. This method has not been
extended to distribution functions, but the extension is straightforward.
32
-------
Yates-Grundy variance estimator for tessellation stratified probability samples
Investigation of this variance estimator is continuing. The method will be included
in the next version of this manual.
2.2 Model-Based Estimation Methods
The previous section was devoted to design-based methods used to derive population
estimates, distribution functions, and confidence bounds. Model-based estimation is another
common approach to computing population estimates. In this approach, certain
assumptions with regard to the underlying model are made, and the information provided
by auxiliary variables often provides greater precision of the estimates.
Within EMAP, these model-based methods have not been developed to the same
degree as the design-based methods. No algorithms for confidence bounds of distribution
functions using model-based methods are presented in this report, although they are under
development. The purpose of including this section is to provide a brief description of
currently available model-based methods. Further, application of the model-based methods
has so far been restricted to discrete populations. Investigation of the applicability of these
methods in continuous and extensive populations is under way.
Three ways in which model-based methods can be used within EMAP are discussed:
(1) data collected on the full frame across the population can be incorporated into the
estimation process using prediction estimators to improve precision; (2) because the EMAP
design is a double sample (Section 2.1.2.2), auxiliary variables on the first-stage'sample can
be used to improve the precision at the second stage; and (3) a calibration method is
described for modifying an indicator variable to adjust for changes in instrumentation or
protocol such methods are needed to maintain the viability of a long-term monitoring
program.
The strategy is to begin with the basic design-based methods and to incorporate
33
-------
model-based methods as the opportunity to do so becomes apparent and the necessary frame
materials are developed. The design-based methodology will be enhanced by the use of
models whenever feasible.
2.2.1 Prediction Estimator
If auxiliary data that can be used to predict certain indicators are available on the
entire frame, model-based prediction techniques can be used to obtain predictions of the
response variable for the population. These predictions then become the base for population
inference
These methods Tequire a vector of predictor variables defined on the frame, while the
response variable is measured on the Tier 2 sample. A model is postulated for the
relationship between the response variable, y, and the vector of predictor variables, x:
Based on this model, a predictor equation, y =g(x), is estimated from the Tier 2 sample.
The equation for the basic estimator, which is referred to as the general regression
estimator, is defined as
where 11 and S2 designate the universe of units and sample units at Tier 2, respectively.
The variance of this estimator is estimated by
y=g(x) + e, with Var(c) = h(x) .
(38)
T„= £y,+ £ w2. (y, — y,) ,
'2
(39)
var(T„) = £d? w2, (w2, - 1) + £ £d, (w2, w2j - w2(J)
(40)
34
-------
where d, = (y, — y,) (Sarndal et al., 1992, Equation 7.2.11). Our experience (Overton and
Stehman, 1993b) suggests that this equation slightly underestimates the variance; this result
is not unexpected because Equation 40 is based only on the variance of the second term of
Equation 39.
One model-based estimator of the distribution function of the proportion of numbers,
as established by Rao et al. (1990), is based on the general regression estimator and defined
as
F(y) = {E P[(y,+0
-------
2.2.2 Double Samples
As mentioned previously, the EMAP design is a double sample with Tier 1
representing the first stage (or phase) and Tier 2 the second stage. Through most of this
document, design-based methods are provided for the Tier 2 sample; these methods are
similar to those described for single-stage samples. However, where model-based methods
are used, double sampling formulae can be quite different from single-stage formulae. An
elementary discussion of double sampling with model-based methods is presented in Cochran
Existence of an auxiliary variable on the Tier 1 sample will enable model-based
double-sample methods at Tier 2. EMAP does not require a resource-specific frame, but it
does allow for acquisition of more detailed information for many resources. There is a Tier
1 sample for all resources, and for most resources, the Tier 2 sample is a subset of the Tier 1
sample, thus providing the basis for model-based double-sample methods.
The model specification follows the developments under the general prediction model
(Equation 38). The basic estimator, derived from the general regression estimator, is
defined as
where Sj and S2 define the sample at Tiers 1 a»id 2, respectively. The form of this estimator
allows equal or variable probability at Tier 1. The variance estimator for Equation 42
follows Sarndal et al. (1992, p.365, Eq. 9.7.28):
var(T y) = EE ("uwij " *!,>. Vz.i.j + EE (w2.i,w2.ij " w2.i.,)d.wi.Vu < <43)
S2 S2 S2 S2
where d,=(y,-y,).
(1977).
(42)
36
-------
The estimate of the distribution function of the proportion of numbers is developed as
an extension of Equation 41,
?(y)= £ P[(y,+f,)
-------
2.3 Other Issues
2.3.1 Missing Data
Two types of missing data are expected to arise in EMAP. One type is a missing
sampling unit, such as a missing lake. The other type of missing value occurs within a
sampling unit, such as a missing observation on a specific chemical variable or a missing
suite of chemical variables for a lake. In this situation, information is available on some,
but not all, indicators for a specific unit or site.
2.3.1.1 Missing Sampling Units
There appears to be no basis for imputation of a missing sampling unit where no Tier
2 information is available to predict that observation. Therefore, missing sampling units
should be considered as representing a subset of the subpopulation of interest that is
unavailable for measurement. All procedures outlined in this document accommodate data
sets that contain missing units. No adjustments to the weighting factors are necessary;
summation is over the observed portion of the sample, and the estimates produced apply to
the subpopulation represented by the sample analyzed. When Yates-Grundy estimation of
variance is used, it will be necessary to modify the equation; this requirement is the primary
reason for using the Horvitz-Thompson variance estimator when possible.
In a long-term program, this approach of classifying missing units with the
subpopulation not represented by the sample is clearly appropriate; such units can be
sampled in subsequent years without having to modify sample weights again. This
approach is also consistent with the practice of allowing sampling units to change
.subpopulation classes from time to time. Comparisons must take this into account, but
such class changes will always be a feature of long-term monitoring programs
A general problem remains when a substantial number of resource sites cannot be
measured; EMAP must find a way to provide indicator values for such sites. When the
38
-------
problem is severe, it might be possible to develop an alternate indicator suite that can be
obtained via aerial television or photography. Perhaps it will be possible to impose a higher
(lower resolution) sample level that will provide for model-based methods and predictors of
the indicator. (This option will be difficult because the predictor relation must be developed
specifically for the subpopulation of concern.) But whatever the solution, some method is
required to provide representation of these sites. Until then, it is appropriate for these to be
identified in the subpopulation for which no sample has been generated and about which
nothing is known.
2.3.1.2 Missing Values within Sampling Units
It is advantageous to use information collected on a specific sampling unit to impute
any missing observations for that sampling unit. To minimize error, a multivariate analysis
is suggested, utilizing the data collected for that particular unit. No specific procedure is
suggested for this analysis, because most standard analyses will impute similar values, and
because the method must be tailored to the circumstances. Some multivariate procedures
are discussed in statistics books that concentrate on imputation of missing values (cf., Little
and Rubin, 1987).
2.3.2 Censored Data
For certain measurements, values for indicators will be less than the identified
detection limit; exact values cannot be measured for such units or sites. This problem is not
uncommon and has been discussed frequently in the literature applying to water quality
management programs (cf., Porter et al., 1988). Caution is prescribed when characterizing
data that consist of many observations below the detection limit. Proper analysis and
reporting can prevent improper inference for these data; specifically, it must be noted that
although reliable values are not provided, a great deal is known about the site that has a
value at or below the detection limit.
-------
To guide the data analyst in the treatment.of the indicator that contains censored
observations, the proximity of the detection limit to the critical value of the indicators needs
to be considered. Indicators, such as chemical variables, that have detection limits near or
above the critical value should not be considered meaningful indicators; the information
supplied by such an indicator is too fuzzy to justify inferences. In such cases, the most
meaningful parameters are those whose estimates are not affected by censoring. Other
indicators have a detection limit well below the critical value. For these indicators, it is
suggested that values below the detection limit should be scored to the detection limit and
analyzed with the uncensored data.
The mean is a poorly defined statistic to describe censored data. However, the scored
mean can be interpreted, even though it is slightly biased. Another statistic, the scored
mean minus the detection limit, is unbiased for the mean in excess of the detection limit,
which is a well-defined population parameter. If the distribution below the detection limit is
modeled, and the mean value below the detection limit is calculated, then the scored mean
can be converted into an unbiased estimate of the true mean, given the model.
On the other hand, the median is less ambiguous than the mean and is more
appropriate for characterizing these indicators. Usually the median will not be affected by
scoring. Distribution functions also should not be described below the detection limit. This
restriction is another reason for scoring; standard analyses of the scored data yield the
desired distribution function, emphasizing that the shape of the curve below the detection
limit is unknown. Because the critical level changes with circumstance, it is desirable to
present the truncated (scored) distribution function, to be interpreted as the situation
dictates. In fact, the capacity to truncate the distribution function without impairing
inferences is one of the strong arguments for choosing this parameter to characterize these
data.
40
-------
Modeling the function below the detection limit is one method proposed in the
statistical literature to modify estimates from censored data (Cox and Oakes, 1984; Miller,
1981). However, a hypothetical distribution must be assumed to represent the censored
data. In EMAP, distributions are defined on real populations and are unlikely to follow any
distributional law. We propose that the distribution function reflect the data alone and that
the unsupported portion of the distribution function is not described. Use of the scored
mean is somewhat less justifiable, but generally consistent with this position.
2.3.3 Combining Strata
The strata that form the structure of the Tier 2 sample are established from classes of
resources identified at Tier 1, on the Tier 1 sample. The seven basic resources are the
foundation of this structure, but there is provision for further classification leading to several
strata for lakes, several for forests, and so on. These strata are referred to as resources in
this report.
Tier 2 selection is then stratum (resource) specific and independent among strata.
This structure is chosen to provide inferences within strata, with the thought that few
occasions will arise for inferences involving combined strata. For example, a distribution
function [F(y)] combining small and middle-sized lakes will be dominated by small lakes. If
the population of large lakes is of interest, it must be characterized separately. Further, a
wide range of sizes makes the frequency distribution less useful in characterizing the
population. Still, because there may be interest in a population consisting of the largest of
the small lakes and the smaller of the middle-sized lakes, analysis of combined strata is
needed.
41
-------
2.3.3.1 Discrete Resources
Samples are combined across strata to compute the Tier 2 estimates Weights will
not be uniform, so the Horvitz-Thompson algorithms using weights are needed. Estimation
of N0(y) and Fa(y) is identical to the estimation algorithms for a single stratum, but
estimation of variance requires modification. The basic formula for estimating variance is
also unchanged, only the w2(J must be modified. Specifically,
if i and j are from the same stratum, then
2n2w2.w2j " w2. - w2j
W2'-' " 2 (n2 - 1) ;
or if i and j are from different strata, then
w2ij = w1,jw2.1,w2.1j
where, if i and j are from different 40-hexes, then
2nlwl.wlj " wl. " wij
Wl,J" 2 (nj - 1)
or, if i and j are from the same 40-hex, then
(45)
(46)
(47)
wi»wU
wl,j = —W— ' (48)
where w is the weight associated with the basic Tier 1 area! sample.
In the case of the quasi-stratified design used for lakes and streams, the
recommendation is that the sample be conditioned on the realized sample sizes in the several
distinct classes having equal inclusion probabilities (within class). This approach leads to a
42
-------
post-stratified sample that can be analyzed exactly like the sample from a stratified design.
The gain in precision will carry over into analysis of combined strata in the manner
discussed in this section.
2.3.3.2 Extensive Resources
Procedures for combining strata for point samples in extensive resources are the same
as those outlined for discrete resources (Section 2.3.3.1). Methods to combine strata for
areal samples in the extensive resources are still under consideration and will be addressed at
a later time.
2.3.4 Additional Sources of Error
Other potential sources of error can be expected in the process of developing the
distribution function and confidence bounds Some of these have been discussed after
evaluation of the Eastern and Western Lake Surveys (Overton, 1989a, 1989b). These
additional sources of error add to the uncertainty and bias of the estimated distribution
function. Research is presently under way to investigate methods, such as deconvolution, to
correct for these added components of error and bias. Preliminary methods are
unsatisfactory, and two different approaches are being followed to improve results. These
methods will be introduced to EMAP analyse., as they become available.
The rounding of measurements reduces precision in quantiles and distribution
function estimation. Analyses of the National Surface Water Surveys suggested that
reporting data at two decimal places beyond the inherent accuracy of the indicator
satisfactorily reduces bias attributed to rounding error (Overton, 1989b). It is recommended
that additional decimal places be carried into the data set if they are provided by the
instrumentation. Additional rounding should be made only at the reporting step, and the
43
-------
rule for rounding should take into account gain in precision from averaging and other
statistical practices.
2.3.5 Supplementary Units or Sites
Supplementary units, in addition to the yearly EMAP grid points, have been selected
and measured or remeasured by some resource groups. For example, a set of supplementary
units can be selected as a subset of one of the interpenetrating replicates. The
remeasurement of these supplementary units is directed at specific issues, such as estimation
of variance, and the selection procedure is likely to be influenced by this purpose. If data
from supplementary probability samples are combined with the general EMAP sample, it is
necessary to use a protocol for combining two probability samples. If the supplementary
data are not from a probability sample, then it is necessary to use a protocol for combining
found data with probability sample data (Overton et al., 1993). Ordinarily, a good strategy
will be to use these supplementary data only for analyses initially intended. The effort
necessary to satisfactorily combine supplementary data within the general sample analysis,
such as the distribution functions, is sufficiently great that one should be reluctant to
attempt this combination. On the other hand, there will be certain circumstances in which
this effort is justifiable.
44
-------
SECTION 3
DISTRIBUTION FUNCTION ALGORITHMS
The types of distribution function algorithms, along with their associated conditions
for application, are presented in Table 1. The first part of this table (A) presents the
various cases yielding the distribution of numbers, N(y). Part B presents the various cases
discussed in this report yielding the distribution functions for the proportions of numbers.
The methods of obtaining the distribution functions for size-weighted statistics are presented
in Part C.
To explain the notation presented in the following algorithms, some terminology is
introduced. The target population size, N, is the size of the target subset of the universe of
units, defined as <11 The following algorithms are written to obtain estimates over a
particular subpopulation of interest. For a particular subpopulation (a), the distribution of
numbers is denoted as Na(y) and the distribution of the proportion of numbers is denoted as
Fa(y) NQ denotes the subpopulation size over the subpopulation, a. In addition, the n and
na refer to the sample size from the population and subpopulation, respectively.
The variance estimator discussed in Section 2 is based on the Horvitz-Thompson
theorem and is appropriate for both equal and variable probability sampling, independent of
a known population or subpopulation size. The confidence bounds using this variance
estimator are then based on the normal approximation. Therefore, for any condition, the
general Horvitz-Thompson algorithms for N0(y) and F„(y), as presented in the following
.subsections under variable probability sampling, are appropriate.
Estimation of these bounds simplifies under equal probability sampling when the size
of either the population or the subpopulation is known. For example, an exact confidence
bound for Fa(y) can be based on the hypergeometric distribution in the case of equal
45
-------
probability sampling when the subpopulation size is known. When the subpopulation size is
unknown, these bounds can be based on the binomial distribution.
It should be emphasized that there are no differences in the distribution functions
obtained from the alternative design-based approaches discussed in this report. Further, the
distribution functions obtained under the same conditions based on the Horvitz-Thompson,
the binomial, or the hypergeometric algorithm are the same. The differences occur in the
computation of the confidence bounds. Note, however, that model-based distribution
functions will be different from those obtained from design-based methods.
In all situations, the algorithms in this report provide two one-sided 95% confidence
bounds. The combined upper and lower confidence bounds enable two-sided 90% confidence
bounds on the distribution function The Horvitz-Thompson algorithm estimates standard
errors from which the confidence bound is based on a normal approximation. The
alternative methods directly provide confidence bounds based on the exact binomial or
hypergeometric distributions. All design-based methods suggested for discrete populations
assume the randomized model, as discussed in Section 2. Because exact methods are usually
preferred over approximate methods, the exact methods are suggested for those cases by
which the conditions justify their use.
A test data set was applied to the following algorithms. Any resource group
interested in comparing their versions of these algorithms to the ones provided in this report
are encouraged to contact the authors. A copy of the test data set will be provided in order
to compare results from other programs.
3.1 Discrete Resources
In this section, examples are provided for each of the possible approaches to obtaining
Na(y) and F0(y) for discrete resources. For each of these approaches, the conditions and
assumptions of the selection of the sampling units are defined. For quick reference, Table 1
46
-------
(Section 4) presents this information in condensed form. An interest in obtaining the
distribution function of numbers and proportion of numbers across the subpopulation is
expected for all resource groups. For example, the lakes and streams resource group can
compute the numbers or proportions of numbers of lakes with some attribute based on this
algorithm.
3.1.1 Estimation of Numbers
A number of algorithms are presented for computing the distribution function for
numbers. The choice of the algorithm is dependent on whether the units were chosen by
either equal or variable selection. The first three cases (algorithms) in this section derive the
distribution functions based on an equal probability selection of units and the latter two
cases (algorithms) are based on an unequal probability selection of units.
Equal Probability of Selection
In this subsection, three cases are provided based on information that is known or
unknown. For the first algorithm, N is either known or unknown and Na is known; this
algorithm produces confidence bounds based on the hypergeometric distribution. For the
second algorithm, N is known, but Nfl is unknown; this algorithm is also based on the
hypergeometric distribution. For the third algorithm, both N and Nfl can be either known
or unknown; this algorithm produces confidence bounds based on the Horvitz-Thompson
variance estimator and the normal approximation.
47
-------
(Case 1)
Case 1— Estimation of Nn(y): Discrete Resource, Equal Probabilities, Nu known, n=na.
Confidence Bounds by Hypergeometric Algorithm.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, N0, is known.
3. There is an equal probability of selection of units from the subpopulation.
4. Sample size condition: n=na.
Outline for Algorithm
Under the given conditions, the confidence bounds can be obtained by either the
exact hypergeometric distribution or by the normal approximation. This case provides the
confidence bounds for Na(y) by the hypergeometric distribution, when Na is known. The
normal approximation bounds are provided in the next subsection (see Examples 4 and 5).
This algorithm computes the confidence bounds for each point along the curve using
the hypergeometric distribution. In the following formula, N0 is the subpopulation size; nQ
is the sample size from the subpopulation; Na(y) refers to the number of units, u, in the
subpopulation, A, for which yu < y; and na(y) refers to the number of units in the sample
from Na, Sa, for which yu < y. Under these conditions, na(y) has the following
hypergeometric distribution. Let X represent the random variable for which n0(y) is a
realization:
The upper confidence bound is computed by obtaining the largest value of Nfl(y) for which
Prob[X < na(y)] > 0.05. The lower confidence bound is computed by obtaining the smallest
value of Nfl(y) for which Prob[X > n0(y)] > 0.05.
(49)
48
-------
(Case 1)
A GAUSS program is presented here that derives the confidence bounds based on the
hypergeometnc distribution. Comments in capital letters in braces explain the
programming steps. Under the conditions of Case 1, the upper and lower halves of the
confidence bounds are symmetric.
CALCULATION OF CONFIDENCE BOUNDS ON Nn(y) BY THE
HYPERGEOMETRIC DISTRIBUTION
load x[a,b] = data; {LOADS DATA FILE WHICH INCLUDES LABEL CODE AND
VARIABLE TO BE ANALYZED. HERE a DESIGNATES
THE SAMPLE SIZE, n0, AND b DESIGNATES THE
NUMBER OF COLUMN VECTORS}
x=x[.,2]; {IDENTIFIES THE VARIABLE OF INTEREST IN SECOND COLUMN}
nm=rows(x); {NUMBER OF ELEMENTS OF INTEREST IN SUBPOPULATION, nc
IN THIS ALGORITHM, n=nm=n0}
n=rows(x);
NN=Na; {DEFINES TOTAL SUBPOPULATION SIZE HERE, Na }
x=sortc(x,2); {SORTS VARIABLE OF INTEREST}
y=seqa(l,l,nm), {CREATES SEQUENCE OF NUMBERS}
x2=x[.,2]; {DEFINES VARIABLE OF INTEREST AS X2}
x=y~x2; {CREATES MATRIX x}
zz=x; {DEFINES MATRIX x as zz}
{THE FOLLOWING COMBINES RECORDS WITH DUPLICATE VALUES
OF THE VARIABLE}
xx=zeros(l,2);
q=0;
i=l;
do while i < nm;
if x[i,2]==x[i+l,2];
q=q-f 1; I;
else;
xx=xx|x[i,.];
endif;
i=i+l;
endo;
xx=xx|x[nm,.];
x=xx;
49
-------
(Case 1)
{THE FOLLOWING STEPS BEGIN CONFIDENCE BOUND ESTIMATION}
r=rows(x),
z=zeros(r,l);
xl=x[.,l],
x2=x[.,2];
x=x2"x 1*(N N »x 1 /nm )'z;
{THE FOLLOWING STEPS GENERATE THE UPPER CONFIDENCE BOUND}
i=i;
do while i <= r; {BEGINS INITIAL DO LOOP}
rr=x[i,2];
mm=trunc(NN*rr/nm);
if mm >= NN;
goto three,
endif;
one:,
mm=mm-fl;
if NN <= 160,
aa=n!*mm'/NN'*(NN-mm),»(NN-n)';
else,
aa=lnfact(n) + lnfact(mm) - Infact(NN) + lnfact(NN-mm) + lnfact(NN-n);
endif;
j=0;
if (NN-mm-n) < 0;
j=-(NN-mm-n);
endif,
s=0;
do while j <= rr,
if NN <= 160,
a=aa/j!/(mm-j)!/(n-j)!/(NN-mm-n+j)!;
else;
a=aa - lnfact(j) - Infact(mm-j) - lnfact(n-j) - lnfact(NN-mm-n+j);
a=exp(a);
endif,
s=s+a;
j=j+i;
endo;
if s>= .05;
goto one;
endif;
three:;
if mm>=NN;
x[i,4]=NN;
else;
x[i,4]=mm-l;
endif;
i=i+l;
ENDO; {ENDS INITIAL DO LOOP}
50
-------
(Case 1)
{THE FOLLOWING STEPS ADD AN EXTRA LINE TO DATA MATRIX NEEDED IN
CONFIDENCE BOUND ADJUSTMENT COMPUTED AT END OF ALGORITHM}
r=rows(x);
y=zeros(r,l)
x=x~y;
y=zeros(l,5);
y[l,2:4]=x[r,2:4];
x=x|y;
{THE FOLLOWING STEPS GENERATE THE LOWER CONFIDENCE BOUND}
r=rows(x);
i=l;
do while i <= r; (BEGINS SECOND DO LOOP}
rr=x[i,2];
mm=trunc(NN»rr/n);
if mm==0;
goto six;
endif;
four:;
mm—mm-1;
if NN <= 160;
aa=n!*mm!/NN!*(NN-mm)!*(NN-n)!;
else;
aa=lnfact(n) + lnfact(mm) - lnfact(NN) + lnfact(NN-mm) + lnfact(NN-n);
endif;
j=rr;
mnm=minc(n|mm);
s=0;
do while j < — mnm;
if NN <= 160;
a=aa/j!/(mm-j)!/(n-j)!/(NN-mm-n+j)!;
else;
a=aa- lnfact(j) - lnfact(mm-j) - Infact(n-j) - lnfact(NN-mm-n+j);
a=exp(a),
endif;
s=s+a;
j=j+i;
endo;
if 8>= .05;
goto four;
endif;
six:;
if mm==0;
x[i,5]=0;
else;
x[i,5]=mm+l;
endif;
i=i+l;
ENDO; {ENDS SECOND DO LOOP}
51
-------
(Case 1)
{ASSIGN LABELS}
"N= " NN n = " n;
x;
OUTPUT OFF;
{ADJUST Nfl(y) and CONFIDENCE BOUNDS - AVERAGE SUCCESSIVE VALUES}
r=rows(x);
xx=x;
i=2;
do while i <= r-1;
xx[i,3:5]=(x[i,3:5] + x[i-l,3:5])/2;
i=i+l;
endo;
{OUTPUT FILE AND PRINT MATRIX x}
OUTPUT FILE=NAME;
OUTPUT reset,
" x " " Sequence #" " F(x) " " F-lower(x) " " F-upper(x)
format /ml/rd 12,7;
print x;
OUTPUT OFF;
end;
52
-------
(Case 2)
Case 2— Estimation of N0(y): Discrete Resource, Equal Probabilities,
N known, Na unknown, n>nfl.
Confidence Bounds by Hypergeometric Algorithm.
Conditions for approach
1. The frame population size, N, is known.
2. The subpopulation size, Na, is unknown.
3. There is an equal probability of selection of units from the subpopulation.
4. Sample size condition: n>na.
Outline for Algorithm
Under the given conditions, the confidence bounds can be obtained by either the
exact hypergeometric distribution or by the normal approximation. This example provides
the confidence bounds for Na(y) by the hypergeometric distribution, when N is known, but
N0 is unknown. Normal approximation bounds are provided in the next subsection (see
Examples 4 and 5).
This algorithm computes the confidence bounds for each point along the curve using
the hypergeometric distribution. In the following formula, N is the frame population size; n
is the sample size from the frame population; Na(y) refers to the number of units, u, in the
subpopulation, J., for which yu < y; and nQ(y) refers to the number of units in the sample
from N0, Sa, for which yu < y. Under the conditions, n0(y) has the following hypergeometric
distribution. Let X represent the random variable for which na(y) is a realization. Note
that na(y) < n0 and that Nu(y) < Na < N.
(50)
53
-------
(Case 2)
The upper confidence bound is computed by obtaining the largest value of Na(y) for which
Prob[X < na(y)] > 0.05. The lower confidence bound is computed by obtaining the smallest
value of N0(y) for which Prob[X > na(y)] > 0.05.
To obtain the distribution function, the data file needs to be sorted on the indicator,
either in an ascending or descending order. When the data file is sorted in ascending order
on the indicator, the distribution function of numbers, N0(y), denotes the number of units in
the target population that have the value less than or equal to the specific y. Conversely, if
it is of interest to obtain bounds on the number of units in the target population with
indicator values greater than or equal to y, the data file must be sorted and analyzed in
descending order on this variable. The distribution function generated by the analysis in
descending order is [NQ - Nn(y)].
A GAUSS program provided in Case 1 derives the commence bounds based on the
hypergeometnc distribution. However, under the conditions discussed here, the sample size
and population sizes are defined as follows.
CALCULATION OF CONFIDENCE BOUNDS ON Na(y) BY THE
HYPERGEOMETRJC DISTRIBUTION
loa^ x[a,b] = data; {LOADS DATA FILE WHICH INCLUDES LABEL CODE AND
VARIABLE TO BE ANALYZED. HERE a DESIGNATES
THE OBSERVED SAMPLE SIZE, nQ, AND b DESIGNATES
THE NUMBER OF COLUMN VECTORS}
x=x[.,l];
nm=rows(x); {NUMBER OF ELEMENTS OBSERVED, nfl. IN THIS ALGORITHM,
n*na. }
n=#; {FULL SAMPLE SIZE}
NN=N; {DEFINES TOTAL POPULATION SIZE HERE }
REFER TO CASE 1 (AFTER LINE 13) FOR THE REMAINING STEPS
IN THIS PROGRAM.
54
-------
(Case 3)
Case 3- Estimation of Na(y): Discrete Resource, Equal Probabilities.
Confidence Bounds by Horvitz-Thompson Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, Na, can be known or unknown.
3. There is an equal probability of selection of units from the subpopulation.
[Note that this algorithm can also be applied to those cases presented in
Examples 1 and 2.]
Outline for Algorithm
The algorithm recommended, given the foregoing conditions, is based on the Horvitz-
Thompson formulae, which were discussed in Section 2. The algorithm presented for the
general case of variable probability of selection (the following subsection) is appropriate to
use given the foregoing conditions.
Equal probability selection is a special case of variable probability selection. In equal
probability of selection of units, the weighting factors are equal for all units, wt=w =w. If
the weights, w1( and w2 lt are appropriately identified, then the general algorithm presented
in Example 4 will not need any modification. The Tier 2 weight, w2|, computed by
Equation 4 is the same for all units.
55
-------
Variable Probability Selection
In this subsection, two examples are provided to demonstrate variable probability of
selection. For both cases, the frame population size can be known or unknown. In Case 4,
Na can be unknown or known and equal to Na. For Case 5, N0 is known and not equal to
N0. Both algorithms produce confidence bounds based on the Horvitz-Thompson variance
estimator and the normal approximation.
56
-------
(Case 4)
Case 4— Estimation of NQ(y): Discrete Resource, Variable Probabilities,
Na unknown or Na known and equal to NQ.
Confidence Bounds by by Horvita-Thompson Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown. ^
2. The subpopulation size, Na, is unknown or known and equal to N0
3. There is a variable probability of selection of units from the subpopulation.
Outline for Algorithm
The algorithm supplied for this example is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. It is useful to identify the estimator of Na from Tier 2.
The design-based estimator of Na is
where Sa is the portion of the sample from the subpopulation, A, over which the weighting
factors (w) are summed. The variance estimator for Na is presented in Equation 3b of
Section 2.
Calculation of confidence bounds on Nji) 6jr the Horritz-Thompaon formulae -
For each indicator, the following algorithm derives the distribution function and the
confidence bound for N(y) or N0(y). This algorithm is similar to the algorithm defined for
the National Surface Water Surveys (Overton, 1987a,b). The Horvitz-Thompson variance
estimator, discussed in Section 2.1, is used to compute the variance in this algorithm. The
confidence bounds are computed based on a normal approximation.
(51)
57
-------
(Case 4)
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, wlt
c. Tier 2 conditional weighting factor, w2 lt
d. Indicator of interest (y)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When the data file is sorted in ascending order on the indicator,
the distribution function of numbers, Nn(y), denotes the number of units in the
target population that have a value less than or equal to the y for a specific
indicator Conversely, if it is of interest to estimate the number of units in the
target population with indicator variables greater than or equal to y, the data file
would be sorted in descending order on this variable. The distribution function
generated by the analysis in descending order is [Na - Na(y)].
3. Computation of weighting factors
The Tier 1 and Tier 2 weights are included for each observation in the data
set. These weights are used to compute the total weight of selecting the Ith unit in
the Tier 2 sample. Compute the following weight for each observation:
w2. = wl,w2.1.
where w,( is the weighting factor for the ith unit in the Tier 1 sample (the inverse of
its Tier 1 inclusion probability) and w2 lt is the inverse of the conditional Tier 2
inclusion probability.
4 Algorithm for Na(y)
a. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and five statistics for each row.
qj = value of y variable for the record
q2 = Na(yJ
q3 = var[Na(y)]
q4 = upper confidence bound for N0(y)
q5 = lower confidence bound for Na(y)
b. Index rows using t from 1 to n; the i*'1 row will contain q-values
corresponding to the record in the file, as analyzed.
c. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate the q-statistics as each
observation is read into file. Continue this loop until the end of file is
reached. At that time, store these vectors and go to d. This algorithm is
calculating the distribution for the number of units [Na(y)J in the
subpopulation. It is necessary to identify the records for which
58
-------
(Case 4)
>¦ qiH = y[«]
ii. q2[«] = q,[. - 1] + w2l
iii. Qaf.] = cy. - 1] + w2,*(w2l - 1) + 2 £ (w2lw2j - w2lJ)
3 < »
where, if neither w2 1( nor w2>1 =1:
2n2w2.w2j"w2.-w2j
W2'J- 2(n3-l)
where, if either w2 lt or w2= 1:
w2ij = w1«jw2.1iw2.1 j
where:
2niwi.wij - Wj, - Wjj
Wl,J~ 2(ni-l)
>v. q4[i] = q2[i] + 1.645^^/q^i]
v- Qst'l = ^21'] _ l-645*yq^[i]
Multiple observations with one y value create multiple records in the above
analysis for one distinct value of y. The last record for that y contains all
the information needed for N0(y). Therefore, at this stage of the analysis,
eliminate all but the last record for those y values that have multiple
records.
d. Output of interest
From the last entry of the row of q-vectors just computed:
i. q} = largest value of y (or smallest if analysis is descending)
"• <12 = N0 ^
iii. q3 = var(N0)
iv. Standard error of Na = ^/q^
From the q column vectors:
i. qj represents the ordered vector of distinct values of y
ii. q2 represents the estimated distribution function, N0(y),
corresponding to the values of y
iii. q4 represents the 95% one-sided upper confidence
bound of the distribution function, Na(y)
iv. q5 represents the 95% one-sided lower confidence
bound of the distribution function, N0(y)
59
-------
(Case 5)
Case 5— Estimation of Na(y): Discrete Resource, Variable Probabilities,
Na known and not equal to Na.
Confidence Bounds by by Horvits-Thompeon Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, Na, is known and not equal to Na
3. There is a variable probability of selection of units from the subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. The algorithm for the distribution function for the
proportion of numbers, F0(y), given exactly the same conditions listed above, is presented in
Case 8 To compute the distribution function of numbers, Na(y), first use the algorithm in
Case 8 to compute the distribution function with the corresponding confidence bounds for
the proportion of numbers Then, compute the following*
Na(y) =Ffl(y)»Na, (52)
where N0 is the known subpopulation size. To compute the confidence bounds for Na(y),
simply multiply the upper and lower confidence limits of FQ(y) by Na.
60
-------
3.1.2 Proportions of Numbers
A number of algorithms are presented to compute the distribution function for the
proportion of numbers. For any case in a resource group, the choice of the algorithm is first
determined by the method by which the units were selected. The first two algorithms in
this section derive the distribution functions based on an equal probability selection of units
and the latter two algorithms are based on an unequal probability selection of units.
Equal Probability of Selection
In this subsection, two examples are provided based on whether or not the
subpopulation size is known or unknown. For the first algorithm, Na can be known or
unknown; this algorithm produces confidence bounds based on the binomial distribution.
For the second algorithm, Na is known, this algorithm is based on the hypergeometric
distribution
61
-------
(Case 6)
Case 6— Estimation of Fa(y): Discrete Resource, Equal Probabilities,
Na known or unknown.
Confidence Bounds by Binomial Algorithm.
Conditions for approach
1. Tbe frame population size, N, can be known or unknown.
2. The subpopulation size, Na, can be known or unknown.
3. There is an equal probability of selection of units from the subpopulation.
Outline for Algorithm
Under the given conditions in which NQ may not be known, the confidence bounds
can be based on the binomial distribution In addition, Example 8 provides the normal
approximation approach to the confidence bound estimation.
A program, based on the binomial distribution and written in the GAUSS language,
is presented in this section. We assume X has the binomial distribution,
X ~ Binomial[n0,F*(y)], where na{y) is the observed realization of X, n0 represents the
N (y)
number of "trials", F^(y)= —rs— represents the true finite population proportion of
"successes*1, and F0(y) is the infinite population parameter. The estimated distribution
function is denoted as F (y)= where nfl is the sample size from the subpopulation and
a
na(y) refers to the number of units in the sample for which yu < y. The upper confidence
bound is computed by obtaining the largest value of Fa(y) tor which Prob[X < n0(y)] > 0.05.
The lower confidence bound is computed by obtaining the smallest value of F0(y) for which
Prob[X > na(y)] > 0.05. As written, the algorithm calculates the upper and lower confidence
•bounds to three decimal places.
Comments in capital letters in braces explain the programming steps. Under these
conditions, the upper and lower halves of the confidence bounds are symmetric.
62
-------
(Case 6)
CALCULATION OF CONFIDENCE BOUNDS ON Fa(y) BY THE
BINOMIAL DISTRIBUTION
load x[a,b] = data; {LOADS DATA FILE FOR THE TARGET SUBPOPULATION
WHICH INCLUDES LABEL CODE AND VARIABLE TO
BE ANALYZED. HERE a DESIGNATES THE SAMPLE
SIZE, na, AND b DESIGNATES THE NUMBER OF
COLUMN VECTORS}
n=rows(x); {SAMPLE SIZE IN TARGET SUBPOPULATION, na}
x=sortc(x,2); {SORTS VARIABLE OF INTEREST}
y=seqa(l,l,nm); {CREATES SEQUENCE OF NUMBERS}
x2=x[.,2]; {DEFINES VARIABLE OF INTEREST AS X2}
x=y"x2, {CREATES MATRIX x}
{THE FOLLOWING STEPS COMBINE RECORDS WITH COMMON y-VALUES}
xx=zeros(l,2);
q=0;
i=i;
do while i < n;
if x[i,2]==x[i+l,2];
q=q+l; I;
else; xx=xx|x[i,.];
endif;
i=i+l;
endo;
xx=xx|x[n,.j;
r=rows(xx);
x=xx;
{THE FOLLOWING STEPS FORM DATA MATRIX - x}
r=rows(x);
z=zeros(r,l);
xl=x[.,l];
x2=x[.,2];
x=x2"xr(xl/n)~z;
{THESE STEPS GENERATE BINOMIAL COMBINATION TERMS}
f=zeros(n+l,l);
i=0;
if n < 160;
do while i<=n;
fli+l,l]=n!/i!/(n-i)!;
63
-------
l—i+l;
endo,
else;
f(i+l,l]=lnfact(n) - lnfact(i) - infact(n-i);
endif;
{THE FOLLOWING STEPS GENERATE UPPER CONFIDENCE BOUND}
i=l;
do while i <= r; {BEGINS INITIAL DO LOOP}
rr=x[i,2];
p=(trunc(100*x[i,3]))/100;
if p==1.0;
p=p — .001,
goto three;
endif;
one:;
p=p+.01;
j=0,
s=0;
do while j <= rr,
a=f[j + l,l]*pj*(l-p)"(n-j);
s=s+a,
j=j+i;
endo,
if s >= .05;
goto one;
endif,
two.,
p=p — .001,
j=0;
s=0,
do while j <= rr;
a=f[j+l,l]»pj*(l-pHn-j);
8=s+a;
j=j+i;
endo;
if s <= .05;
goto two;
" endif;
three:;
x[i,4]=p+.001;
i=i+l;
ENDO; {ENDS INITIAL DO LOOP}
64
-------
(Case 6)
{THE FOLLOWING STEPS ADD AN EXTRA LINE TO DATA MATRIX NEEDED IN
CONFIDENCE BOUND ADJUSTMENT COMPUTED AT END OF ALGORITHM)
r=rows(x);
y=zeros(r,l);
x=x'y;
y=zeros(l,5);
y[l,2]=n;
y[l,3]=l;
y[l,4]=l;
x=x|y;
{THE FOLLOWING STEPS GENERATE LOWER CONFIDENCE BOUND}
r=rows(x);
i=l;
do while i <= r; {BEGINS SECOND DO LOOP}
rr=x[i,2],
p=(trunc( 100*x[i,3]))/100;
if p==0;
p=.001;
goto six;
endif;
four:;
p=p - .01;
if p<=0;
p=.001;
goto six;
endif;
j=rr,
s=0,
do while j <= n;
a=ftj+l,l]*p>(l-p)"(n-j);
s=s+a;
j=j+i;
endo;
if s >= .05;
goto four;
endif;
five:;
p=p+.001;
j=rr;
8=0;
do while j <= n;
a=flj+l,l]»PJ*(l-p)"(n-j);
s=s-fa;
j=j+i;
endo;
if s <= .05;
goto five;
endif;
65
-------
(Case 6)
six:;
x[i,5]=p — .001;
i=i+l;
ENDO; {ENDS SECOND DO LOOP}
{ADJUST Fa(y) and CONFIDENCE BOUND - AVERAGE SUCCESSIVE VALUES}
r=rows(x);
xx=x;
i=2'.
do while 1 <= r-1;
xx[i,3:5]=(x[i,3:5] + x[i-l,3:5])/2;
i=i+l;
endo;
{OUTPUT FILE AND PRINT MATRIX x}
OUTPUT FILE=NAME;
OUTPUT ON;
"x" "Sequence #" "F(x)n "F-upper(x)" "F-lower(x)" ;
format /ml/rd 12,7;
print x;
OUTPUT OFF,
end;
66
-------
(Case 7)
Case 7— Estimation of Fa(y): Discrete Resource, Equal Probabilities, Nfl known.
Confidence Bounds by Hypergeometric Algorithm.
Condiitona for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, N0, is known.
3. There is an equal probability of selection of units from the subpopulation.
Oithne for Algorithm
Under the given conditions, the confidence bounds can be based either on the
binomial or on the hypergeometric distribution. The binomial algorithm presented in
Example 6 is appropriate to use given the foregoing conditions. In addition, Example 9
provides the norma] approximation approach, which is also applicable, given the foregoing
conditions, to the confidence bound estimation.
To obtain confidence bounds for F(y) based on the hypergeometric distribution, refer
to the algorithm provided for the confidence bound calculation for N0(y) in Example 1.
Simply divide the lower and upper confidence bounds, and Na(y), by the known
subpopulation size, Na. No further changes are necessary to this algorithm to provide
confidence bounds for Fa(y) based on the hypergeometric distribution.
67
-------
Variable Probability Selection
In this subsection, two cases are provided to demonstrate variable probability of
selection. For both cases, the frame population size can be known or unknown. In Case 8,
Na can be unknown or known but not equal to Na; this algorithm produces confidence
bounds based on the Horvitz-Thompson ratio standard error and the normal approximation.
For Case 9, N0 is known and equal to N0; this algorithm produces confidence bounds based
on the Horvitz-Thompson variance estimator and the normal approximation.
68
-------
(Case 8)
Case 8— Estimation of F0(y): Discrete Resource, Variable Probabilities,
Na unknown or known and not equal to Na.
Confidence Bounds by Horvitz-Thompeon Ratio Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, Na, is known and not equal to Na.
3. There is a variable probability of selection of units from the subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest.
Calculation of confidence bounds on Fa(y) by the Horvitz-Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for N0(y) similar to that given in Example 4. In this section, however,
the interest is in obtaining a distribution function for proportions. Therefore, the variance
of a ratio estimator is used in this algorithm. The confidence bounds are computed based
on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, wlt
c. Tier 2 conditional weighting factor, w2 lt
d. Indicator of interest (y)
e. The subset of data corresponding to the subpopulation of interest, indexed
by a.
2. Computation of weighting factors
This step does not have to be made with each use of the datum, as the
weights are permanent attributes of a sampling unit. The following details are
69
-------
(Case 8)
given for completeness.
The Tier 1 and Tier 2 weights are included for each record in the data set.
These weights are used to compute the total weight of selecting the Ith unit in the
Tier 2 sample. Compute the following weight for each record:
w2i = wi,w2.l,
where Wj, is the weighting factor for the t"1 unit in the Tier 1 sample (the inverse of
its Tier 1 inclusion probability) and w2 j, is the inverse of the conditional Tier 2
inclusion probability. The pairwise inclusion weight is defined below. The sample
size at Tier 2, n2, is not subpopulation specific.
3 Algorithm for F0(y) and Confidence Intervals
a. Sorting of data. The data file needs to be sorted on the indicator, either in
an ascending or descending order. When the data file is sorted iji ascending
order on the indicator, the distribution function of proportions, Fn(y),
denotes the proportion of units in the target population that have a value
less than or equal to the y for a specific indicator. Conversely, if it is of
interest to estimate the proportion of units in the target population with
indicator variables greater than or equal to y, the data file would be sorted
in descending order on this variable. The distribution function generated
by the analysis in descending order is [1 - Fa(y)].
b. First, compute Na = ^ w2i (^'s sums over data matrix).
S°
c. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and five statistics for each row.
qj = value of y variable for the record
qa = var[Fa(y)]
q4 = upper confidence bound for Fa(y)
q5 = lower confidence bound for Fa(y)
d. Index rows using i from 1 to n; the Ith row will contain q-values
corresponding to the i*'1 record in the file, as analyzed.
e. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate and store the q
-statistics, below, as each observation is read into file. Continue this loop
until the end of file is reached.
>• qiM = y[«]
ii. q2[.] = q2[. - 1] + -^
a
Multiple observations with one y-value creates multiple records in the
preceding analysis for one distinct value of y. The last record for that y
70
-------
(Case 8)
contains all the information needed for Fa(y). Therefore, at this stage of
the analysis, eliminate from the q-file all but the last record for those y
values that have multiple records.
f. Entries in the first column (q,) of the q-matrix identifies the vector of y
-values for the remainder of the calculations. For each such y-value, y(,
make the following calculations. Note that this part of the algorithm is not
recursive; each calculation is made over the entire sample.
''»• qal'l = { Hdjw2j(w2j-J) + EEdA(V2t"V)J/^
J 3 k
* ^ J
where,
2n2w2jw2fc - w2j - w2k
and,
dj = i(y.,
-------
(Case 9)
Case 9— Estimation of Fa(y): Discrete Resource, Variable Probabilities,
Na known and equal to Nfl.
Confidence Bounds by Horvits-Thompeon Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, Nfl, is known and equal to N0.
3. There is a variable probability of selection of units from the subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest.
Calculation of confidence bounds on Fa(i) 6j the Horoitz-Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bounds for Nfl(y) exactly as given in Example 4. Because Na is known and equal
to N0, it is not necessary to use the ratio estimator applied in Case 8. The distribution
function of Fa(y) is obtained by dividing the distribution function, Nfl(y), and the associated
bounds, by N0. (These additional steps are included in this algorithm.) The Horvitz-
Thompson variance estimator, discussed in Section 2.1, is used to compute the variance in
this algorithm. The confidence bounds are computed based on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, wlf
c. Tier 2 conditional weighting factor, w2
d. Indicator of interest (y)
72
-------
(Case 9)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When the data file is sorted in ascending order on the indicator,
the distribution function of proportions, Fa(y), denotes the proportion of units in
the target population that have a value less than or equal to the y for a specific
indicator. Conversely, if it is of interest to estimate the proportion of units in the
target population with indicator variables greater than or equal to y, the data file
would be sorted in descending order on this variable. The distribution function
generated by the analysis in descending order is [1 - F0(y)].
3. Computation of weighting factors
For this step, refer to the program steps given in Example 4 to derive the
distribution function and the confidence bound for N0(y). Follow the^steps labeled
3 and 4. Additional steps, shown here, are needed to obtain F0(y) and its
corresponding confidence bounds. Proceed with the following steps after conducting
steps 3 and 4 from Example 4:
e. The operations that follow generate the q vectors to compute the estimated
distribution function and appropriate confidence bounds for Fa(y). These
are denoted by qg through q^. Each element of q6-qg is computed by
performing the following operations on the corresponding elements of q2, q4,
and q5.
i. q6 = Divide each element of q2 by the known subpopulation size
ii. q7 = Divide each element of q4 by the known subpopulation size
iii. qg = Divide each element of q5 by the known subpopulation size
From the q vectors:
i. q^ represents the estimated distribution function, Fa(y)
ii. q7 represents the 95% one-sided upper confidence
bound of the distribution function, Fa(y).
iii. qg represents the 95% one-sided lower confidence
bound of the distribution function, Fa(y).
73
-------
3.1.3 Rationales for Approaches
Justification for the variance estimators used in the algorithms in Sections 3.1.1 and
3.1.2 was presented in Section 2 of this report. The different choices proposed for confidence
bound estimation, under some conditions, were also discussed. For example, both the
hypergeometric and binomial approaches to the confidence bound calculation for F0(y),
when Na is known, were provided in the above cases. Choice of one of the approaches
presented to compute confidence bounds for Fa(y), when the subpopulation size is known,
depends in part on the available information and in part on the purpose of inference. The
bounds based on the hypergeometric distribution provide for inferences directed to the finite
population. For example, if data are available for every lake in a small population of lakes,
there is no uncertainty relative to this attribute for this population (in the absence of
measurement error). Bounds based on hypergeometric or on the normal approximation
approach will reduce to zero width as n"N, because of the finite population correction.
These bounds are more relevant for management purposes In contrast, those bounds based
on the binomial distribution provide for inferences directed to the superpopulation
parameter. In this situation, the entire population is considered as a sample from the
superpopulation. Statements about the set of high mountain lakes in New England are
finite, but general statements about high mountain lakes, based on those found in New
England, are relative to a hypothetical, infinite, superpopulation. Therefore, the confidence
bounds obtained by the binomial distribution are broader than those provided by the
hypergeometric distribution to account for this additional level of variability.
74
-------
3.1.4 Estimation of Size-Weigh ted Statistics
A few algorithms are presented to compute the distribution functions for size-
weighted totals and size-weighted proportions of totals. The following subsection describes
algorithms to compute the distribution function for size-weighted totals. The next
subsection presents algorithms to compute the distribution function for the proportions of
size-weighted totals.
Estimation of Size-Weighted Totals
In this subsection, two examples are provided based on information that is known or
unknown. For the first algorithm, the size-weight, Za, is unknown or known and equal to
ZQ; this algorithm produces confidence bounds based on the Horvitz-Thompson standard
error and the normal approximation. For the second algorithm, Zfl is known but not equal
to Za; this algorithm produces confidence bounds based on the Horvitz-Thompson ratio
standard error and the normal approximation.
75
-------
(Case 10)
Case 10- Estimation of Za(y): Discrete Resource, Siae-Weighted Estimate,
Equal or Variable Probabilities. Zu unknown or known and
equal to Za.
Confidence Bounds by Horvitz-Thompson Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2. The subpopulation size-weighted total, Za, is unknown or known and
equal to Zn.
3. There can be an equal or variable probability selection of units from
the subpopulation.
Outline, for Algorithm
General formulae for Tier 1 estimates were provided in Section 2.1.1. The general
form of a size-weighted estimate in a subpopulation at Tier 1, denoted as Z0, is similar to
Equation 2 The y, in that equation refers to the size-weight value, now denoted as z,:
Z0=Hz,wi. ' (53)
Sa
where z, defines a size-weight, such as the area of a lake or the stream length in miles, and
w is the inverse of the inclusion probability at Tier 1. Using these same definitions, the
variance estimator for Za is similar to Equation 3a.
Estimation of Zjy) h the Horntz-Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for Za(y). This algorithm is similar to the algorithm defined for the
National Surface Water Surveys (Overton, 1987a,b). The Horvitz-Thompson variance
estimator, discussed in Section 2.1, is used to compute the variance in this algorithm. The
76
-------
(Case 10)
confidence bounds are computed based on a normal approximation. This algorithm is
appropriate for a sample subset for any subpopulation a that is of interest.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, wlt
c. Tier 2 conditional weighting factor, w2 j,
d. Size-weighted value (z)
e. Indicator of interest (y)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When the data file is sorted injiscending order on the indicator,
the distribution function of size-weighted totals, Za(y), denotes the size-weights in
the target population that have a value less than or equal to the y for a specific
indicator. Conversely, if it is of interest to estimate the size-weight in the target
population with indicator variables greater than or equal to y, the data file would
be sorted in descending order on this variable. The distribution function generated
by the analysis in descending order is [Za - Za(y)].
3. Computation of additional weighting factors
The Tier 1 and Tier 2 weights are included for each observation in the data
set. These weights are used to compute the total weight of selecting the i unit in
the Tier 2 sample. First, compute this weight for each observation:
w2, = wl.w2.1, -
where wlt is the weighting factor for the ith unit in the Tier 1 sample and w2 ,, is
the inverse of the conditional Tier 2 inclusion probability.
a
4. Algorithm for Z0(y)
a. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and four statistics for each row.
qj = value of y variable for the record
q2 = Z0(yJ
<13 = var[Z0(y)]
q4 = upper confidence bound for ZQ(y)
q5 = lower confidence bound for Za(y)
b. Index rows using i from 1 to n; the ith row will contain q-values
corresponding to the Ith record in the file, as analyzed.
c. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate the q-statistics as each
observation is read into file. Continue this loop until the end of file is
reached. At that time, store these vectors and go to d. It is necessary, as
shown below for q4, to identify the records for which w2 lt=1.
77
-------
(Case 10)
'• QiW = y[»]
ii. q2[.] = q2[, - 1] + w2,»z,
iii.
-------
(Case 11)
Case 11- Estimation of Za(y): Discrete Resource, Size-Weighted Estimate,
Equal or Variable Probabilities. Zfl known and not
equal to ZB.
Confidence Bounds by Horvitg-Thompson Ratio Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2. The subpopulation size-weighted total, Zfl, is known and not equal
za.
3. There can be an equal or variable probability selection of units from the
subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. The algorithm for the distribution function for the
proportion of numbers, Ga(y), given exactly the same conditions listed here, is presented in
Case 12. To compute the distribution function of size-weighted totals, Z0(y), first use the
algorithm in Case 12 to compute the distribution function with the corresponding confidence
bounds for the proportion of size-weighted totals. Then, compute the following:
Za(y) =Ga(y)*Za , (54)
where Za is the known size-weighted total. To compute the confidence bounds for ZQ(y),
simply multiply the upper and lower confidence limits of G0(y) by Za.
79
-------
(Case 11)
Estimation of Proportion of Sue-Weighted Totals
In this subsection, two examples are provided based on varying conditions. For the
first algorithm, the size-weight, Za, is unknown or known and not equal to Z0; this
algorithm produces confidence bounds based on the Horvitz-Thompson ratio standard error
and the normal approximation. For the second algorithm, Za is known and equal to Za;
this algorithm produces confidence bounds based on the Horvitz-Thompson standard error
and the normal approximation.
80
-------
(Case 12)
Case 12— Estimation of Ga(y): Discrete Resource, Size-Weighted Estimate,
Equal or Variable Probabilities. Za unknown or known and not
equal to Za.
Confidence Bounds by Horvitz-Thorn peon Ratio Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2. The subpopulation size-weighted total, Za, is unknown or known and not
equal to Za.
3. There can be an equal or variable probability selection of units from the
subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. Another discussion of the formulae are presented in the
previous section, Estimation of Size-Weighted Totals.
Calculation of confidence bounds on Ga(j) if the Bormtz- Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for Za(y) similar to that given in Case 10. Because Z0 is unknown or
known and not equal to ZQ in this example, however, the variance of a ratio estimator is
used in this algorithm. The confidence bounds are based on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, w,(
c. Tier 2 conditional weighting factor, w2 lt
d. Size-weighted value (z)
e. Indicator of interest (y)
f. The subset of data corresponding to the subpopulation of interest, indexed
by a.
81
-------
(Case 12)
2. Computation of additional weighting factors
This step does not have to be made with each use of the datum, as the
weights are permanent attributes of a sampling unit. The following details are
given for completeness.
The Tier 1 and Tier 2 weights are included for each observation in the data
set. These weights are used to compute the total weight of selecting the t unit in
the Tier 2 sample. First, compute this weight for each observation:
w2i = wiiw2.li '
where wlt is the weighting factor for the t"1 unit in the Tier 1 sample and w21l is
the inverse of the conditional Tier 2 inclusion probability. The pairwise inclusion
weight is defined below. The sample size at Tier 2, n2, is not subpopulation
specific
3. Algorithm for Ga(y) and Confidence Intervals
a. Sorting of data. The data file needs to be sorted on the indicator, either in
an ascending or descending order When the data file is sorted in ascending
order on the indicator, the distribution function of size-weighted
proportions, Ga(y), denotes the proportion of size-weights in the target
population, such as stream miles, that have a value less than or equal to the
y for a specific indicator. Conversely, if it is of interest to estimate the
proportion of size-weights in the target population with indicator variables
greater than or equal to y, the data file would be sorted in descending order
on this variable The distribution function generated by the analysis in
descending order is [1 - Ga(y)].
b. Compute Za = w2l ~ z,
sa
c. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and five statistics for each row.
qj = value of y variable for the record
q2 = Ga(yJ
03 = var[Gu(y)]
q4 = upper confidence bound for G„(y)
q5 = lower confidence bound for Gn(y)
d. Index rows using l from 1 to n; the ith row will contain q-values
corresponding to the ith record in the file, as analyzed.
e. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate and store the q
-statistics as each observation is read into file. Continue this loop until the
end of file is reached.
i- qj[0 = y[«]
ii. q2[.] = -i]
a
82
-------
(Case 12)
Multiple observations with one y-value create multiple records in the
preceding analysis for one distinct value^of y. The last record for that y
contains all the information needed for Ga(y). Therefore, at this stage of
the analysis, eliminate from the q-file all but the last record for those y
values that have multiple records.
f. Entries in the first column (qj) of the q-matrix identifies the vector of y
-values for the remainder of the calculations. For each such y-value, y,,
make the following calculations. Note that this part of the algorithm is not
recursive; each calculation is made over the entire sample.
iil. qaW = [Edjw2j(w2j"1) + E E djik(w2]W2k-w2Jk)]/Zl
} 1 k
fc t)
where,
2n2w2jw2Jt" w2j " w2fc
- 2(^Tj
and,
dj = [I(y, < y.) - G0(y,)] • z,
Similarly for dt.
iv. q4[i] = q2[i] + 1.645*v/q^[iJ
v q5I'] = ~ l-645*v/q^i]
g. Output of interest
From the q column vectors:
i. qj represents the ordered vector of distinct values of y.
ii. q2 represents the estimated distribution function, Ga(y),
corresponding to the values of y
iii. q^ represents the 95% one-sided upper confidence
bound of the distribution function, Ga(y)
iv. qs represents the 95% one-sided lower confidence
bound of the distribution function, Ga(y)
83
-------
(Case 13)
Case 13- Estimation of Ga(y): Discrete Resource, Size-Weighted Estimate,
Equal or Variable Probabilities. ZB known and equal to ZQ.
Confidence Bounds by Horvits-Thompson Standard Error
and Normal Approximation.
CondtUons for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2. The subpopulation size-weighted total, Z0, is known and equal to Za.
3. There can be an equal or variable probability selection of units from the
subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest.
Calculation of confidence hounds on Ga('$) by the Horvtiz-Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for Z0(y) exactly as given in Case 10. Because Z0 is known and equal to
Za, it is not necessary to use the ratio estimator. The distribution function of Ga(y) is
obtained by dividing the distribution function, Za(y), and the associated confidence bounds
by Za. (These additional steps are included in this algorithm.) The Horvitz-Thompson
variance estimator, discussed in Section 2.1, is used to compute the variance in this
algorithm. The confidence bounds are computed based on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, w,f
c. Tier 2 conditional weighting factor, w2 lt
d. Size-weighted value (z)
e. Indicator of interest (y)
84
-------
(Case 13)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When the data file is sorted in ascending order on the indicator,
the distribution function of size-weighted proportions, Ga(y), denotes the proportion
of size-weights in the target population, such as lake area, that have a value less
than or equal to the y for a specific indicator. Conversely, if it is of interest to
estimate the proportion of size-weights in the target population with indicator
variables greater than or equal to y, the data file would be sorted in descending
order on this variable. The distribution function generated by the analysis in
descending order is [1 - Ga(y)].
3. Computation of weighting factors
For this step, refer to the program steps given in Case 10 to derive the
distribution function and the confidence bound for Za(y). Follow thejsteps labeled 3
and 4. Additional steps, shown here, are needed to obtain Ga(y) and its
corresponding confidence bounds. Proceed with the following steps after conducting
steps 3 and 4 from Case 10-
e. The operations that follow generate the q vectors to compute the estimated
distribution function and appropriate confidence bounds for Gfl(y). These
are denoted by qg through qg. Each element of qg-q$ is computed by
performing the following operations on the corresponding elements of q2, <14,
and q5.
1. q6 = Divide each element of q2 by the known subpopulation size
ii. q7 = Divide each element of q4 by the known subpopulation size
111. qg = Divide each element of q5 by the known subpopulation size
From the q vectors-
I. q6 represents the estimated distribution function, Ga(y)
II. q7 represents the 95% one-sided upper confidence
bound of the distribution function, Ga(y)
iii. represents the 95% one-sided lower confidence
bound of the distribution function, Ga(y)
85
-------
3.2 Extensive Resources
A detailed discussion of the formulae for obtaining area and proportion of areal
extent for continuous and extensive resources was presented in Section 2.1.2. Formulae were
presented for both areal and point samples.
3.2.1 Estimation of Proportion of Areal Extent
As discussed in Section 2.1.2, the confidence bounds for the proportion of areal extent
in continuous and extensive resources can be based on the binomial distribution. This
algorithm was presented in Section 3.1.2, Case 6, for discrete resources. No changes in this
algorithm are needed.
3.2.2 Estimation of Area
Formulae for the estimation of total areal extent of the surveyed resources was
proposed in Section 2.1.2. Proposed methods to compute areal extent for point and areal
samples are discussed in the following subsections.
Point Samples
Formulae for the estimation of areal extent based on point sample was presented in
Section 2.1.2.2. To obtain confidence bounds for A0(y) based on the binomial distribution,
refer to the algorithm provided for the confidence bound calculation for F0(y) in Section
3.1.2, Case 6. Simply multiply the lower and upper confidence bounds, and Fa(y), by the
known area or estimated area of the resource. No further changes are necessary to this
'algorithm to provide confidence bounds for Aa(y) based on the binomial distribution.
86
-------
Area! Samples
Formulae for the estimation of area) extent based on areal samples are still under
development However, some preliminary formulae are proposed in Section 2.1.2.1. Work
in this area is continuing and will be included in the next version of this report.
3.3 Estimation of Quantiles
Overton (1987a) defines the calculations for both the ascending and descending sorted
indicators. For the algorithm used in this report, it is not necessary to employ a different
definition for percentiles for an ascending or descending analysis, distributions are identical
as generated either way. The general algorithm computes the linear interpolation of the
distribution function for both types of analyses. In the following equation, let r represent
the proportion of the desired percentile. The fraction in this equation can be interpreted as
the slope of the line. The coefficient of this fraction interpolates to the value [Q(r)-a].
The lower bound, a, is added to this piece, [Q(r) - a], to obtain the quantile of interest.
Assuming an ascending analysis and that the generated distribution function is F(y):
where F(a) is the greatest value of F(y) < r and F(b) is the least value of F(y) > r.
For a descending analysis, the distribution function generated was F*(y)=[l - F(y)].
To obtain the percentiles, calculate F(y)=l - F*(y); the foregoing formula is appropriate for
the analysis.
(55)
87
-------
88
-------
SECTION 4
TABLES
Table 1. Reference to Distribution Function Algorithms
A. Distribution Functions for Numbers - Estimation of Na(y)
Equal Probability Selection:
Population Size
Subpopulation Size
Algorithm
Case
Known/unknown
Known
Hypergeometric1
1
HT-NA2
3
Known
Unknown
Hypergeometric
2
HT-NA
3
Variable Probability Selection:
Population Size
Subpopulation Size
Algorithm
Case
Known/unknown
Unknown or
HT-NA
4
known and = Na
Known/unknown
Known and ^ Na
HTR-NA3
5
1 Hypergeometric refers to the exact hypergeometric distribution algorithm
used to obtain confidence bounds.
2 HT-NA refers to Horvitz-Thompson variance with normal approximation
to obtain confidence bounds.
3 HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
normal approximation to obtain confidence bounds.
4 Binomial refers to the exact binomial distribution algorithm
used to obtain confidence bounds.
89
-------
Table 1 - Continued.
B. Distribution Functions for Proportions of Numbers - Estimation of FQ(y)
Equal Probability Selection:
Population Size Subpopulation Size Algorithm Case
Known/unknown Known/unknown Binomial4 6
Known/unknown Known Hypergeometric 7
Variable Probability Selection:
Population Size Subpopulation Size Algorithm Case
Known/unknown Unknown or HTR-NA 8
known and ^ Na
Known/unknown Known and = Na HT-NA 9
1 Hypergeometric refers to the exact hypergeometric distribution algorithm
used to obtain confidence bounds.
2 HT-NA refers to Horvitz-Thompson variance with normal approximation
to obtain confidence bounds.
3 HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
normal approximation to obtain confidence bounds.
4 Binomial refers to the exact binomial distribution algorithm
used to obtain confidence bounds.
90
-------
Table 1 - Continued.
C. Distribution Functions for Size-Weighted Statistics for Both Equal and
Variable Probability Selection
Population Size Subpopulation Size Algorithm Section
Estimation of Zn(y)
Known/unknown Unknown or HT-NA 10
known and = Za
Known/unknown Known and ^ Za HTR-NA 11
Estimation of Gn(y)
Known/unknown Unknown or HTR-NA 12
known and # Za
Known/unknown Known and = Za HT-NA 13
1 Hypergeometric refers to the exact hypergeometric distribution algorithm
used to obtain confidence bounds.
2 HT-NA refers to Horvitz-Thompson variance with normal approximation
to obtain confidence bounds.
3 HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
normal approximation to obtain confidence bounds.
4 Binomial refers to the exact binomial distribution algorithm
used to obtain confidence bounds.
91
-------
Table 2. Summary of Notation Used in Formulae and Algorithms
Symbol
Definition
Populations:
N
Na
Distribution Functions:
Discrete Resources:
N(y)
F(y)
Z(y)
G(y)
Population size
Subpopulation size
Estimated distribution function for total number
Estimated distribution function for proportion of numbers
Estimated distribution function of size-weighted totals
Estimated distribution function for a size-weighted proportion
Continuous and Extensive Resources:
A(y) Estimated distribution function for areal extent
F(y) Estimated distribution function for proportion of areal extent
Inclusion Probabilities:
7r, Probability of inclusion of unit i
7rl} Probability that unit i and J are simultaneously included
7Tj, Probability of inclusion of unit i at Tier 1
7r2l Probability of inclusion of unit l at Tier 2
7r2 !, Conditional Tier 2 inclusion probability
Weights:
w Inverse of the above inclusion probabilities
(Same definitions apply with corresponding subscripts)
Sample Notation-
n General notation for sample size
nj Sample size at Tier 1
n2 Sample size at Tier 2
Sj Sample of units at Tier 1
S2 Sample of units at Tier 2
(These may be made specific for subpopulations or resources by appending
an a or r. For example: )
na Sample size for subpopulation a
nr| Sample size for a resource r at grid point i
Slr Sample of units at Tier 1 for resource r
S2r Sample of units at Tier 2 for resource r
92
-------
SECTION 5
REFERENCES
Chambers, R.L., and R. Dunstan. 1986. Estimating distribution functions from survey
data. Biometrika, 73, 597-604.
Cochran, W.G. 1977. Sampling Techniques, Third Edition. Wiley, New York.
Cordy, C.B. In press. An extension of the Horvitz-Thompson theorem to point sampling
from a continuous universe. Accepted in Probability in Statistics Letters.
Cox, D.R., and D. Oakes. 1984. Analysis of Survival Data. Chapman and Hall, New
York.
Hansen, M.H., W.G. Madow, and B.J. Tepping 1983. An evaluation of model-dependent
and probability-sampling inferences in sample surveys. J. Amer. Stat. Assoc. 78:
776-793.
Hartley, H.O., and J.N.K. Rao. 1962. Sampling with unequal probability and without
replacement. The Annals of Mathematical Statistics, 33, 350-374.
Horvitz, D.G., and D.J. Thompson. 1952. A generalization of sampling without
replacement from a finite universe. J. Amer. Stat. Assoc. 47: 663-685.
Hunsaker, C.T., and D.E. Carpenter, eds. 1990. Environmental Monitoring and
Assessment Program: Ecological Indicators. EPA/600/3-90/060. U.S.EPA, Office of
Research and Development, Washington, DC.
93
-------
Kaufman, P.R., A.T. Herlihy, J.W. Elwood, M.E. Mitch, W.S. Overton, M.J. Sale, K.A.
Cougan, D.V. Peck, K.H. Reckhow, A.J. Kinney, S.J. Christie, D.D. Brown, C.A. Hagley,
and H.I. Jager. 1988 Chemical Characteristics of Streams in the Mid-Atlantic and
Southeastern United States. Volume I: Population Descriptions and Physico-Chemical
Relationships. EPA/600/3-88/02 la. U.S. EPA, Washington, DC.
Landers, D.H., J.M. Eilers, D.F. Brakke, W.S. Overton, P.E. Kellar, M.E. Silverstein, R.D.
Schonbrod, R.E. Crowe, R.A. Linthurst, J.M. Omernik, S.A. Teague, and E.P. Meier.
1987. Characteristics of Lakes in the Western United States. Volume Ir Population
Descriptions and Physico-Chemical Relationships. EPA-600/3-86/054a. U.S. EPA,
Washington, DC.
Linthurst, R.A , D.H. Landers, J.M. Eilers, D.R Brakke, W.S.Overton, E.P. Meier, and
R.E. Crowe. 1986 Characteristics of Lakes in the Eastern United States, Volume I:
Population Descriptions and Physico-Chemical Relationships. EPA-600/4-86/007a.
U.S. EPA, Washington, DC
Little, R.J.A., and D.B. Rubin. 1987. Statistical Analysis with Missing Data. Wiley,
New York.
Messer, J.J., R.A. Linthurst, and W.S. Overton. 1991. An EPA Program for Monitoring
Ecological Status and Trends. Environ. Monit. and Assess. 17, 67-78.
Messer, J.J., C.W. Ariss, R. Baker, S.K. Drouse, K.N. Eshelman, P.R. Kaufmann, R.A.
Linthurst, J.M. Omernik, W.S. Overton, M.J. Sale, R.D. Schonbrod, S.M. Stambaugh,
and J.R. Tutshall, Jr. 1986. National Surface Water Survey: National Stream Survey,
Phase I - Pilot Survey. EPA/600/4-86/026. U.S. EPA, Washington, D.C.
94
-------
Miller, R.G. 1981. Survival Analysis. Wiley, New York.
Sarndal, C-E., B Swensson, and J. Wretman. 1992. Model Assisted Survey Sampling.
Springer-Verlag, New York.
Overton, W.S. 1987a. Phase II Analysis Plan, National Lake Survey, Working
Draft. Technical Report 115, Department of Statistics, Oregon State University.
Overton, W S. 1987b. A Sampling and Analysis Plan for Streams in the National
Surface Water Survey. Technical Report 117, Department of Statistics, Oregon State
University.
Overton, W.S. 1989a. Calibration Methodology for the Double Sample Structure of the
National Lake Survey Phase II Sample. Technical Report 130, Department of Statistics,
Oregon State University.
Overton, W.S. 1989b. Effects of Measurements and Other Extraneous Errors on Estimated
Distribution Functions in the National Surface Water Surveys. Technical Report 129,
Department of Statistics, Oregon State University.
Overton, W.S., and S.V. Stehman. 1987. An Empirical Investigation of Sampling and
Other Errors in National Stream Survey: Analysis of a Replicated Sample of Streams.
Technical Report 119, Department of Statistics, Oregon State University.
Overton, W.S., and S.V. Stehman. 1992. The Horvitz-Thompson theorem as a unifying
perspective for sampling. Proceedings of the Section on Statistical Education of the
American Statistical Association, pp. 182-187.
95
-------
Overton, W.S , and S V. Stehman. 1993a Properties of designs for sampling
continuous spatial resources from a triangular grid. Communications in Statistics -
Theory and Methods, 22, 2641-2660.
Overton, W.S., and S.V. Stehman. 1993b. Improvement of Performance of Variable
Probability Sampling Strategies Through Application of the Population Space and the
Fascimile Population Bootstrap. Technical Report 148, Department of Statistics, Oregon
State University
Overton, W.S., D. White, and D.L. Stevens. 1990 Design Report for EMAP,
Environmental Monitoring and Assessment Program. EPA/600/3-91/053.
U.S EPA, Washington, DC.
Overton, J M.. T C. Young, and W.S. Overton. 1993. Using found data to augment a
probability sample, procedure and case study. Environ. Monitoring and
Assessment, 26, 65-83.
Porter, P.S., R C. Ward, and H F. Bell. 1988. The detection limit. Environ. Sci.
Technol., 22, 856-861.
Rao, J.N.K., J.G. Kovar, and H.J. Mantel. 1990. On estimating distribution functions and
quantiles from survey data using auxiliary information. Biometrika, 77, 365-375.
Sarndal, C.E., B. Swensson, and J. Wretman. 1992. Model Assisted Survey Sampling.
Springer-Verlag, New York.
96
-------
Smith, TM.F 1976. The foundations of survey sampling: a review. J. of Roy. Stat.
Soc , A, 139, Part 2, 183-195
Stehman, S.V., and W.S. Overton. 1989. Pairwise Inclusion Probability Formulas in
Random-order, Variable Probability, Systematic Sampling. Technical Report 131,
Department of Statistics, Oregon State University.
Stehman, S.V., and W.S Overton. In press. Comparison of Variance Estimators of
the Horvitz-Thompson Estimator for Randomized Variable Probability Systematic
Sampling Jour, of Amer. Stat. Assoc
Stevens, D.L. In press. Implementation of a National Monitoring Program. Jour, of Envir.
Management.
Thomas, Dave. Oregon State University, Statistics Department, Corvallis, OR.
Wolter, K.M. 1985. Introduction to Variance Estimation, New York: Springer-Verlag.
97
-------
98
-------
SECTION 6
GLOSSARY OF COMMONLY USED TERMS
Continuous attribute: an attribute that is represented as a continuous surface over some
region. Examples are certain attributes of large bodies of water, such as chemical variables
of estuaries or lakes.
Discrete resource: resources consisting of discrete resource units, such as lakes or stream
reaches. Such a resource will be described as a finite population of such units.
Distribution function: a mathematical expression describing a random variable or a
population. For real-world finite populations, these distributions are knowable attributes
(parameters) of the population, and may be determined exactly by a census, or estimated
from a sample. The general Torm will be the proportion (or other measure, like numbers,
length, or area) of the resource having a value of an attribute equal to or less than a
particular value Proportions may also be of the different possible measures, like number
(frequency distributions), area (areal distributions), length, or volume.
Domain- a frame feature that includes the entire area within which a potential sample might
encounter the resource. The domain of any one resource can include other resources.
Extensive resource: resources without natural units. Examples of extensive resources are
grasslands or marshes.
40-hex: a term for the landscape description hexagon or areal sampling unit centered on
each of the grid points in the EMAP sampling grid. The area of each hexagon is
approximately 40 km2.
Inclusion probability (x,): the probability of including the ith sampling unit within a
sample.
Pairwise inclusion probability (t^): the probability that both element i and element j are
included in the sample.
99
-------
Population: often used interchangeably with the term universe to designate the total set of
entities addressed in a sampling effort. The term population is defined in this report to
designate any collection of units of a specific discrete resource, or any subset of a specific
extensive resource, about which inferences are desired or made.
Randomized model: a model invoked in analysis, assuming the population units have been
randomly arranged prior to sample selection In many cases, this is equivalent to assuming
simple random sampling.
Resource: an ecological entity that is identified as a target of sampling, description, and
analysis by EMAP. Such an entity will ordinarily be thought of and described as a
population. Two resource types, discrete and extensive, recognized in EMAP pose different
problems of sampling and representation. EMAP resources are ordinarily treated as strata
at Tier 2.
Resource class: a subset of a resource, represented as a subpopulation. For example, two
classes of substrate, sand and mud, can be defined in the Chesapeake Bay. Subpopulation
estimates require only that the classification be known on the sample.
Stratum: a stratum is a sampling structure that restricts sample randomization/selection to
a subset of the frame. Samples from different strata are independent. Inclusion
probabilities may or may not differ among strata.
Tierl/Tier2: these terms represent different phases of the EMAP program Relative to the
EMAP sample, they refer to the two phases (stages) of the EMAP double sample. The Tier
1 sample is common to all resources and provides for each a sample from which the Tier 2
sample is selected. The Tier 2 sample for any resource is a set of resource units or sites at
which field data will be obtained.
Weights: in a probability sample, the sample weights are inverses of the inclusion
probabilities; these are always known for a probability sample.
100
------- |