-------
with
= standard normal density function,
G(i; Xi, T) = (1, -1, 1)(DTD)MDTY (i.e., least squares solution)
D =
*i A?
and Y =
'*..-*
with
y =
g(<; Jf,, r) = (1, -I.
D as defined above and
x.-t
6. Procedure
6.1. Generate a sequence of k grid points, tx < t2 < < £fc, spanning the range of
the observed data
E.g., Suppose rnin{Ai, X2, . . ., Jfn} = 0 and max{Xl, X^ . . ., Xn} = 25. We could
let k = 51 and define i, =0, t2 = 0.5, *3 = 1.0, t4 = 1.5, . . ., t50 = 24.5, <51 = 25.0.
6.2. Generate a sequence of m values 0 < Ax < A2 < < Am.
See § 8.1 for more information.
6.3. For each grid point ,, /i = 1, . . ., &,
-------
6.3.1. For each data value X{, i = 1, ..., n,
6.3.1.1. Calculate G(th; X,, a2} (or G(th\ X,, a2) when cr2 is estimated).
-i,
where
1 A, A?
XI
Y =
(Note, $ is the standard normal cumulative distribution function).
6.3.1.2. If a2 is estimated, calculate g(th\ X^ a2}.
= (!, -1, iK&
w
here
D is defined in § 6.3.1.1 above,
y =
X--
-|T
(Note,
-------
If ^x.n^K^k)} °n {
-------
6.6. Restrict range of {F^n, JK(^))U i to [°>
Set h = 1
While FnJ
End of While
Set /i = k
While Fn > 1
End of While
6.7. Apply isotonic regression to (L^), . . ., l>(tk)} on {ij, . . ., ifc} auid restrict range of
{£(**)} J = i to [0,1].
See § 6.5. and § 6.6. above.
J O
6.8. Apply isotonic regression to (U^), . . ., U(tt)} on {tj, ..., it} and restrict range of
See § 6.5. and § 6.6. above.
7 Associated Methods
A related method for estimating a cumulative distribution in the presence of
measurement error is described in Estimation Method 2: The Simulation- Extrapolation
Method. This method does not assume a particular sampling model nor does it require
a finite population.
8. Notes
8.1. The algorithm given in § 6 requires specification of 0 < Ax < - - < Am. Stefanski
and Bay [3] propose taking equally spaced values over the interval [0.05, 2.00]. They
also suggests that m > 5, although the exact number of values is not critical.
-------
8.2. The algorithm in § 6 calculates estimates of the cumulative proportion. Estimates
of the cumulative total may be obtained by multiplying the estimates {F_y n jK(<,')},' _ l
by TV, the population size. The variance estimator for the cumulative total is equal to
the variance estimator for the cumulative proportion times TV2. Confidence limits
would need to be recalculated. Additionally the range of the estimates of the
cumulative total and its confidence limits would be [0, N] rather than [0, 1] as specified
for the cumulative proportion.
8.3. This method of bias-adjustment is closely related to Quenouille's Jackknife. The
usual Jackknife increases sampling variance by decreasing sample size. In this method
measurement error variance is increased by adding pseudo-random errors to the
observed data, achieving the same "variance-inflation" effect as in the Jackknife
method. This is done by calculating
X*i(X) = Xi + 0 is a constant. The variance of the
additional error is
-------
estimated by least squares'regression of {F^ A .,(*)}/*= x on {A.,}-! where 0 <
< Am are fixed constants. Extrapolating to A = 1, we obtain
which may also be expressed as
t=l
where
TT,- = inclusion probability for selecting the ith element in population U,
G(r; X, a2) = (1, -1,
such that
D =
1 A, A2
1 \ A2
* /\m /^m
Y =
~ t
When cr2 is known, the variance of Fjy]fl]jK(i) is estimated by the Horvitz-
Thompson estimator [2, p.43],
where
it- and G are given above,
7711 j = joint inclusion probability for selecting elements i and j from population U.
When cr2 is estimated the Horvitz-Thompson estimator is still used to estimate
the variance of a parametric Jackknife estimate. However the additional variation due
estimating cr2 must also be accounted for. Hence when cr2 is estimated the variance
estimator is given by
ro
-------
+
=i
where
TT,-, 7T- and G are given above,
d^i
y =
See [3] for more detail.
9. References
[1] Barlow, R. E., Bartholamu, D. J., Brenner, J. M., and Brunk, H. D. (1972),
Statistical Inference under Order Restrictions. New York: John Wiley fc Sons.
[2] Sarndal, C. E., Swensson, B., and Wretman J. (1992), "Model Assisted Survey
Sampling," New York: Springer-Verlag.
[3] Stefanski, L. A. and Bay, J. M. (1994), "Parametric Jackknife Deconvolution of
Finite Population CDF Estimators," in review.
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 1 of 8
ESTIMATION METHOD 10: Estimation of Variance of the Cumulative Distribution
Function for the Proportion of an Extensive Resource; Horvitz-Thompson Variance Estimator
1 Scope and Application
This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the proportion of an extensive resource that has an indicator value equal
to or less than a given indicator level. There are two variance estimators presented in this
method. An estimate can be produced for the entire population or for an arbitrary
subpopulation with known or unknown size. This size is the extent of the resource in the
subpopulation. The method applies to any probability sample and the variance estimate will
be produced at the supplied indicator levels of interest. This method does not include
estimators for the CDF. For information on CDF estimators, refer to Section 7.
2 Statistical Estimation Overview
A sample of size na units is selected from subpopulation a with known inclusion densities
n = {Ki,,ni,-,Kn }and joint inclusion densities given by 7^, where tej. The indicator is
evaluated for each unit and represented by y = { y\ ,-,;y,-,-,;yn }. The inclusion densities are
design dependent and should be furnished with the design points. See Section 9 for further
discussion.
*. <*
The Horvitz-Thompson variance estimator of the CDF, V [Fa(xk)], is calculated for each
value of the indicator levels of interest, xk . There are two Horvitz-Thompson variance
estimators presented in this method. The first is a variance estimator of the Horvitz-
Thompson estimator of a proportion. The second is a variance estimator of a Horvitz-
Thompson ratio estimator. The former estimator calculates the variance of the Horvitz-
Thompson estimator of a total and divides this variance by the known subpopulation size
squared, Na . The latter estimator requires as input the CDF estimates produced using the
Horvitz-Thompson ratio estimator of the CDF for proportion.
The output consists of the estimated variance values.
3 Conditions Under Which This Method Applies
Probability sample with known inclusion densities and joint inclusion densities
Extensive resource
Arbitrary subpopulation
AD units sampled from the subpopulation must be accounted for before applying this
method
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 2 of 8
4 Required Elements
4.1 Input Data
>', = value of the indicator for the ith unit sampled from subpopulation a.
TT, = inclusion density evaluated at the location of the i!h sample point in subpopulation a.
Tiy = joint inclusion density evaluated at the locations of the i'h and j'h sample points in
subpopulation a.
/S
Fa(xk) ~ estiraated CDF (proportion) for indicator value xk in subpopulation a.
4.2 Additional Components
na = number of units sampled from subpopulation a.
xk = l^h indicator level of interest.
Na = subpopulation size, if known.
5 Formulas and Definitions
The estimated variance of the estimated CDF (proportion) for indicator value xk in
* <*.
subpopulation a, V [Fa(xk)], with known subpopulation size, Na, Horvitz-Thompson
variance estimator of the Horvitz-Thompson estimator of a CDF is
The estimated variance of the estimated CDF (proportion) for indicator value xk in
subpopulation a, V [Fa(xk)], with estimated subpopulation size, Na; Horvitz-Thompson
variance estimator of the Horvitz-Thorapson ratio estimator of a CDF is
d
,
[Fa(Xk)] =
" 1
1 = 1 ni
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 3 of 8
For these equations:
F (x,,) = estimated CDF (proportion) for indicator value xk in subpopulation a.
f
1 k [ 0, otherwise
xk - tfh indicator level of interest.
y{ - value of the indicator for the ith unit sampled from subpopulation a.
Tt,. = inclusion density evaluated at the location of the ith sample point in subpopulation a.
nr = joint inclusion density evaluated at the locations of the ith and fh sample points in
subpopulation a.
na = number of units sampled from subpopulation a.
6 Procedure
6.1 Enter Data
Input the sample data consisting of the indicator values, y, , and their associated inclusion
densities, 7t For example,
Calcium
ft
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Density
*,
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, PageU of 8
6.2 Sort Data
Sort the sample data in nondecreasing order based on the _y; indicator values. Keep all
occurrences of an indicator value to obtain correct results.
Calcium
y>
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Density
*,-
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
6.3 Compute or Input Joint Inclusion Densities
The required joint inclusion densities are in the following table. For this example, they were
computed by the formula n^ = (na-l)7t,- rc;-/na and are displayed in the following table.
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 5 of 8
Joint Inclusion Density Tty = ;T.. , TI(-(- = ni
i
i
1
2
3
4
5
6
7
8
9
10
1
.000047
.000261
.002531
.000127
.002531
.000013
.000075
.000020
.000013
2
.000979
.009491
.000475
.009491
.000047
.000282
.000074
.000047
3
.052205
.002610
.052205
.000261
.001550
.000408
.000261
4
.025313
.506250
.002531
.015032
.003955
.002531
5
.025313
.000127
.000752
.000198
.000127
6
.002531
.015032
.003955
.002531
7
.000075
.000020
.000013
8
.000117
.000075
9
.000020
6.4 Obtain Subpopulation Size
Input Na if using a known subpopulation size. Na = 1130 for this dataset.
s*.
Calculate Na from the sample data only if using the variance of the Horvitz-Thompson ratio
estimator of a CDF. Sum the reciprocals of the inclusion densities, Ki , for all units in the
sample a to obtain Na .
Na = (1/.00375) + (1/.01406) + (1/.07734) + . . . + (1/.00375) = 1128.939 for this data set.
6.5 Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,
1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 6 of 8
A
Input Fa(xk) for each xk if the Horvitz-Thompson ratio estimator was used to estimate the
CDF.
Calcium
**
. .7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
CDF for Proportion,
Ratio Estimator
fy**>
.2362
.2992
.3355
.3366 - '
.5729
.6126
.7638
1
6.6 Compute Estimated Variance Values
Calculate V [Fa(xk)] for xk using the formulas from Section 5. ,
Compare each yi to xk . Set I(y
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, Page 7 of 8
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Estimated
Variance of CDF
for Proportion,
Ratio Estimator
V t^U*)]
.044888
.046211
.046672
.046687
.052820
.052442
.044888
0
Estimated
Variance of CDF
for Proportion,
Na= 1130
v Ffl(*t)]
.055690
.056351
'.054565
.054479
.092531
.089057
.091322
.106996
7 Associated Methods
An appropriate estimator for the estimated CDF for extensive resources may be found in
Method 1 (Horvitz-Thompson Estimator).
8 Validation Data
Actual data with results, EMAP Design and Statistics Dataset #10, are available for
comparing results from other versions of these algorithms.
9 Notes
Inclusion densities, n-, and joint inclusion densities, TI^- , are determined by the design and
should be furnished with the design points. In some instances, the joint inclusion densities
may be calculated from a formula that uses the location of the design points or they may be
approximated by a formula that assumes simple random sampling. This simple random
sampling formula, n^ = (nfl-l)n(. 7t;-/na , is used in Section 6.3.
-------
EMAP Estimation Method 10, Rev. No. 0, May 1996, Page18 of 8
10 References
Cordy, C. B. 1993. An extension of the Horvitz-Thompson theorem to point sampling from a
continuous universe. Statistics & Probability Letters 18:353-362.
Lesser, V. M., and W S. Overton. 1994. EMAP status estimation: Statistical procedures
and algorithms. EPA/620/R-94/008. Washington, DC: U.S. Environmental Protection
Agency.
Sarndal, C. E., B. Swensson, and J. Wretman, 1992. Model assisted survey sampling. New
York: Springer-Verlag.
-------
ESTIMATION METHOD llrThe Simulation-Extrapolation Method
Estimation of a Population Cumulative Distribution.
1. Scope and Application
This report describes an estimation procedure called Simulation-Extrapolation [2] used
to estimate a population cumulative distribution when sample units are measured with
error. Estimates obtained when the measurement error is ignored are biased and may
be misleading. The Simulation- Extrapolation (SIMEX) method reduces the bias
induced by measurement error by establishing a relationship between measurement
error-induced bias and the variance of the error. Extrapolating this relationship back to
the case of no measurement error, an estimator with smaller bias is produced. The
method assumes that the variance of the measurement error in the observed sample is
known or at least well estimated.
A variance estimator of the SIMEX estimator is also described
2. Summary of Method
Let U = { f/j, t/2? > &n} be the true (unobserved) data subject to measurement error
and X = (X-L, X^ -..,Xn} denote the observed data where A", is a measure [/,. A
functional measurement error model with additive independent normal error is assumed.
That is, X( = Ut + aZ,, for i = 1, . .., n, where {Z,}j=1 are mutually independent,
independent of random sampling, and identically distributed standard normal random
variables. Hence, the measurement errors in the observed sample have mean zero and
variance
-------
the data {Xi}t = 1^ and identically distributed standard normal pseudo-random variables.
For A fixed, the measurement error variance of the additional errors {{avA/^J,- -i}t = \
is a2X. Therefore, the total measurement error in Xj|t-(A), for 1 < i < n and 1 < b < B,
has variance \(a2 + 1). The estimates F^.A.&C*) = T({^b,,(^)}"= i) ^e tnen calculated
for b = 1, ..., B. The average of these estimates is used to estimate the expectation of
F^ A b(t) with respect to the distribution of the pseudo-random variates {Zj^;},- _ j. This
is the simulation step of the SIMEX method.
Next the expectation, Fx
-------
5. Definitions and Formulas
Let t denote a fixed argument in the following definitions and formulas. Define
FU(£) = estimator of the population cumulative distribution (CD) in the
absence of measurement error.
estimator based on the data {AT, + a\f\Z"bt-}1= l where
{-₯,};_! is the observed sample, {Z£ -}-t=l are standard normal pseudo-random
variables, a1 is the measurement error variance, and A > 0 is a constant.
Tg(A) = estimator of the variance of F^- Aifc(i).
F_yiA(f) = estimator of the expectation of Fx A)fc(t) with respect to the
distribution of the pseudo-random errors [Z*b ,;};-!.
T2(A) = estimator of the expectation of ^(A) with respect to the distribution of
4(A) = estimator of VarF^,^) - Fx.A(t)
x t x b\L) V-'J-' estimator based on the data {A",- + J(ar2+)XZ£i,}"_. ^ where
{X,}".-! is the observed sample, {^J ,}i=1 are standard normal pseudo-random
variables, a2 is the measurement error variance, and e > 0 (e ~ 0) and A > 0 are
constants.
Fx t x(t) = estimator of the expectation of fx,e,\,b(t] with respect to the
distribution of {.Z^,-}"_! only.
f^ x(t) = estimator of the derivative of FXt\(t) with respect to the measurement
error variance a2.
FSIMEX(0 = SIMEX estimator.
= variance estimator of the SIMEX estimator.
-------
. L(<) = lower 100(1 - a)% confidence limit for FSIMEX(i).
\J(t) = upper 100(1 a)% confidence limit for FSIMEX(i).
The formulas for the above definitions are as follows.
B
- vTr
-------
If a2 is known,
If cr2 is estimated,
Var{FSIMEX(i)} =
L(0 FSIMEX(<) -
where
= F
SIMEX(i)
= { £/,-},- _ j = (true) unobserved data values,
X = {^J"_ j = sample observed with error,
,-}, _ j};, _ l = independent and identically distributed standard normal pseudo-
random variables.
cr2 = variance of measurement error,
Var(
-------
i-a/2 ~ 100(1 ~ a/2}th percentile in the standard normal distribution.
6. Procedure
6.1. Generate a sequence of k grid points, t^ < t2 < < tk, spanning the range of
the observed data
E.g., suppose minl^j, . . ., Xn} = 0 and max{X1, . . ., Xn] = 25. We could let
k = 51 and define ^ = 0, *2 = 0.5, <3 = 1.0, . . ., tso = 24.5, and tsi = 25.0.
6.2. Generate a sequence of m values 0 < A: < A2 < < Am.
See § 8.1 for more information.
6.3. For each grid point th, h ~ 1. . . ., fc,
6.3.1. For each Aj; j = 1, . . ., m,
6.3.1.1. For 6= 1, ..., B,
6.3.1.1.1. Generate n standard normal pseudo-random variates (Z^^ ...Zlin}
6.3.1.1.2. Calculate the pseudo-data set {^.-(Aj)}"- j.
ii, fori = l, .., n
6.3.1.1.3. Calculate F^|A.i
-------
6.3.1.1.4. Calculate fg(A,-).
6.3.1.1.5. If Var{a2} > 0,
6.3.1.1.5.1. Calculate the data set {AJ^ J"= x.
- fori =
6.3.1.1.5.2. Calculate Fx,t,\--,i
6.3.1.2. Calculate Fx A .
6.3.1.3. Calculate r2(A.,).
6=1
6.3.1.4. Calculate ^(A,-
6.4.1.5. If Var{a2} > 0,
6.4.1.5.1. Calculate fx,i,:
x £ A b(*/i)
6=1
-------
6.4.1.5.2. Calculate fj^.
6.3.2. Calculate FSIMEX( 0),
VSr{FSIMEX(t/1)} = VTr? + Var(?2)
where
VT and 57 are defined above,
6.3.4. Calculate approximate 100(1 - Q)% confidence limits, L(i/,) and U(i/,).
= FSIMEX(ih) + ^.
-------
where zl.Q^ is the 100(1 -a/2)th percentile in the standard normal
distribution.
6.4. Apply isotonic regression to (FSIMEX(<1), - -, FSIMEX(tjt)} on {tlt . . ., tk}
i^ {FSIMEx(*fc)}* = i is NOT non-decreasing
Let i = 1 and j = 2
While2 j < n
While3 FSIMEX(VI) > FSIM
Let j = j + l
End of While3
For h = i, .... j-1,
~
End of For ' = ''
Let i = j and j ' = j -f 1
End of While2
End of
6.5. Restrict range of {FSIMEX(th)}J = a to [0, 1].
Set h = 1
While FSJMEX(*A) < 0
End of While
Set h = fc
While FSIMEx(
End of While
(Note, isotonic regression simply ensures that FSIMEX ^s a non-decreasing function
on [
-------
6.6. Apply isotonic regression to {L(*i), ..., L(tk)} on {tl5 ..., tk} and restrict range of
}£ = i to [o,i].
See § 6.4. and § 6.5 above.
6.7. Apply isotonic regression to (U^), ..., U(ifc)} on {t1( ..., tk} and restrict range of
{U(tfc)}J = ito[0,l].
See § 6.4. and § 6.5 above.
7. Associated Methods
A similar procedure of estimating the cumulative distribution of a finite population in
the presence of measurement error is described in Estimation Method 1: The Parametric
Jackknife Estimator. This method assumes a particular sampling model which allows
for the expectation of sample cumulative distributions to be obtained analytically,
rather than by simulation as in the SIMEX method.
8. Notes
8.1. The procedure outlined in § 6 requires specification of 0 < A} < < Am. Cook
and Stefanski [2] propose taking equally spaced values over the interval [0.05, 2.00].
They also suggests that m > 5, although the exact number of values is not critical.
8.2. The algorithm in § 6 is designed for calculating estimates of the cumulative
proportion. A slight variation of this algorithm would allow for estimating the
cumulative total. In this case we assume that FU(<) = T( U) is an unbiased estimator of
the cumulative total. The algorithm is modified by changing the upper bound of the
SIMEX estimate and the confidence limits from one to the population size, if the
population is finite, or oo if the population is infinite. This modification is required in
§ 6.5 through § 6.7.
9. References
[1] Barlow, R. E., Bartholamu, D. J., Brenner, J. M., and Brunk, H. D. (1972),
Statistical Inference under Order Restrictions, New York: John Wiley & Sons.
[2] Cook, J. R. and Stefanski, L. A. (1994), "Simulation-Extrapolation Estimation in
Parametric Measurement Error Models," Journal of the American Statistical
Association, 89, 1314-1328.
10
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 1 of 8
ESTIMATION METHOD 12: Estimation of Variance of the Cumulative Distribution
Function for the Proportion of a Discrete or an Extensive Resource; Yates-Grundy Variance
Estimator
1 Scope and Application
This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the proportion of a discrete or an extensive resource that has an indicator
value equal to or less than a given indicator level. There are two variance estimators
presented in this method. An estimate can be produced for the population with known or
unknown size. In the discrete case, this size is the number of units in the population. In the
extensive case, this size is the population extent The method applies to any probability
sample with fixed sample size and the variance estimate will be produced at the supplied
indicator levels of interest. This method does not include estimators for the CDF. For
information on CDF estimators, refer to Section 7.
2 Statistical Estimation Overview
A sample of size n units is selected from population a with known inclusion probabilities
7i-(nl,-,Ki,-,nn} and joint inclusion probabilities given by n^, where tej. The indicator is
evaluated for each unit and represented by y -{^i »-,>',.,}' } When sampling an
extensive resource, the inclusion probabilities are replaced by the inclusion density function
evaluated at the sample locations. The inclusion probabilities are design dependent and
should be furnished with the design points. See Section 9 for further discussion.
The Yates-Grundy variance estimator of the CDF, V [Fa(xk}], is calculated for each value
of the indicator levels of interest, xk . There are two Yates-Grundy variance estimators
presented in this method. The first is a variance estimator of the Horvitz-Thompson estimator
of a proportion. The second is a variance estimator of a Horvitz-Thompson ratio estimator.
The former estimator calculates the variance of the Horvitz-Thompson estimator of a total and"
f\
divides this variance by the known population size squared, N . The latter estimator requires
as input the CDF estimates produced using the Horvitz-Thompson ratio estimator of the CDF
for proportion.
The output consists of the estimated variance values.
3 Conditions Under Which This Method Applies
Probability sample with known inclusion probabilities (or densities) and joint inclusion
probabilities (or densities)
Discrete or Extensive resource
Arbitrary population
All units sampled from the population must be accounted for before applying this
method
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 2 of 8
4 Required Elements
4.1 Input Data
y- = value of the indicator for the fh unit sampled from population a.
7i- = For discrete resources, the inclusion probability for selecting the ith unit of population
a. For extensive resources, the inclusion density evaluated at the location of the i'
sample point in population a.
K, = For discrete resources, the inclusion probability for selecting both the fh and fh units
of population a. For extensive resources, the inclusion density evaluated at the
locations of the ith and fh sample points in population a.
^F (x.) = estimated CDF (proportion) for indicator value xk in population a.
4.2 Additional Components
n = number of units sampled from population a.
xk - tfh indicator level of interest.
N = population size, if known.
5 Formulas and Definitions
The estimated variance of the estimated CDF (proportion) for indicator value xk in population
a, V [F (xk)], with known population size, N ;' Yates-Grundy variance estimator of the
Horvitz-Thompson estimator of a CDF is
V [Fa(xk]} = 1^±_
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page% of 8
For these equations:
xk -
v,-
n =
~ estimated CDF (proportion) for indicator value xk in population a.
\ I 1 >'/ - xk
)=i -' *
[0, otherwise
tfh indicator level of interest.
value of the indicator for the i'h unit sampled from population a.
For discrete resources, the inclusion probability for selecting the ith unit of population
a. For extensive resources, the inclusion density evaluated at the location of the i'h
sample point in population a.
For discrete resources, the inclusion probability for selecting both the ith and jth units
of population a. For extensive resources, the inclusion density evaluated at the
locations of the ith and jth sample points in population a.
number of units sampled from population a.
6 Procedure
6.1 Enter Data
Input the sample data consisting of the indicator values, yi , and their associated inclusion
probabilities (or densities), rc, For example,
Calcium
vi
1.5992
2.3707
1.5992
2.0000
7.0000
2.8196
1.2204
1.5992
2.9399
.7395
Inclusion
Probability
*/
.07734
.00375
.75000
.75000
.00375
.02227
.01406
.03750
.00586
.00375
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 4 of 8
6.2 Sort Data
Sort the sample data in nondecreasing order based on the j; indicator values. Keep all
occurrences of an indicator value to obtain correct results.
Calcium
?,-
.7395
1.2204
1.5992
1.5992
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Inclusion
Probability
*,-
.00375
.01406
.07734
.75000
.03750
.75000
.00375
.02227
.00586
.00375
6.3 Compute or Input Joint Inclusion Probabilities (or Densities)
The required joint inclusion probabilities are in the following table. For this example, they
were computed by the formula n^ = {2(n-l)7r/7r/-} / {2n-ni-nj} and are displayed in the
following table.
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 5 of 8
>
1
1
2
3
4
5
6
7
8
9
10
Joint Inclusion Probability K^ = TC^ , 7tM- = 71,
1
.000047
.000262
.002630
.000127
.002630
.000013
.000075
.000020
.000013
2
.000983
.009867
.000476
.009867
.000047
.000282
.000074
.000047
3
.054457
. .002625
.054457
.000262
.001558
.000410
.000262
4
.026350
.547297
.002630
.015636
.004111
.002630
5
.026350
.000127
.000754
.000198
.000127
6
.002630
.015636
.004111
.002630
7
.000075
.000020
.000013
8
.000118
.000075
9
.000020
6.4 Obtain Population Size
Input N if using a known population size. N - 1130 for this data set.
Calculate N from the sample data only if using the variance of the Horvitz-Thompson ratio
estimator of a CDF. Sum the reciprocals of the inclusion probabilities (or densities), re, , for
A
all units in the sample a to obtain N .
N = (1/.00375) + (1/.01406) + (1/.07734) + . . . + (1/.00375) = 1128.939 for this data set.
6.5 Input Indicator Levels of Interest and Estimated CDF Values
For this example data, the variance of the empirical CDF is of interest; xk values = (.7395,
1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7).
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 6 of 8
Input Fa(xk) for each xk if the Horvitz-Thompson ratio estimator was used to estimate the
CDF.
Calcium
*t
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
CDF for Proportion,
Ratio Estimator
^(**>
.2362
.2992
.3355
.3366
.5729
.6126
.7638
1
6.6 Compute Estimated Variance Values
.* /*
Calculate V [FQ(x-^)] for xk using the formulas from Section 5.
Compare each >, to xk . Set 7(y-
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page?7 of 8
Calcium
xk
.7395
1.2204
1.5992
2.0000
2.3707
2.8196
2.9399
7.0000
Estimated
Variance of CDF
for Proportion,
Ratio Estimator
V l/fl(**)]
.044710
.046005
.046453
.046467
.052579
.052209
.044710
0
Estimated
Variance of CDF
for Proportion,
N= 1130
? [fy**)]
.055482
.056116
.054400
.054346
.092363
.088936
.091247
.106996 .
7 Associated Methods
An appropriate estimator for the estimated CDF for discrete or extensive resources may be
found in Method 1 (Horvitz-Thompson Estimator).
8 Validation Data
Actual data with results, EMAP Design and Statistics Dataset #12, are available for
comparing results from other versions of these algorithms.
9 Notes
Inclusion probabilities (or densities), TC;. and joint inclusion probabilities (or densities), n^,
are determined by the design and should be furnished with the design points. In some
instances, the joint inclusion probabilities may be calculated from a formula such as
Overtones approximation where ntj = {2(/j-l)7t/-7^} / {2/i-u,-^} , which is used in Section
6.3. In some instances, the joint inclusion densities may be calculated from a formula that
uses the location of the design points or they may be approximated by the formula n^ =
(n-l)7C,-/« that assumes simple random sampling.
-------
EMAP Estimation Method 12, Rev. No. 0, May 1996, Page 8 of 8
10 References
Cochran, W. G. 1977. Sampling techniques. 3rd Edition. New York: John Wiley & Sons.
Cordy, C. B. 1993. An extension of the Horvitz-Thompson theorem to point sampling from a
continuous universe. Statistics & Probability Letters 18:353-362.
Lesser, V. M., and W. S. Overton. 1994. EMAP status estimation: Statistical procedures
and algorithms. EPA/620/R-94/008. Washington, DC: U.S. Environmental Protection
Agency.
Overton, W. S., D. White, and D. L. Stevens Jr. 1990. Design report for EMAP,
Environmental Monitoring and Assessment Program. EPA 600/3-91/053. Corvallis, OR:
U.S. Environmental Protection Agency, Environmental Research Laboratory.
Sarndal, C. E., B. Swensson, and J. Wretman, 1992. Model assisted survey sampling. New
York: Springer-Verlag.
Stevens, Jr., D. L. 1995. A family of designs for sampling continuous spatial populations.
Environme tries. Submitted.
-------
ESTIMATION METHOD 13: Simplified Variance of the Cumulative Distribution Function
for Proportion (Discrete or Extensive) and for Total Number of a Discrete Resource, and
Variance of the Size-Weighted Cumulative Distribution Function for Proportion and Total of
a Discrete Resource; Simple Random Sample Variance Estimator
1 Scope and Application
This method calculates the estimated variance of the estimated cumulative distribution
function (CDF) for the proportion and total number of a discrete (or in the case of proportion,
extensive) resource that has an indicator value equal to or less than a given indicator level.
The method also calculates the estimated size-weighted versions of these CDFs for a discrete
resource. All of these CDFs are produced using Horvitz-Thompson estimators found in other
methods. An estimate can be produced for the entire population or for a geographic
subpopulation with unknown size. "This size is the total number of units or extent in the
subpopulation.
The estimation algorithms have been simplified for use in spreadsheet software such as Lotus
1-2-3 and Quattro Pro; however, because of this simplification, the use of these variability
estimates is restricted. This method provides a mechanism for generating quick summaries of
indicators to assist in internal research and is distributed with the restriction that results for
inclusion in peer-reviewed documents or EPA reports should be cleared by EMAP
statisticians. The variance estimates will be produced at the supplied indicator levels of
interest. For information on the Horvitz-Thompson estimators of the CDF, refer to Section 7.
2 Statistical Estimation Overview
r,
A sample of size na units is selected from subpopulation a with known inclusion probabilities
K = {nl,-,ni,-,nn } and if applicable, the size-weight values \v = {wl.-,wl,-,\vn }. The
indicator is evaluated for each unit and represented by .y = {y^ , ,yj, ,yn } When sampling.
an extensive resource, the inclusion probabilities are replaced by the inclusion density
function evaluated at the sample locations. The inclusion probabilities are design dependent
and should be furnished with the design points. See Section 9 for further discussion.
The variance estimators of the CDF are calculated for each value of the indicator levels of
interest, xk. The units are assumed to come from an independent sampling design that
reduces the usually required joint inclusion probabilities given by n^ , where i*j , to
7t.. = [(n^-l^Tt ] lna . Under the independent random sampling model, the Horvitz-
Thompson variance estimator simplifies to the usual simple random sampling variance
estimator, s 2, applied to a cumulative total. This total differs depending upon whether the
CDF is for proportion or for total number. In the case of proportion, the Horvitz-Thompson
ratio estimator is used to calculate the CDF and because both the numerator of the proportion
are estimated, there is more variability in the estimate. As a result, the variance estimators of
the CDF for proportion and the size-weighted CDF for proportion require as input the CDF
estimates produced using the Horvitz-Thompson ratio estimator.
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 2 of 9
The output consists of the estimated variance values.
3 Conditions Under Which This Method Applies
Independent random sample (IRS) with fixed sample size and known inclusion
probabilities or densities
Discrete resource (or extensive, in the case of proportion)
Subpopulation is defined geographically, or the number of sites within the
subpopulation of interest is known; examples: by ecoregion or first order stream length
All units sampled from the subpopulation must be accounted for before applying this
method; Missing values are excluded
3.1 Restrictions
Variability estimates of the CDF for non-geographic subpopulation estimates cannot be made
using the supplied calculation routines. For example, the supplied routine does not apply for
the estimates of variability of the percentage of lakes that are hypereutrophic and have ANC
< 200, or the estimated number of streams containing a species of fish for a subset of the
sample that is determined by a chemistry response. A more sophisticated variance estimator is
needed for these cases; contact EMAP Design and Statistics for assistance.
4 Required Elements
4.1 Input Data
y, = value of the indicator for the ith unit sampled from subpopulation a.
nL - inclusion probability for selecting the i'h unit of subpopulation a.
wt - size-weight value for the ilh unit sampled from subpopulation a. This applies to
discrete resources only. An example would be area of a lake.
4.2 Additional Components
na = number of units sampled from subpopulation a.
xk = tfh indicator level of interest.
l k lO, otherwise '
For the estimated variance of the estimated CDF for proportion, also input
° 1
£
' ' tne estimated CDF for proportion for indicator value xk in
"a
subpopulation a with estimated subpopulation size, Na = J^
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 3 of 9
For the estimated variance of the estimated size-weighted CDF for proportion, also input
w,
E -1 *
1 = 1 ^J
Ga(xk^ = - ' tne estimated size-weighted CDF for proportion for
*a
indicator value xk in subpopulation a with estimated subpopulation size-weighted total,
H<
1 = 1 j
5 Formulas and Definitions
The estimated variance of the estimated CDF (proportion) for indicator value xk in
subpopulation a, V [F^x^)] ; Simple random sample variance estimator of the Horvitz-
Thompson ratio estimator of a CDF is
?('r?)
The estimated variance of the estimated CDF (total number) for indicator value xk in
yv ^v ^
subpopulation a, V [NaFa(xk)] ; Simple random sample variance estimator of the Horvitz-
Thompson estimator of a CDF is
? [#<.(**>] =nas" ' ** = " ^ ' '« = 7^/^) X ^
The estimated variance of the estimated size-weighted CDF (proportion) for indicator value xk
in subpopulation a, V [Gfl(^)J ; Simple random sample variance estimator of the Horvitz-
Thompson ratio estimator of a CDF is
v2 2 l(r'"7)
"x" ~JJ n«-^ ' ' ">
The estimated variance of the estimated size-weighted CDF (total) for indicator value xk in
subpopulation a, V [lfra
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 4 of 9
Wi
; r = /(>><*) x _i
,
nfl - 1 71,
For these equations:
W - estimated subpopulation size-weighted total.
^.
Na - estimated subpopulation size.
Fa(xk) = estimated CDF (proportion) for indicator value xk in subpopulation a.
G (xk) = estimated size-weighted CDF (proportion) for indicator value xk in
subpopulation a.
[0, otherwise
xk = kfh indicator level of interest.
y; = value of the indicator for the ith unit sampled from subpopulation a.
n- = inclusion probability for selecting the i'h unit of subpopulation a.
\Vj = size-weight value for the ilh unit sampled from subpopulation a. This applies to
discrete resources only. An example would be area of a lake.
s1 - sample variance of /.
n - number of units sampled from subpopulation a.
7 = I , the arithmetic mean of/.
na
6 Procedure
6.1 Enter Data
Input the sample data consisting of the indicator values, v(- , their associated inclusion
probabilities, Tt^ and if applicable, the size-weight values,'w, and CDF estimates. For this
example data, the variance of the empirical CDF is of interest; xk values are equal to v, .
6.2 Sort Data
Sort the sample data in nondecreasing order based on the y- indicator values. Keep all
occurrences of an indicator value to obtain correct results. Our sample data is
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 5 of 9
Indicator
Vi
(1)
1.9
6.0
9.8
10.9
11.0
11.8
12.0
12.3
13.6
14.2
Inclusion
Probability
*/
(2)
.042201
.059245
.023847
.060562
.037023
.055115
.102785
.059545
.084789
.083752
Size-weight
(ex. area)
wi
(3)
117.85
147.30
185.55
55.55
239.91
165.09
129.83
51.42
262.33
74.58
fy**>
(4)
.1219
.2087
.4244
.5093
.6482
.7415 -
.7916
.8779
.9386
1.0000
,-<1.9 . If this is
not the case, set l(y>,.< 1.9) is abbreviated as 7(1.9).
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 6 of 9
Indicator
y>
a)
1.9
6.0
9.8
10.9
11.0
11.8
12.0
12.3
13.6
14.2
Inclusion
Probability
*«
(2)
.042201
.059245
.023847
.060562
.037023
.055115
.102785
.059545
.084789
.083752
-7(1.9)
(3)
1
0
0
0
0
0
0
0
0
0
-7(1.9)-Ffl(1.9)
(4) = (3) - .2087
.8781
-.1219
-.1219
,1219
-.1219
-.1219
.1219
-.1219
-.1219
.1219
7(1.9)-Fa(1.9)
rei
(5) = (4) * (2)
20.8
-2.1
-5.1
-2.0
-3.3
-2.2
-1.2
-2.0
-1.4
-1.5
/(1.9)
*,'
(6) = (3) H- (2)
23.696
0
0
0
0
0
0
0
0
0
For estimating the variance of the estimated CDF for proportion for xk = 1.9, calculate the
sample variance, s2, of column (5). (In Excel, use the VAR( ) function), s2 = 54.765.
2
V [/fl( 1.9)] = V [.1219] = n°S = (10)(54-765) = 0.0145 .
Na2 194.432
-For estimating the variance of the estimated CDF for total number for xk = 1.9, calculate the
sample variance, s2, of column (6). s2 = 56.15.
V [NaFa(l.9)] = V [23.70] = nas2 = (10X56.15) = 561.5 .
Do this same procedure for the next xk value, xk = 6.0. The table now becomes
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page"'? of 9
Indicator
Vi
(1)
1.9
6.0
9.8
10,9
11.0
11.8
12.0
12.3
13.6
14.2
Inclusion
Probability
*/
(2)
.042201
.059245
.023847
.060562
.037023
.055115
.102785
.059545
.084789
.083752
7(6.0)
' (3)
1
1
0
0
0
0
0
0
0
0
7(6.0) -Fa(6.0)
(4) = (3) - .2087
.7913
.7913
-.2087
-.2087
.2087
-.2087
-.2087
-.2087
-.2087
-.2087
7(6.0) -Ffl(6.0)
*,'
(5) = (4) - (2)
18.751
13.357
-8.751
-3.446
-5.637 .
-3.786
-2.030
-3.505
-2.461
-2.492
7(6.0)
*,-
(6) = (3) - (2)
23.696
16.879
0
0
0
0
0
0
0
0
For estimating the variance of the estimated CDF for proportion for xk = 6.0, calculate the
sample variance, s2, of column (5). s2 - 77.026.
V [F (6.0)] = V [.2087] = "^ = (10)(77.026) = QQ2Q4
a N 2 194.432
For estimating the variance of the estimated CDF for total number for xk = 6.0, calculate the
sample variance, j2, of column (6). s2 - 75.75.
V [NaFa(6W] = V [40.58] = V2 = (10)(75.75) = 757.5 .
Repeat this process for the remaining xk values.
6.4 Compute estimated variance of the estimated size-weighted CDF for proportion,
V [Ga(xk)], and for total, V
The procedure for calculating the variance estimates for the size-weighted CDFs is the same
as the one used in Section 6.3. The only difference between the estimates is that w,/r:, is
substituted for I/TT, in every part of the calculation. The following example is for xk = 6.0.
Create a new table of 6 columns. Use column (1) from the table in Section 6.2 for the first
column. In the second column, enter the result from dividing column (3) by column (2) of
the table in Section 6.2. Insert the I(y
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 8 of 9
7(>'1-<;c/(:) = 1 if y(<6.0 . If this is not the case, set 7(>'(-<^) = 0 . In column (4), insert
the difference between column (3) and the size-weighted CDF value corresponding to xk =
6.0. This CDF value is .1786, from the table in Section 6.2. In column (5), put the result
from multiplying column (4) by column (2). In column (6), put the result from multiplying
column (3) by column (2). Results are in the following table; -7(;y.<6.0) is abbreviated as
7(6.0).
Indicator
y>
(i)
1.9
6.0
9.8
10.9
11.0
11.8
12.0
12.3
13.6
14.2
wi / *i
(2)
2792.5879
2486.2858
7780.8529
917.2418
6480.0259
2995.3733
1263,1221
863.5486
3093.9155
890.4862
7(6.0)
(3)
1
1
0
0
0
0
0
0
0
0
7(6.0) -Gfl(6.0)
(4) = (3) - .1786
.8214
.8214
-.1785
-.1786
.1786
-.1786
-.1786
-.1786
.1786
-.1786
w
[7(6.0) -Gfl(6.0)]_i
Ki
(5) = (4) x (2)
2293.8317
2042.2351
-1389.6603
-163.8194
-1157.3326
-534.9737
-225.5936
-154.2298
-552.5733
-159.0408
[7(6.0)] .Hi
*,-
(6) = (3)x(2)
2792.5879
2486.2858
0
0
0
0
0
0
0
0
For estimating the variance of the estimated size-weighted CDF for proportion for xk - 6.0,
calculate the sample variance, s2, of column (5). s2 = 1491256.3.
V [Gfl(6.0)] = V [.1786] = HaS = (10X14912563) = 0.0171
W 2 29563.602
For estimating the variance of the estimated size-weighted CDF for total number for jc^ = 6.0,
calculate the sample variance, s2, of column (6). s2 - 1243723.68.
.2
V [WflGfl(6.0)] = V [5280.06] = nasL = (10)(1243723.68) = 12,437,237
Repeat this process for the remaining xk values.
-------
EMAP Estimation Method 13, Rev. No. 0, May 1996, Page 9 of 9
7 Associated Methods
An appropriate estimator for the estimated CDF for proportion for discrete or extensive
resources may be found in Method 1 (Horvitz-Thompson Estimator). For the estimated CDF
for total number, size-weighted CDF for proportion, and size-weighted CDF for total (which
apply only to discrete resources), see Methods 2, 3, and 4, respectively.
8 Validation Data
Actual data with results, EMAP Design and Statistics Dataset #13, are available for
comparing results from other versions of these algorithms.
9 Notes
Inclusion probabilities, Tt,, are determined by the design and should be furnished with the
design points.
Population estimates are calculated using inclusion probabilities or densities and differ by
indicator. For example, in the 1993 stream sample, periphyton, and full physical habitat
(P-hab) were measured only on the IX grid streams, requiring use of the IX inclusion
probabilities. Water chemistry measurements were taken on both IX and 7X streams, and in
this case, the IX inclusion probabilities should be used. Reference/test sites (both lakes and
streams) were hand picked, and can not.be used, to make population estimates. These
restrictions apply to all sampling years.
If estimates across multiple years are required, responses for sites sampled in multiple years
should only be included for the initial year and the inclusion probabilities should be
multiplied by the number of years of data.
10 References
Lesser, V. M., and W. S. Overton. 1994. EMAP status estimation: Statistical procedures
and algorithms. EPA/620/R-94/008. Washington, DC: U.S. Environmental Protection
Agency.
Overton, W. S., D. White, and D. L. Stevens Jr. 1990. Design report for EMAP,
Environmental Monitoring and Assessment Program. EPA 600/3-91/053. Corvallis, OR:
U.S. Environmental Protection Agency, Environmental Research Laboratory.
Stevens, Jr., D. L. 1995. A family of designs for sampling continuous spatial populations.
Environmetrics. Submitted.
-------
ANSWERS TO COMMONLY ASKED QUESTIONS
ABOUT R-EMAP SAMPLING DESIGNS
AND DATA ANALYSES
Prepared for
Victor Serveiss
U.S. Environmental Protection Agency
Research Triangle Park, NC
Prepared by
Jon H. Volstad
Steve Weisberg
Versar, Inc.
Columbia, MD 21045
and
Douglas Heimbuch
Harold Wilson
John Seibel
Coastal Environmental Services, Inc.
Linthicum, MD
March 1995
-------
ANSWERS TO COMMONLY ASKED QUESTIONS ABOUT
R-EMAP SAMPLING DESIGNS AND DATA ANALYSES
INTRODUCTION
The Environmental Monitoring and Assessment Program
(EMAP) is an innovative, long-term research, and moni-
toring program designed to measure the current and
changing conditions of the nation's ecological resources,
EMAP achieves this goal by using statistical survey
methods that allow scientists to assess the condition of
large areas based on data collected from a representative
sample of locations. Statistical survey methods are very
efficient because they require sampling relatively few
locations to make valid scientific statements about the
condition of large areas (e.g., all wadable streams within
an EPA Region).
Regional-EMAP (R-EMAP) is a partnership between
EMAP, EPA Regional offices, states, and other federal
agencies to adapt EMAP's broad-scale approach to
produce ecological assessments at regional, state, and
local levels. R-EMAP is based on the same statistical
survey techniques used in EMAP, which have proven
successful in many disciplines of science. Applying
these techniques effectively requires recognizing several
key principles of survey sampling and using specialized,
although not difficult, data analysis methods.
This document provides a nontechnical overview of the
survey sampling and data analysis concepts underlying
R-EMAP projects. It is intended for regional resource
managers who have had little statistical training, but
who feel they would benefit from a better understanding
of the statistical and scientific underpinnings of R-EMAP.
Familiarity with these concepts is helpful for understand-
ing the kinds of information R-EMAP can provide and
appreciating the strengths of R-EMAP. Several addi-
tional documents are being prepared for scientists with
some statistical training who may become involved in
analyzing R-EMAP data.
This document is organized in two sections. The first
section explains the general principles of survey
sampling and its application to determining ecological
condition. Terms such as target population, sampling
-------
frame, and random sampling are defined. The second
section addresses questions frequently asked about the
R-EMAP sampling design and data analysis methods.
Throughout the document, the concepts of survey
design are illustrated first with examples from everyday
life, and then with examples from a typical R-EMAP
study. The R-EMAP examples involve a stream study;
however, the concepts are equally applicable to assess-
ing the condition of other resources such as lakes,
estuaries, wetlands, or forests.
PRINCIPLES OF SURVEY DESIGN
There are two generally accepted data collection
schemes for studying the characteristics of a population.
The first is a census, which entails examining every unit
in the population of interest. For most ecological
studies, however, a census is impractical. For example,
measuring fish assemblages everywhere to assess condi-
tions within a watershed that has 1000 kilometers of
stream would be prohibitively expensive.
A more practical approach for studying an extensive
resource, such as a watershed, is to examine parts of it
through probability (or random) sampling. Studies based
on statistical samples rather than complete coverage^or
enumeration) are referred to as sample surveys. Sample
surveys are highly cost-effective, and the principles
underlying such surveys are well developed and docu-
mented. The principles of survey design provide the
basis for (a) selecting a subset of sampling units from
which to collect data, and (b) choosing methods for
analyzing the data.
One example of a sample survey is an opinion poll to
estimate the percentage of eligible voters who plan to
vote Democratic in a presidential election. Such opinion
polls are based on interviews with only a small fraction
of all eligible voters. Nevertheless, by using statistically
sound survey methods, highly accurate estimates can be
obtained by interviewing a representative sample of only
around 1200 voters. If 700 of the polled voters plan to
vote Democratic, then the fraction 700/1200, or 58 per-
cent, is a reliable estimate of the percent of all voters
who plan to vote Democratic.
-------
A target population of enrolled students at
university. Sampling unit » individual
student.
A target population of perennial, variable
streams in a watershed. Sampling unit «
point location and associated plot.
The approach used in conducting a R-EMAP stream
survey is basically the same as in an opinion poll.
Instead of collecting the opinions of a sample of people,
a R-EMAP project might collect data about fish assem-
blages from a representative sample of point locations
along the stream length of a watershed to determine the
percent of kilometers of streams in which ecological con-
ditions are degraded. If data are collected from plots of,
say, 40 times the stream width in length at each of 40
randomly selected sites, and 16 of the 40 sites exhibit
degraded conditions, then the estimated proportion of
degraded stream kilometers in the watershed would be
40% {i.e., 16/40).
STEPS FOR IMPLEMENTING A SAMPLE SURVEY
The survey design is a plan for selecting the sample
appropriately so that it provides valid data for developing
accurate estimates for the entire population or area of
interest. Planning and executing a sample survey
involves three primary steps: (1) creating a list of all
units of the target population from which to select the
sample, (2) selecting a random sample of units from this
list, and (3) collecting data from the selected units. The
same techniques used to select the sample of people to
interview in an opinion poll are used to select the sample
of sites from which to collect field data.
Developing a Sampling Frame
Before the sample survey can be conducted, a clear,
concise description of the target population is needed.
In statistical terminology the target population (often
shortened to "population") does not necessarily refer to
a population of people. It could be a population of
schools, area units of farm land, freshwater lakes, or
length-segments of streams.
The list or map that identifies every unit within the popu-
lation of interest is the samp/ing frame. Such a list is
needed so that every individual member of the popula-
tion can be identified unambiguously. The individual
members of the target population whose characteristics
are to be measured are the sampling units.
J
-------
A random umpk of students from the target
population. The poll results in "yes" or "no"
response!. _.
For example, if we were conducting a sample survey to
estimate the percentage of students at a university who
participate in intramural sports, the target population
would consist of all the enrolled students. The individual
students would be the sampling units, and the registrar's
office could provide a list of students to serve as the
sampling frame. We could draw a representative (ran-
dom) sample of students from this list and interview
them about their participation in sports. Their responses
would be "yes" or "no." The percentage of interviewed
students who participate in intramural sports would yield
an estimate of the "true" percentage for all students.
For a stream survey, the target population might be all
perennial, wadable streams in a watershed. .The sam-
pling unit is a point along the stream length, and an
associated plot, e.g. 40 times the stream width in
length. The response variable might be "degraded" or
"non-degraded" based on measures of water quality.
Conceptually, the collection of all possible point
locations along these streams serve as a sampling frame,
similar to the list of students in the previous example.
The sampling frame for streams typically would be
established by using U.S. Geological Survey stream
reach files through a geographic information system
(CIS).
A random sample of location* from the target
population.
Selecting a Representative Sample
Survey sampling is intended to characterize the entire
population of interest; therefore, all members of the
target population must have a known chance of being
included in the sample. Conducting an election poll by
asking only your neighbors' opin'ons probably would not
enable you to predict the outcome of a national election
accurately.
Simple random selection ensures that the sample is
representative because all members of the population
have an equal chance of being selected. Random selec-
tion can be thought of as a kind of lottery drawing to
determine which stream reaches, for example, are in-
cluded in the sample. The selection is non-preferential
towards any particular reach or group of reaches. One
way to make a random selection would be to place
uniquely numbered ping-pong balls (one for each sam-
pling unit) into a drum, blindly mix the drum, and then
-------
blindly pick one ball corresponding to each stream reach
(i.e., sampling unit) from which data are to be collected.
In practice, computers are used to make the random
selections. Either way, the result is a subset of sampling
units randomly selected from the sampling frame.
Students potted at the entrance to the
gymnasium are not representative of all
students on the university campus.
A biased sample of locations from the
target population of afl streams In the
shaded area.
FREQUENTLY ASKED QUESTIONS
Upon thoughtful consideration of the sample survey
approach, several questions may come to mind. This
section answers several commonly asked questions.
Some of them concern survey sampling, and some of
them concern data analysis. These questions are
addressed in fairly general terms. As noted in the intro-
duction, additional technical detail will be available in a
series of methods manuals.
Why is ft so important to select sampling sites ran-
domly?
The way we select the sample (i.e., choose the units
from which to collect data) is crucial for obtaining
accurate estimates of population parameters. We clearly
would not get a good estimate of the percentage of all
students at a university who participate in intramural
sports if we polled students at the entrance to the
gymnasium. This preferential sample would, most likely,
include a much higher proportion of athletes than the
general population of students.
Similarly in a stream study, preferential sampling occurs
if the sample includes only sites downstream of sewage
outfalls in a watershed where sewage outfalls affect
only a small percentage of total stream length. This kind
of sampling program may provide useful information
about conditions downstream of sewage outfalls, but it
will not produce estimates that accurately represent the
condition of the whole watershed.
Preferential selection can be avoided by taking random
samples. Simple random sampling ensures that no par-
ticular portion of the sampling frame (i.e., groups of
students or kinds of river reaches) is favored. Within
streams/the chance of selecting a sampling unit that
has degraded ecological conditions would be proportional
-------
to the number of sampling units within the target popu-
lation that have degraded conditions. For example, if
30% of the target population has degraded conditions,
then on average 30% of the (randomly selected) units in
the sample will exhibit degraded conditions. This pro-
perty of random sampling allows estimates (based only
on the sample) to be used to draw conclusions about the
target population as a whole.
For 305b reports, I need to estimate the total number of
stream miles in my EPA Region that are degraded. Can
I do this from sample survey data?
The number of degraded stream miles can be calculated
in two steps. First, the proportion of stream miles that
are degraded is calculated as illustrated earlier. Then,
that fraction is multiplied by the total number of stream
miles in the population. The total number of stream
miles is available from the sampling frame, which
delineates all members of the target population.
Defining "degraded" is an important part of the calcula-
tion, regardless of whether rt is for percent or absolute
number of stream miles. "Degraded" can be defined; if
a threshold value or goal for each measurement variable
can be established. Most of the variables measured, in
stream surveys, such as pH, have continuous rangesiof
response (e.g., between 1 and 14 for pH). Calculating
the proportion of stream miles that are degraded requires
converting this continuous data into binary, or yes/no
(e.g., degraded or not degraded) form. The question of
how many stream miles are degraded, therefore, must
be rephrased to include a threshold value for the relevant
measurement variable. For pH, the question might be
rephrased as "What are the total number of stream miles
in my Region with pH below 6.5?"
I am accustomad to seeing estimates of average condi-
tion Instead of estimates of proportion. Can R-EMAP
data be used to estimate average condition?
Yes, estimates of average condition, such as the average
pH in a watershed, provide valuable information and can
be calculated with R-EMAP data as a simple mean.
The principles of survey sampling, particularly the
emphasis on selecting a representative sample, also
6
-------
apply to estimating a population mean. Just as an esti-
mate of the percent of stream miles in a Region in which
pH is below 6.5 is biased if data are collected only from
sites downstream of sewage outfalls, so is the estimate
of mean pH.
EMAP emphasizes estimating spatial extent (e.g., per-
cent of river miles) because it has several advantages
over estimating the mean. For instance, a Region with
an average stream pH of 7 might be composed entirely
of streams with a pH of 7; however, the same average
would occur if half the streams have a pH of 6 and the
other half a pH of 8. Estimating the spatial extent of the
resource that fails to meet some standard (e.g., pH of at
least 6.5) provides more information about the condition
of the resource and is consistent with EPA initiatives to
establish environmental goals and measure progress
toward meeting them.
Distribution of umpllng locations along a
transect for different sampling >chemes.
RAMDOM aAMPL
RESTMCTED RANDOM SAJVUMO
SYSTEMATIC tAMPUMO
Many EMAP documents refer to hexagons in describing
the sampling design. How are hexagons involved?
In geographic studies, such as a stream survey, it is
often desirable to-distribute samples throughout the
study area. Often this is accomplished using a syste-
matic design in which samples are placed at regular
intervals. In EMAP, this is accomplished by a special
kind of random sampling known as restricted random
sampling. This type of random sampling has a syste-
matic component. The systematic element causes the
selected sampling units to be spread out geographically.
The random element ensures that every sampling unit
has an equal chance of being selected. The illustration
at left compares the typical allocations of sampling units
along a transect for random, restricted random, and
systematic sampling designs.
In EMAP, hexagons are used to add the systematic ele-
ment to the design. The hexagonal grid is positioned
randomly on the map of the target resource, and sam-
pling units from within each grid cell are selected
randomly. The grid ensures spatial separation of
selected sampling units; randomization ensures that each
sampling unit has an equal chance of being selected.
-------
Target population: aD eligible voters to all
states. Area of special interest (stratum):
voters to Rhode bland.
Target population: watershed with 1000 km
of streams. Area of special interest (stratum):
200 km of streams.
EMAP documents suggest that the sampling design is
"flexible to enhancement." What does this mean?
One goal of a sample survey may be to compare a sub-
population with the target population. For instance, an
opinion poll might be used to determine if a higher per-
centage of the people living in Rhode Island are likely to
vote Democratic than in the nation as a whole. Given its
small size, Rhode Island probably would receive very
little attention in a national poll if samples are allocated
randomly. One way to achieve a sample of people in
Rhode Island that is sufficient to make this comparison
is to increase sampling effort for the nation as a whole
until enough people from Rhode Island are included in
the randomly selected national sample. This option is
not very cost-effective because it requires considerable,
unnecessary sampling effort in other areas to achieve a
desired sample size in one small area.
Another, preferable, alternative would be to divide the
entire target population into two subpopulations,, or
strata. Voters in the United States could be stratified
into (1) those living in Rhode Island, and (2) those living
elsewhere. A simple random sample of desired size
could then be selected from each of these groups. Stat-
isticians refer to this as stratified random sampling.
Stratified sampling designs can have any number of
strata with a different level of sampling effort in each.
f ;
Stratified sampling could be used in a stream survey to
enhance sampling effort in a watershed of special inter-
est so that its condition could be compared with that of
a larger area. In a study area with 1000 kilometers of
streams, for example, an area of special interest may
contain 200 kilometers of streams. If budget constraints
limit the size of the total sample to 60 sampling units,
30 could be randomly selected from the special interest
area, and 30 from the rest of the sampling frame. If
simple random sampling is used, the area of special
interest, which represents 20% of the area, will
contain only about 12 of the 60 selected sampling units.
A sample of 12 would be insufficient to estimate the
condition of the special interest area reliably.
8
-------
Doesn't enhancing the sampling intensity for an area of
special Interest bias the overall estimate?
No. Sampling units inside an area of special interest
usually have a higher chance of being selected than sam-
pling units outside the special interest area. Within each
stratum, however, the chance of selecting any location
is equal; therefore, a separate (unbiased) estimate can
be computed for each stratum.
With stratified random sampling, estimates are generated
first for individual strata, then the stratum-specific
estimates are combined into an overall estimate for the
whole target population. Stratum-specific estimates are
combined by weighting each one by the fraction of all
sampling units that are within the stratum. For the
simple two-stratum example given above, the weights
would be 200/1000 for stratum 1 and 800/1000 for
stratum 2. So, if the stratum-specific estimates are 0.5
for stratum 1 and 0.25 for stratum 2, the overall esti-
mate is 0.30 [(0.5 x 2/10) + (0.25 x 8/10)1. This
approach ensures that the overall estimate is corrected
for the intentional selection emphasis within a particular
subpopulation.
EMAP's objectives state that estimates are made with
known confidence. What is "known confidence"?
An estimate of a population parameter is of limited value
without some indication of how confident one should be
in it. Scientists typically describe the appropriate level
of confidence in an estimate derived from a sample sur-
vey by defining confidence limits or margins of error,
This description of statistical confidence is used fre-
quently in reporting the results of opinion polls using
statements such as "this poll has a margin of error of
± 4%'. Provided random sampling is used, similar
statements can be made about estimates from biological
sample surveys.
Sample surveys provide estimates that are used to make
inferences about parameters for the population as a
whole. Two types of estimates are commonly provided:
the point estimate and the interval estimate. For ex-
ample, the estimated proportion of voters that support
a party is a point estimate. It is important to know how
likely it is that such a point estimate deviates from the
9
-------
Percent of Democratic voters estimated from
a sample of 30; note the wide confidence
interval.
Pollerigesponses 14 of 30 (47%)
Confidence Interval 29% - 65%
Margin of Error 18%
100
gM
x too am
I POLLED RESPONSES
A cample of 300 people produces a better
estimate; the confidence interval k narrower.
Po4l«d
Confidence Interval
Margin of Error
140 of 300 (47%)
41% -63%
6%
100
true population parameter by no more then a given
amount. An interval estimate for a parameter is defined
by upper and lower limits estimated from the sample
values. A confidence interval is constructed so that the
probability of the interval containing the parameter of
interest can be specified. We do not know with cer-
tainty whether an individual interval, specified as a
sample estimate plus minus a margin of error, includes
the true population parameter. For repeated sampling,
however, the estimated 95% confidence intervals would
include the true parameter 95% of the times. The
length of the confidence intervals is a measure of how
precise the parameter is being estimated: a narrow
interval signify high precision. The margin of error is
often used for defining the upper and lower limits of the
confidence interval, it is half the width of the confidence
interval. Thus, if a poll estimates that 55% of the popu-
lation will vote Democratic and the margin of error is
± 4%, then the estimated 95% confidence interval
ranges from 51 % to 59%.
A great advantage of using a random sampling design is
that statisticians have developed procedures for calcu-
lating confidence intervals for the estimates. For most
R-EMAP projects, in which the goal is to estimate the
proportion of the resource that is degraded, a standard
probability distribution known as the binomial distri-
bution can be used to determine the upper and lower
bounds of confidence intervals.
What are the most Important factor* affecting the size
of the confidence interval?
The sample size (# of sampling units collected) and the
proportion of yes answers are the primary factors affect-
ing the size of the confidence interval with binary
(yes/no) data. The effect of sample size can be illu-
strated with a pre-election poll of voters. If only 30
people are sampled, and 14 indicate that they will vote
Democratic, it would be unwise to predict the winner.
With such a small sample size, the margin of error would
be about ±18% for a 95% confidence interval. The
degree of confidence would be higher rf 140 people out
of a sample of 300 say they will vote Democratic (47%
±6%), and higher still if 1400 people out of a sample
of 3000 say they will vote Democratic (47% ± 2%). In
this example, the estimated proportion of sampled voters
TO
-------
A sample of 3000 people produces a very
accurate estimate, with a narrow confidence
interval.
m
lRa|-
vKmtifc''1'^'
xiiMffi
Polled .^eponeea 1400 Ol 3000 (47%)
Confidence Interval 45% 48% -
Margin of Error 2%
100
50
30 900 MOO
t POLLED R£SPOMSCS
Margin of error as a function of the percent
yes responses for fixed sample size* of 30 and
100 (90% confidence Interval).
Plot of margin of error versus sample size
when 20% of the population b to the YES
category (P » 0.2).
who will vote Democratic stays the same (p = 47%), but
the width of the confidence interval decreases with
increasing sample size.
Confidence intervals for estimated percentages (p) are
affected to a lesser degree by the proportio'n of yas
answers (P) in the population. The widest confidence
interval occurs for P equal to 50%. For values of P
ranging from 20% to 80%, the margin of error will not
vary much with P; it will be determined mainly by the
sample size. The fact that there is a maximum margin
of error for binomial estimates of proportions is very
useful for planning a survey. If we plan for the worst
case (i.e., when half of the population is in the yes
category) we can select a sample size that ensures that
the confidence interval for P will be smaller than a
specified limit.
Doesn't the size of the target population affect
confidence in the estimates?
The size of the target population theoretically affects the
precision of the estimates. For most sample surveys,
however, the effect is negligible because the sampled
fraction of the target population is so small. When the
sampled fraction is small, the size of the sample rather
than the size of the target population determines the
precision of the estimate. Polling 1000 people in the
state of Rhode Island, for example, would yield as
precise an estimate as polling 1000 people in the state
of Texas. In both cases, a very small proportion of the
total population is polled.
If the sample includes a large proportion of the popu-
lation, in contrast, the accuracy of the estimate is
improved. For instance, if a local town has a population
of 1400 people, then a sample of 1200 people would
produce a substantially more accurate estimate than a
sample of 1200 people from a population of 100 million.
As the size of the sample approaches the size of the
population, statisticians adjust the confidence interval
using the finite population correction factor. In practice,
however, most sampling efforts don't sample a large
enough fraction of the population for this correction
factor to become important. That is why pollsters inter-
view approximately the same number of people for a
local election as for a presidential election.
11
-------
For R-EMAP projects, the fraction of the population that
is sampled is generally very small. Fish assemblages, for
example, are generally sampled from 100-meter
segments. If 50 such samples are collected from a
Region with 1000 miles of streams, the sampled fraction
is .0031.
CLOSING COMMENTS
The approaches and concepts described in this overview
document are generally applicable to all R-EMAP
projects. They are appropriate whether the purpose of
sampling is to estimate the proportion of the number of
resource units (e.g., numbers of lakes), the proportion of
total length of a resource (e.g., miles of streams), the
proportion of area of a resource (e.g., square miles of an
estuary), or the proportion of volume of a resource (e.g.,
cubic meters of one of the Great Lakes). The approaches
and concepts can be applied without modification to
each of these situations.
This overview document purposefully was written non-
technically; it does not contain enough detail to help
someone analyze data. Three companion documents are
being prepared to provide additional technical detail
about recommended methods. These manuals describe
data analysis methods (1) for assessing status (e.g.,
proportion of area with degraded conditions), (2) for
assessing differences in proportions between two sub-
populations of interest (e.g., deep versus shallow areas,
two different states, two different stream orders), and
(3) for assessing long-term trends. The methods manu-
als are intended for scientists with some statistical
training. Technical documentation targeteJ for statis-
ticians is available from the EMAP Statistics and Design
Team in Corvallis, Oregon.
BIBLIOGRAPHY
Cochran, W. G. 1977. Sampling Techniques. 3rd ed.
John Wiley and Sons. New York.
Gilbert, R. O. 1987. Statistical Methods for
Environmental Monitoring. Van Nostrand Reinhold.
New York.
12
-------
Jessen, R. J. 1978. Statistical Survey Techniques. John
Wiley and Sons. New York.
Stuart, A. 1984. The Ideas of Sampling. MacMillan
Publishing Company. New York.
13
-------
R-EMAP
Data Analysis Approach for
Estimating the Proportion of Area that is Subnominal
Prepared for
Victor Serveiss
U.S Environmental Protection Agency
Research Triangle Park, NC
Prepared by
Douglas Heimbuch
Coastal Environmental Services, Inc.
Linthicum, MD
Harold Wilson
Coastal Environmental Services, Inc.
Linthicum, MD
John Seibel
Coastal Environment Services, Inc.
Linthicum, MD
Steve Weisberg
Versar, Inc.
Columbia, MD
April 1995
-------
TABLE OF CONTENTS
!. Introduction 1
li. Estimation of Proportion of Area that is Subnominal 2
II.A. The Resource, the Sample and the Estimate 3
II.B. Probability Distribution for Possible Values of the Estimate 5
II.C. Factors Affecting the Estimated Proportion . . ..... 7
II.C.1. The True Proportion Subnominal 7
II.C.2. Sample Size and Variance 9
III. Construction of Confidence Limits 10
III.A. What are Confidence Limits? 11
III.B. Factors Affecting Width of the Confidence Interval 14
lll.C. How to Compute Confidence Limits 16
III.C.I. Standard Graphs and Tables for Confidence Limits 16
lll.C.2. Normal Approximation 17
IV. Data Analysis for Stratified Random Sampling 19
V. Closing Comments 21
-------
I. Introduction
The Environmental Monitoring
and Assessment Program (EMAP) is an
innovative, long-term research and
monitoring program that is designed to
measure the current and changing
conditions of the nation's ecological
resources. EMAP achieves this goal by
utilizing sample survey approaches
which allow scientific statements to be
made for large areas based on
measurements taken at a sample of
locations. Regional-EMAP (R-EMAP) is
a partnership among EMAP, EPA
Regional offices, other federal
agencies, and states. R-EMAP is
adapting EMAP's broad-scale approach
to produce ecological assessments at
regional, state, and local levels.
The sample survey approaches
utilized by R-EMAP are very efficient in
terms of the (small) number of
locations that need to be sampled in
order to make valid scientific
statements about the condition of a
large area (e.g., all estuarine waters
within a Region). This efficiency
carries with it a small additional cost,
however. Specialized data analysis
methods must be applied to insure that
the results are scientifically valid.
This document is the first in a
series of methods manuals being
prepared to assist the R-EMAP partners
in implementing EMAP's sampling
approach. These manuals build upon
basic concepts that were addressed in
the document "Answers to Commonly
Asked Questions About the EMAP
Sampling Design" by providing a more
thorough discussion of specific topics.
The intended audience of the manuals
are scientists without extensive
statistical training who may become
involved in analysis of R-EMAP data.
Technical documentation, written for
statisticians, is also being prepared.
This manual describes two data
analysis methods for assessing the
status of ecological condition. One
primary measure of ecological condition
addressed by EMAP and R-EMAP is the
proportion of area that has subnominal
(i.e., not meeting some environmental
criterion) conditions. This manual
describes methods for:
o Estimation of the proportion of
area that has subnominat
conditions, and
o Construction of confidence
intervals for the estimates of
proportion of area.
These methods are equally applicable
to any type of proportion including
proportion of numbers (e.g., numbers
of lakes), proportion of length (e.g.,
miles of streams), proportion of area
(e.g.. square miles- of estuaries}, or
proportion of volume (e.g., cubic
meters of a lake) that has subnominal
conditions. The methods are
appropriate only for a sampling
program in which 1) every location
within the resource of interest has the
same chance of being selected for
sampling, and 2) the selection of any
one location does not affect the chance
of selection for any other location. The
-------
methods can also be applied to data
from stratified sampling if these two
conditions are satisfied within each
defined stratum.
These methods are described
along with an in-depth discussion of
underlying concepts. Underlying
concepts (rather than 'cook-book'
instructions) are emphasized for two
reasons. The first is that proper
interpretation of the results from the
data analyses requires an
understanding of ithe -underlying
concepts. Correct interpretation of the
results of data analyses is a key link
between quality data and sound
resource management decisions. The
second reason is that each of the R-
EMAP projects is unique and may
require custom application of the data
analysis methods. Thoughtful
application of these methods cannot
occur without an understanding of the
underlying concepts. Furthermore, a
solid understanding of the underlying
concepts can be a great help when
defending results and conclusions.
II. Estimation of Proportion of
Area that is Subnominal
In this section, the recommended
method for estimating proportion of
area that is subnominal is described
and the rationale for the method is
presented. Also, properties of the
estimates are discussed. To make this
information easier to understand, the
distribution pattern of the response
variable within the resource of interest
is treated as if it is known with
certainty (i.e., a map of the response
variable for the entire resource of
interest is presented). Clearly,
complete information of this kind is
never available in practice; if it were,
there would be no need to sample!
Although the recommended method
will provide an estimate of the
proportion of area that is subnominal, it
does not provide an estimate of the
location of the subnominal areas. The
only locations that are truly known to
be subnominal are the specific sampled
locations. Other analysis approaches
may be used to map the actual
subnominal areas.
This section is organized into three
parts. The first section contains a
discussion of the general relationship
between a) the true-distribution pattern
of the response variable, b) the sample
of response- variables from selected
locations, and c) the estimate of
proportion of area that is subnominal.
Next, the probability of observing
different estimated values, based on
which (randomly selected) locations are
included in the sample, is discussed.
Finally, the last section contains a
discussion of factors that affect the
estimate of the proportion of area that
is subnominal, and the kinds of effects
generated by these factors.
-------
II.A. The Resource, the Sample and the
Estimate
Figure 1. Hypothetical resource with a
subnominal area proportion of 0.2.
Figure 2. Hypothetical resource with 15
sampling locations.
For the purposes of assessing the
proportion of area that is subnominal, a
simple map of the resource of interest can
be envisioned in which areas that are
subnominal are shaded and all other areas
are left unshaded (Figure 1). The resource
depicted in Figure 1 is a hypothetical estuary
with the upstream and downstream limits of
the resource of interest marked with dotted
lines. The shaded areas, in this case, might
represent areas' with concentrations of
metals in sediments that are in excess of
some standard. For the example depicted in
Figure 1, a total of 200 square miles are
subnominal and the total area of the
resource of interest (i.e., shaded plus non-
shaded areas) is 1000 square miles. The
true proportion of area that is subnominal is
the ratio of a) the extent of shaded areas
(e.g., 200 square miles)'divided by b) the
entire extent of the resource of interest
(e.g., 1000 square miles). Therefore, 0.2 is
the true proportion of area that is
subnominal for this example. The true''
proportion is often referred to (e.g., in
textbooks) as the 'population parameter1 P.
Now suppose that a sample of 15
locations within the resource of interest are
selected at random. Furthermore, suppose
the random selection of locations is made so
that a) every location within the resource of
interest has an equal chance of being
selected, and b) after each selection of a
location, all locations are again equally likely
to be chosen as the next selected location
(Figure 2). This would be like blindly
throwing a dart at the map 15 times, each
-------
time ignoring where the previous darts
landed.
&>mp*n« Point
- In nen-cubnwnlnil «rai
- In lubnomlnil «r»i
Figure 3. Hypothetical resource with 3 of 1 5
samples in subnominal areas.
Sunpivg Point
- In norv-cubnemlnil »r»«
- Ir ubnomlnil ir*a
Figure 4. Hypothetical resource with 5 of 15
samples in subnominal areas.
After the locations are selected, the
condition of the resource (e.g., whether or
not the metals concentration exceeds the
standard) is recorded. In practice this might
be accomplished by sending a field crew to
the location to collect a sample for
laboratory analysis. For this example, a
selected location is designated subnominal if
it is in a shaded area of the map (Figure 3).
Next, the total number of selected locations
with subnominal condition is recorded (call
this number '*'), and the total sample size
(call this number 'n') is noted. These two
numbers, x and n, are all that are needed to
estimate the proportion of area that is
subnominal, and to construct the confidence
limits for the estimated proportion.
The estimate of the proportion of area
that is subnominal (referred to as £) is simply
the ratio of x divided by n:
?>= x/n .
For the sample depicted in Figure 3, x = 3
and fl = 15. Accordingly, the estimated
proportion for this example is 3 divided by
15 which is equal to 0.2, In this case, the
estimate is the same as the true population
proportion. This will not always occur. For
example, the.randomly selected locations
could have produced 5 samples that were
subnominal instead of 3 (Figure 4). In this
case the estimate would have been 5 divided
by 15 which is equal to 0.33. This estimate
is not equal to the true proportion. In fact,
the estimated value can be any one of 16
numbers from 0 to 1 (i.e., 0/15, 1 /15, 2/15,
... , and 15/15). However, it is much more
likely for the estimate to be near 0.2 (i.e.,
the true proportion) than any other value.
-------
I. B. Probability Distribution of Possible
Values of the Estimate
5 »°°
2
n
ISO
o.
in
ISO
W 100
88
I
ll.
0 0.1) OJT 0.40 OJJ OJT 0.10 OJ3
007 OJC OJ1 0.4T 0.10 0.71 OJ7 1.00
iM PfotArtoAvf l«k«4MMlAi*»
Figure 5. Frequency distribution of f> based on
one realization of 1000 random selections (15
samples, P = 0.2).
OJ
J0.25
1015
0.05
I,
COO 0 IS OJT 0*0 O.SJ O.BT 0.10 O.M
047 0.20 033 0.47 OJO OTJ OJT 140
EittniM PiBpoflton if tuMumlMl AIM
Figure 6. Probability distribution of t> based on
exact Binomial distribution (15 samples. P =
0.2).
The chance of observing each of the
possible values of the estimate can be
summarized in what is referred to as a
probability distribution (or sampling
distribution) of the estimate. The probability
distribution depicts the likelihood of each
possible outcome (i.e., estimate of the
proportion) of the random sampling compared
to all other possible outcomes and provides a
basis for assessing the reliability of the
estimate.
The probability distribution of possible
values of the estimate can be approximated
by repeating the random selection of locations
over and over. For each random selection of
15 locations, the estimate f) is recorded and a
cumulative tally is kept of the number of times
each possible value of f) is observed. For"
example, with 1000 random selections (of 15
locations) the frequency distribution depicted
in Figure 5 is produced. Because each of the
1000 random selections (of 15 locations) is
equally likely, the value of P with the highest
tally (or frequency) is the most likely value of
f>. Furthermore, the probability of observing
any particular value f> is the limit (as the
number of random selections goes to infinity)
of the ratio of a) the number of selections
producing that value of P, to b) the total
number of selections of 15 locations {Figure
6). Accordingly, the y-axis in Figure 6 has a
possible range from 0 to 1 .
In practice, it is not necessary to
repeatedly sample the resource in order to
.construct the probability distribution of the
estimated values. The probability distribution
of estimates of proportion (based on data
from the type of random sampling addressed
in this document) can always be described by
a standard distribution called the Binomial
-------
Figure 7. Average value of $ {15 samples, P =
0.2).
distribution. The Binomial distribution is fully
defined by only two parameters: the true
proportion and the sample size. Therefore,
the probability distribution of possible values
of the proportion of area can be constructed
by plugging a value for the true proportion
(assumed to be known in this section of the
manual) and the sample size into the formula
for the Binomial distribution.
Also, it is worth noting that the average
of all possible values of P is equal to the true
proportion. This can be seen by envisioning
.the probabilities (Figure 6) as weights on a
board. The average is the center of mass for
the weights (Figure 7) and is equal to the true
proportion. In general, if the mean of the
probability distribution of an estimate is equal
to the parameter being estimated (in this case
the true proportion), then the method of
estimation is referred to as being unbiased.
Because this condition is satisfied for the
recommended method of estimation, this
method is unbiased.
Figure 8. Hypothetical resource with scattered
subnominal areas (P = 0.2).
Another important property of this
method of estimation is that the specific
pattern of subnominal areas on the map does
not affect the estimates (providing that the
true proportion remains unchanged). For
example, with the shaded areas more
scattered (Figure 8), the probability
distribution of values of £ is exactly the same
as the probability distribution (Figure 6)
associated with the map in Figure 1.
Therefore the same sampling design can be
used regardless of the underlying spatial
pattern, which generally is unknown. The
specific pattern of the response variable
within the resource of interest does not affect
the probability distribution of the estimate 0).
However as described in the next part of this
section, the probability distribution of £ is
affected by the true proportion and by the
sample size (n).
-------
Figure 9. Hypothetical resource with a
subnominal area proportion of 0.3.
.C. Factors Affecting the Estimated
Proportion
II.C.1. The True
Subnominal
Proportion
There is a different probability
distribution of values of ft for every possible
value of the true proportion. For example, if
the total area that is subnominal is 300
square miles (Figure 9), the true proportion
is 0.3 (i.e., 300 square miles divided by
1000 square miles). The probability
distribution for estimated values of the
proportion can be generated from the
formula for the.Binomial distribution. In this
case the probability distribution is as
depicted in Figure 10. Notice that the
distribution has shifted to the right. The
most likely values are near 0.3 and the mean
of the distribution is exactly 0.3 (Figure 11).
0.25
"
0.15
"
0.05
il
I,
0.00 0.13 OJ7 0.40 O.U 0.8T OJO 049
0.07 OJO 03S 0.47 OJO 0.73 OJ7 1.00
E*ttnaM PfopofflMi «* SuMiornkMl AIM
Figure 10. Probability distribution of £ 05
samples, P = 0.3).
Figure 11. Average value of £ (15 samples, P
0.3).
-------
Figure 12. Hypothetical resource with
subnominal area proportion of 0.4.
The same exercise can be conducted
for a map in which 400 square miles are
subnominal (Figure 12). In this case the true
proportion is 0.4 (i.e., 400 square miles
divided by 1000 square miles). The
probability distribution for a true proportion
equal to 0.4 is depicted in Figure 13. Now
the most likely values are near 0.4 and the
mean of the probability distribution is exactly
0.4 (Figure 14).
The mean of the probability distribution
of values of f) is always exactly equal to the
true proportion. This is true for any value of
the true proportion (from 0.0 to 1.0), and is
true regardless of the spatial pattern of
subnominal areas within the resource of
interest. Also, the most likely values of $
are always near the true proportion.
Furthermore, by increasing the sample size,
the probability that the estimate will be very
close to the true proportion can be
increased.
0.2S
0.2
I
LU
S
£
0.16
0.1
.
e
Q.
0.03
0.00 0.1) 0.27 0.40 O.U 0.87 010 O.B3
0.07 OJO 0.» 0.47 O.M 0.73 0.87 1.00
E*flm«»d Praporfen of i*nomln«l A/M
Figure 13. Probability distribution of
samples, P = 0.4).
(15
,l
Figure 14. Average value of f> (15 samples, P
= 0.4).
8
-------
II.C.2. Sample Size and Variance
Figure 15. Hypothetical resource with 30
samples, subnominal area proportion of 0.2.
o
v
UJ
"o
o
Q.
J
o.u
Estimated Proportion of Subnomral Area
Figure 16. Probability distribution of £ (30
samples, P = 0.2).
Intuitively, it seems that estimates
based on larger samples should be more
reliable than estimates based on just a few
locations. The effect of sample size on the
probability distributions of values of $
supports this position. As can be seen from
the following examples the effect of sample
size on the probability distribution of values
of t> can be quite dramatic.
First consider a random sample of 30
locations from a resource with a true
proportion of subnominal area equal to 0.2
(Figure 15). The probability distribution of
values of f> in this case is depicted in Figure
16. The probability distribution is much
more concentrated around the true
proportion of 0.2. Also, notice that instead
of 16 possible values of P, there are now 31
possible values (i.e., 0/30, 1/30, 2/30, ... ,
and 30/30). This provides a finer scale of
resolution for the estimates.
Now consider a random sample of 50
locations from the same resource (Figure
17). The probability distribution of values of
P is extremely concentrated around the true
proportion of 0.2 (Figure 18). The scale of
resolution for the estimates is improved as
wen. There are now 51 possible values of f>
(i.e., 0/50, 1/50, 2/50 and 50/50) with
a step size between possible values of only
0.02 (i.e., 1/50).
The benefits of increased sample size
are readily apparent from these examples.
The scale of resolution of the estimates is
improved and the spread or dispersion of the
probability distribution is reduced with
increased sample size. More specifically, the
dispersion of the probability distribution of
-------
Stmpkmg Pont
In noA-tubnemh*! in*
tubnemtMl »r»«
Figure 17. Hypothetical resource with 50
samples, subnomlnal area proportion of 0.2.
T3
41
£
2
o
0-00 0,1] OJ4 g_M O.M OJO 0.71 OX O.M
0.0* 0.11 046 OU 0*4 OJ» 07| O.M
Estimated Proportwn of Subnormal Area
Figure 18. Probability distribution of t> (50
samples, P = 0.2).
possible values of P (as measured by the
variance) is inversely proportional to the
sample size. For example if the true
proportion is 0.2 and the sample size is 30,
then the variance is equal to 0.0053 (i.e.,
[0.2 x 0.8] / 30 ), whereas if the sample size
is 50, then the variance is equal to only
0.0032 (i.e., [0.2 x 0.8] / 50).
The variance is a useful summary
measure of the spread of a probability
distribution. It can also be used in
constructing confidence limits in some
cases. For example, if the probability
distribution of the estimate is known to be
approximately Normal in shape (Figure 19),
then knowledge of the variance can be used
to construct confidence limits. However in
general, the Normal approximation may not
be accurate. This is' particularly true for
estimates based on relatively small sample
sizes. The following section describes a
method for constructing confidence limits
that produces accurate limits for any sample
size. Confidence limits based on the Normal
approximation are also described, and
conditions under which the Normal
approximation is appropriate are discussed.
III. Construction of Confidence Limits
Figure 19. Probability distribution of
approximately Normal.
that is
After reviewing Section II (Estimation of
Proportion of Area that is Subnominal) it
should be clear why confidence limits, in
addition to the estimate (£), are sometimes
needed. Although values of P near the true
proportion are the most likely values, it is
possible to observe a value of P that is as
small as 0.0 or as large as 1.0, regardless of
the value of the true proportion. This raises
the question: even though you have an
10
-------
estimate of the proportion of area that is
subnominal, how can you be sure that the
true proportion isn't some other number
entirely? Unfortunately, you can't be sure.
However, you can put confidence limits
around the estimate.
In this section, the recommended
method for constructing confidence limits is
described and the rationale for using the
method is presented. For the purposes of
this section, the pattern of the response
variable within the resource of interest is
treated as if it is not known. This is in
contrast to the previous section, and
represents the more realistic situation.
This section is organized into three
parts. In the first part, the concept of
confidence limits ' for estimates of
proportions is discussed, and the
recommended approach for constructing
confidence limits is presented. Next, factors
that affect the width of confidence limits are
described. In the final part of this section,
standard graphs and tables for exact
confidence limits, and the use of an
approximation (the Normal approximation)
are discussed.
III.A. What are Confidence Limits?
Confidence limits are bounds around
the estimate that are determined such that
there is a known probability that the bounds
will bracket the true proportion. For
example, 90% confidence limits have the
property that over many replications of
sample selection and confidence interval
calculation, 90% of the resulting intervals
will cover the true value. Therefore, with
11
-------
0-2
0.15
I
u
o
I
JJ
o.t
0.05
prob{p>.3}
.
0.00 0.19 0.17 0 «0 0.53 OJT O.»0 OJ3
0.07 OJO OJ3 0.4? O.M 0.71 0.17 1.00
E«Om»»S Preperten »f tufen»m*ul Am«
Figure 20. Probability distribution of f> (30
samples, P = 0.3).
OJ
0.09
(D g
_3
> "
i 0.1
LJJ
0
r= 0.1
0} OiS
l»
a.
0
r
"nnfl
J '1 1 1
n
:pffl
|
n
n 1
"in t^11 proportion 0.19
I In
HT|I.:"
"In true proportion 0.18
Jfln
Iflll^ ".:" -
tru« proportion 0.17
.|(]nz:::::::-:;;;..:::;::
tru« proportion 0.1B
T
ru
l ln«_
Rgure 21. Construction of lower 5%
confidence bound.
symmetric confidence limits there is a 5%
chance that the lower limit will be greater
than the true proportion and there is a 5%
chance that the upper limit will be less than
the true proportion.
The approach for constructing
confidence limits for estimates of proportion
may be understood by considering a simple
example. Suppose 30 locations are randomly
selected and measurement at these locations
generates an estimate of the proportion of
area that is subnominal equal to 0.3. As is
almost universally the case, the true
proportion subnominal is not known. To
place a lower bound on the estimate we can
start by asking the question: II the true
proportion was 0.20, what would be the
probability of observing an estimate of 0.3 or
larger? The answer to the question can be
found in the probability distribution of values
of P for the case of a true proportion equal to
0.20 and a sample size of 30 (Figure 16).
The answer is the sum of the probabilities
from 0.3 through 1.0 (Figure 20), which for
this example is 0.13. This means that even
if the true proportion was as low as 0.2 there
would be a 13% chance of observing an
estimate of 0.3 or larger. Therefore, 0.2 can
be taken as the lower 13% confidence
bound.
If some pre-determined probability level
(e.g., 5%) is desired, a range of hypothesized
values of the true proportion could be
evaluated. For example, the cumulative
probabilities (for values of P of 0.3 through
1.0) could be computed for cases of the true
proportion being 0.19, 0.18, 0.17 and 0.16
(Figure 21). The cumulative probabilities of
greater values of P for these four scenarios
are 0.10, 0.08, 0.06, and 0.04. Therefore,
the lower 5% confidence bound is between
0.17 and 0.16 ( further evaluation can show
-------
that, to two decimal places, the bound is
0.17).
>
o
UJ
*o
o
£
0 14
0 \
0 M
0 06
0 04
0 01
0
_
n
f
prob 1 1
{Pi. 3} |f f
- ~\ --^M- * -i
Ml
~\
FV
,.tT:::
0.00 0.11 027 040 0 SI 0.67 0 BO CO
0.07 O.ZO 0.3J 047 O.«0 O.TJ 0.67 1.00
Estimated Proportion of Subnominal Area
Figure 22. Probability distribution of f) (30
samples, P = 0.45).
0)
«
(A
UJ
OJ
CMS
0 1
OM
0
OJ
015
0.1
OM
0
OJ
0.11
0.1
O.OS
0
OJ
0.1 B
0.1
CM
o !
try* proportion « 0.46
ITM!* P"??0!1 * °-47
ru
_fru« P_f»portton 0.48
. *5*.P!?R*?*?.?_"A'4.*-
Figure 23. Construction of upper 5%
confidence bound.
Similarly, to place an upper bound on
the estimate, we can start by asking the
question; 11 the true proportion was 0.45,
what would be the probability of observing
an estimate of 0.3 or smaller? In this case
we need to examine the probability
distribution of P for a sample size of 30 and
assuming the true proportion is equal to
0.45. The answer in this case is the sum of
probabilities from 0.0 through 0.3 (Figure
22), which for this example is 0.07.
Therefore, 0.45 can be taken as the upper
7% confidence bound .
Again, if a pre-determined probability
level (e.g., 5%) is desired, a range of
hypothesized values of the true proportion
could be evaluated. In this case, cumulative
probabilities (from values of P of 0.0 through
0.3) could be computed for true proportion
values of 0.46, 0.47, 0.48 and 0.49 (Figure
23), for example. The cumulative
probabilities for these four scenarios are
0.06, 0.04, 0.04 and 0.03. Therefore, the
upper 5% confidence bound is between
0.46 and 0.47 (further evaluation can show
that, to two decimal places, the bound is
0.47).
The upper 5% confidence bound and
the lower 5% confidence bound form the
90% confidence limits for the proportion of
area that is subnominal. For this example
(i.e., with a sample size of 30 and an
observed value of f) equal to 0.3), the 90%
confidence limits are 0.17 and 0.47. There
is a 90% chance that confidence limits
constructed in this manner will bracket
the true proportion, and a 10% chance
that the limits will miss the true proportion.
This result can be demonstrated by repeating
-------
the random selection of 30 locations from a
known pattern of subnominal areas (as was
discussed in Section II.A) over and over. For
each of the random selections, the value of ft
can be computed and the corresponding
confidence limits determined. In 90 out of
100 iterations (on average), the confidence
limits will bracket the true proportion,
regardless of the value of the true proportion.
III.B. Factors Affecting Width of the
Confidence Interval
The interval between the lower and
upper 90% confidence limits is the 90%
confidence interval. In the example
discussed above, the width of the 90%
confidence interval is .0.30 (0.47 minus
0.17). Two factors (given a specific
estimate, £) affect the width of the
confidence interval: the confidence level
(e.g., 90%), and the sample size. If a higher
level of confidence had been specified, say
95%, then the confidence interval would
have been wider. On the other hand, if the
sample size had been larger, then the
confidence interval would have been
narrower.
The fact that the width of a confidence
interval increases as the confidence level
increases is intuitively appealing. This is
consistent with being more confident about
making general statements (wide intervals)
and being less confident about making more
specific statements (narrow intervals). The
reason for this effect of confidence level on
confidence intervals is clear from the way the
confidence limits are determined. For
the example discussed above, if a higher
-------
0.23
OJ -
I
I
0.
0.85 -
(1
1
-1
n
.n
jl preb (p > .3)
, ill./
C-DO B.1J 0.27 040 0.63 O.B7 O.tO 0 iS
0.07 DJO OSS 0.47 O.BO 07) OJ7 1.00
Pmpoftnn el Bubnomml An*
Figure 24. Probability distribution of f> (30
samples, P = 0.15).
confidence level is desired (e.g., 95% rather
than 90%), then the upper and lower 2.5%
confidence bounds would be used. The true
proportion would have to be 0.15 in order for
the cumulative probability (from 0.3 through
1.0} to equal only 2.5% (Figure 24).
Similarly, the true proportion would have to
be 0.49 in order for the cumulative
probability (from, 0.0 through 0.3) to equal
only 2.5% (Figure 25). The effect of varying
confidence levels from 75% to 99% on the
size of the confidence intervals is summarized
in Figure 26 (for a sample size of 30 and an
observed £> of 0.3).
(.1
Mi
r
prob
(p£-3> n
.IJ',1 ,
n
1
. . k. . , .
»te t.ii ur t4t LU i*7 «j> tn
*JT tJl Ui MT « «.TJ ««? 1M
Estimated Proportion of Subnominal Araa
Figure 25. Probability distribution of t> (30
samples, P = 0.49).
«
i
tm
»Mt
*4
*n
nt*
5
o
O
Ui tJ «J
Confidence Level
Figure 26. Confidence interval widths as a
function of confidence level (30 samples, f>
0.3).
-------
0.32
3 »»
IOJB
jjoje
lo.24
o
O
I"
0.15 -
0,18
0.14
' 20
40
SO 10
S«mpt» Stz*
100
Figure 27. Confidence interval width as a
function of sample size (90% confidence, f> =
0.3).
The fact that the width of the confidence
interval decreases as the sample size
increases is also intuitively appealing.
Increased sample size, as discussed in Section
II, increases the reliability of estimates and
should allow more detailed statements to be
made without diminishing the level of
confidence. For example, with a sample size
of 60 and assuming the true proportion is
0.17 (the lower 5% bound for a sample size
of 30) there is only a 1% chance that the
observed value of f> would be 0.3 or greater.
The 5% lower confidence bound in this case
is .0.20. Similar effects are exhibited with the
upper confidence bound. The effect of
varying sample size on the size of confidence
intervals is summarized in Figure 27 (for 90%
confidence and an observed £ of 0.3) for a
range of sample sizes from 10 to 100.
III.C. How to Compute Confidence Limits
III.C. 1. Standard Graphs and Tables
for Confidence Limits
No special calculations are needed to
determine confidence limits for the situations
described above. The required confidence
limits are tabulated and summarized in
standard graphs in many textbooks and
handbooks on statistics (e.g., see W.H. Beyer
[ed.] 1976. CRC Handbook of Tables for
Probability and Statistics. CRC Press). The
confidence limits are referred to as
"Confidence Limits for Proportions" for the
"Binomial Distribution". Separate tables are
published for different confidence levels
(usually 90%, 95% and 99%). The tables are
read by specifying x ( referred to as the
-------
numerator or the number of successes)
and n (referred to as the denominator
or the sample size). The corresponding
table entries are the lower and upper
confidence limits. This information is
also summarized in graphs that depict
the upper and lower confidence limits
on the y-axis and the estimated
proportion on the x-axis.
III.C.2. Normal Approximation
An alternative approach to the
one based on the Binomial distribution
(described above) is to construct the
confidence limits based on the Normal
distribution. Construction of
confidence limits based on the Normal
distribution provides a greater degree
of flexibility which can be
advantageous for more complex
sampling designs (e.g., stratified
random designs discussed in the next
section).
As discussed in the previous
section the Binomial distribution exactly
describes the probability distribution of
possible values of the estimate.
However, under certain circumstances,
the Normal distribution is a good
approximation to the Binomial
distribution. In these cases, tonfidence
limits based on the Normal distribution
may be used instead of those based on
the Binomial distribution.
Approximate confidence limits,
based on the Normal distribution, are
computed from a simple formula.
Therefore, confidence limits do not
have to be restricted to confidence
levels and sample sizes listed in
standard tables and graphs, and
interpolations between tabulated values
are not required. For example, a
standard table of Binomial confidence
limits may list confidence limits for
confidence levels of 95% and 99%,
and may list sample sizes in steps of
10 (e.g., 10, 20, 30, etc.). By using
the Normal approximation, any
combination of confidence level and
sample size can be addressed directly
(e.g., 85% confidence and a sample
size of 53). The Normal approximation
requires information on only two
quantities: a coefficient based on the
Normal distribution, and the variance of
the probability distribution .of possible
values of the estimate (£).
The variance of the probability
distribution of possible values of the
estimate can be estimated as the
product of the estimated proportion
times one minus the estimated
proportion, all divided by the sample
size:
(P (1-p) 1 '"
This is the same as the expression for
the variance presented in Section
II.C.2, except that the true proportion
(which in practice is unknown) in this
case is replaced by the estimate of the
proportion ($>). For example, if f> is 0.4
and the sample size is 50, then the
estimate of the variance is 0.0048 (i.e.,
(0.4x0.61/50).
-------
The required Normal coefficients are
tabulated in most introductory statistics
textbooks as well as in advanced texts.
Generally the tabulations are presented in
steps of 1% or less (e.g., 90%, 91%, 92%,
etc.). For the standard confidence levels of
90%, 95% and 99%, the Normal coefficients
(c) are 1.65, 1.96, and 2.58 respectively.
The lower confidence limit based on the
Normal approximation is simply the estimated
value minus a quantity equal to the product of
the Normal coefficient (c) for the desired
confidence level 3nd the square root of the
estimated variance:
P ~ c / p (l-p) I n .
Similarly, the upper confidence limit based on
this approximation is the estimated value plus
that same quantity:
p + c v/ p (l -/?) / n .
For a 95% confidence interval based on the
example in the previous paragraph, the lower
confidence limit is 0.26,
0.4 - 1.96 y/ 0.0048 ,
and the upper confidence limit is 0.54,
0.4 + 1.96V7 0.0048
As noted previously, the Normal
approximation is not always accurate. In
particular, it is not accurate if the sample
size is too small. Working rules have been
-------
established (e.g., see W.G. Cochran. 1977.
Table 1. Minimum sample sizes required . Sampling Techniques. John Wiley and Sons)
heNomTa^pF^^ regarding the minimum sample size that is
required when using the Normal
£ approximation. The required minimum sample
or size is larger for estimates of the proportion
.p) n (values of £) close to 0.0 or 1.0, and is
smallest if the estimate is 0.5 (Table 1). The
required minimum sample size for all
04 50 situations is never less than 30 and is as large
03 BO as 1400 when the estimated proportion is
0.05 (or 0.95). Clearly, the Normal
approximation must be applied with caution.
0.1 eoo Whenever possible, the exact confidence
005 1400 limits based on the Binomial distribution
: - should be used.
IV. Data Analysis for Stratified
Random Samples
The discussion of data analysis methods
up to this point has assumed that all locations
within the resource of interest have an equal
chance of being selected for sampling. This
may not always be the case. In some
situations, areas of special interest (strata)
may be identified and additional sampling
effort expended in these areas. Although
every location within a stratum may have an
equal chance of being selected for sampling,
a location within a special interest area would
have a higher chance of being selected than
a location outside the special interest area.
To ensure that estimates are unbiased, the
analysis of data from this type of sampling
design (referred to as a stratified random
sampling design) must account for the
different levels of sampling effort -in the
different strata.
The recommended method for analyzing
data from stratified sampling designs is
straight-forward and intuitively appealing.
-------
The method is illustrated in the following
paragraphs with a hypothetical example for
the case of two strata. The method,
however, is not limited to two strata and can
be extended to analyze data from a stratified
random sample with any number of strata.
Rgure 28. Hypothesized resource divided into
two strata.
Sam ping Point
- In non-«ubnomlnil irti.
In tubnomlnal ir»« a
Rgure 29. Stratified resource with 20 samples
in Stratum 1 and 10 samples in Stratum 2.
Suppose the resource being studied is
divided into two strata: a 200 square mile
area of special interest (labeled Stratum 1 in
Figure 28) and the remaining 800 square
miles of the resource (labeled Stratum 2).
The intention is to be able to characterize
the entire resource but also to characterize
just the area of special interest. For this
reason, suppose that most samples are
allocated to stratum 1. For this example,
the total sample size of 30 is split between
the two strata with 20 samples going to
stratum 1 and 10 samples going to stratum
2. Within each stratum, the sampling
locations are selected randomly (Figure 29).
Two steps are needed to estimate the
proportion of the total area (i.e., the entire
resource) that is subnominal in this case.
First, a separate estimate is computed for
each of the two strata (say£1 and £2) using
the methods described in the foregoing
sections. For this example, the estimate for
stratum 1 is based on 20 samples, and the
estimate for stratum 2 is based on 10
samples. The second step is to compute a
weighted average of these two estimates.
The weight associated with each stratum is
the ratio of the area of the stratum divided
by the total area of the resource. For this
example the weight for stratum 1 is 0.2 (i.e.,
200/1000) and the weight for stratum 2 is
-------
0.8 (i.e., 800/1000). Therefore, the
weighted average (0 ) is:
p. = 0.2/51 + 0.8/2 .
The resulting estimate for the entire
resource is unbiased.
Similarly, the upper confidence limit is
the weighted average proportion plus
that same quantity. As with any use of
the Normal approximation, adequate
sample sizes in each stratum are
necessary for reliable results.
A confidence interval for the
overall estimate can be calculated
based on the Normal approximation.
Using the previous example, the
estimated variance of the estimate from
stratum 1 is (01(7-07) ] / 20, while the
estimated variance of the estimate from
stratum 2 is [ £2(7-02) 1/10. The
estimated variance of the weighted
average is the weighted sum of the
variances from the two strata:
Var(p ) = (0.2)2[(/51(1-PO)/20)
* (0.8)2[(/2(l-/2))/10] .
Each weight used to compute the
overall variance is simply the square of
the corresponding weight that was
used to compute the overall proportion
(0.).
The lower limit of the confidence
interval is the weighted average
proportion minus the quantity equal to
the product of the Normal coefficient
(c) for the desired confidence level, and
the square root of the variance of the
weighted average:
p -
V. Closing Comments
The methods described in this
document are appropriate for analyzing
data from simple random samples and
stratified random samples. However,
some R-EMAP programs use neither
simple random nor stratified random
sampling designs. In these cases, the
described methods should only be used
as a last resort, and will only produce
approximations. , A more general
method of data analysis should be used
that is consistent with the complexity
of the sampling design. This more
general approach is conceptually similar
to the described methods, but is more
involved and requires that the
probability of selecting each location'
and the probability of selecting every
pair of locations is known. This
additional information may not always
be available. if any doubt exists
regarding which method to use, the
EMAP Statistics and Design "ieam at
the EPA Corvallis Laboratory can make
the determination.
The hypothetical example referred
to throughout this document is
intended simply as an instructional tool.
The methods described are not limited
to analyzing data from estuaries. They
are appropriate whether the purpose of
-------
sampling is to estimate the proportion
of The number of resource units (e.g.,
numbers of lakes), the proportion of
total length of a resource (e.g., miles of
streams), the proportion of area of a
resource (e.g., square miles of an
estuary), or the proportion of volume of
a resource (e.g., cubic meters of one of
the Great Lakes). The methods can be
applied without modification to each of
these situations. More detailed,
technical documentation on the
methods described in this document is
available from the EMAP Statistics and
Design Team in Corvallis.
-------
EMAP Status Estimation:
Statistical Procedures and Algorithms
V.M. LESSER AND W.S. OVERTON
Department of Statistics, Oregon State University,
Corvallis, Oregon
Project Officer
Anthony R. Olsen
U.S. Environmental Protection Agency
Environmental Research Laboratory
200 SW 35th Street, Corvallis, Oregon
The information in this document has been funded wholly or in part by the U.S.
Environmental Protection Agency under cooperative agreement CR-816721 with Oregon
State University at Corvallis. It has been subject to the agency's peer and administrative
review. It has been approved for publication as an EPA document.
-------
CONTENTS
INTRODUCTION 1
1.1 Overall Design 2
1.2 Resources 2
1.3 Response Variables of Interest 3
1.4 Statistical Methods 4
1.5 Use of this Manual 6
GENERAL THEORETICAL DEVELOPMENT 9
2.1 Design-Based Estimation Methods 9
2.1.1 Discrete Resources 9
2.1.1.1 General Estimator and Assumptions 10
2.1.1.2 Tiered Structure 14
2.1.2 Extensive Resources 22
2.1.2.1 Areal Samples 23
2.1.2.2 Point Samples 25
2.1.2.3 Alternative Variance Estimators 29
2.2 Model-Based Estimation Methods 33
2.2.1 Prediction Estimator 34
2.2.2 Double Samples 36
2.2.3 Calibration 37
2.3 Other Issues 38
2.3.1 Missing Data 38
2.3.1.1 Missing Sampling Units 38
2.3.1.2 Missing Values within Sampling Units 39
2.3.2 Censored Data 39
2.3.3 Combining Strata 41
2.3.3.1 Discrete Resources 42
2.3.3.2 Extensive Resources 43
2.3.4 Additional Sources of Error 43
2.3.5 Supplementary Units or Sites 44
DISTRIBUTION FUNCTION ALGORITHMS 45
3.1 Discrete Resources 46
3.1.1 Estimation of Numbers 47
Equal probability sampling 47
Case 1 N known/unknown, Na known 48
Case 2 - N known, N0 unknown 53
Case 3 - N known/unknown, Na known/unknown 55
Variable probability sampling ___ 56
Case 4 Na unknown or Na known and equal Nfl 57
Case 5 - Na known and not equal N0 60
3.1.2 Proportions of Numbers 61
Equal probability sampling 61
Case 6 - N0 known/unknown 62
Case 7 - Na known 67
Variable probability sampling 68
Case 8 - N0 unknown or Na known and not equal Na 69
Case 9 - Na known and equal Na 72
3.1.3 Rationales for the Algorithms in Section 1.1 and 1.2 74
-------
SECTION 1
INTRODUCTION
The Environmental Protection Agency (EPA) has initiated a program to monitor
ecological status and trends and to establish baseline environmental conditions against
which future changes can be monitored (Messer et al., 1991). The objective of this
environmental program, referred to as EMAP (Environmental Monitoring and Assessment
Program), is to assess the status of a number of different ecological resources, including
surface waters, wetlands, Great Lakes, near-coastal waters, forests, arid lands, and
agroecosystems.
A design plan and a number of support documents have been prepared to guide
design development for EMAP (Overton et al., 1990; Overton and Stehman, 1990; Stehman
and Overton, in press; Stevens, in press). The statistical methods outlined in earlier
documents, such as those analyzing the EPA National Surface Water Surveys, are also
relevant to EMAP (Linthurst et al., 1986; Landers et al., 1987; Messer et al., 1986).
This report presents statistical procedures collected from other EMAP documents, as
well as from Oregon State University technical reports describing data analyses for other
EPA designs. By integrating this information, this manual and the EMAP design report
will serve as reference sources for statisticians who implement an ecological monitoring
program based on the EMAP design framework. Spatial and temporal analyses of EMAP
data are not covered in this version of the report. A brief discussion of the four-point
moving average, which combines data over the interpenetrating sample, is presented in
Overton et al. (1990; Section 4.3.7). Algorithms listed in this report cover most design
options discussed in the EMAP design report. It is expected that any further realizations of
the EMAP design will also include documentation of corresponding variance estimators.
-------
1.1 Overall Design
The EMAP design is a probability sample of resource units or sites that is based on
two tiers of sampling. The first tier (Tier 1) primarily supports landuse characterization
and description of population structure, while the second tier (Tier 2) supports status
assessment by the indicators. The second tier sample is a probability subsample of the first
tier sample; such a sample is referred to as a double sample. Across the ecological resource
groups, it is expected that discrete, continuous, and extensive populations will be monitored.
The statistical methods outlined in this report address these different population types at
both sampling tiers. A description of the sampling design is presented in Overton et al.
(1990).
1.2 Resources
EMAP is designed to provide the capability of sampling any ecol'ogical resource. To
achieve this objective, explicit design plans must'be specific for a particular resource and all
resources to be characterized must be identified. Currently, the resources to be sampled
within EMAP include: surface waters, wetlands, forests, agroecosystems, arid lands, Great
Lakes, and near-coastal wetlands. These resources are further divided by major classes to
represent the specific 'resource' that will be addressed by the sampling effort. For example,
surface waters can be partitioned into classes such as very small lakes, intermediate-sized
lakes, very large lakes, very small streams, intermediate-sized streams, rivers, and very large
rivers. Because each class potentially generates different sampling issues, each would be
considered a different entity. The design structure meets this condition by identifying each
such class as a resource, thereby resulting in 6 to 12 surface water resources. Each major
resource group may also have as many divisions.
Most resources will be sampled via the basic EMAP grid and associated structures.
However, other resources, such as very large lakes and very large rivers, represent unique
\ 2
-------
ecological entities and cannot be treated as members of a population of entities to be
described via a sample of the set. For example, Lake Superior and the Mississippi River are
unique, although the tributaries of the Mississippi might be treated as members of a wider
class of tributaries.
Resources sampled by the EMAP grid will be associated with an explicit domain in
space, within which the resource is confined. This domain should be established early in the
design process. Within the defined domain, it is not expected that the resource will occupy
all space or that no other resource will occur. Domains of different resources will overlap,
but the domain of a particular resource is an important parameter of its design. For
purposes of nomenclature, the resource domain is a region containing the resources. The
resource universe is either a point set within one point for each resource unit (discrete
resource) or the continuous space actually occupied by the resource (for extensive resources).
A resource class will be a subset of its universe. Such a class may or may not be treated as
a sampling stratum and may or may not have an identified subdomain.
1.3 Response Variables of Interest
The term response, variable is used generally for the measured characteristic of
interest in the sample survey. In EMAP, a special class of response variables is referred to
as indicators, such as indicators of ecological status (Hunsaker and Carpenter, 1990). These
indicators are the environmental and ecological variables measured in the field on resource
units or at resource sites; they may be measured directly or modified via formulae or
analytic protocols.
The term, indicators, should not be applied to the structural variables defined at Tier
1. The Tier 1 variables are used to estimate population structure and to partition the Tier
1 sample into the necessary structural parts for Tier 2. Then the indicator variables are
determined on the Tier 2 sample. When the Tier 2 sample includes the entire Tier 1
-------
sample, it is still appropriate to make the distinction between indicator and structural
variables, both of which are response variables. Because of this distinction, it is sometimes
appropriate to distinguish Tier 1 and Tier 2 in terms of the variables, rather than strictly in
terms of a subsample.
1.4 Statistic*] Methods
The primary statistic used to summarize population characteristics is the estimated
distribution function. This distribution function estimates the number or .proportion of
units or sites for which the value of the indicator is equal to or less than y. For discrete
resources, the estimated distribution function for numbers is designated as N(y), while the
estimated distribution function for the proportion of numbers is designated as F(y). The
estimated distribution function of size-weighted totals in discrete resources is designated as
Z(y), while a size-weighted proportion is designated as G(y). Examples of size-weights are
lake area, stream miles,- and stream discharge. There are no distribution functions
comparable to Z(y) and G(y) in the continuous and extensive populations because there are
no objects in these resources. Therefore, there are no object sizes to use as weights. In
extensive resources, the estimated distribution function representing actual area! extent for
which the value of the indicator is equal to or less than y is designated as A(y), while the
proportion of area! extent is designated as F(y). Thus A(y) is analogous to N(y); A is the
size of an extensive resource and N is the size of a finite resource.
A number of estimates of interest, which can be obtained from the distribution
function, have been used quite successfully in the National Surface Water Surveys (NSWS)
(Linthur'st et a]., 1986; Landers et al., 1987; Kaufmann et al., 1988). For example, any
quantile, including the median of the distribution, can be interpolated easily from the
distribution function. In addition, the distribution function can be supplemented with
tables of means, quantiles, or any other statistics of particular interest, providing greater
accuracy than can.be obtained from the plotted distribution function.
-------
The basic formula for estimated distribution functions is
where S is the sample of units representing the universe ("U) and the variable y represents
any response variable. The subscript a denotes a subpopulation of interest; Sa is the portion
of the sample in subpopulation a, and S is the portion of the sample in subpopulation a
having values < y. We associate the inclusion probability, TT,-, with each ith sampling unit.
Each sampling unit is a representation of a subset of the universe, and the weight (wt = i)
accounts for the size of the subset. Na denotes the estimated population size for the
subpopulation a. Fa(y) is calculated for each value of y appearing in the sample.
As given, Fa(y) is a step function not suitable to general EMAP -needs; a smoothed
version is desirable. Thus, we propose the following method. In this method, FQ(y) is
re y' is the next lesser value to y, For the minimum
(y)
r / \ i r / ?\
replaced by - - ~ - - , where y' is the next lesser value to y, For the minimum values
of y, Fa(y) is replaced by ^ . A linear interpolation is then made between these points
to generate the plot or to determine quantiles. For each of the distribution function
algorithms provided in this report, two successive values are averaged in this manner and
used to develop an interpolated distribution function. Confidence bounds are constructed on
F0(y) and then averaged and interpolated in the same manner. We rest justification for this
procedure on the interpretation of the initial and final values of the resulting distribution
function. The initial value is our best estimate of the proportion of the population below
the minimum observed value, and similarly, one minus the last point is our best estimate of
the proportion of the population above the maximum observation.
Computation of the distribution function and the associated confidence bounds differs
slightly 'for specific resource groups, reflecting differences in detail of the sampling design.
\
-------
For example, in some cases simplifications of the computing algorithms result from equal
probability designs. Some algorithms are presented in this report to accommodate the range
of conditions and objectives anticipated across the resource groups. These algorithms have
been previously discussed in greater detail in other documents; references are given for those
requiring a more in-depth approach. Table 1 provides an outline of the distribution
functions, which are presented in greater detail throughout this report. Table 2 provides a
table of notation used throughout this document.
1.5 Use of this Manual
The body of this manual has been separated into two sections. Section 2 includes the
genera] theoretical development upon which the algorithms are based. Formulae for
discrete, continuous, and extensive resources are presented; the mathematical notation is
introduced and defined; and both design-based and model-based approaches for computing
the distribution functions are discussed. Other issues relevant to the analysis of EMAP
data, such as handling of missing data, are also discussed in this section.
Section 3 includes the algorithms used to produce the distribution functions, the
conditions that provide for the application of the algorithms, and the rationales that support
the choice and derivation of the algorithms presented. References will be made to the
general formulae (discussed in Section 2) used to develop these algorithms.
The following list outlines a step-by-step sequence for obtaining the distribution
functions:
1. Determine whether the data represent a discrete or extensive resource.
2. Determine the type of distribution function to compute. For example, for discrete
resources, the distribution of numbers and/or proportions of numbers will be
of interest.
3. Determine whether the sampling units were collected with equal or variable
probability of selection. The inclusion probabilities, TTj, and TTj.iii discussed in
Section 2.1.1, are to be a permanent part of a datum record, as are the
identification code of the sampling unit and the variable of interest. In some
cases, it is also necessary to identify the grid point, which can be included as
part of the identification code.
4. Determine whether the size of the subpopulation of interest is known or unknown.
V
-------
The subpopulation is the group of population units about which one wishes to
draw inference.
5. Using the conditions from steps 1-4, refer to the example of that specific algorithm
in Section 3.
6. Optional, but suggested: Refer to the formulae referred to in Section 2 for a
description of the formulae and for clarification of any notational problems.
7. Optional: An algorithm to obtain specific quantiles is presented in Section 3.
This manual is expected to be updated as research continues in the development of
statistical procedures for EMAP, as EMAP adapts to changing concerns and orientation,
and as EMAP makes and accumulates more in-depth frame materials. For example, efforts
to date have been focused on design-based approaches to confidence bound estimation,
therefore this version reflects a fairly in-depth approach to design-based estimation over all
resources. Further discussion of model-based approaches currently under development are
expected in future versions of this manual.
\
-------
SECTION 2
GENERAL THEORETICAL DEVELOPMENT
Two approaches are commonly used to draw inferences from a sample survey relative
to a real population. In the design-based approach, described in Section 2.1, the
probabilities of selection are accounted for by the estimators and the properties of inferences
are derived from the design and analytical protocol. In contrast, the model-based approach
(Section 2.2) assumes a model and requires knowledge of auxiliary variables for inference.
Properties of model-based inference are derived from the assumed model and analytical
protocol. A model-based estimator takes into account only model features, while a model-
assisted estimator takes into account both model and design features. For a discussion of
the relationship between these two approaches and the way they are used together, refer to
Hansen et al. (1983). The paper by Smith (1976) also provides useful insight.
2.1 Design-Based Estimation Methods
2.1.1' Discrete Resources
A population of natural units readily identified as objects is defined as a discrete
resource. For example, lakes, stream segments, farms, and prairie potholes are all
considered discrete resources. Populations of a large number of discrete resource units that
can be described by a sample are considered for EMAP representation. It is suggested, for
example, that lakes less than 2,000 hectares be characterized as populations of discrete
resources. Distribution functions of the numbers of units or proportions of these numbers
may be of interest. On the other hand, very large lakes are unique, and less usefully
characterized as members of populations of lakes.
-------
2.1.1.1 General Estimator and Approximations of Design-Based Formulae
Because the EMAP design is based on a probability sample, design-based estimators,
which account for this structure, are applicable. The Horvitz-Thompson theorem (Horvitz
and Thompson, 1952) provides general estimators of the population attributes for general
probability samples and for estimators of variance of these estimators (Overton and
Stehman, 1993a; Overton et al., 1990).
In Horvitz-Thompson estimation, the probability of inclusion, 7T,, is associated with
the ith sampling unit. Each sampling unit is a representation of a subset of the universe,
and the weight (wf = ^-) accounts for the size of the subset. Therefore, estimates of
population parameters, such as totals or means, simply sum the variables collected over the
sampling units, expanding them by the sampling weights. The Horvitz-Thompson estimator
proposed for EMAP is unbiased for the population (and subpopulation) totals and means, if
7T,>0 for all units in the population.
The general form of the Horvitz-Thompson estimator is
(2)
where S is the sample of units representing the universe (1L), wf is the weight, and the
variable y represents any response variable. The total of y on the universe is defined as Ty
= £y and is generally referred to as the population total. This estimator (Equation 2)
1L
yields estimates of many parameters simply by the definition of y. For example, if y,-=l for
all units in the population, then Tv = N, the population size; it follows that N=^wf-.
S
Suppose further that we are interested in a subpopulation, a. The portion of the
sample, Sa, that came from this subpopulation is also a probability sample from this
subpopulation. To obtain parameter estimates for a eubpopulation, Equation 2 is simply
summed over the subpopulation sample,
10
-------
(3)
The Horvitz-Thompson theorem also provides for an unbiased estimator of the
variance of these estimators under the condition that 7T,j>0 for all pairs of units in the
population. The quantity v^ is the probability that unit i and unit j are simultaneously
included in the sample. The estimator of the variance is designated in lower case as var,
and w(J is the inverse of the pairwise inclusion probability. The variance of Tya or Na is
obtained by the choice of y,:
var(Tyj = \ ^yXK-i) + 2- L^K",-*,;)) (4a)
5, S, S_
a
var(Na) = *,.(*,.-!) + K^ - *,-,) (4b)
This presentation shows that the form of the variance estimator does not change when
estimating variance for the estimator based on a full sample or a subset of the sample. The
subsetting device, with summation over the appropriate subset of the sample, will always
represent the appropriate estimator. The principal reason for using the Horvitz-Thompson
form (Equation 4) is its subsetting capability; the commonly used form for the Yates-
Grundy variance estimator does not permit the convenience of subsetting.
The EMAP design is based on a systematic sampling scheme. The Horvitz-
Thompson theorem does not provide a design-unbiased estimator of variance based on this
design, because some pairwise inclusion probabilities are zero. The following sections include
a discussion of assumptions and approximations applied to Equation 4 in order to apply this
variance estimator in EMAP.
11
-------
Systematic sampling
Because EMAP units are selected by a systematic sampling design, many of the
pairwise inclusion probabilities (TT^) equal zero and an unbiased variance estimator is not
available. However, it has been established that in many cases the variance of a systematic
design can be satisfactorily approximated by the variance that applies to a sample taken on
a randomly ordered list (cf., Wolter, 1985). A common systematic sample selected on a
randomly ordered list is a simple random sample. Therefore, a simple random sample is an
approximate model for an equiprobable systematic sample. The randomized model
proposed here provides approximate variance estimation for a systematic variable
probability design.
A modification of the randomized sampling model provides only for 'local'
randomization of the position of the population units, rather than global randomization.
Good behavior of the variance estimator results from this assumption (Overton and
Stehman, 1993b). As a consequence, we can justify use of the suggested pairwise inclusion
probability with less restriction as compared with the global randomization assumption.
We will refer to the local randomization model as the weak randomization assumption.
The Horvitz-Thompson estimator of variance, Equation 4, is thus proposed for
EMAP indicators under the weak randomization assumption. The simulation studies
conducted on the behavior of this estimator suggested that this assumption was adequate in
most situations expected for EMAP (Overton, 1987a; Overton and Stehman, 1987; Overton
and Stehman, 1992; Stehman and Overton, in press). In a few situations the estimator
overestimated the true variance, thus providing for a conservative estimate of variance. In
certain circumstances, as discussed in Section 2.1.2.3, it is appropriate to modify the
estimation methods to account for the spatial patterns.
-------
Pairwise inclusion probability
Approximations for the pairwise inclusion probability under the randomized model
have been proposed in the literature (Hartley and Rao, 1962). A major disadvantage with
these approximations, as discussed by Stehman and Overton (1989), is the requirement that
all inclusion probabilities for the population must be known. For large populations such as
those studied in EMAP, it is practically impossible to obtain inclusion probabilities for all
units in the populations. Another approximation for this pairwise inclusion probability
requires that the inclusion probabilities be known only on the sample (Overton, 1987b).
The formula for the inverse of this pairwise inclusion probability is
2nw,w - w-- w.
w - J ;
-
where n is sample size.
Investigation of this approximation indicates that it performs at least as well as other
commonly recommended approximations (Overton and Stehman, 1992; Overton, 1987a).
Therefore, this pairwise inclusion probability will be used in the approximation of the
variance estimator for the population parameter estimates collected in EMAP, for those
circumstances in which the randomization assumption is justified.
This variance estimator (Equation 4) accommodates variable probability of selection,
but it is also appropriate for equal probability designs. The approximation for the pairwise
weight given in Equation 5 is also appropriate for randomized equal probability designs. As
a consequence, Equation 4 with 5 is valid for either equal or variable probability selection,
under the weak assumption of a randomized model, as discussed above under systematic
sampling.
When the randomization model is not acceptable, alternative variance estimators,
based on the mean square successive difference, have been developed for use with an equal
-------
probability systematic design and regular spacing (Overton and Stehman, 1993a). The
conditions and assessment of these and other variance estimators are presently under
investigation; subsequent versions of this document will discuss these alternate methods.
Extension must be made to account for irregular spacing. It should also be noted that in
some circumstances the methods of spatial statistics may provide adequate assessment of
variance.
The confidence bounds obtained using the Horvitz-Thompson variance estimator
(Equation 4) are based on normal approximation. This approximation may be inadequate
for estimating confidence bounds at the tails of the distribution, even for moderate sample
sizes. In the special case of equal probability of selection and the randomization
assumption, confidence bounds can be obtained by exact methods (see Section 3). However,
exact methods may also yield inadequate confidence bounds at the tails of the distribution
(also discussed in Section 3).
2.1.1.2 Tiered Structure
The following description of the tiered structure was summarized in the EMAP design
report. (Overton et al., 1990).
The Tier 1 sample
The EMAP sample design partitions the area of the United States into hexagons,
each comprising approximately 635 square kilometers (Overton et al., 1990), and selects a
point at an identical position in each hexagon; selection of this one position is random
(equiprobable) over the hexagon. This method results in a triangular grid of equally spaced
points. An areal sample of a 40-km2 hexagon (40-hex) is imposed on each point, with the
sampled hexagonal area containing -^ of the total area of the larger hexagon. This fraction,
, therefore represents a constant inclusion probability, TT, and 16 represents a constant
-------
weight, w, to be applied to each fixed-size areal sample. Because other enhancements of the
grid are expected, possibly with different sized areal samples, the following formulae will
incorporate general notation.
No detailed characterization of indicators is collected at Tier 1, so no distribution
functions will be computed based on the Tier 1 data. It is of interest, however, to estimate
the total number of discrete resource units in specific populations at Tier 1. This estimation
is possible for any resource class for which units can be uniquely located by a position point.
The following formula can be used to estimate the total number of units for a particular
resource (r) at Tier 1:
Nr = "^nr, , (6)
r
where ^tr is the domain for resource r. A domain of a resource is a feature of the spatial
frame that delineates the entire area within which a sample might encounter the resource
(Section 1.1). In these formulae, the quantity nri represents the number of units for the
particular resource at grid point i. The variance can be estimated using Equation 4b, as
follows:
«3>r \
where nr is the number of grid points for which the areal sample hexagon includes part of
the domain of the resource. It is worth noting again that the estimates of variance are often
expected to slightly overestimate variance if the systematic design results in greater precision
than would a randomized design, thus providing for a conservative estimate of variance.
-------
The reduced Tier 1 sample
In preparation for selecting the Tier 2 sample, resource classes are identified. Some of
these classes will be treated as sampling strata, and hence be designated as 'resources'. The
Tier 1 sample for such a 'resource' is reduced so that it contains only one unit at each 40-
hex at which that resource is represented. This condition effectively changes the sample
from a set of systematic areal samples to a spatially well-distributed subset of units from the
population of units for the particular resource.
A consequence of this sample reduction step is the introduction of variable inclusion
probabilities in the Tier 1 sample, reflecting the scheme used to reduce nri to 1. For
example, if a random sample of size 1 is selected from the nri units of hexagon z, then the
selected unit will have 7Tlrl = ^, A consequence of this is that Nr = 5Zwlr( = w£3nr,i
S,r -Slr
where Slr is now the sample of points for which nri->0; at each of these points, the sample
now consists of one unit of resource r. Because this estimate, Nr, is identical to the original
Tier 1 estimate, it has the same variance. This sample, Slr,- is then subsampled to generate
the Tier 2 sample, S2r- Again, note that it is now a resource-specific sample of units, not
the original areal sample.
The Tier 2 sample
The Tier 2 sample, S2, is a probability subsample, a double sample, of the Tier 1
sample of resource units. At this tier, a specific resource has been identified and the reader
should remember that subsequent equations are for specific resources. The reader should
also be aware that the subscript t will now index a resource unit, not the grid point. All
Tier 2 samples for discrete resources consist of individual units from the universe of discrete
resource units.
With these changes, the estimator presented in Equation 2 is appropriate for the
sample collected at the second tier. The indicator values are summed over the samples
-------
surveyed at the second tier by the assigned weights. The inclusion probabilities account for
the probability structure of this double sample. Overton et al. (1990) identified the
probability of the inclusion of the i unit in the Tier 2 sample as the product of the Tier 1
inclusion probability and the conditional Tier 2 inclusion probability. The conditional Tier
2 inclusion probability is defined as the probability of inclusion at Tier 2, given that the
unit was included at Tier 1. This product is still conditional on the Tier 1 sample and leads
to conditional Horvitz-Thompson -estimation.
In subsequent equations, the subscripts 1 and 2 represent the first and second tiers,
respectively. The weighting factor for unit i at Tier 2 is defined as
W2r, = Wlr,w2.1r, ' (8)
where w]ri is the weighting factor for the i-t unit in the Tier 1 reduced sample and w2 lri is
the inverse of the conditional Tier 2 inclusion probability for resource r. .=
Selection of the Tier 2 sample from the reduced Tier 1 sample and calculation of the
conditional Tier 2 inclusion probabilities are discussed in Section 4.0 of the EMAP design
report (Overton et al., 1990). This procedure generates a list in a specific order, based on
spatial clusters. Clusters of 40-hexes are arbitrarily constructed with uniform size of the
initial Tier 1 sample of the specific resource. The reduced Tier 1 sample is sorted at random
within clusters, and then the clusters are arranged in an arbitrary order. A subsarr.ple of
fixed size, n2r, is selected from Slr by ordered variable probability systematic sampling from
this list. The purpose of this elaborate procedure is to generate a spatially well-distributed
Tier 2 sample.
The Tier 2 conditional inclusion probabilities are proportional to the weights at Tier
1:
IT j = yirl = "ar^lri _ (9)
-------
where Nr was defined for a specific resource in Equation 6. However, for some units
N
wiri->fT!"' To obtain conditional inclusion probabilities < 1, these units are placed into an
artificial 'certainty' stratum, all having ""j.iri^- This step takes place prior to the cluster
formation. For the remaining units, the selection protocol and achieved probabilities are
modified to adjust for the number of units having probability 1. These remaining units now
have conditional inclusion probabilities:
*2.,r, = ^i , (10)
s,%Wlr'
where n^ equals n2r less the number of units entering S2r with probability 1, and Sjr equals
Slr less these same units that were included with probability 1.
Note that this selection protocol is designed to create Tier 2 inclusion probabilities as
nearly equal as possible:
7Tlrl , if t is in the artificial stratum with l"2.ir, = l
, otherwise,
Slr
and if no units are in the artificial stratum,
where N, is the Tier 1 estimate of the'total number of population units in resource r. For
generality, we will retain the variable probability notation, but ideally the sample will now
be equal probability. If there is great deviation from equiprobability, then consideration
should be given to enhancement of the grid, perhaps by reducing the size of the Tier 1 areal
sample, in order to better achieve the goal of equiprobability.
The variance estimator presented in Equation 4 is also appropriate for estimating
variance from the Tier 2 sample, using inclusion probabilities defined above for Tier 2.
-------
When no units enter with 7r2.iri-=l, lhen
_ 2n2rw2MW2rj-W2r.-w2r;
W'"'- 2(n2r-l)
However, if unit i enters with ""2.iri=l> then,
2nlrwlr.Wl .-wlr.--w1 - |
'r ' ( }
Because the term in the bracket equals wjri; , Equation 14 simplifies to w2r|j = Wir,jw2.irj-
Special case: The Tier 2 point sample of lakes
We assume & stratified design with equiprobable selection within strata. If a quasi-
stratified design is used instead, appropriate analysis can condition on the realized sample
sizes in the classes and use post-stratification.
Special case: The Tier 2 point sample of streams
A point sample of streams at Tier 2, rather than a sample of stream reaches, has
been proposed. With a few simple changes, that point sample will be a rigorous
equiprobable point sample of the stream population with a very simple estimation
algorithm. A probability sample of stream reaches, on which the sample points are
represented and from which other estimates of population structure can be obtained, will
also be provided. The protocol provided will apply to the sample of stream reaches and the
point sample design proposed to us.
We assume a stratified design with equiprobable selection within strata. If a quasi-
stratified design is used instead (as has been proposed), appropriate analysis can condition
on the realized sample sizes in the classes and use post-stratification.
-------
Slr is the Tier 1 collection of reaches in a resource stratum identified via the 40-
hexes. S2r is generated by selecting njr points from this set using the frame representation
of stream length. This process results in (1) the selection of n^r frame reaches with
probability proportional to frame length, and (2) the random selection of 0, 1, 2, ... points
in each selected reach with inclusion density, given reach selection, inversely proportional to
length. The resultant point sample is equiprobable on the population of stream reaches.
Then, in terms of the sample of reaches,
k
L^-^IJ^T = -^2-,7T <15a>
estimates the total length of population reaches, where for resource r, /rlj- represents the
length of the j actual reach in the i sampling unit, /", represents the length of the i
frame reach, and Zri- represents the sum of length of all reaches in the I sampling unit.
Recall that a sampling unit is a frame reach. Also, Dr is the total frame reach length in the
Tier 1 sample of resource r and L"=wDr is the Tier 1 estimate of L*. Because L" is known
on the frame, wDr is replaced by L", resulting in:
Lr=L^ R (15b)
n?
where R= -~~ V^ =. Also,
where kri represents the number of actual reaches in the i1 sampling unit. Again, wDr can
be replaced by L*, which is known for the frame, resulting in:
-------
For these estimates, the variance estimators for Lr and Nr, are given by L*2var(R),
where the variance of the ratio can be approximated by:
where fri= -p1 , when computing var(Lr), or ur,= -^ , when computing var(Nr). Note
ri 'ri
that this formula is different from most ratio variance estimators in this report.
The distribution function is estimated from the data collected at the sample points,
not from the set of sample reaches, as in the above estimation of N and L. Recall that each
selected frame reach will have an associated sample point; this may result in 0, 1, 2, or more
sample points for the actual streams represented by the frame reach. Let:
(18)
where nr is the total number of sample points in the resource, as realized in stream reaches,
and nr(y) is the number of these for which the observed indicator value is less than y;
)' w'tn summation over all frame reaches, t, and all points, j, for each
S
frame reach. Then, rewrite nr(y) = 2_,2r,(y), so that
i=i
For Fp(y), under the randomization assumption, it is appropriate to treat nr(y) as
conditionally binomial(nr, Fr(y)) and to use the binomial algorithm. Alternatively, the
variance of Fr(y) can be estimated in the manner of ratio estimators. For the equiprobable
point sample, this is:
var(Fr(y)) = IX, w»,. + £ £dri drj (w2( w2j - w2|J) / nj2 , (19a)
O ^
-------
~ wD w^D^
where dri(\)=(zr,(\) - Fr(y)xr-), w2(-= ^ , and w2- = . r . Here, xri equals the
n2r J D2r (n2r- l)
number of sample points for the I frame reach. This then simplifies to:
(19b)
Then it is necessary to estimate L(y) as a product, Lr(y) = LrFr(y), with variance
estimator, var(Lr(y))=L? vai(Fr(y)) + F?(y) var(Lr).
This analysis presumes that there are no strata for stream reaches. For two strata
(!*' and 2"*" order), simple modification of these formulae will suffice. The numbers and
length of reaches in the cross-classified strata are estimated and then combined. For F,
sample points from units in the wrong stratum are simply combined with the correct
stratum. If more than these two strata are desired, the general method of frame correction
via sample unit correction is not feasible, and the method prescribed here is not appropriate.
2.1.2 Extensive Resources
The universe of an extensive resource is a continuous spatial region. If the domain is
correctly identified, the universe of the resource will be a subset of the domain and may be
fragmented over that domain. Extensive resources may have populations of two kinds,
continuous or discontinuous. Because these discontinuous populations are defined on a
continuous universe, they are referred to as extensive resources. Continuous populations are
referred to as extensive resources as well. Section 3.3.4 of the EMAP design report (Overton
et ah, 1990) describes two methods for sampling extensive populations, via a point or area)
sample. For each resource, the design provides for the classification of a large area] sample
(40-hex) at each grid point; these area! samples are also subject to subsampling via points or
area! subsamples.
-------
At Tier 2, two distinct directions are available, depending on the nature of the
resource. Specifically, if the domain of the resource is well known from existing materials,
as are boundaries of the Chesapeake Bay or Lake Superior, then the Tier 1 areal sample is of
little value either in estimating extent or in obtaining a sample at Tier 2. In these cases,
the domain should correspond to the universe. Conversely, if the spatial distribution or
pattern of a resource is poorly known, as it will be for certain arid land types or for certain
types of wetlands, then the Tier 1 areal sample may provide the best basis for obtaining a
well-distributed sample at Tier 2. Other factors, such as size of the domain and degree of
correspondence of universe and domain, will influence the sampling design. In the first
circumstance, the Tier 2 sample will be selected from the area! sample obtained at Tier 1.
In the other, the Tier 2 sample will be selected from the known universe by a higher
resolution point sample that contains the base Tier 1 sample.
2.1.2.1 Area] Samples ,*-.;_
Tier 1
All Tier 1 areal samples are expected to be collected with equal probability.
Enhancement of the grid may be made for any resource, but any resource should have
uniform grid density over its domain. Further, the area! sample imposed on the grid points
will be of the same size for any resource, so that algorithms are presented only for equal
probability sampling. The following formula estimates the total areal extent of a particular
resource (r) over its domain "jDr:
(20)
where the domain was discussed in Section 2.1.1.2. The value ar, defines the area of
resource r in the areal sample at grid point i, and w is the inverse of the density of the grid
divided by the size of the areal sample. For equal probability sampling, the variance
-------
estimator is
(2.)
GJ GJ
where n is the number of whole or partial areal sample hexagons located in *Jr. As with the
discrete resources, even though the sample is selected by a systematic grid, we assume, in
order to estimate variance, that the sample was taken from a locally randomized scheme.
The justification of this assumption is similar to that for discrete resources. Alternate
procedures are available when the assumption is questioned (see Section 2.1.2.3).
Tier 2
At the second stage of sampling for extensive resources, the distribution function for a
particular resource is estimated. To identify the objective of Tier 2 sampling, we can write
estimating equations as though a complete census were made at Tier 2. The general
conceptual formula for the distribution function of areal extent for a specific resource (r)
over its domain 1, is
(22)
where ar((y) is the area of the resource in areal sample i such that the value of the indicator
is less than y. The estimated variance follows Equation 21 as
var(Ar(y)) = r * _\ ' { ^ a&y) - (^ *r,(y)f/nr } . (23)
The estimate of areal proportion for an extensive population divides Equation 22 by
the estimate of total areal extent:
-------
(24)
In the rare instance in which Ar is known, then an improved estimator of Ar(y) is given by
Ar(y) = ArF» . (25)
Ordinarily, these distribution functions will be calculated at each distinct value of y
appearing in the sample. The variance associated with the areal proportion is the general
form for a ratio estimator (Sarndal et al., 1992, Equation 7.2.11). In writing this
expression, it is necessary to identify the specific value, y,, at which Fr(y) is being assessed.
var(Fr(y,.)) = [S:d;.w.(w.-l) + £ £ d^t(w-wfc- w-t)]/Aj , (26)
J 3 *
k* 3
where d = [a (y^) - ar Fr(y|)], a is the area of sample j in resource r, w fc is defined as
in Equation 5, and A^ is replaced by A^ when Ar is known.
In practice, the Tier 2 assessment will be based on a subsample of some kind, and the
above ideal estimation will not be available. The only method proposed for subsampling is
use of a Tier 2 point sample.
2.1.2.2 Point Samples
Two methods of directly sampling objects from the grid points are discussed in the
.EMAP design report (Overton et al., 1990, Section 4.3.3.2). A Tier 1 reduced sample of
discrete resource units can be selected by choosing the units into whose areas of influence the
points fall; this method is not currently scheduled for use, but it is a viable method for
several discrete resources. The same procedure can be used to select area! sampling units
from an arbitrary spatial partitioning of the United States. The agroecosystem component
-------
of EMAP provides such an example: the units selected for the sample are secondary
sampling units of the National Agricultural Statistics Service (NASS) frame, and estimates
are of totals over subsets of the frame units. Each selected unit is a mosaic of fields and
other land use structures. These structures are then classified and sampled to provide
ecological indicators for characterizing the sampling unit. Essentially, this areal sample is
analyzed exactly like the 40-hex fixed area! unit discussed in the previous section, with the
exception that inclusion probabilities are now proportional to the size of the unit, and the
genera] formulae (e.g., Equations 2-4) must be used.
An alternate use of the point sample can be applied to an extensive resource, with the
ecological indicators of the resource measured at the grid points. For continuous
populations, such as temperature or pH, the response can be measured exactly at a selected
point. For other populations, it is necessary to make observations on a support region
surrounding the point, like a quadrat. For example, the wetlands resource group could
obtain an indicator, such as plant diversity, from a quadrat sample centered on a grid point.
The indicator measured in the quadrat can be treated like a point measurement. A cluster
of quadrats centered on the grid point provides yet another method for sampling extensive
resources.
This point sample will be applied at Tier 2 in either of two ways. For resources that
depend on the Tier 1 areal sample to provide a sample frame, a high-resolution sample of
points is to be imposed on each 40-hex containing the resource; this arrangement will
generate an equiprobable point sample of the areal fragments of all resources that were
identified at Tier 1. For a resource in which the universe is clearly identified, such as Lake
Superior, a better spatial pattern of sample points will be obtained by imposing an enhanced
grid on the entire universe. ID the latter case, the universe is known, whereas in the former
case, the Tier 1 sample provides a sample of the universe, which is then sampled by a Tier 2
point sample.
-------
In either case, an equiprobable sample of points is obtained from which resource
indicators will be measured, and the estimation equations will differ only by the weights.
Variance estimators will differ, as one is a single-stage sample and the other is a double
sample.
Point sample for a universe with well-defined boundaries
For a resource in which the universe is known (e.g., the Chesapeake Bay), the general
formula for equiprobable point samples for a resource class is presented. A resource class is
defined as a subset of the resource. For example, two classes of substrate, sand and mud,
can be defined in the Chesapeake Bay. The distribution function of the proportion of a
specific class of a specific resource (re) having the indicator < y reduces to
where nrc(y) is the number of points in resource class re with the indicator equal to or less
than a specific value, y, and nrc is the total number of sample points in the resource class
re. Under the randomization assumption, the conditional distribution of nrc(y), given nrc,
is Binomial(nrc, Frc(y)), so that confidence bounds are readily s«t by the binomial
algorithm in those instances in which spatial patterns indicate adequacy of the
randomization model (Overton et al., 1990, Section 4.3.5). Alternate protocols are available
when the randomization model cannot be assumed (Section 2.1.2.3).
Estimation of the area occupied by an extensive resource class is provided by
Arc = Ar^ = Arprc , (28)
where nr is the number of grid points falling into the domain of the resource, and Ar is the
area of the resource. Under the randomization assumption, nrc, conditional on nr, is a
-------
binomial random variable; bounds on prc are again set by the binomial algorithm, as are
bounds on Arc.
Point sample for universe with poorly defined boundaries
When the universe of the resource is not known and one must use the Tier 1 areal
sample as a base for the Tier 2 sample, then Equation 19 provides the estimates of Ar at
Tier 1. Then the Tier 2 sample is an equiprobable sample of points selected from the area
of the resource class contained in the 40-hexes. This procedure is implemented as a
tessellation stratified sample in each 40-hex, with k=l to 6 sample points per 40-hex. With
only 1 point per 40-hex, the binomial algorithm will be appropriate under the randomization
assumption; multiple points per 40-hex will require an explicit design-based expression for
variance. In all cases,
Arc = Ar^ = Arprc , (29a)
~. n lv\
FPC(>')= -IT22 . (29b)
Arc(y) = ArcFrc(y) = Ar -^ = ArR . (29c)
It should be recognized that Equation 29a is a special case of Equation 29c.
When k>l, the following variance formula is appropriate:
(30)
The outside summation is over the 40-hexes and d- = (I(rc, y,-,<>) Frc(y)I(rc)). This
expression is derived from the general Horvitz-Thompson formula used with ratio
estimators. The formula can be recognized as the usual stratified random sample variance
-------
formula, applied to d, .
In addition,
var(Arc(y)) = var(Ar) R2 + A? var(R) (31a)
where var(R) follows,
<31b>
where d(J = [I(rc, ytj < y) - I(r)R]).
Note that var(Ar(y)) = var(Ar)F^(y) -I- A?var(Fr(y)); F replaces R in Equation 31a as well
as in Equation 29c.
2.1.2.3 AJternative Variance Estimators
Confidence bounds for distribution functions based on point samples of continuous
and extensive populations can be computed by several methods. The choice of a method is
determined by the pattern of the resource area. First, the binomial approach is suggested
for fragmented area distributed randomly across the domain. When this condition has been
met, the randomization assumption holds and the binomial model is appropriate for
computing confidence bounds.
If the area, Ar(y), is in an entire block, rather than fragmented, then the binomial
algorithm will overestimate variance, and alternative estimators will be needed. Other
methods allow for a nonfragmented area and the randomization assumption is not required.
The mean square successive difference (MSSD) is suggested for a strict systematic sampling
scheme. Another method, the probability sampling method using the Yates-Grundy
variance estimator, requires that the design have all positive pairwise inclusion probabilities.
-------
One such design that provides this structure is the two-stage tessellation stratified model.
The MSSD is discussed by Overton and Stehman (1993a) and the probability estimator is
discussed by Cordy (in press). Methods of spatial statistics are also available for estimating
this variance.
Mean square successive difference estimator
The variance estimator based on the mean square successive difference is intended to
provide an estimate of variance for either the mean of values from a set of points on a
triangular grid or obtained from a random positioning of the tessellation cells of the
hexagonal dual to the triangular grid. In the latter case, the data are analyzed as though
the values were taken from the center of the tessellation cell. The data set consists of all
points falling in the target resource. The MSSD has not been developed for this tessellation
formed by triangular decomposition of the hexagons.
Smoothing
Smoothing often results in improved variance estimation (Overton and Stehman,
1993a). The following method is from that report. For each datum, y, calculate a
'smoothed1 value, y", as a weighted average of the datum and its immediate neighbors (i.e.:
distance of one sampling interval). Weighting for this procedure is provided below. As a
result, two new statistics are generated at ;ach point: y' and A.
-------
Number of Neighbors y* values A values-
6
5
4
3
2
1
0
(6y,+ EyjO/12
(7y,-+ £yj)/i2
(8y,- + S>,-)/12
(9y,+ Lyj)/i2
(I0y,+ Eyj)/u
(iiy,-+ Ey,-)/u
y,-
7/24
5/24
5/36
1/12
> 1/24
> 1/72
0
Given these smoothed values, summing over all data points,
~2 = . (32)
Mean Square Successive Difference
Identify the data along the three axes of the triangular grid; each point will appear
once in the analyses of each axis. Analyze the y*, not the original y. For each axis,
calculate
«= £(y;-y£)2 - <33)
where y and yfc represent members of a pair of adjacent points, and where the summation
is over all adjacent pairs identified on this axis. Also, calculate for each axis,
(34)
-------
where it is necessary that all pair differences be taken in the same direction. From these
statistics, calculate for each axis,
* ~_m)
(35a)
and
A^(S2-*2) ^
where m denotes the number of pairs in the above summations.
These statistics are then combined over the three axes, where summation is over all
successive pairs in the k axis.
V!=2E ^ , (36a)
i . (36b)
E'(mfc-l)
Lastly, the following are computed to provide estimates of the variance of the mean values.
y.) = vi + TT7 (37a)
and
where nr equals the number of. sample points in resource r. This method has not been
exteDded to distribution functions, but the extension is straightforward.
-------
Yates-Grundy variance estimator for tessellation stratified probability samples
Investigation of this variance estimator is continuing. The method will be included
in the next version of this manual.
2.2 Model-Baaed Estimation Methods
The previous section was devoted to design-based methods used to derive population
estimates, distribution functions, and confidence bounds. Model-based estimation is another
common approach to computing population estimates. In this approach, certain
assumptions with regard to the underlying model are made, and the information provided
by auxiliary variables often provides'greater precision of the estimates.
Within EMAP, these model-based methods have not been developed to the same
degree as the design-based methods. No algorithms for confidence bounds of distribution
functions using model-based methods are presented in this report, although they are under
development. The purpose of including this section is to provide a brief description of
currently available model-based methods. Further, application of the model-based methods
has so far been restricted to discrete populations. Investigation of the applicability of these
methods in continuous and extensive populations is under way.
Three ways in which model-based methods can be used within EMAP are discussed:
(1) data collected on the full frame across the population can be incorporated into the
estimation process using prediction estimators to improve precision; (2) because the EMAP
design is a double sample (Section 2.1.2.2), auxiliary variables on the first-stage sample can
be used to improve the precision at the second stage; and (3) a calibration method is
described for modifying an indicator variable to adjust for changes in instrumentation or
protocol - such methods are needed to maintain the viability of a long-term monitoring
program.
The strategy is to begin with the basic design-based methods and to. incorporate
-------
model-based methods as the opportunity to do so becomes apparent and the necessary frame
materials are developed. The design-based methodology will be enhanced by the use of
models whenever feasible.
2.2.1 Prediction Estimator
If auxiliary data that can be used to predict certain indicators are available on the
entire frame, model-based prediction techniques can be used to obtain predictions of the
response variable for the population. These predictions then become the base for population
inference.
These methods require a vector of predictor variables defined on the frame, while the
response variable is measured on the Tier 2 sample. A model is postulated for the
relationship between the response variable, y, and the vector of predictor variables, x:
y=g(x) + e, with Var(0 = h(x) . (38)
Based on this model, a predictor equation, y"=gXx), is estimated from the Tier 2 sample.
The equation for the basic estimator, which is referred to as the general regression
estimator, is defined as
= Ey,+ E W2,(y,-y,) , (39)
IL s,
where Hi and S2 designate the universe of units and sample units at Tier 2, respectively.
The variance of this estimator is estimated by
var(f tf) = Ed? w2, (w2, - 1) + E Ed, d, (w2|. w2j - w2
-------
where d, = (y(. - y ) (Sarodal et al., 1992, Equation 7.2.11). Our experience (Overton and
Stehman, 1993b) suggests that this equation slightly underestimates the variance; this result
is not unexpected because Equation 40 is based only on the variance of the second term of
Equation 39.
One model-based estimator of the distribution function of the proportion of numbers,
as established by Rao et al. (1990), is based on the general regression estimator and defined
as
where N is the target population size, and l(y,
-------
2.2.2 Double Samples
As mentioned previously, the EMAP design is a double sample with Tier 1
representing the first stage (or phase) and Tier 2 the second stage. Through most of this
document, design-based methods are provided for the Tier 2 sample; these methods are
similar to those described for single-stage samples. However, where model-based methods
are used, double sampling formulae can be quite different from single-stage formulae. An
elementary discussion of double sampling with model-based methods is presented in Cochran
(1977).
Existence of an auxiliary variable on the Tier 1 sample will enable model-based
double-sample methods at Tier 2. EMAP does not require a resource-specific frame, but it
does allow for acquisition of more detailed information for many resources. There is a Tier
1 sample for all resources, and for most resources, the Tier 2 sample is a subset of the Tier 1
sample, thus providing the basis for model-based double-sample methods.
The model specification follows the developments under the general prediction model
(Equation 38). The basic estimator, derived from the general regression estimator, is
defined as
= { £ wlty + £ w2l.(y,. - y ,)} , (42)
S S
where Sj and S2 define the sample at Tiers 1 and 2, respectively. The form of this estimator
allows equal or variable probability at Tier 1. The variance estimator for Equation 42
follows Sarndal et al. (1992, p. 365, Eq. 9.7.28):
E(wi,-wi,--wi,->)y,-y>w2.iij + f £(w2a.-w2.iy-w2.i.->)d,-wi,-d>wij - (43)
n Jn Qn ^O
where di=(y(-f1).
-------
The estimate of the distribution function of the proportion of numbers is developed as
an extension of Equation 41,
S2
When N is unknown, N is a suitable replacement. Smoothed versions of Equations 41 and
44, along with confidence bound algorithms, are under development.
2.2.3 Calibration
Calibration is defined as the replacement of one variable in the data set by a function
of that variable representing another variable. For example, in a long-term monitoring
program such as EMAP, it is expected that some laboratory or data management protocols
will change over time. Using this analytical tool, data from old protocols can be calibrated
to represent data from the new protocols, thereby allowing assessment of trends across the
transition.
Overton (1987a, 1987b, 1989a) described the application of calibration issues for the
National Surface Water Surveys. In that instance, protocols were unchanged but the
extensive data of 1984 were calibrated to the same variable in 1986 to take advantage of the
strong predictive relationship through the double sample. The algorithms are provided for
this calibration in Technica' Report 130 (Overton, 1989a). Tailoring of these methods to
the specific needs of EMAP will be required in certain instances. However, each application
is likely to present some unique issues and properties, so that general development does not
appear feasible.
-------
2.3 Other Issues
2.3.1 Missing Data
Two types of missing data are expected to arise in EMAP. One type is a missing
sampling unit, such as a missing lake. The other type of missing value occurs within a.
sampling unit, such as a missing observation on a specific chemical variable or a missing
suite of chemical variables for a lake. In this situation, information is available on some,
but not all, indicators for a specific unit or site.
2.3.1.1 Missing Sampling Units
There appears to be no basis for imputation of a missing sampling unit where no Tier
2 information is available to predict that observation. Therefore, missing sampling units
should be considered as representing a subset of the subpopulation of interest that is
unavailable for measurement. All procedures outlined in this document-accommodate data
sets that contain missing units. No adjustments to the weighting factors are necessary;
summation is over the observed portion' of the sample, and the estimates produced apply to
the subpopulation represented by the sample analyzed. When Yates-Grundy estimation of
variance is used, it will be necessary to modify the equation; this requirement is the primary
reason for using the Horvitz-Thompson variance estimator when possible.
In a long-term program, this approach of classifying missing units with the
subpopulation not represented by the sample is clearly appropriate; such units can be
sampled in subsequent years without having to modify sample weights again. This
approach is also consistent with the practice of allowing sampling units to change
subpopulation classes from time to time. Comparisons must take this into account, but
such class changes will always be a feature of long-term monitoring programs.
A general problem remains when a substantial number of resource sites cannot be
measured; EMAP must find a way to provide indicator values for such sites. When the
-------
problem is severe, it might be possible to develop an alternate indicator suite that can be
obtained via aerial television or photography. Perhaps it will be possible to impose a higher
(lower resolution) sample level that will provide for model-based methods and predictors of
the indicator. (This option will be difficult because the predictor relation must be developed
specifically for the subpopulation of concern,) But whatever the solution, some method is
required to provide representation of these sites. Until then, it is appropriate for these to be
identified in the subpopulation for which no sample has been generated and about which
nothing is known.
2.3.1.2 Missing Values within Sampling Units
It is advantageous to use information collected on a specific sampling unit to impute
any missing observations for that sampling unit. To minimize error, a multivariate analysis
is suggested, utilizing the data collected for that particular unit. No specific procedure is
suggested for this analysis, .because most standard analyses will impute similar values, and
because the method must be tailored to the circumstances. Some multivariate procedures
are discussed in statistics books that concentrate on imputation of missing values (cf., Little
and Rubin, 1987).
2.3.2 Censored Data
For certain measurements, values for indicator? will be less than the identified
detection limit; exact values cannot be measured for such units or sites. This problem is not
uncommon and has been discussed frequently in the literature applying to water quality
management programs (cf., Porter et al., 1988). Caution is prescribed when characterizing
data that consist of many observations below the detection limit. Proper analysis and
reporting can prevent improper inference for these data; specifically, it must be noted that
although reliable values are not provided, a great deal is known about the site that has a
value at or below the detection limit.
-------
To guide the data analyst in the treatment of the indicator that contains censored
observations, the proximity of the detection limit to the critical value of the indicators needs
to be considered. Indicators, such as chemical variables, that have detection limits near or
above the critical value should not be considered meaningful indicators; the information
supplied by such an indicator is too fuzzy to justify inferences. In such cases, the most
meaningful parameters are those whose estimates are not affected by censoring. Other
indicators have a detection limit well below the critical value. For these indicators, it is
suggested that values below the detection limit should be scored to the detection limit and
analyzed with the uncensored data.
The mean is a poorly defined statistic to describe censored data. However, the scored
mean can be interpreted, even though it is slightly biased. Another statistic, the scored
mean minus the detection limit, is unbiased for the mean in excess of the detection limit,
which is a well-defined population parameter. If the distribution below the detection limit is
modeled, and the mean value below the detection limit is calculated, then the scored mean
can be converted into an unbiased estimate of the true mean, given the model.
On the other hand, the median is less ambiguous than the mean and is more
appropriate for characterizing these indicators. Usually the median will not be affected by
scoring. Distribution functions also should not be described below the detection limit. This
restriction is another reason for scoring; standard analyses of the scored data yield the
desired distribution function, emphasizing that the shape of the curve below the detection
limit is unknown. Because the critical level changes with circumstance, it is desirable to
present the truncated (scored) distribution function, to be interpreted as the situation
dictates. In fact, the capacity to truncate the distribution function without impairing
inferences is one of the strong arguments for choosing this parameter to characterize these
data.
-------
Modeling the function below the detection limit is one method proposed in the
statistical literature to modify estimates from censored data (Cox and Oakes, 1984; Miller,
1981). However, a hypothetical distribution must be assumed to represent the censored
data. In EMAP, distributions are defined on real populations and are unlikely to follow any
distributional law. We propose that the distribution function reflect the data alone and that
the unsupported portion of the distribution function is not described. Use of the scored
mean is somewhat less justifiable, but generally consistent with this position.
2.3.3 Combining Strata
The strata that form the structure of the Tier 2 sample are established from classes of
resources identified at Tier 1, on the Tier 1 sample. The seven basic resources are the
foundation of this structure, but there is provision for further classification leading to several
strata for lakes, several for forests, and so on. These strata are referred'to as resources in
this report.
Tier 2 selection is then stratum (resource) specific and independent among strata.
This structure is chosen to provide inferences within strata, with the thought that few
occasions will arise for inferences involving combined strata. For example, a distribution
function [F(y)j combining small and middle-sized lakes will be dominated by small lakes. If
the population of large lakes is of interest, it must be characterized separately. Further, a
wide range of sizes makes the frequency distribution less useful in characterizing the
population. Still, because there may be interest in a population consisting of the largest of
the small lakes and the smaller of the middle-sized lakes, analysis of combined strata is
needed.
-------
2.3.3.1 Discrete Resources
Samples are combined across strata to compute the Tier 2 estimates. Weights will
not.be uniform, so the Horvitz-Thompson algorithms using weights are needed. Estimation
of N0(y) and Fa(y) is identical to the estimation algorithms for a single stratum, but
estimation of variance requires modification. The basic formula for estimating variance is
also unchanged, only the W2ii must be modified. Specifically,
if i and j are from the same stratum, then
2n2w2,w2 - w2| - w2
; (4o)
or if i and j are from different strata, then
W2,j = Wl,jw2.1.w2.1j
where, if i and j are from different 40-hexes, then
2n, w,-w, - w, - w,
or, if i and j are from the same 40-hex, then
(48)
where w is the weight associated with the basic Tier 1 areal sample.
In the case of the quasi-stratified design used for lakes and streams, the
recommendation is that the sample be conditioned on the realized sample sizes in the several
distinct classes having equal inclusion probabilities (within class). This approach leads to a
-------
post-stratified sample that can be analyzed exactly like the sample from a stratified design.
The gain in precision will carry over into analysis of combined strata in the manner
discussed in this section.
2.3.3.2 Extensive Resources
Procedures for combining strata for point samples in extensive resources are the same
as those outlined for discrete resources (Section 2.3.3.1). Methods to combine strata for
areal samples in the extensive resources are still under consideration and will be addressed at
a later time.
2.3.4 Additional Sources of Error
Other potential sources of error can be expected in the process of developing the
distribution function and confidence bounds. Some of these have been discussed after
evaluation of the Eastern and Western Lake Surveys (Overton, 1989a, 1989b). These
additional sources of error add to the uncertainty and bias of the estimated distribution
function. Research is presently under way to investigate methods, such as deconvolution, to
correct for these added components of error and bias. Preliminary methods are
unsatisfactory, and two different approaches are being followed to improve results. These
methods will be introduced Lo EMAP analyse^ as they become available.
The rounding of measurements reduces precision in quantiles and distribution
function estimation. Analyses of the National Surface Water Surveys suggested that
reporting data at two decimal places beyond the inherent accuracy of the indicator
satisfactorily reduces bias attributed to rounding error (Overton, 1989b). It is recommended
that additional decimal places be carried into the data set if they are provided by the
instrumentation. Additional rounding should be made only at the reporting step, and the
-------
rule for rounding should take into account gain in precision from averaging and other
statistical practices.
2.3.5 Supplementary Units or Sites
Supplementary units, in addition to the yearly EMAP grid points, have been selected
and measured or remeasured by some resource groups. For example, a set of supplementary
units can be selected as a subset of one of the interpenetrating replicates. The
remeasurement of these supplementary units is directed at specific issues, such as estimation
of variance, and the selection procedure is likely to be influenced by this purpose. If data
from supplementary probability samples are combined with the general EMAP sample, it is
necessary to use a protocol for combining two probability samples. If the supplementary
data are not from a probability sample, then it is necessary to use a protocol for combining
found data with probability sample data (Overton et al., 1993). Ordinarily, a good strategy
will be to use these supplementary data only for analyses initially intended. The effort
necessary to satisfactorily combine supplementary data within the general sample analysis,
such as the distribution functions, is sufficiently great that one should be reluctant to
attempt this combination. On the other hand, there will be certain circumstances in which
this effort is justifiable.
-------
SECTIONS
DISTRIBUTION FUNCTION ALGORITHMS
The types of distribution function algorithms, along with their associated conditions
for application, are presented in Table 1. The first part of this table (A) presents the
various cases yielding the distribution of numbers, N(y). Part B presents the various cases
discussed in this report yielding the distribution functions for the proportions of numbers.
The methods of obtaining the distribution functions for size-weighted statistics are presented
in Part C.
To explain the notation presented in the following algorithms, some terminology is
introduced. The target population size, N, is the size of the target subset of the universe of
units, defined as 11. The following algorithms are written to obtain estimates over a.
particular subpopulation of interest. For a particular subpopulation (a), the distribution of
numbers is denoted as Na(y) and the distribution of the proportion of numbers is denoted as
F0(y). Na denotes the subpopulation size over the subpopulation, a. In addition, the n and
nQ refer to the sample size from the population and subpopulation, respectively.
The variance estimator discussed in Section 2 is based on the Horvitz-Thompson
theorem and is appropriate for both equal and variable probability sampling, independent of
a known population or subpopulation size. The confidence bounds using this variance
estimator are then based on the normal approximation. Therefore, for any condition, the
general Horvitz-Thompson algorithms for Na(y) and Fa(y), as presented in the following
subsections under variable probability sampling, are appropriate.
Estimation of these bounds simplifies under equal probability sampling when the size
of either the population or the subpopulation is known. For example, an exact confidence
bound for Fa(y) can be based on the hypergeometric distribution in the case of equal
-------
probability sampling when the subpopulation size is known. When the subpopulation size is
unknown, these bounds can be based on the binomial distribution.
It should be emphasized that there are no differences in the distribution functions
obtained from the alternative design-based approaches discussed in this report. Further, the
distribution functions obtained under the same conditions based on the Horvitz-Thompson,
the binomial, or the hypergeometric algorithm are the same. The differences occur in the
computation of the confidence bounds. Note, however, that model-based distribution
functions will be different from those obtained from design-based methods.
In all situations, the algorithms in this report provide two one-sided 95% confidence
bounds. The combined upper and lower confidence bounds enable two-sided 90% confidence
bounds on the distribution function. The Horvitz-Thompson algorithm estimates standard
errors from which the confidence bound is based on a normal approximation. The
alternative methods directly provide confidence bounds based on the exact binomial or
hypergeometric distributions. All design-based methods suggested for discrete populations
assume the randomized model, as discussed in Section 2. Because exact methods are usually
preferred over approximate methods, the exact methods are suggested for those cases by
which the conditions justify their use.
A test data set was applied to the following algorithms. Any resource group
interested in comparing their versions of these algorithms to the ones provided in this report
are encouraged to contact the authors. A copy of the test data set will be provided in order
to compare results from other programs.
3.1 Discrete Resources
In this section, examples are provided for each of the possible approaches to obtaining
N0(y) and F0(y) for discrete resources. For each of these approaches, the conditions and
assumptions of the selection of the sampling units are defined. For quick reference, Table 1
-------
(Section 4) presents this information in condensed form. An interest in obtaining the
distribution function of numbers and proportion of numbers across the subpopulation is
expected for all resource groups. For example, the lakes and streams resource group can
compute the numbers or proportions of numbers of lakes with some attribute based on this
algorithm.
3.1.1 Estimation of Numbers
A number of algorithms are presented for computing the distribution function for
numbers. The choice of the algorithm is dependent on whether the units were chosen by
either equal or variable selection. The first three cases (algorithms) in this section derive the
distribution functions based on an equal probability selection of units and the latter two
cases (algorithms) are based on an unequal probability selection of units.
Equal Probability of Selection
In this subsection, three cases are provided based on information that is known or
unknown. For the first algorithm, N is either known or unknown and Nq is known; this
algorithm produces confidence bounds based on the hypergeometric distribution. For the
second algorithm, N is known, but Na is unknown; this algorithm is also based on the
hypergeometric distribution. For the third algorithm, both N and Na can be either known
or unknown; this algorithm produces confidence bounus based on the Horvitz-Thompson
variance estimator and the normal approximation.
-------
(Case 1)
Case 1 Estimation of NQ(y): Discrete Resource, Equal Probabilities, NQ known, n=nQ.
Confidence Bounds by Hypergeometric Algorithm.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, N0, is known.
3. There is an equal probability of selection of units from the subpopulation.
4. Sample size condition: n=n .
Chiiline for Algorithm
Under the given conditions, the confidence bounds can be obtained by either the
exact hypergeometric distribution or by the normal approximation. This case provides the
confidence bounds for NQ(y) by the hypergeometric distribution, when Na is known. The
normal approximation bounds are provided in the next subsection (see Examples 4 and 5).
This algorithm computes the confidence bounds for each point along the curve using
the hypergeometric distribution. In the following formula, Nfl is the subpopulation size; na
is the sample size from the subpopulation; NQ(y) refers to the number of units, u, in the
subpopulation, J., for which yu < y; and nQ(y) refers to the number of units in the sample
from NO, Sa, for which yu 0.05. The lower confidence bound is computed by obtaining the smallest
value of N (y) for which Prob[X > n (y)] > 0.05.
-------
(Caael)
A GAUSS program is presented here that derives the confidence bounds based on the
hypergeometric distribution. Comments in capital letters in braces explain the
programming steps. Under the conditions of Case 1, the upper and lower halves of the
confidence bounds are symmetric.
CALCULATION OF CONFIDENCE BOUNDS ON Na(y) BY THE
HYPERGEOMETRIC DISTRIBUTION
load x[a,b] = data; {LOADS DATA FILE WHICH INCLUDES LABEL CODE AND
VARIABLE TO BE ANALYZED. HERE a DESIGNATES
THE SAMPLE SIZE, n0, AND b DESIGNATES THE
NUMBER OF COLUMN VECTORS}
x=x[.,2]; {IDENTIFIES THE VARIABLE OF INTEREST IN SECOND COLUMN]
nm=rows(x); {NUMBER OF ELEMENTS OF INTEREST IN SUBPOPULATION, nc
IN THIS ALGORITHM, n=nm=n0}
n=rows(x);
NN=Na; {DEFINES TOTAL SUBPOPULATION SIZE HERE, Nfl }
x=sortc(x.2); {SORTS VARIABLE OF INTEREST}
y=seqa(U,nm); {CREATES SEQUENCE OF NUMBERS}
x2=x[.,2j; {DEFINES VARIABLE OF INTEREST AS X2}
x=y x2; {CREATES MATRIX x}
zz=x; {DEFINES MATRIX x as zz}
{THE FOLLOWING COMBINES RECORDS WITH DUPLICATE VALUES
OF THE VARIABLE}
xx=zeros(l,2);
q=0;
i=i;
do while i < nm;
ifx[i,2]==x[i+l,2];
q=q+l; I;
else;
xx=xx|x[i,.j;
endif;
endo;
xx=xx|x[nm,.];
-------
(Case 1)
{THE FOLLOWING STEPS BEGIN CONFIDENCE BOUND ESTIMATION)
r=rows(x);
z=zeros(r,l)l
xl=x[.,l];
x2=x[.,2];
x=x2~xr(NN*xl/nm)~z;
{THE FOLLOWING STEPS GENERATE THE UPPER CONFIDENCE BOUND}
i=lj
do while i <= r; {BEGINS INITIAL DO LOOP)
rr=x[i,2];
mm=trunc(NN*rr/nm);
if mm >= NN; .,
goto three;
endif;
one::
mm=mm-(-l;
if NN <= 160;
aa=n!*mm!/NN!*(NN-mm)!*(NN-n)!;
else;
aa=lnfact(n) + Infact(mm) - Infact(NN) + Infact(NN-mm) + Infact(NN-n);
endif;
j=0;
if (NN-mm-.n) < 0;
j=-(NN-mm-n);
endif;
s=0;
do while j <= rr;
if NN <= 160;
else;
a=aa Infact(j) - Infact(mm-j) Infact(n-j) lnfact(NN-mm-n+j);
a=exp(a);
endif;
s=s+a;
endo;
if s>= .05;
goto one;
endif;
three:;
if mm>=NN;
x[i,4] = NN;
else;
x[i,4]=mrn-l;
endif;
ENDO; {ENDS INITIAL DO LOOP}
-------
(Casel)
{THE FOLLOWING STEPS ADD AN EXTRA LINE TO DATA MATRIX NEEDED IN
CONFIDENCE BOUND ADJUSTMENT COMPUTED AT END OF ALGORITHM]
r=rows(x);
y=zeros(r,l)
x=x~y;
y=zeros(l,5);
y[l(2:4]=x[r,2:4];
x=x|y;
{THE FOLLOWING STEPS GENERATE THE LOWER CONFIDENCE BOUND}
r=rows(x);
i=l;
do while i <= r; (BEGINS SECOND DO LOOP}
rr=x[i,2];
mm=trunc(NN»rr/n);
if mm==0;
goto six:
end if;
four:;
mm=mm-l;
if NN <= 160;
aa=n!*mm!/NN!*(NN-nim)!*(NN-n)!;
else;
aa=lnfact(n) -t- Infact(mm) Infact(NN) + Infact(NN-mm) + Infact(NN-n);
endif;
j=rr;
mnm=minc(n|mm);
6=0;
do while j <= mnm;
if NN <= 160;
else;
a=aa- lnfact(j) Infact(mm-j) - Infact(n-j) lnfact(NN-mm-n+j);
a=exp(a);
endif;
s=s-(-a;
endo;
if s>= .05;
goto four;
endif;
six:;
if mm==0;
x[i,5]=0;
else;
x[i,5]=mm+l;
endif;
ENDO; {ENDS SECOND DO LOOP}
-------
(Case!)
{ASSIGN LABELS}
"N= " NN ", n = " n;
x;
OUTPUT OFF;
{ADJUST Nfl(y) and CONFIDENCE BOUNDS AVERAGE SUCCESSIVE VALUES}
r=rows(x);
xx=x;
i=2;
do while i <= r-1;
jtx[i,3:5] = (x[i,3:5] + x[i-l,3:5])/2;
i=i+l;
endo;
{OUTPUT FILE AND PRINT MATRIX x}
OUTPUT FILE=NAME;
OUTPUT reset;
" x r " Sequence #" " F(x) " " F-lower(x) " " F-upper(x) ";
format /ml/id 12,7;
print x;
OUTPUT OFF;
end;
-------
(Case 2)
Case 2- Estimation of N0(y): Discrete Resource, Equal Probabilities,
N known, Na unknown, n>nfl.
Confidence Bounds by Hypergeometric Algorithm.
Conditions for approach
1. The frame population size, ,N, is known.
2. The subpopulation size, Nfl, is unknown.
3. There is an equal probability of selection of units from the subpopulation.
4. Sample size condition: n>n,,.
Outline for Algorithm
Under the given conditions, the confidence bounds can be obtained by either the
exact hypergeometric distribution or by the normal approximation. This example provides
the confidence bounds for N0(y) by the hypergeometric distribution, when N is known, but
Nffl is unknown. Normal approximation bounds are provided in the next subsection (see
Examples 4 and 5).
This algorithm computes the confidence bounds for each point along the curve using
the hypergeometric distribution. In the following formula, N is the frame population size; n
is the sample size from the frame population; Na(y) refers to the number of units, u, in the
subpopulation, J., for which yu < y; and na(y) refers to the number of units in the sample
from N0, Sa, for which yu < y. Under the conditions, n0(y) has the following hypergeometric
distribution. Let X represent the random variable for which na(y) is a realization. Note
that n(y) < n and that Na(y) < Na < N.
X(y)WN-Na(y)\
X II n-X I
^ A-^r L (50)
-------
(Case 2)
The upper confidence bound is computed by obtaining the largest value of Na(y) for which
Prob[X < na(y)j > 0.05. The lower confidence bound is computed by obtaining the smallest
value of NQ(y) for which Prob[X > nfl(y)] > 0.05.
To obtain the distribution function, the data file needs to be sorted on the indicator,
either in an ascending or descending order. When the data file is sorted in ascending order
on the indicator, the distribution function of numbers, NQ(y), denotes the number of units in
the target population- that have the value less than or equal to the specific y. Conversely, if
it is of interest to obtain bounds on the number of units in the target population with
indicator values greater than or equal to y, the data file must be sorted and analyzed in
descending order on this variable. The distribution function generated by the analysis in
descending order is [NQ - Na(y)]-
A GAUSS program provided in Case 1 derives the confidence bounds based on the
hypergeometric distribution. However, under the conditions discussed here, the sample size
and population sizes are defined as follows.
CALCULATION OF CONFIDENCE BOUNDS ON N0(y) BY THE
HYPERGEOMETRIC DISTRIBUTION
load x[a,b] = data; {LOADS DATA FILE WHICH INCLUDES LABEL CODE AND
VARIABLE TO BE ANALYZED. HERE a DESIGNATES
THE OBSERVED SAMPLE SIZE, DO, AND b DESIGNATES
THE NUMBER OF COLUMN VECTORS}
x=x[.,l];
nm=rows(x); {NUMBER OF ELEMENTS OBSERVED, nfl. IN THIS ALGORITHM,
°*na. }
n=#; {FULL SAMPLE SIZE}
NN=N; {DEFINES TOTAL POPULATION SIZE HERE }
REFER TO CASE 1 (AFTER LINE 13) FOR THE REMAINING STEPS
IN THIS PROGRAM.
-------
(Case 3)
Case 3- Estimation of Na(y): Discrete Resource, Equal Probabilities.
Confidence Bounds by Horvitz-Tbompeon Standard Error
and Normal Approximation.
Conditions for approach
I. The frame population size, N, can be known or unknown.
2. The subpopulation size, N0, can be known or unknown.
3. There is an equal probability of selection of units from the subpopulation.
[Note that this algorithm can also be applied to those cases presented in
Examples 1 and 2.]
Outline for Algorithm
The algorithm recommended, given the foregoing conditions, is based on the Horvitz-
Thompson formulae, which were discussed in Section 2. The algorithm presented for the
general case of variable probability of selection (the following subsection) is, appropriate to
use given the foregoing conditions.
Equal probability selection is a special case of variable probability selection. In equal
probability of selection of units, the weighting factors are equal for all units, wi=:wj=w. If
the weights, w1| and w2 lt are appropriately identified, then the general algorithm presented
in Example 4 will not need any modification. The Tier 2 weight, w2|., computed by
Equation 4 is the same for all units.
-------
Variable Probability Selection
In this subsection, two examples are provided to demonstrate variable probability of
selection. For both cases, the frame population size can be known or unknown. In Case 4,
NQ can be unknown or known and equal to Nfl. For Case 5, Na is known and not equal to
N0. Both algorithms produce confidence bounds based on the Horvitz-Thompson variance
estimator and the normal approximation.
-------
(Case 4)
Case 4- Estimation of N0(y): Discrete Resource, Variable Probabilities,
N0 unknown or N0 known and equal to Na.
Confidence Bounds by by Horvita-Tbompeon Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulatjon size, NQ, is unknown or known and equal to N0
3. There is a variable probability of selection of units from the subpopulation.
Outline for A Igoriihm
The algorithm supplied for this example is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. It is useful to identify the estimator of Na from Tier 2.
The design-based estimator of Na is
Na= w2, , (51)
sa
where Sa is the portion of the sample from the subpopulation, J(, over which the weighting
factors (w) are summed. The variance estimator for Na is presented in Equation 3b of
Section 2.
Calculation of confidence bounds on Na(j) ky the HorviU-Tbompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for N(y) or Na(y). This algorithm is similar to the algorithm defined for
the National Surface Water Surveys (Overton, 1987a,b). The Horvitz-Thompson variance
estimator, discussed in Section 2.1, is used to compute the variance in this algorithm. The
confidence bounds are computed based on a normal approximation.
-------
(Case 4)
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, wlt-
c. Tier 2 conditional weighting factor, w21(
d. Indicator of interest (y)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When "the data file is sorted in ascending order on the indicator,
the distribution function of numbers, Na(y), denotes the number of units in the
target population that have a value less than or equal to the y for a specific
indicator. Conversely, if it is of interest to estimate the number of units in the
target population with indicator variables greater than or equal to y, the data file
would be sorted in descending order on this variable. The distribution function
generated by the analysis in descending order is [Nfl - NQ(y)j.
3. Computation of weighting factors
The Tier 1 and Tier 2 weights are included for each observation in the data
set. These weights are used to compute the total weight of selecting the l unit in
the Tier 2 sample. Compute the following weight for each observation:
where wlf is the weighting factor for the 2( unit in the Tier 1 sample (the inverse of
its Tier ] inclusion probability) and w2 1( is the inverse of the conditional Tier 2
inclusion probability.
4. Algorithm for Na(y)
a. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and five statistics for each row.
qj = value of y variable for the record
q2 = Na(yJ
qj = var[Na(y)]
q_4 = upper confidence bound for Nfl(y)
q5 = lower confidence bound for N0(y)
b. Index rows using t from 1 to n; the t' row will contain q-values
corresponding to the i record in the file, as analyzed.
c. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate the q-statistics as each
observation is read into file. Continue this loop until the end of file is
reached. At that time, store these vectors and go to d. This algorithm is
calculating the distribution for the number of units [Na(y)} in the
subpopulation. It is necessary to identify the records for which
-------
(Case 4)
' qiM = y[«)
ii. q2[,j = q2[, - 1] + Wj
iii. o^,] = q^, - 1] + w2.*(w2i - 1)
where, if neither w2 j. nor w2 j,= l:
w2l-w2 . - w2,-
3 <
where, if either w2 li or w2 j .= 1:
W2,j = wl,jw2.1.W2.1;
where:
w, =
iv. q4[i] = q2[j] + 1.645*
v. q5[i] =-q2p] - 1.
Multiple observations with one y value create multiple records in the above
analysis for one distinct value of y. The last record for that y contains all
the information needed for Na(y). Therefore, at this stage of the analysis,
eliminate all but the last record for those y values that have multiple
records.
d. Output of interest
From the last entry of the row of q-vectors just computed:
i. qj = largest value of y (or smallest if analysis is descending)
ii. q2 = N0 _
iii. q^ = var(N0)
iv. Standard error of N =
a
From the q column vectors:
i. qj represents the ordered vector of distinct values of y
ii. q2 represents the estimated distribution function, N0(y),
corresponding to the values of y
iii. c^ represents the 95% one-sided upp«r confidence
bound of the distribution function, Nfl(y)
iv. qs represents the 95% one-sided lower confidence
bound of the distribution function, N (y)
-------
(CaseS)
Case 5 Estimation of N0(y): Discrete Resource, Variable Probabilities,
Na known and not equal to Na.
Confidence Bounds by by Horvite-Thompeon Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, NQ, is known and not equal to NQ.
3. There is a variable probability of selection of units from the subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. The algorithm for the distribution function for the
proportion of numbers, Fa(y), given exactly the same conditions listed above, is presented in
Case 8. To compute the distribution function of numbers, NQ(y), first use the algorithm in
Case 8 to compute the distribution function with the corresponding confidence bounds for
the proportion of numbers. Then, compute the following:
Na(y) =Fa(y)*Na, (52)
where NQ is the known subpopulation size. To compute the confidence bounds for NQ(y),
simply multiply the upper and lower confidence limits of Fa(y) by Na.
-------
3.1.2 Proportions of Numbers
A number of algorithms are presented to compute the distribution function for the
proportion of numbers. For any case in a resource group, the choice of the algorithm is first
determined by the method by which the units were selected. The first two algorithms in
this section derive the distribution functions based on an equal probability selection of units
and the latter two algorithms are based on an unequal probability selection of units.
Equal Probability of Selection
In this subsection, two examples are provided based on whether or not the
subpopulation size is known or unknown. For the first algorithm, Na can be known or
unknown; this algorithm produces confidence bounds based on the binomial distribution.
For the second algorithm. N0 is known; this algorithm is based on the hypergeometric
distribution.
-------
(Case 6)
Case 6 Estimation of F0(y): Discrete Resource, Equal Probabilities,
N0 known or unknown.
Confidence Bounds by Binomial Algorithm.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, NQ, can be known or unknown.
3. There is an equal probability of selection of units from the subpopulation.
Outline for Algorithm
Under the given conditions in which NQ may not be known, the confidence bounds
can be based on the binomial distribution. In addition, Example 8 provides the normal
approximation approach to the confidence bound estimation.
A program, based on the binomial distribution and written in the GAUSS language.
is presented in this section. We assume X has the binomial distribution,
X ~ Binomial[na,Fg(y)], where na(y) is the observed realization of X, na represents the
number of "trials", F^(y)= £ represents the true finite population proportion of
^
a
"successes"1, and Fa(y) is the infinite population parameter. The estimated distribution
function is denoted as FQ(y)= ^ , where na is the sample size from the subpopulation and
na(y) refers to the number of units in the sample for which yu < y. The upper confidence
bound is computed by obtaining the largest value of F0(y) for which Prob[X < na(y)] > 0.05.
The lower confidence bound is computed by obtaining the smallest value of Fa(y) for which
Prob[X > na(y)] > 0.05. As written, the algorithm calculates the upper and lower confidence
bounds to three decimal places.
Comments in capital letters in braces explain the programming steps. Under these
conditions, the upper and lower halves of the confidence bounds are symmetric.
-------
(Case 6)
CALCULATION OF CONFIDENCE BOUNDS ON Fa(y) BY THE
BINOMIAL DISTRIBUTION
load x[a,b] = data; {LOADS DATA FILE FOR THE TARGET SUBPOPULATION
WHICH INCLUDES LABEL CODE AND VARIABLE TO
BE ANALYZED. HERE a DESIGNATES THE SAMPLE
SIZE, na, AND b DESIGNATES THE NUMBER OF
COLUMN VECTORS}
n=rows(x); {SAMPLE SIZE IN TARGET SUBPOPULATION, na}
x=sortc(x.2); {SORTS VARIABLE OF INTEREST}
y=seqa(l,l,nm); {CREATES SEQUENCE OF NUMBERS}
x2=x[.,2]; {DEFINES VARIABLE OF INTEREST AS X2}
x=y x2- {CREATES MATRIX x}
{THE FOLLOWING STEPS COMBINE RECORDS WITH COMMON y-VALUES}
xx=zeros(l,2);
do while i < n;
if x[i,2]==x[i+l,2];
q=q+l: I;
else; xx=xx|x[i,.j;
end if;
endo;
xx=xx|x[n,.];
r=rows(xx);
x=xx;
{THE FOLLOWING STEPS FORM DATA MATRIX x}
r=rows(x);
z=zeros(r,l);
xl=x[.,l];
x2=x[.,2];
x=x2"xl"(xl/n)"z;
{THESE STEPS GENERATE BINOMIAL COMBINATION TERMS}
f=zeros(n-fl,l);
i=0;
if n<160;
do while i<=n;
-------
(Case 6)
endo;
else;
f[i+l,l]=lnfact(n)-lnfact(i) - Infact(n-i);
endif;
{THE FOLLOWING STEPS GENERATE UPPER CONFIDENCE BOUND}
i=l;
do while i <= r; {BEGINS INITIAL DO LOOP}
rr=x[i,2];
p=(trunc(100*x[i,3]))/100;
if p==1.0;
p=p- .001;
goto three;
endif;
one:;
p=p+.01;
j=0;
s=0;
do while j <= rr;
s=s+a;
endo;
if s >= .05:
goto one;
endif;
two:;
p=p- .001;
j=0;
s=0;
C*D while j <= rr;
a=fIi + l,l]*PJ*(l-p)'(n-J);
s=s+a;
endo;
if s <= .05;
goto two;
endif;
three:;
x[i,4)=p4.001;
ENDO; {ENDS INITIAL DO LOOP}
"64
-------
(Case 6)
{THE FOLLOWING STEPS ADD AN EXTRA LINE TO DATA MATRIX NEEDED IN
CONFIDENCE BOUND ADJUSTMENT COMPUTED AT END OF ALGORITHM}
r=rows(x);
y=zeros(r,l);
x=x~y;
y=zeros(l,5);
y[l,2]=n;
x=x|y;
{THE FOLLOWING STEPS GENERATE LOWER CONFIDENCE BOUND}
r=rows(x);
i = l:
do while i <= r; {BEGINS SECOND DO LOOP}
rr-x[i,2j;
p=(trunc(100*x[i,3]))/100;
if p==0;
p=.001;
goto six;
endif;
four:;
p=p-.01;
if p<=0;
p=.001:
goto six;
endif;
j=rr;
s=0;
do while j <= n;
endo;
if s >= .05;
goto four;
endif;
five:;
p=p+.001;
j=rr;
s=0;
do while j <= n;
6=s-t-a;
endo;
if s <= .05;
goto five;
endif;
-------
(Case 6)
six:;
x[i,5]=p-.001;
ENDO; {ENDS SECOND DO LOOP}
{ADJUST F0(y) and CONFIDENCE BOUND - AVERAGE SUCCESSIVE VALUES]
r=rows(x);
xx=x;
i=2;
do while i <= r-1;
xx[i,3:5] = (x[i,3:5] + x[i-l,3:5))/2;
i=i+l;
endo;
{OUTPUT FILE AND PRINT MATRIX x)
OUTPUT FILE=NAME;
OUTPUT ON;
"x" "Sequence #r "F(x)" "F-upper(x)" "F-lower(x)" ;
format /ml/rd 12,7;
print x;
OUTPUT OFF;
end;
-------
(C«e7)
Case 7- Estimation of F0(y): Discrete Resource, Equal Probabilities, Na known.
Confidence Bounds by Hypergeometric Algorithm.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, NQ, is known.
3. There is an equal probability of selection of units from the subpopulation.
Outline for Algorithm
Under the given conditions, the confidence bounds can be based either on the
binomial or on the hypergeometric distribution. The binomial algorithm presented in
Example 6 is appropriate to use given the foregoing conditions. In addition, Example 9
provides the normal approximation approach, which is also applicable, given the foregoing
conditions, to the confidence bound estimation.
To obtain confidence bounds for F(y) based on the hypergeometric distribution, refer
to the algorithm provided for the confidence bound calculation for Na(y) in Example 1.
Simply divide the lower and upper confidence bounds, and Na(y), by the known
subpopulation size, Nfl. No further changes are necessary to this algorithm to provide
confidence bounds for F (y) based on the hypergeometric distribution.
-------
Variable Probability Selection
In this subsection, two cases are provided to demonstrate variable probability of
selection. For both cases, the frame population size can be known or unknown. In Case 8,
N0 can be unknown or known but not equal to N0; this algorithm produces confidence
bounds based on the Horvitz-Thompson ratio standard error and the normal approximation.
For Case 9, Na is known and equal to NQ; this algorithm produces confidence bounds based
on the Horvitz-Thompson variance estimator and the normal approximation.
w6
-------
(Case 8)
Case 8- Estimation of F0(y): Discrete Resource, Variable Probabilities,
N0 unknown or known and not equal to Na.
Confidence Bounds by Horvitz-Thompeon Ratio Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, Na, is known and not equal to N0.
3. There is a variable probability of selection of units from the subpopulation.
Ontltne for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest.
Calculation of confidence bounds on Fa(y) by the Eoroiiz-Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for N0(y) similar to that given in Example 4. In this section, however,
the interest is in obtaining a distribution function for proportions. Therefore, the variance
of a ratio estimator is used in this algorithm. The confidence bounds are computed based
on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, wjt
c. Tier 2 conditional weighting factor, w2 1(
d. Indicator of interest (y)
e. The subset of data corresponding to the subpopulation of interest, indexed
by a.
2. Computation of weighting factors
This step does not have to be made with each use of the datum, as the
weights are permanent attributes of a sampling unit. The following details are
-------
(CaseS)
given for completeness.
The Tier 1 and Tier 2 weights are included for each record in the data set.
These weights are used to compute the total weight of selecting the i* unit in the
Tier 2 sample. Compute the following weight for each record:
W2i - Wliw2.1i
where w1( is the weighting factor for the t* unit in the Tier 1 sample (the inverse of
its Tier 1 inclusion probability) and w2 1( is the inverse of the conditional Tier 2
inclusion probability. The pairwise inclusion weight is defined below. The sample
size at Tier 2, n2, is not subpopulation specific.
3. Algorithm for Fa(y) and Confidence Intervals
a. Sorting of data. The data file needs to be sorted on the indicator, either in
an ascending or descending order. When the data file is sorted in ascending
order on the indicator, the distribution function of proportions, F0(y),
denotes the proportion of units in the target population that have a value
less than or equal to the y for a specific indicator. Conversely, if it is of
interest to estimate the proportion of units in the target population with
indicator variables greater than or equal to y, the data file would be sorted
in descending order on this variable. The distribution function generated
by the analysis in descending order is [1 - FQ(y)].
b. First, compute N0 = \^ W2i (this sums over data matrix).
5a
c. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and five statistics for each row.
q.j = value of y variable for the record
q2 = Fa(yj
q3 = var[Fa(y)]
q4 = upper confidence bound for Fa(y)
q5 = lower confidence bound for Fa(y)
d. Index rows using i from 1 to n; the il row will contain q-values
corresponding to the t' record in the file, as analyzed.
e. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate and store the q
-statistics, below, as each observation is read into file. Continue this loop
until the end of file is reached.
» qiW = y[-]
ii. q. = q.'-l +
Multiple observations with one y-value creates multiple records in the
preceding analysis for one distinct value of y. The last record for that y
Tfr-
-------
(Case 8)
contains all the information needed for Ffl(y). Therefore, at this stage of
the analysis, eliminate from the q-file all but the last record for those y
values that have multiple records.
f. Entries in the first column (qj) of the q-matrix identifies the vector of y
-values for the remainder of the calculations. For each such y-value, yt-,
make the following calculations. Note that this part of the algorithm is not
recursive; each calculation is made over the entire sample.
where,
2n2w2j.w2fc-w2j.-w2fc
"W- 2(n2-l)
and,
Similarly for dfc".
iv. q4[i] = q2[i] + 1.645*v/o~[i]
v. q.fil = q,[il - 1.645*,
Output of interest
From the q column vectors:
i. qj represents the ordered vector of distinct values of y.
ii. q2 represents the estimated distribution function, Fa(y),
corresponding to the values of y.
iii. q4 represents the 95% one-sided uppjer confidence
bound of the distribution function, F0(y).
iv. q5 represents the 95% one-sided lower confidence
bound of the distribution function, F0(y).
74.
-------
(Case 9)
Case &- Estimation of Ffl(y): Discrete Resource, Variable Probabilities,
Na known and equal to Nfl.
Confidence Bounds by Horvitz-Thompeon Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size, N, can be known or unknown.
2. The subpopulation size, Nfl, is known and equal to NQ.
3. There is a variable probability of selection of units from the subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset1 for any
subpopulation a that is of interest.
Calculation of confidence bounds on Fa(y) by the Horvitz-Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bounds for Na(y) exactly as given in Example 4. Because NQ is known and equal
to NQ, it is not necessary to use the ratio estimator applied in Case 8. The distribution
function of FQ(y) is obtained by dividing the distribution function, NQ(y), and the associated
bounds, by Nn. (These additional steps are included in this algorithm.) The Horvitz-
Thompson variance estimator, discussed in Section 2.1, is used to compute the variance in
this algorithm. The confidence bounds are computed based on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, w]t.
c. Tier 2 conditional weighting factor, w2j,
d. Indicator of interest (y)
-------
(C*»e9)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When the data file is sorted in ascending order on the indicator,
the distribution function of proportions, FQ(y), denotes the proportion of units in
the target population that have a value less than or equal to the y for a specific
indicator. Conversely, if it is of interest to estimate the proportion of units in the
target population with indicator variables greater than or equal to y, the data file
would be sorted in descending order on this variable. The distribution function
generated by the analysis in descending order is [1 - F0(y)].
3. Computation of weighting factors
For this step, refer to the program steps given in Example 4 to derive the
distribution function and the confidence bound for N0(y). Follow the steps labeled
3 and 4. Additional steps, shown here, are needed to obtain Fa(y) and its
corresponding confidence bounds. Proceed with the following steps after conducting
steps 3 and 4 from Example 4:
e. The operations that follow generate the q vectors to compute the estimated
distribution function and appropriate confidence bounds for FQ(y). These
are denoted by q6 through q^. Each element of q6-qg is computed by
performing the following operations on the corresponding elements of q2, q4,
and qs.
i. qe = Divide each element of q2 by the known subpopulation size
ii. q7 = Divide each element of q4 by the known subpopulation size
iii. q^ = Divide each element of q5 by the known subpopulation size
From the q vectors:
i. q^ represents the estimated distribution function, Fa(y)
ii. q7 represents the 95% one-sided uppjer confidence
bound of the distribution function, Fa(y).
iii. q^ represents the 95% one-sided lower confident
bound of the distribution function, Fa(y).
-------
3.1.3 Rationales for Approaches
Justification for the variance estimators used in the algorithms in Sections 3.1.1 and
3.1.2 was presented in Section 2 of this report. The different choices proposed for confidence
bound estimation, under some conditions, were also discussed. For example, both the
hypergeometric and binomial approaches to the confidence bound calculation for F0(y),
when Na is known, were provided in the above cases. Choice of one of the approaches
presented to compute confidence bounds for FQ(y), when the subpopulation size is known,
depends in part on the'available information and in part on the purpose of inference. The
bounds based on the hypergeometric distribution provide for inferences directed to the finite
population. For example, if data are available for every lake in a small population of lakes,
there is no uncertainty relative to this attribute for this population (in the absence of
measurement error). Bounds based on hypergeometric or on the normal approximation
approach will reduce to zero width as n~N, because of the finite population correction.
These bounds are more relevant for management purposes. In contrast, those bounds based
on the binomial distribution provide for inferences directed to the superpopulation
parameter. In this situation, the entire population is considered as a sample from the
superpopulation. Statements about the set of high mountain lakes in New England are
finite, but general statements about high mountain lakes, based on those found in New
England, are relative to a hypothetical, infinite, superpopulation. Therefore, the confidence
bounds obtained by the binomial distribution are broader than those provided by the
hypergeometric distribution to account for this additional level of variability.
-------
3.1.4 Estimation of Size-Weighted Statistics
A few algorithms are presented to compute the distribution functions for size-
weighted totals and size-weighted proportions of totals. The following subsection describes
algorithms to compute the distribution function for size-weighted totals. The next
subsection presents algorithms to compute the distribution function for the proportions of
size-weighted totals.
Estimation of Size-Weighted Totals
In this subsection, two examples are provided based on information that is known or
unknown. For the first algorithm, the size-weight, ZQ, is unknown or known and equal to
Za; this algorithm produces confidence bounds based on the Horvitz-Thompson standard
error and the normal approximation. For the second algorithm, Za is known but not equal
to ZQ; this algorithm produces confidence bounds based on the Horvitz-Thompson ratio
standard error and the normal approximation.
-------
(Case 10)
Caae 10- Estimation of Za(y): Discrete Resource, Sise-Weighted Estimate,
Equal or Variable Probabilities. Za unknown or known and
equal to ZQ.
Confidence Bounds by HorviU-Thompeon Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2. The subpopulation size-weighted total, Za, is unknown or known and
equal to ZQ.
3. There can be an equal or variable probability selection of units from
the subpopulation.
Outline for Algorithm
Genera] formulae for Tier 1 estimates were provided in Section 2.1.1. The general
form of a size-weighted estimate in a subpopulation at Tier 1, denoted as ZQ, is similar to
Equation 2. The y, in that equation refers to the size-weight value, now denoted as zt:
sa
where z, defines a size-weight, such as the area of a lake or the stream length in miles, and
w is the inverse of the inclusion probability at Tier 1. Using these same definitions, the
variance estimator for Z0 is similar to Equation 3a.
Estimation of Za(y) by ike Hormtz-Thompso* formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for Za(y). This algorithm is similar to the algorithm defined for the
National Surface Water Surveys (Overton, 1987a,b). The Horvitz-Thompson variance
estimator, discussed in Section 2.1, is used to compute the variance in this algorithm. The
TO-
-------
(Case 10)
confidence bounds are computed based on a normal approximation. This algorithm is
appropriate for a sample subset for any subpopulation a that is of interest.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, w](
c. Tier 2 conditional weighting factor, w2 ,(
d. Size-weighted value (z)
e. Indicator of interest (y)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When the data file is sorted in ascending order on the indicator,
the distribution function of size-weighted totals, Za(y), denotes the size-weights in
the target population that have a value less than or equal to the y for a specific
indicator. Conversely, if it is of interest to estimate the size-weight in the target
population with indicator variables greater than or equal to y, the data file would
be sorted in descending order on this variable. The distribution function generated
by the analysis in descending order is [Za - Za(y)].
3. Computation of additional weighting factors
The Tier 1 and Tier 2 weights are included for each observation in the data
set. These weights are used to compute the total weight of selecting the I unit in
the Tier 2 sample. First, compute this weight for each observation:
W2i = Wliw2.1i '
where w]( is the weighting factor for the i* unit in the Tier 1 sample and w2 1( is
the inverse of the conditional Tier 2 inclusion probability.
4. Algorithm for Za(y)
a. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and four statistics for each row.
q} = value of y variable for the record
q2 = z0(y)
qa = var[za(y)J
q^ = upper confidence bound for Za(y)
qs = lower confidence bound for Za(y)
b. Index rows using i from 1 to n; the i' row will contain q-values
corresponding to the t record in the file, as analyzed.
c. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate the q-statistics as each
observation is read into file. Continue this loop until the end of file is
reached. At that time, store these vectors and go to d. It is necessary, as
shown below for q4, to identify the records for which w2 ^=1.
-------
(Case 10)
i- qJO = y[«]
ii. q2[.] = q2[i - 1] + w2 -*z
q.j[i - 1] + z?*w2l-*(w2|. - 1) + 2£ z,-z -(w2l-w2 -- w2l- )
3 < '
where, if neither w2 1( or w2il 1:
_ 2n2w2l-w2 j - wg|- - w2j
2(n2-l)
where, if either w2 1( or w2 !,-=!:
W2.j =
and where:
2n1w1|wlj-wli-wlj
iv. q4[i] = q2[i]
v-
Multiple observations with one y value create multiple records in the
preceding analysis for one distinct value of y. The last record for that y
contains all the information needed for Za(y). Therefore, at this stage of
the analysis, eliminate all but the last record for those y values that have
multiple records.
d. Output of interest
From the last entry of the row of q-vectors just computed:
i. qj = largest value of y (or smallest if analysis is descending).
ii. q2 = Za
iii. q3 = var(Za)
iv. Standard error of Z =
Q
From the q column vectors:
i. qj represents the ordered vector of distinct values of y
ii. q2 represents the estimated distribution function, Zfl(y),
corresponding to the values of y.
iii. q^ represents the 95% one-sided upper confidence
bound of the distribution function, Z0(y).
iv. q5 represents the 95% one-sided lowjer confidence
bound of the distribution function, Z0(y).
-78"
-------
(Case 11)
Case 11- Estimation of Za(y): Discrete Resource, Size-Weighted Estimate,
Equal or ^Variable Probabilities. Z0 known and not
equal to Za.
Confidence Bounds by Horvite-Tbompeon Ratio Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2. The subpopulation size-weighted total, Za, is known and not equal
toZ0.
3. There can be an equal or variable probability selection of units from the
subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. The algorithm for the distribution function for the
proportion of numbers, G0(y), given exactly the same conditions listed here, is presented in
Case 12. To compute the distribution function of size-weighted totals, Z0(y), first use the
algorithm in Case 12 to compute the distribution function with the corresponding confidence
bounds for the proportion of size-weighted totals. Then, compute the following:
Za(y) =Ga(y)*Za , (54)
where Z is the known size-weighted total. To compute the confidence bounds for ZQ(y),
simply multiply the upper and lower confidence limits of Ga(y) by Za.
-------
(Case 11)
Estimation of Proportion of Size-Weighted Totals
ID this subsection, two examples are provided based on varying conditions. For the
first algorithm, the size-weight, Za, is unknown or known and not equal to Za; this
algorithm produces confidence bounds based on the Horvitz-Thompson ratio standard error
and the normal approximation. For the second algorithm, Z0 is known and equal to Z0;
this algorithm produces confidence bounds based on the Horvitz-Thompson standard error
and the normal approximation.
-------
(Case 12)
Case 12- Estimation of Ga(y): Discrete Resource, Siae-Weighted Estimate,
Equal orJVariable Probabilities. Za unknown or known and not
equal to Za.
Confidence Bounds by Horvitz-Thompeon Ratio Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2, The sub-population size-weighted total, Za, is unknown or known and not
equal to ZQ.
3. There can be an equal or variable probability selection of units from the
subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest. Another discussion of the formulae are presented in the
previous section, Estimation of Size-Weighted Totals.
Calculation of confidence bovnds on Ga(y) by the Bormtz-Tbompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for Za(y) similar to that given in Case 10. Because Za is unknown or
known and not equal to Za in this example, however, the variance of a ratio estimator is
used in this algorithm. The confidence bounds are based on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, w,(
c. Tier 2 conditional weighting factor, w2 1(
d. Size-weighted value (z)
e. Indicator of interest (y)
f. The subset of data corresponding to the subpopulation of interest, indexed
by a.
-------
(Case 12)
2. Computation of additional weighting factors
This step does Dot have to be made with each use of the datum, as the
weights are permanent attributes of a sampling unit. The following details are
given for completeness.
The Tier 1 and Tier 2 weights are included for each observation in the data
set. These weights are used to compute the total weight of selecting the i unit in
the Tier 2 sample. First, compute this weight for each observation:
W2i Wiiw2.ii >
where wlt is the weighting factor for the I* unit in the,Tier 1 sample and w2 lt is
the inverse of the conditional Tier 2 inclusion probability. The pairwise inclusion
weight is defined below. The sample size at Tier 2, n2, is not subpopulation
specific.
3. Algorithm for Ga(y) and Confidence Intervals
a. Sorting of data. The data file needs to be sorted on the indicator, either in
an ascending or descending order. When the data file is sorted in ascending
order on the indicator, the distribution function of size-weighted
proportions, Ga(y), denotes the proportion of size-weights in the target
population, such as stream miles, that have a value less than or equal to the
y for a specific indicator. Conversely, if it is of interest to estimate the
proportion of size-weights in the target population with indicator variables
greater than or equal to y, the data file would be sorted in descending order
on this variable. The distribution function generated by the analysis in
descending order is [1 - G0(y)j.
b. Compute ZQ = \J w2) * z(
So
c. Define a matrix of q column vectors, which will be defined as the following.
There is one row for each data record and five statistics for each row.
q-j = value of y variable for the record
q2 = Gfl(yJ
. 03 = var[Ga(y)]
q^ = upper confidence bound for Ga(y)
q5 = lower confidence bound for G0(y)
d. Index rows using t from 1 to n; the t* row will contain q-values
corresponding to the i' record in the file, as analyzed.
e. Read first observation (first row of data matrix), following with the
successive observations, one at a time. Accumulate and store the q
-statistics as each observation is read into file. Continue this loop until the
end of file is reached.
' qiM = y[>]
ii. q2[,] = q2[f - 1] + ?&
-------
(Caael2)
Multiple observations with one y-value create multiple records in the
preceding analysis for one distinct value^pf y. The last record for that y
contains all the information needed for Ga(y). Therefore, at this stage of
the analysis, eliminate from the q-file all but the last record for those y
values that have multiple records.
f. Entries in the first column (qj) of the q-matrix identifies the vector of y
-values for the remainder of the calculations. For each such y-value, y(,
make the following calculations. Note that this part of the algorithm is not
recursive; each calculation is made over the entire sample.
where,
2n2w2>i
** ~ 2(n2-l)
and,
Similarly for dk.
iv. qji] = q2(i] -f 1.
v- qs(i] = q2M -
Output of interest
From the q column vectors:
i. qj represents the ordered vector of distinct values of y.
ii. q2 represents the estimated distribution function, G0(y),
corresponding to the values of y
iii. q^ represents the 95% one-sided upper confidence
bound of the distribution function, GQ(y)
iv. q5 represents the 95% one-sided lower confidence
bound of the distribution function, Ga(y)
-------
(Case 13)
Case 13 Estimation of G0(y): Discrete Resource, Size-Weighted Estimate,
Equal or Variable Probabilities. Za known and equal to Za.
Confidence Bounds by HorvitB-Thompson Standard Error
and Normal Approximation.
Conditions for approach
1. The frame population size-weighted total, Z, can be known or unknown.
2. The subpopulation size-weighted total, Za, is known and equal to Za.
3. There can be an equal or variable probability selection of units from the
subpopulation.
Outline for Algorithm
The algorithm supplied in this section is based on the Horvitz-Thompson formulae,
which were discussed in Section 2. This algorithm is appropriate for a sample subset for any
subpopulation a that is of interest.
Calculation of confidence bounds on Ga(y) by the Horoitz-Thompson formulae
For each indicator, the following algorithm derives the distribution function and the
confidence bound for ZQ(y) exactly as given in Case 10. Because ZQ is known and equal to
ZQ, it is not necessary to use the ratio estimator. The distribution function of Gfl(y) is
obtained by dividing the distribution function, Za(y), and the associated confidence bounds
by Za. (These additional steps are included in this algorithm.) The Horvitz-Thompson
variance estimator, discussed in Section 2.1, is used to compute the variance in this
algorithm. The confidence bounds are computed based on a normal approximation.
1. Data set
a. Unit identification code
b. Tier 1 weighting factor, w](
c. Tier 2 conditional weighting factor, w2 ](
d. Size-weighted value (z)
e. Indicator of interest (y)
-84-
-------
(Case 13)
2. Sorting of data
The data file needs to be sorted on the indicator, either in an ascending or
descending order. When the data file is sorted in ascending order on the indicator,
the distribution function of size-weighted proportions, Ga(y), denotes the proportion
of size-weights in the target population, such as lake area, that have a value less
than or equal to the y for a specific indicator. Conversely, if it is of interest to
estimate the proportion of size-weights in the target population with indicator
variables greater than or equal to y, the data file would be sorted in descending
order on this variable. ^Tbe distribution function generated by the analysis in
descending order is [1 - Ga(y)].
3. Computation of weighting factors
For this step, refer to the program steps given in Case 10 to derive the
distribution function and the confidence bound for Za(y). Follow the steps labeled 3
and 4. Additional steps, shown here, are needed to obtain Ga(y) and its
corresponding confidence bounds. Proceed with the following steps after conducting
steps 3 and 4 from Case 10:
e. The operations that follow generate the q vectors to compute the estimated
distribution function and appropriate confidence bounds for G0(y). These
are denoted by qg through qg. Each element of q6-q^ is computed by
performing the following operations on the corresponding elements of q2, q^,
and q5.
i. q6 = Divide each element of q2 by the known subpopulation size
ii. q7 = Divide each element of q4 by the known subpopulation size
iii. qg = Divide each element of q5 by the known subpopulation size
From the q vectors:
i. qg represents the estimated distribution function, Ga(y)
ii. q7 represents the 95% one-sided upper confidence
bound of the distribution function, Ga(y)
iii. qg represents the 95% one-sided lower confidence
bound of the distribution function, G (y)
"85,
-------
3.2 Extensive Resources
A detailed discussion of the formulae for obtaining area and proportion of areal
extent for continuous and extensive resources was presented in Section 2.1.2. Formulae were
presented for both areal and point samples.
3.2.1 Estimation of Proportion of Areal Extent
As discussed in Section 2.1.2, the confidence bounds for the proportion of areal extent
in continuous and extensive resources can.be based on the binomial distribution. This
algorithm was presented in Section 3.1.2, Case 6, for discrete resources. No changes in this
algorithm are needed.
3.2.2 Estimation of Area
Formulae for the estimation of total areal extent of the surveyed resources was
proposed in Section 2.1.2. Proposed methods to compute areal extent for point and areal
samples are discussed in the following subsections.
Point Samples
Formulae for the estimation of areal extent based on point sample was presented in
Section 2.1.2.2. To obtain confidence bounds for Aa(y) based on the binomial distribution,
refer to the algorithm provided for the confidence bound calculation for Ffl(y) in Section
3.1.2, Case 6. Simply multiply the lower and upper confidence bounds, and Fa(y), by the
known area or estimated area of the resource. No further changes are necessary to this
algorithm to provide confidence bounds for A (y) based on the binomial distribution.
-------
Area] Samples
Formulae for the estimation of areal extent based on areal samples are still under
development. However, some preliminary formulae are proposed in Section 2.1.2.1. Work
in this area is continuing and will be included in the next version of this report.
3.3 Estimation of Qu&ntiJes
Overton (1987a) defines the calculations for both the ascending and descending sorted
indicators. For the algorithm used in this report, it is not necessary to employ a different
definition for percen tiles for an ascending or descending analysis; distributions are identical
as generated either way. The general algorithm computes the linear interpolation of the
distribution function for both types of analyses. In the following equation, let r represent
the proportion of the desired percentile. The fraction in this equation can be interpreted as
the slope of the line. The coefficient of this fraction interpolates to the value [Q(r)-a].
The lower bound, a, is added to this piece, [Q(r) -a], to obtain the quantile of interest.
Assuming an ascending analysis and that the generated distribution function is F(y):
Q(r) = a + [r - F(a)] { F(^a } . (55)
where F(a) is the greatest value of F(y) < r and F(b) is the least value of F(y) > r.
For a descending analysis, the distribution function generated was F*(y)=[l - F(y)].
To obtain the percentiles, calculate F(y)=l - F*(y); the foregoing formula is appropriate for
the analysis.
-------
SECTION 4
TABLES
Table 1. Reference to Distribution Function Algorithms
A. Distribution Functions for Numbers Estimation of N (y)
Equal Probability Selection:
Population Size
Known/ unknown
Known
Variable Probability
Population Size
Known/unknown
Known/unknown
Subpopulation Size
Known
Unknown
Selection:
Subpopulation Size
Unknown or ^
known and = Na
Known and ^ Na
Algorithm
Hypergeometric1
HT-NA2
Hypergeometric
HT-NA
Algorithm
HT-NA
HTR-NA3
Case
1
3
2
3
Case
4
5
1 Hypergeometric refers to the exact hypergeometric distribution algorithm
used to obtain confidence bounds.
2 HT-NA refers to Horvitz-Thompson variance with normal approximation
to obtain confidence bounds.
3 HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
normal approximation to obtain confidence bounds.
4 Binomial refers to the exact binomial distribution algorithm
used to obtain confidence bounds.
89
-------
Table 1 Continued.
B. Distribution Functions for Proportions of Numbers - Estimation of F (y)
Fx}ua) Probability Selection:
Population Size
Subpopulation Size
Algorithm
Case
Known/unknown
Known/unknown
Known/unknown
Known
Binomial4
Hypergeometric
Variable Probability Selection:
Population Size
Known/unknown
Known/unknown
Subpopulation Size
Unknown or
known and ^ NQ
Known and = Nn
Algorithm
HTR-NA
HT-NA
Case
8
9
1 Hypergeometric refers to the exact hypergeometric distribution algorithm
used to obtain confidence bounds.
2 HT-NA refers to Horvitz-Thompson variance with normal approximation
to obtain confidence bounds.
3 HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
normal approximation to obtain confidence bounds.
4 Binomial refers to the exact binomial distribution algorithm
used to obtain confidence bounds.
-------
Table 1 Continued.
C. Distribution Functions for Size-Weighted Statistics for Both Equal and
Variable Probability Selection
Population Size Subpopulation Size
Algorithm
Section
Estimation of Zn(y)
Known/unknown
Known/unknown
Unknown or
known and = Zc
Known and ? Z
HT-NA
HTR-NA
10
11
Estimation of Ga(y)
Known/unknown
Known/unknown
Unknown or
known and ^ Z£
Known and = Z_
ETR-NA
HT-NA
"12
13
1 Hypergeometric refers to the exact hypergeometric distribution algorithm
used to obtain confidence bounds.
2 HT-NA refers to Horvitz-Thompson variance with normal approximation
to obtain confidence bounds.
3 HTR-NA refers to Horvitz-Thompson ratio estimator of variance with
normal approximation to obtain confidence bounds.
4 Binomial refers to the exact binomial distribution algorithm
used to obtain confidence bounds.
-------
Table 2. Summary of Notation Used in Formulae and Algorithms
Svmbol
Definition
Populations:
N
Distribution Functions:
Discrete Resources:
N(y)
F(y)
Z(y)
G(y)
Population size
Subpopulation size
Estimated distribution function for total number
Estimated distribution function for proportion of numbers
Estimated distribution function of size-weighted totals
Estimated distribution function for a size-weighted proportion
Continuous and Extensive Resources:
A(y) Estimated distribution function for area! extent
F(y) Estimated distribution function for proportion of areal extent
Inclusion Probabilities:
7T
'l!
2i
Probability of inclusion of unit I
Probability that unit i and j are simultaneously included
Probability of inclusion of unit i at Tier 1
Probability of inclusion of unit i at Tier 2
Conditional Tier 2 inclusion probability
Weights:
w
Inverse of the above inclusion probabilities
(Same definitions apply with corresponding subscripts)
Sample Notation:
n
Di
General notation for sample size
Sample size at Tier 1
Sample size at Tier 2
Sample of units at Tier 1
Sample of units at Tier 2
(These may be made specific for subpopulations or resources by appending
an a or r. For example: )
na Sample size for subpopulation a
nri Sample size for a resource r at grid point i
Slr Sample of units at Tier 1 for resource r
S2r Sample of units at Tier 2 for resource r
-------
SECTION 5
REFERENCES
Chambers, R.L., and R. Dunstan. 1986. Estimating distribution functions from survey
data. Biometrika, 73, 597-604.
Cochran. W.G. 1977. Sampling Techniques, Third Edition. Wiley, New York.
Cordy, C.B. In press. An extension of the Horvitz-Thompson theorem to point sampling
from a continuous universe. Accepted in Probability in Statistics Letters.
Cox, D.R., and D: Oakes. 1984. Analysis of Survival Data. Chapman and Hall, New
York.
Hansen, M.H., W.G. Madow, and B.J. Tepping. 1983. An evaluation of model-dependent
and probability-sampling inferences in sample surveys. J. Amer. Stat. Asspc. 78:
776-793.
Hartley, H.O., and J.N.K. Rao. 1962. Sampling with unequal probability and without
replacement. The Annals of Mathematical Statistics, 33, 350-374.
Horvitz, D.G., and D.J. Thompson. 1952. A generalization of sampling without
replacement from a finite universe. J. Amer. Stat. Assoc: 47: 663-685.
Hunsaker, C.T., and D.E. Carpenter, eds. 1990. Environmental Monitoring and
Assessment Program: Ecological Indicators. EPA/600/3-90/060. U.S.EPA, Office of
Research and Development, Washington, DC.
-------
Kaufman, P.R., A.T. Herlihy, J.W. Elwood, M.E. Mitch, W.S. Overtoil, M.J. Sale, K.A.
Cougan, D.V. Peck, K.H. Reckhow, A.J. Kinney. S.J. Christie, D.D. Brown, C.A. Hagley
and H.I. Jager. 1988. Chemical Characteristics of Streams in the Mid-Atlantic and
Southeastern United States. Volume I: Population Descriptions and Physico-Chemica!
Relationships. EPA/600/3-88/021a. U.S. EPA, Washington, DC.
Landers, D.H., J.M. Eilers, D.F. Brakke, W.S. Overton, P.E. Kellar, M.E. Silverstein, R.D.
Schonbrod, R.E. Crowe, R.A. Linthurst, J.M. Omernik, S.A. Teague, and E.P. Meier.
1987. CharacteristicsofLakesintheWesternUnitedStat.es. Volume I: Population
Descriptions and Physico-Chemical Relationships. EPA-600/3-86/054a. U.S. EPA,
Washington, DC.
Linthurst, R.A., D.H. Landers, J.M. Eilers, D.R. Brakke, W.S.Overton, E.P. Meier, and
R.E. Crowe. 1986. Characteristics of Lakes in the Eastern United States, Volume I:
Population Descriptions and Physico-Chemical Relationships. EPA-600/4-86/00"a.
U.S. EPA, Washington, DC.
Little, R.J.A., and D.B. Rubin. 1987. Statistical Analysis with Missing Data. Wiley,
New York.
Messer, J.J., R.A. Linthurst, and W.S. Overton. 1991. An EPA Program for Monitoring
Ecological Status and Trends. Environ. Monit. and Assess. 17, 67-78.
Messer, J.J., C.W. Ariss, R. Baker, S.K. Drouse, K.N. Eshelman, P.R. Kaufmann, R.A.
Linthurst, J.M. Omernik, W.S. Overton, M.J. Sale, R.D. Schonbrod, S.M. Stambaugh,
and J.R. Tutshall, Jr. 1986. National Surface Water Survey: National Stream Survey,
Phase 1 Pilot Survey. EPA/600/4-86/026. U.S. EPA, Washington, D.C.
-------
Miller, R.G. 1981. Survival Analysis. * Wiley, New York.
Sarndal, C-E., B. Swensson, and J. Wretman. 1992. Model Assisted Survey Sampling.
Springer-Verlag, New York.
Overton. W.S. 1987a. Phase II Analysis Plan, National Lake Survey, Working
Draft. Technical Report 115, Department of Statistics, Oregon State University.
Overton. W.S. 1987b. A Sampling and Analysis Plan for Streams in the National
Surface Water Survey. Technical Report 117, Department of Statistics. Oregon State
University.
Overton, W.S. 1989a. Calibration Methodology for the Double Sample Structure of the
National Lake Survey Phase II Sample. Technical Report 130, Department of Statistics,
Oregon State University.
Overton, W.S. 1989b. Effects of Measurements and Other Extraneous Errors on Estimated
Distribution Functions in the National Surface Water Surveys. Technical Report 129,
Department of Statistics, Oregon State University.
Overton, W.S., and S.V. Stehman. 1987. An Empirical Investigation of Sampling and
Other Errors in National Stream Survey: Analysis of a Replicated Sample of Streams.
Technical Report 119, Department of Statistics, Oregon State University.
Overton, W.S., and S.V. Stehman. 1992. The Horvitz-Thompson theorem as a unifying
perspective for sampling. Proceedings of the Section on Statistical Education of the
American Statistical Association, pp. 182-187.
-------
Overton. W.S., and S.V. Stehman. 1993a. Properties of designs for sampling
continuous spatial resources from a triangular grid. Communications in Statistics -
Theory and Methods, 22, 2641-2660.
Overton, W.S., and S.V. Stehman. 1993b. Improvement of Performance of Variable
Probability Sampling Strategies Through Application of the Population Space and the
Fascimile Population Bootstrap. Technical Report 148, Department of Statistics, Oregon
State University.
Overton, W.S., D. White, and D.L. Stevens. 1990. Design Report for EMAP,
Environmental Monitoring and Assessment Program. EPA/600/3-91/053.
U.S. EPA, Washington, DC.
Overton, J.M.. T.C. Young, and W.S. Overton. 1993. Using found data to augment a
probability sample: procedure and case study. Environ. Monitoring and
Assessment, 26, 65-83.
Porter, P.S., R.C. Ward, and H.F. Bell. 1988. The detection limit. Environ. Sci.
Technol., 22, 856-861.
Rao, J.N.K., J.G. Kovar, and H.J. Mantel. 1990. On estimating distribution functions and
quantiles from survey data using auxiliary information. Biometrika, 77, 365-375.
Sarndal, C.E., B. Swensson, and J. Wretman. 1992. Model Assisted Survey Sampling.
Springer-Verlag, New York.
-------
Smith, T.M.F. 1976. The foundations of survey sampling: a review. J. of Roy. Stat.
Soc., A,-139, Part 2, 183-195.
Stehman, S.V., and W.S. Overton. 1989. Pairwise Inclusion Probability Formulas in
Random-order, Variable Probability, Systematic Sampling. Technical Report 131,
Department of Statistics, Oregon State University.
Stehman,'S.V., and W.S. Overton. In press. Comparison of Variance Estimators of
the Horvitz-Thompson Estimator for Randomized Variable Probability Systematic
Sampling. Jour, of Amer. Stat. Assoc.
Stevens, D.L. In press. Implementation of a National Monitoring Program. Jour, of Envir.
Management.
Thomas, Dave. Oregon State University, Statistics Department, Corvallis, OR.
Wolter, K.M. 1985. Introduction to Variance Estimation, New York: Springer-Verlag.
-------
SECTION 6
GLOSSARY OF COMMONLY USED TERMS
Continuous attribute: an attribute that is represented as a continuous surface over some
region. Examples are certain attributes of large bodies of water, such as chemical variables
of estuaries or lakes.
Discrete resource: resources consisting of discrete resource units, such as lakes or stream
reaches. Such a resource will be described as a finite population of such units.
Distribution function: a mathematical expression describing a random variable or a
population. For real-world finite populations, these distributions are knowable attributes
(parameters) of the population, and may be determined exactly by a census, or estimated
from a sample. The general form will be the proportion (or other measure, like numbers,
length, or area) of the resource having a value of an attribute equal to or less than a
particular value. Proportions may also be of the different possible measures, like number
(frequency distributions), area (areal distributions), length, or volume.
Domain: a frame feature that includes the entire area within which a potential sample might
encounter the resource. The domain of any one resource can include other resources.
Extensive resource: resources without natural units. Examples of extensive resources are
grasslands or marshes.
40-hez: a term for the landscape description hexagon or areal sampling unit centered on
each of the grid points in the EMAP sampling grid. The area of each hexagon is
approximately 40 km .
Inclusion probability (*<): the probability of including the ith sampling unit within a
sample.
Pairwiae inclusion probability (T,-J): the probability that both element t and element j are
included in the sample.
-------
Population: often used interchangeably with the term universe to designate the total set of
entities addressed in a sampling effort. The term population is defined in this report to
designate any collection of units of a specific discrete resource, or any subset of a specific
extensive resource, about which inferences are desired or made.
Randomized model: a model invoked in analysis, assuming the population units have been
randomly arranged prior to sample selection. In many cases, this is equivalent to assuming
simple random sampling.
Resource: an ecological entity that is identified as a target of sampling, description, and
analysis by EMAP- Such an entity will ordinarily be thought of and described as a
population. Two resource types, discrete and extensive, recognized in EMAP pose different
problems of sampling and representation. EMAP resources are ordinarily treated as strata
at Tier 2.
Resource class: a subset of a resource, represented as a subpopulation. For example, two
classes of substrate, sand and mud, can be defined in the Chesapeake Bay. Subpopulation
estimates require only that the classification be known on the sample.
Stratum: a stratum is a sampling structure that restricts sample-randomization/selection to
a subset of the frame. Samples from different strata are independent. Inclusion
probabilities may or may not differ among strata.
Tierl/Tier2: these terms represent different phases of the EMAP program. Relative to the
EMAP sample, they refer to the two phases (stages) of the EMAP double sample. The Tier
1 sample is common to all resources and provides for each a sample from which the Tier 2
sample is selected. The Tier 2 sample for any resource is a set of resource units or sites at
which field data will be obtained.
Weights: in a probability sample, the sample weights are inverses of the inclusion
probabilities; these are always known for a probability sample.
-------
Reproduced by NTIS
££
£ 0 0) 0)
M. ^ +J +J
S 0) o C
££-.2
t 0 3*3
0) o 015
ajS>c
+j Q-On
0£ = o
="0
- a,
National Technical Information Service
Springfield, VA 22161
report was printed specifically for your order
from nearly 3 million titles available in our collection.
For economy and efficiency, NTIS does not maintain stock of its vast
collection of technical reports. Rather, most documents are printed for
each order. Documents that are not in electronic format are reproduced
from master archival copies and are the best possible reproductions
available. If you have any questions concerning this document or any
order you have placed with NTIS, please call our Customer Service
Department at (703) 605-6050.
About NTIS
NTIS collects scientific, technical, engineering, and business related
information then organizes, maintains, and disseminates that
information in a variety of formats from microfiche to online services.
The NTIS collection of nearly 3 million titles includes reports describing
research conducted or sponsored by federal agencies and their
contractors; statistical and business information; U.S. military
publications; multimedia/training products; computer software and
electronic databases developed by federal agencies; training tools; and
technical reports prepared by research organizations worldwide.
Approximately 100,000 new titles are added and indexed into the NTIS
collection annually.
For more information about NTIS products and services, call NTIS
at 1-800-553-NTIS (6847) or (703) 605-6000 and request the free
NTIS Products Catalog, PR-827LPG, or visit the NTIS Web site
http://www.ntis.gov.
NTIS
Your indispensable resource for government-sponsored
informationU.S. and worldwide
-------