Estimation Method 4: Estimation of the Size-weighted Cumulative Distribution Function for Total of a Discrete Resource; Horvitz-Thompson Estimator Normal Approximation

ESTIMATION METHOD 4: Estimation of the Size-Weighted Cumulative Distribution Function
for Total of a Discrete Resource; Horvitz-Thompson Estimator, Normal Approximation

1 Scope and Application

This method calculates the estimate of the size-weighted cumulative distribution function (CDF) for
the total of a discrete resource that has an indicator value equal to or less than a given indicator
level. The size-weight value is a measurement of the discrete resource such as area of a lake. The
method applies to any probability sample and presents two estimators. An estimate can be
produced for the entire population or for an arbitrary subpopulation with known or unknown size,
where this size is the size-weighted total in the subpopulation. Suggestions for estimating the CDF
over the range of the indicator are included. Alternatively, the CDF can be calculated at the
indicator levels found in the probability sample. The method uses the Normal approximation to
provide confidence bounds or intervals for the true cumulative distribution function. This method
does not include variance estimators for the estimated CDF. For information on appropriate
variance estimators, refer to Section 7.

This method has been applied in:

The 1991 Surface Waters Pilot Report

2 Statistical Estimation Overview

A sample of size na units is selected from subpopulation a with known inclusion probabilities
% = } and size-weight values w= {w, }. The indicator is evaluated

1 I Yla 1 I Yla

for each unit and represented by y = {>\ }.

Estimates of the cumulative distribution function are obtained for the indicator levels of interest,
x= {x^,- ,xk,- ,xm). Several alternatives are available for choosingx. The recommended

alternative is the use of equally spaced values across the range of the indicator. Ideally, this range is
known a priori and extends beyond the range of any particular data set. A second alternative is to
use the set of unique values in the data set. This alternative gives the classical empirical cumulative
distribution function. A third alternative is to use the midpoints of adjacent ordered values iny for
the levels x.

To obtain the estimated size-weighted cumulative distribution function, Fa(xk), the Horvitz-

Thompson estimator of a cumulative total is calculated for each xk by summing up the number of
indicators which are less than or equal to the xk value, weighted by the size-weight values wt.
Alternatively, when the subpopulation size (size-weighted total) is known, first form the Horvitz-
Thompson ratio estimator by dividing this cumulative total by the estimated subpopulation size,

Wa , and then multiply this ratio by the known subpopulation size, Wa, to obtain Fa(xk).

The Horvitz-Thompson ratio estimator may perform better than the estimator that does not use the
known subpopulation size, Wa. Some of the conditions under which this ratio estimator is

-------
recommended are given in Section 9. This ratio estimator cannot be used in the case of missing
data.

Confidence limits for Fa(xk) are produced by assuming a Normal distribution. These limits may
be used to construct either a lower confidence bound, an upper confidence bound, or a confidence
interval for Fa(xk). Computation of these limits requires an estimated variance of Fa(xk) which is

not provided in this method. Details for computing a suitable estimated variance of Fa(xk) are
found in other methods referenced in Section 7.

The output consists of the estimated cumulative distribution function values with either a one-sided
confidence bound (upper or lower) or a confidence interval for Fa(xk).

3 Conditions Under Which This Method Applies

• Probability sample with known inclusion probabilities

• Discrete resource

• Arbitrary subpopulation

• All units sampled from the subpopulation must be accounted for before applying this method

• When the indicator value is missing, exclude this missing value and the corresponding
inclusion probability and size-weight; use the Horvitz-Thompson estimator of a total

4 Required Elements

4.1 Input Data

v( = value of the indicator for the ith unit sampled from subpopulation a.

¦K. = inclusion probability for selecting the i'h unit of subpopulation a.

wt = size-weight value for the i'h unit sampled from subpopulation a.

4.2 Additional Components

na = number of units sampled from subpopulation a.

xk = kh indicator level of interest.

Wa = subpopulation size (size-weighted total), if known.

4.3 Graphical Display Considerations

Two issues should be resolved before graphing the CDF: 1) how many points to use and 2) what are
the first and last points on the plot. The following are guidelines for the three alternatives
mentioned in Section 2. In all three approaches, the plotted points are connected by line segments.
The sample j is understood to be in ascending order for this discussion.

If the empirical CDF is chosen, the number of points plotted is at most n+2. The first plotted point
is (0,0) when the indicator takes on only positive values. Otherwise, choose a point smaller thany1

-------
as the abscissa and assign zero as the ordinate. Where there is more than one occurrence of an
indicator level in the data set, plot only one point using the largest cumulative distribution function
value associated with this level as the ordinate.

If the midpoints of adjacent values in y are used for the levels x, at most n+1 points are plotted. To
determine the first plotted point, calculate the distance between v, and y2. Take half this distance
and subtract it from v, to obtain the abscissa. If this abscissa is a negative number and the indicator
can never be negative, instead assign zero as the abscissa. Use zero as the ordinate. Similarly, to
determine the last plotted point, calculate the distance between the largesty values, yn . l and yn .

Halve this distance, add it to y and plot this abscissa using the cumulative size-weighted total

associated with v as its ordinate.

'"a

The recommended approach uses equally spaced levels across the potential range of the indicator.
The levels used should be potential real values that the indicator could attain. In this case of
discrete data, integer values should be used. As mentioned previously, ideally this range is known a
priori and extends beyond the range of any particular data set. If an informed guess cannot be made
for this range, one suggested range would be to use the midpoint approach for obtaining the first
and last plot points as explained in the previous paragraph. How many points to use is a subjective
decision and should take into account the chosen range, the size of the data set, and sometimes the
data distribution itself must be examined. The following suggestions are given to help decide how
many points to use.

In most cases, using the same number of points as used in the empirical distribution, n+2 points,
will be sufficient for plotting the CDF. Extreme outliers in a particular data set may have a great
influence on the graph. In this case, more points may be needed to achieve greater resolution within
the body of the data. In the case of large data sets, plotting less than n+2 points should be
adequate. Begin by using 100 points for these larger data sets. The range of the indicator will have
a part in determining if this is an adequate number of points. Trying the plots with differing
numbers of points may be useful to see if the graph changes significantly.

The y-axis (CDF) should range in values from zero to either the known or estimated subpopulation
size, depending on the estimator used. This size will be the cumulative size-weighted total
associated with yn . This method may result in confidence limits which drop below zero or exceed

the applicable subpopulation size. These limits should not appear on the plot. Instead, truncate the
plotted upper limit at Fa(yn ). Truncate the plotted lower limit at zero.

-------
5 Formulas and Definitions

The estimated size-weighted CDF (total) for indicator value xk in subpopulation a, Fa(xk);
Horvitz-Thompson estimator of a total is

w.

Fa(xk) = E— I(yt*xk) .

i= 1 71.

The estimated size-weighted CDF (total) for indicator value xk in subpopulation a, Fa(xk), with

known subpopulation size, Wa, and estimated subpopulation size, Wa; Horvitz-Thompson ratio
estimator is

"a w.

£ — I(yt
-------
wt = size-weight value for the i'h unit sampled from subpopulation a.
na = number of units sampled from subpopulation a.
za = z-score from the standard Normal distribution.

a = level of significance.

6 Procedure
6.1 Enter Data

Input the sample data consisting of the indicator values, v(, their associated inclusion probabilities,
71., and their size-weights, wt. For example,

Calcium

Inclusion
Probab ility

Lake
Area

wt

1.5992

.07734

24.249

2.3707

.00375

92.251

1.5992

.75000

28.018

2.0000

.75000

52.953

7.0000

.00375

362.254

2.8196

.02227

140.671

1.2204

.01406

7.758

1.5992

.03750

29.702

2.9399

.00586

149.276

.7395

.00375

1.081

-------
6.2 Sort Data

Sort the sample data in nondecreasing order based on theyt indicator values. Keep all occurrences
of an indicator value to obtain correct results.

Calcium

Inclusion
Probab ility

Lake
Area

wt

.7395

.00375

1.081

1.2204

.01406

7.758

1.5992

.07734

24.249

1.5992

.75000

28.018

1.5992

.03750

29.702

2.0000

.75000

52.953

2.3707

.00375

92.251

2.8196

.02227

140.671

2.9399

.00586

149.276

7.0000

.00375

362.254

6.3	Obtain Subpopulation Size (Size-Weighted Total)

If using the Horvitz-Thompson ratio estimator, input Wa and calculate Wa from the sample data.
Divide each wi by the inclusion probability, tl , for all units in the sample a. Sum each of these
quantities to obtain Wa .

Wa = (1.081/.00375) + (7.758/.01406) + (24.249/.07734) + . . . + (362.254/.00375) = 155045.265
and Wa = 156000 for this data set.

6.4	Input Indicator Levels of Interest

Assign indicator levels of interest, x, based on graphical display considerations. Choose one of the
three methods previously discussed in Section 4.3.

6.4.1 The Recommended Approach — Levels of Interest

Form an expected range of the indicator before looking at the data. Next, examine the data set to
see if the estimated range encompasses ally values. If not, increase the range to encompass the
outlying j values. If there are large outliers, more points than na+2 may be needed to retain good
resolution in the body of the plot. Determine evenly spaced x values across the chosen range.

-------
For this example, the estimated range was .5 to 9.5 mg/L. The range does not have to be adjusted
because it includes the observed ja values. The point spacing interval for x, xint = (xmax - xmiJ/(na -
1) = (9.5 - ,5)/(10- 1) = 9/9 = 1.0. The 10* values = (xmm , xmi +1.0, xm! +2(1.0), ...) = (.5, 1.5,
2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5). Try obtaining the cumulative distribution function first with
these x values and then again with an increased number of x values spaced closer together. More
points across the range may be needed because all but one of the y, values are less than 3.0.

6.4.2	The Empirical CDF — Levels of Interest

For the empirical CDF, x values = (.7395, 1.2204, 1.5992, 2, 2.3707, 2.8196, 2.9399, 7). Duplicate
values in the data set, 1.5992, do not have to be repeated when forming x .

6.4.3	The Midpoint Approach — Levels of Interest

Calculate the midpoints of each pair of yt values to form x . The first* value is (.7395+1.2204)/2 =
.9800. In this particular data set, there are three occurrences of 1.5992. As a result, there are two
midpoints of 1.5992. Regardless of howmany times a midpoint is repeated, include it only once in
x The x values = (.9800, 1.4098, 1.5992, 1.7996,2.1854,2.5952,2.8798,4.9700).

6.5 Compute Cumulative Distribution Function Values

Calculate Fa(xk) for each element in* using the formulas from Section 5.

To calculate Fa(xx), compare each v( to x1. If v( is less than or equal to x1, then /%. is added to
the computation of F^x^) until y, exceeds x1 (when using sorted data). Multiply the cumulative
sum of these w. / 7z!s by Wa / Wa to obtain Fa(x1), if using the Horvitz-Thompson ratio estimator.
Otherwise, this cumulative sum is F (x^, if using the Horvitz-Thompson estimator of a total.

Similarly, to calculate Fa(x2), compare each v, to x2, add the wl/tl's until v, exceedsx2, and then
multiply this sum by Wa / Wa if applicable.

Do this for every value in *.

-------
Below is an example for obtaining the cumulative sum for each Fa(xk). Complete results for the
example data are in Section 6.7.

Calcium

Inclusion
Probab ility

Lake
Area

wt

Indicator Level
of Interest

Xk

Cumulative Sum for

.7395

.00375

1.081

.7395

1.081/.00375

1.2204

.01406

7.758

1.2204

1.081/.00375+7.7 5 8/. 01406

1.5992

.07734

24.249

1.5992

1.081/.00375+7.7 5 8/. 01406+24.24 9/. 07734+
28.018/.75000+29.702/. 03750

1.5992

.75000

28.018





1.5992

.03750

29.702





2.0000

.75000

52.953

2.0000

1.081/. 00375+7.7 5 8/. 01406+24.24 9/. 07734+
28.018/. 75000+2 9.702/. 03750+52.953/. 75

6.6	Compute Confidence Limits

Calculate the confidence bound (upper or lower) or confidence interval for each Fa(xk) using the
formulas from Section 5.

Estimate the variance of Fa(xk) using an applicable method listed in Section 7. Next, take the
square root of the variance and multiply this square root by the

z-score from the standard Normal distribution corresponding to the desired confidence level.

Add this quantity to Fa(xk) to obtain the upper bound, Bl (xl). Subtract this quantity from Fa(xk)

to obtain the lower bound, Bl (xl ). For the confidence interval, obtain both Bl(xl) and	For

example, 1.645 would be the za for a one-sided 95% upper or lower confidence bound, and the za/2

for a two-sided 90% confidence interval. A two-sided 95% confidence interval would use 1.96 for

Za/2-

6.7	Output Results

Output the indicator levels of interest, the associated size-weighted CDF value, and either a
confidence bound (upper or lower) or a confidence interval for Fa(xk). If the output generated will

be used for graphing the CDF, append the first and last graph points to this output as directed for the
three methods below. The tables in Section 6.7.1 - 6.7.3 contain results for the ratio estimator

-------
applied to the example data. A hypothetical variance is used in confidence bound and interval
calculations.

Lower bounds less than zero are set equal to zero.

6.7.1 The Recommended Approach — Results

Append the point (0,0) to the output file for graphing purposes. Since xmax, 9.5, exceeds the
maximum ja , 7, no other points are appended.

Calcium

Size-Weighted
CDF, Ratio
Estimator

Hypothetical
Variance

V

One-sided 95%
Lower Conf.
Bound

BL(xk)

One-sided 95%
Upper Conf.
Bound

BLliXk)

Two-sided 90%
Conf. Interval

C(xk)

0*

0*

0*

0*

0*

(0,0)*

0.5

0

0

0

0

(0,0)

1.5

845

775618

o**

2294

(0,2294)

2.5

26818

797663932

o**

73278

(0,73278)

3.5

58804

2048346898

o**

133255

(0,133255)

4.5

58804

2048346898

o**

133255

(0,133255)

5.5

58804

2048346898

o**

133255

(0,133255)

6.5

58804

2048346898

o**

133255

(0,133255)

7.5

156000

0

156000

156000

(156000,156000)

8.5

156000

0

156000

156000

(156000,156000)

9.5

156000

0

156000

156000

(156000,156000)

*appended	**set to 0

-------
6.7.2 The Empirical CDF — Result

Append the point (0,0) to the output file for graphing purposes.

Calcium

Xk

Size-Weighted
CDF, Ratio
Estimator

Fa(*k)

Hypothetical
Variance

V

One-sided 95%
Lower Conf.
Bound

One-sided 95%
Upper Conf.
Bound

Bu(xk)

Two-sided 90%
Conf. Interval

C(xk)

0*

0*

0*

0*

0*

(0,0)*

0.7395

290

133932

o**

892

(0,892)

1.2204

845

775618

o**

2294

(0,2294)

1.5992

1995

3128985

o**

4905

(0,4905)

2.0000

2066

3270403

o**

5041

(0,5041)

2.3707

26818

797663932

o**

73278

(0,73278)

2.8196

33174

954063204

o**

83984

(0,83984)

2.9399

58804

2048346898

o**

133255

(0,133255)

7.0000

156000

0

156000

156000

(156000,156000)

*appended	**set to 0

-------
6.7.3 The Midpoint Approach — Results

Determine the first plotted point by calculating the distance between the first two yt values, .7395
and 1.2204. Take half this distance and subtract it from .7395 to obtain .7395 - [(1.2204
- .7395)/2] = .4991. Append to the output (.4991,0) as the first plotted point. If a negative number
were obtained and the indicator can never be negative, append (0,0) as the first plotted point.
Similarly, to determine the last plotted point, calculate the distance between the two largest v(
values, 2.9399 and 7. Take half this distance and add it to 7 to obtain 7 + [(7-2.9399)/2] = 9.0301.
Because the distance between these last two v( values is relatively large, choosing the last point

slightly above 7 with an ordinate of Fa(7) may be preferable over appending (9.0301 ,Fa(7)) to

the output. For this example, (7.5,156000) was appended.

Calcium

Xk

Size-Weighted
CDF, Ratio
Estimator

Hypothetical
Variance

V

One-sided 95%
Lower Conf.
Bound

Bl(xk)

One-sided 95%
Upper Conf.
Bound

BiAxk)

Two-sided 90%
Conf. Interval

C(xk)

.4991*

0*

0*

0*

0*

(0,0)*

.9800

290

133932

0**

892

(0,892)

1.4098

845

775618

0**

2294

(0,2294)

1.5992

1995

3128985

0**

4905

(0,4905)

1.7996

1995

3128985

0**

4905

(0,4905)

2.1854

2066

3270403

0**

5041

(0,5041)

2.5952

26818

797663932

0**

73278

(0,73278)

2.8798

33174

954063204

0**

83984

(0,83984)

4.9700

58804

2048346898

0**

133255

(0,133255)

7.5000*

156000*

0*

156000*

156000*

(156000,156000)*

*appended	**set to 0

7	Associated Methods

An appropriate variance estimator for this estimated size-weighted CDF for discrete resources may
be found in Method 8 (Horvitz-Thompson Variance Estimator).

8	Validation Data

Actual data with results, EMAP Design and Statistics Dataset #4, are available for comparing
results from other versions of these algorithms.

-------
9 Notes

The method which uses the ratio estimator may perform better under certain conditions and may be
used only if the subpopulation size is known. Sampling done with variable probability and variable
sample size, na, are two of these conditions. The ratio estimator retains a stability under these cases
and tends to have smaller variance than the other estimator because the numerator and denominator
tend to be positively correlated.

In the case of missing data, the ratio estimator cannot be used because the size of the subpopulation
is not known. All graphs should be labeled as applying only to the population that was sampled and
not to the original target population.

10 References

U.S. Environmental Protection Agency (EPA). 1993. Surface waters 1991 pilot report.
EPA/620/R-93/003. Washington, D.C: U.S. Environmental Protection Agency.

Lesser, V. M., and W. S. Overton. 1994. EMAP status estimation: Statistical procedures and
algorithms. EPA/620/R-94/008. Washington, DC: U.S. Environmental Protection Agency.

Overton, W. S. 1987. A sampling and analysis plan for streams in the National Surface Water
Survey. Technical Report 117. Corvallis, OR: Oregon State University, Department of Statistics.

-------