R-EMAP Data Analysis Approach for Estimating the Proportion of Area that is Subnominal


R-EMAP
Data Analysis Approach for
Estimating the Proportion of Area that is Subnominal
Prepared for
Victor Serveiss
U.S Environmental Protection Agency
Research Triangle Park, NC
Prepared by
Douglas Heimbuch
Coastal Environmental Services, Inc.
linthicum, MD
Harold Wilson
Coastal Environmental Services, Inc.
Linthicum, MD
John Seibel
Coastal Environment Services, Inc.
Linthicum, MD
Steve Weisberg
Versar, Inc.
Columbia, MD
April 1995

-------
R-EMAP
Data Analysis Approach for
Estimating the Proportion of Area that is Subnominal
Prepared for
Victor Serveiss
U.S Environmental Protection Agency
Research Triangle Park, NC
Prepared by
Douglas Heimbuch
Coastal Environmental Services, Inc.
Linthicum, MD
Harold Wilson
Coastal Environmental Services, Inc.
Linthicum, MD
John Seibel
Coastal Environment Services, Inc.
Linthicum, MD
Steve Weisberg
Versar, Inc.
Columbia, MD
April 1995

-------
TABLE OF CONTENTS
I.	Introduction	 1
II.	Estimation of Proportion of Area that is Subnominal 	2
II.A. The Resource, the Sample and the Estimate	3
II.B. Probability Distribution for Possible Values of the Estimate	5
II.C.	Factors Affecting the Estimated Proportion	. 7
II.C 1. The True Proportion Subnominal	 	 7
II.C.2.	Sample Size and Variance			 9
III.	Construction of Confidence Limits			10
III.A.	What are Confidence Limits?			11
III.B. Factors Affecting Width of the Confidence Interval . . 14
III.C. How to Compute Confidence Limits			16
III.C1.	Standard Graphs and Tables for Confidence Limits	 16
III.C.2. Normal Approximation	17
IV.	Data Analysis for Stratified Random Sampling	19
V.	Closing Comments	 	21

-------
I. Introduction
The Environmental Monitoring
and Assessment Program (EMAP) is an
innovative, long-term research and
monitoring program that is designed to
measure the current and changing
conditions of the nation's ecological
resources. EMAP achieves this goal by
utilizing sample survey approaches
which allow scientific statements to be
made for large areas based on
measurements taken at a sample of
locations. Regional-EMAP (R-EMAP) is
a partnership among EMAP, EPA
Regional offices, other federal
agencies, and states. R-EMAP is
adapting EMAP's broad-scale approach
to produce ecological assessments at
regional, state, and local levels.
The sample survey approaches
utilized by R-EMAP are very efficient in
terms of the (small) number of
locations that need to be sampled in
order to make valid scientific
statements about the condition of a
large area (e.g., all estuarine waters
within a Region). This efficiency
carries with it a small additional cost,
however. Specialized data analysis
methods must be applied to insure that
the results are scientifically valid.
This document is the first in a
series of methods manuals being
prepared to assist the R-EMAP partners
in implementing EMAP's sampling
approach. These manuals build upon
basic concepts that were addressed in
the document "Answers to Commonly
Asked Questions About the EMAP
Sampling Design" by providing a more
thorough discussion of specific topics.
The intended audience of the manuals
are scientists without extensive
statistical training who may become
involved in analysis of R-EMAP data.
Technical documentation, written for
statisticians, is also being prepared.
This manual describes two data
analysis methods for assessing the
status of ecological condition. One
primary measure of ecological condition
addressed by EMAP and R-EMAP is the
proportion of area that has subnominal
(i.e., not meeting some environmental
criterion) conditions. This manual
describes methods for:
o Estimation of the proportion of
area that has subnominal
conditions, and
o Construction of confidence
intervals for the estimates of
proportion of area.
These methods are equally applicable
to any type of proportion including
proportion of numbers (e.g., numbers
of lakes), proportion of length (e.g.,
miles of streams), proportion of area
(e.g., square miles of estuaries), or
proportion of volume (e.g., cubic
meters of a lake) that has subnominal
conditions. The methods are
appropriate only for a sampling
program in which 1} every location
within the resource of interest has the
same chance of being selected for
sampling, and 2) the selection of any
one location does not affect the chance
of selection for any other location. The
1

-------
methods can also be applied to data
from stratified sampling if these two
conditions are satisfied within each
defined stratum.
These methods are described
along with an in-depth discussion of
underlying concepts. Underlying
concepts (rather than 'cook-book'
instructions) are emphasized for two
reasons. The first is that proper
interpretation of the results from the
data analyses requires an
understanding of the underlying
concepts. Correct interpretation of the
results of data analyses is a key link
between quality data and sound
resource management decisions. The
second reason is that each of the R-
EMAP projects is unique and may
require custom application of the data
analysis methods. Thoughtful
application of these methods cannot
occur without an understanding of the
underlying concepts. Furthermore, a
solid understanding of the underlying
concepts can be a great help when
defending results and conclusions.
II. Estimation of Proportion of
Area that is Subnominal
In this section, the recommended
method for estimating proportion of
area that is subnominal is described
and the rationale for the method is
presented. Also, properties of the
estimates are discussed. To make this
information easier to understand, the
distribution pattern of the response
variable within the resource of interest
is treated as if it is known with
certainty (i.e., a map of the response
variable for the entire resource of
interest is presented). Clearly,
complete information of this kind is
never available in practice; if it were,
there would be no need to sample'.
Although the recommended method
will provide an estimate of the
proportion of area that is subnominal, it
does not provide an estimate of the
location of the subnominal areas. The
only locations that are truly known to
be subnominal are the specific sampled
locations. Other analysis approaches
may be used to map the actual
subnominal areas.
This section is organized into three
parts. The first section contains a
discussion of the general relationship
between a) the true distribution pattern
of the response variable, b) the sample
of response variables from selected
locations, and c) the estimate of
proportion of area that is subnominal.
Next, the probability of observing
different estimated values, based on
which (randomly selected) locations are
included in the sample, is discussed.
Finally, the last section contains a
discussion of factors that affect the
estimate of the proportion of area that
is subnominal, and the kinds of effects
generated by these factors.
2

-------
II.A. The Resource, the Sample and the
Estimate
' Land UZ Subnomlntl Am
Figure 1. Hypothetical resource with a
subnominal area proportion of 0.2.
Sampling Point
Figure 2. Hypothetical resource with 15
sampling locations.
For the purposes of assessing the
proportion of area that is subnommal, a
simple map of the resource of interest can
be envisioned in which areas that are
subnominal are shaded and all other areas
are left unshaded (Figure 1). The resource
depicted in Figure 1 is a hypothetical estuary
with the upstream and downstream limits of
the resource of interest marked with dotted
lines. The shaded areas, in this case, might
represent areas with concentrations of
metals in sediments that are in excess of
some standard. For the example depicted in
Figure 1, a total of 200 square miles are
subnominal and the total area of the
resource of interest (i.e., shaded plus non-
shaded areas) is 1000 square miles. The
true proportion of area that is subnominal is
the ratio of a) the extent of shaded areas
(e.g., 200 square miles) divided by b) the
entire extent of the resource of interest
(e.g., 1000 square miles). Therefore, 0.2 is
the true proportion of area that is
subnominal for this example. The true
proportion is often referred to (e.g., in
textbooks) as the 'population parameter' P.
Now suppose that a sample of 15
locations within the resource of interest are
selected at random. Furthermore, suppose
the random selection of locations is made so
that a) every location within the resource of
interest has an equal chance of being
selected, and b) after each selection of a
location, all locations are again equally likely
to be chosen as the next selected location
(Figure 2). This would be like blindly
throwing a dart at the map 15 times, each
3

-------
time ignoring where the previous darts
landed.
Scmptne Pent.
In nafl-cubnonlntl »r»«
-In lubnomlnal traa
Figure 3. Hypothetical resource with 3 of 15
samples in subnominal areas.
After the locations are selected, the
condition of the resource (e.g.. whether or
not the metals concentration exceeds the
standard) is recorded. In practice this might
be accomplished by sending a field crew to
the location to collect a sample for
laboratory analysis. For this example, a
selected location is designated subnominal if
it is in a shaded area of the map (Figure 3).
Next, the total number of selected locations
with subnominal condition is recorded (call
this number V), and the total sample size
(call this number 'n') is noted. These two
numbers, * and n, are all that are needed to
estimate the proportion of area that is
subnominal, and to construct the confidence
limits for the estimated proportion.
The estimate of the proportion of area
that is subnominal (referred to as P) is simply
the ratio of x divided by n:
StfnpMng Petoit
• in norv-aubnomtnal tr»a
In tubnomlnsl area
Figure 4. Hypothetical resource with 5 of 15
samples in subnominal areas.
t>= xln .
For the sample depicted in Figure 3, * = 3
and n = 15. Accordingly, the estimated
proportion for this example is 3 divided by
15 which is equal to 0.2. In this case, the
estimate is the same as the true population
proportion. This will not always occur. For
example, the randomly selected locations
could have produced 5 samples that were
subnominal instead of 3 (Figure 4). In this
case the estimate would have been 5 divided
by 15 which is equal to 0.33. This estimate
is not equal to the true proportion. In fact,
the estimated value can be any one of 16
numbers from 0 to 1 (i.e., 0/15, 1/15, 2/15,
... , and 15/15). However, it is much more
likely for the estimate to be near 0.2 (i.e.,
the true proportion) than any other value.
4

-------
II.B. Probability Distribution of Possible
Values of the Estimate
¦5 »ee
(U
0 07 CJO Oil 0 47 0 10 0 71 0J7 1.00
Ea*niM Pioponc* ot (nboMtwl Art*
Figure 5. Frequency distribution of f) based on
one realization of 1000 random selections (15
samples, P = 0.2).
0J
V 0 1* SJf BW 0 9# v»r v.sv w w*
0 07 020 Oil 0 47 0 60 0 7) 0 87 1M
EMnM Pfoportnn of SubnomlMl Ana
Figure 6. Probability distribution of fi based on
exact Binomial distribution (15 samples, P -
0.2).
The chance of observing each of the
possible values of the estimate can be
summarized in what is referred to as a
probability distribution (or sampling
distribution) of the estimate. The probability
distribution depicts the likelihood of each
possible outcome (i.e., estimate of the
proportion) of the random sampling compared
to all other possible outcomes and provides a
basis for assessing the reliability of the
estimate.
The probability distribution of possible
values of the estimate can be approximated
by repeating the random selection of locations
over and over. For each random selection of
15 locations, the estimate P is recorded and a
cumulative tally is kept of the number of times
each possible value of P is observed. For
example, with 1000 random selections (of 15
locations) the frequency distribution depicted
in Figure 5 is produced. Because each of the
1000 random selections (of 15 locations) is
equally likely, the value of P with the highest
tally (or frequency) is the most likely value of
p. Furthermore, the probability of observing
any particular value P is the limit (as the
number of random selections goes to infinity)
of the ratio of a) the number of selections
producing that value of P, to b) the total
number of selections of 15 locations (Figure
6). Accordingly, the y-axis in Figure 6 has a
possible range from 0 to 1.
In practice, it is not necessary to
repeatedly sample the resource in order to
construct the probability distribution of the
estimated values. The probability distribution
of estimates of proportion (based on data
from the type of random sampling addressed
in this document) can always be described by
a standard distribution called the Binomial
5

-------
Figure 7. Average value of £ {15 samples, P
0.2).
distribution. The Binomial distribution is fully
defined by only two parameters: the true
proportion and the sample size. Therefore,
the probability distribution of possible values
of the proportion of area can be constructed
by plugging a value for the true proportion
(assumed to be known in this section of the
manual) and the sample size into the formula
for the Binomial distribution.
Also, it is worth noting that the average
of all possible values of P is equal to the true
proportion. This can be seen by envisioning
the probabilities (Figure 6) as weights on a
board. The average is the center of mass for
the weights (Figure 7) and is equal to the true
proportion. In general, if the mean of the
probability distribution of an estimate is equal
to the parameter being estimated (in this case
the true proportion), then the method of
estimation is referred to as being unbiased.
Because this condition is satisfied for the
recommended method of estimation, this
method is unbiased.
Figure 8. Hypothetical resource with scattered
subnominal areas (P = 0.2).
Another important property of this
method of estimation is that the specific
pattern of subnominal areas on the map does
not affect the estimates (providing that the
true proportion remains unchanged). For
example, with the shaded areas more
scattered (Figure 8), the probability
distribution of values of jb is exactly the same
as the probability distribution (Figure 6)
associated with the map in Figure 1.
Therefore the same sampling design can be
used regardless of the underlying spatial
pattern, which generally is unknown. The
specific pattern of the response variable
within the resource of interest does not affect
the probability distribution of the estimate (P}.
However as described in the next part of this
section, the probability distribution of P is
affected by the true proportion and by the
sample size (n).
6

-------
II.C. Factors Affecting the Estimated
Proportion
Lind C3
Subnominal Arts
Figure 9. Hypothetical resource with a
subnominal area proportion of 0.3.
0.2S
txn ojd djs 04? cjo o.rj ejr ijo
btoakd Piupcrtmi Subnomtoal Aim
Figure 10. Probability distribution of P (15
samples, P = 0.3).
II.C.1. The True Proportion
Subnominal
There is a different probability
distribution of values of t> for every possible
value of the true proportion. For example, if
the total area that is subnominal is 300
square miles (Figure 9), the true proportion
is 0.3 (i.e., 300 square miles divided by
1000 square miles). The probability
distribution for estimated values of the
proportion can be generated from the
formula for the Binomial distribution. In this
case the probability distribution is as
depicted in Figure 10. Notice that the
distribution has shifted to the right. The
most likely values are near 0.3 and the mean
of the distribution is exactly 0.3 (Figure 11).
Figure 11. Average value of £ (15 samples, P
= 0.31.
7

-------
Land Q Subnominal Ami ¦
Figure 12. Hypothetical resource with
subnominal area proportion of 0.4.
The same exercise can be conducted
for a map in which 400 square miles are
subnominal (Figure 12). In this case the true
proportion is 0.4 (i.e., 400 square miles
divided by 1000 square miles}. The
probability distribution for a true proportion
equal to 0.4 is depicted in Figure 13. Now
the most likely values are near 0.4 and the
mean of the probability distribution is exactly
0.4 (Figure 14).
The mean of the probability distribution
of values of fi is always exactly equal to the
true proportion. This is true for any value of
the true proportion (from 0.0 to 1.0}, and is
true regardless of the spatial pattern of
subnominal areas within the resource of
interest. Also, the most likely values of j&
are always near the true proportion.
Furthermore, by increasing the sample size,
the probability that the estimate will be very
close to the true proportion can be
increased.
0.25
Figure 13. Probability distribution of f> {15	Figure 14. Average value of P (15 samples, P
samples, P - 0.4).	= 0.4).
8

-------
II.C.2. Sample Size and Variance
Sampfcng Point
to ftoft •ubnomtMi area
In aubrentoai araa
Figure 15. Hypothetical resource with 30
samples, subnominal area proportion of 0.2.
(0
>
"D
a>
LLI
O
£
Z
o
Q.
• 1
L
IJC OH gjr l«C Ul UT IN 111
ut iji lii ut ut i.n ur ijo
Estimated Proportion of Subnormal Area
Figure 16. Probability distribution of fi (30
samples, P =¦ 0.2).
Intuitively, it seems that estimates
based on larger samples should be more
reliable than estimates based on just a few
locations. The effect of sample size on the
probability distributions of values of P
supports this position. As can be seen from
the following examples the effect of sample
size on the probability distribution of values
of t> can be quite dramatic.
First consider a random sample of 30
locations from a resource with a true
proportion of subnominal area equal to 0.2
(Figure 15). The probability distribution of
values of p in this case is depicted in Figure
16. The probability distribution is much
more concentrated around the true
proportion of 0.2. Also, notice that instead
of 16 possible values of P, there are now 31
possible values (i.e., 0/30, 1/30, 2/30, ... ,
and 30/30). This provides a finer scale of
resolution for the estimates.
Now consider a random sample of 50
locations from the same resource {Figure
17). The probability distribution of values of
P is extremely concentrated around the true
proportion of 0.2 (Figure 18). The scale of
resolution for the estimates is improved as
we". There are now 51 possible values of P
(i.e., 0/50, 1/50, 2/50, ... , and 50/50) with
a step size between possible values of only
0.02 (i.e., 1/50).
The benefits of increased sample size
are readily apparent from these examples.
The scale of resolution of the estimates is
improved and the spread or dispersion of the
probability distribution is reduced with
increased sample size. More specifically, the
dispersion of the probability distribution of
9

-------
8
-------
estimate of the proportion of area that is
subnominal, how can you be sure that the
true proportion isn't some other number
entirely? Unfortunately, you can't be sure.
However, you can put confidence limits
around the estimate.
In this section, the recommended
method for constructing confidence limits is
described and the rationale for using the
method is presented. For the purposes of
this section, the pattern of the response
variable within the resource of interest is
treated as if it is not known. This is in
contrast to the previous section, and
represents the more realistic situation.
This section is organized into three
parts. In the first part, the concept of
confidence limits for estimates of
proportions is discussed, and the
recommended approach for constructing
confidence limits is presented. Next, factors
that affect the width of confidence limits are
described. In the final part of this section,
standard graphs and tables for exact
confidence limits, and the use of an
approximation (the Normal approximation)
are discussed.
III.A. What are Confidence Limits?
Confidence limits are bounds around
the estimate that are determined such that
there is a known probability that the bounds
will bracket the true proportion. For
example, 90% confidence limits have the
property that over many replications of
sample selection and confidence interval
calculation, 90% of the resulting intervals
will cover the true value. Therefore, with
11
-------
symmetric confidence limits there is a 5%
chance that the lower limit will be greater
than the true proportion and there is a 5%
chance that the upper limit will be less than
the true proportion.
0 II
J
I
•
ID
e
I
a
A
e
prot>{p> .3}
0 0# -
0 04 Oil 0 27 0 40 OS) 0 JST 010 013
0.07 0J0 OJJ 0 47 OJO 0 7J 0 07 too
Elbmttod Prepoiton of Subflomto*! Ar»«
Figure 20. Probability distribution of P (30
samples, P = 0.3).
tru* proportion
true proportion
Brut proportion
n on
trut proportion
Figure 21. Construction of lower 5%
confidence bound.
The approach for constructing
confidence limits for estimates of proportion
may be understood by considering a simple
example. Suppose 30 locations are randomly
selected and measurement at these locations
generates an estimate of the proportion of
area that is subnominal equal to 0.3. As is
almost universally the case, the true
proportion subnominal is not known. To
place a lower bound on the estimate we can
start by asking the question: If the true
proportion was 0.20, what would be the
probability of observing an estimate of 0.3 or
larger? The answer to the question can be
found in the probability distribution of values
of P for the case of a true proportion equal to
0.20 and a sample size of 30 (Figure 16).
The answer is the sum of the probabilities
from 0.3 through 1.0 (Figure 20), which for
this example is 0.13. This means that even
if the true proportion was as low as 0.2 there
would be a 13% chance of observing an
estimate of 0.3 or larger. Therefore, 0.2 can
be taken as the lower 13% confidence
bound.
If some pre-determined probability level
(e.g., 5%) is desired, a range of hypothesized
values of the true proportion could be
evaluated. For example, the cumulative
probabilities (for values of P of 0.3 through
1.0) could be computed for cases of the true
proportion being 0.19, 0.18, 0.17 and 0.16
(Figure 21). The cumulative probabilities of
greater values of P for these four scenarios
are 0.10, 0.08, 0.06, and 0.04. Therefore,
the lower 5% confidence bound is between
0.17 and 0.16 (further evaluation can show
12
-------
that, to two decimal places, the bound is
0.17).
0 16
(0
>
o
£
£
2
o
w
a
i
{Pi .3)
0 00 Oil 0 27 0 40 0 53 0 67 0 00 0 0}
0 07 0.20 0 33 0 47 0 00 0 73 0 t7 100
Estimated Proportion of Subnominal Area
Figure 22. Probability distribution of $ (30
samples, P = 0 45)
«
3
TO
>
¦c
0>
re
E
¦¦c
(A
Ui
"o
£
i
n
.o
o
0J
0 19
e i
OAS
o
true proportion " 0 46 _

truo proportion « 0.47
OJ
o 16
01
CM
0
OJ
0 19
0 1
CM
C

trua proportion ¦ 0.46
" flrv,.
-M.
tout proportion « 0 49
te
Figure 23. Construction of upper 5%
confidence bound.
Similarly, to place an upper bound on
the estimate, we can start by asking the
question: li the true proportion was 0.45,
what would be the probability of observing
an estimate of 0.3 or smaller? In this case
we need to examine the probability
distribution of P for a sample size of 30 and
assuming the true proportion is equal to
0.45. The answer in this case is the sum of
probabilities from 0.0 through 0.3 (Figure
22),	which for this example is 0.07.
Therefore, 0.45 can be taken as the upper
7% confidence bound .
Again, if a pre-determined probability
level (e.g., 5%) is desired, a range of
hypothesized values of the true proportion
could be evaluated. In this case, cumulative
probabilities (from values of P of 0.0 through
0.3) could be computed for true proportion
values of 0.46, 0.47, 0.48 and 0.49 (Figure
23),	for example. The cumulative
probabilities for these four scenarios are
0.06, 0.04, 0.04 and 0.03. Therefore, the
upper 5% confidence bound is between
0.46 and 0.47 (further evaluation can show
that, to two decimal places, the bound is
0.47).
The upper 5% confidence bound and
the lower 5% confidence bound form the
90% confidence limits for the proportion of
area that is subnominal. For this example
(i.e., with a sample size of 30 and an
observed value of P equal to 0.3), the 90%
confidence limits are 0.17 and 0.47. There
is a 90% chance that confidence limits
constructed in this manner will bracket
the true proportion, and a 10% chance
that the limits will miss the true proportion.
This result can be demonstrated by repeating
13
-------
the random selection of 30 locations from a
known pattern of subnominal areas (as was
discussed in Section II.A) over and over. For
each of the random selections, the value of P
can be computed and the corresponding
confidence limits determined. In 90 out of
100 iterations (on average), the confidence
limits will bracket the true proportion,
regardless of the value of the true proportion.
III.B. Factors Affecting Width of the
Confidence Interval
The interval between the lower and
upper 90% confidence limits is the 90%
confidence interval. In the example
discussed above, the width of the 90%
confidence interval is 0.30 (0.47 minus
0.17). Two factors (given a specific
estimate, P) affect the width of the
confidence interval: the confidence level
(e.g., 90%), and the sample size. If a higher
level of confidence had been specified, say
95%, then the confidence interval would
have been wider. On the other hand, if the
sample size had been larger, then the
confidence interval would have been
narrower.
The fact that the width of a confidence
interval increases as the confidence level
increases is intuitively appealing. This is
consistent with being more confident about
making general statements (wide intervals)
and being less confident about making more
specific statements (narrow intervals). The
reason for this effect of confidence level on
confidence intervals is clear from the way the
confidence limits are determined. For
the example discussed above, if a higher
14
-------
0.25
i oj
>
i
I 0 1»
111
*
£
S C1
I
a.
0.05
0
Figure 24. Probability distribution of 0 (30
samples, P = 0.15).
prob {p
OAD 011 0J7 0 40 0J3 0 07 o«o oos
0 07 0.20 0 )) 0 47 0 00 0 73 0J7 IJO
Ettmitotf Preport*)n el Subnomn*! Ana
confidence level is desired (e.g., 95% rather
than 90%), then the upper and lower 2.5%
confidence bounds would be used. The true
proportion would have to be 0.15 in order for
the cumulative probability (from 0.3 through
1.0) to equal only 2.5% (Figure 24).
Similarly, the true proportion would have to
be 0.49 in order for the cumulative
probability (from 0.0 through 0.3) to equal
only 2.5% (Figure 25). The effect of varying
confidence levels from 75% to 99% on the
size of the confidence intervals is summarized
in Figure 26 (for a sample size of 30 and an
observed P of 0.3).
S
$
"8
1
2
o
£
8
a.
on
114
• 11
• 1
•Jt
prob
(pS.S)
i
ft
¦ ¦ ¦ ¦
•JT ia ui MT lit Ifl »JT I
Estimated Proportion of Subnormal Ana
Figure 25. Probability distribution of P (30
samples, P = 0.49).
•J Ml	(J	IJ
Confidence Level
Figure 26. Confidence interval widths as a
function of confidence level (30 samples, t>
0.3).
15
-------
0 32
1 »
£
|OJ8
bit
fo.24
9
o
3 0.22
6
0 18
0.16
0 14
20 40 eo to 100
Simple Sba
Figure 27. Confidence interval width as a
function of sample size (90% confidence, t>
0.3).
The fact that the width of the confidence
interval decreases as the sample size
increases is also intuitively appealing.
Increased sample size, as discussed in Section
II, increases the reliability of estimates and
should allow more detailed statements to be
made without diminishing the level of
confidence. For example, with a sample size
of 60 and assuming the true proportion is
0.17 (the lower 5% bound for a sample size
of 30) there is only a 1 % chance that the
observed value of P would be 0.3 or greater.
The 5% lower confidence bound in this case
is 0.20. Similar effects are exhibited with the
upper confidence bound. The effect of
varying sample size on the size of confidence
intervals is summarized in Figure 27 (for 90%
confidence and an observed P of 0.3) for a
range of sample sizes from 10 to 100.
III.C. How to Compute Confidence Limits
III.C.1. Standard Graphs and Tables
for Confidence Limits
No special calculations are needed to
determine confidence limits for the situations
described above. The required confidence
limits are tabulated and summarized in
standard graphs in many textbooks and
handbooks on statistics (e.g., see W.H. Beyer
led.] 1976. CRC Handbook of Tables for
Probability and Statistics. CRC Press). The
confidence limits are referred to as
"Confidence Limits for Proportions" for the
"Binomial Distribution". Separate tables are
published for different confidence levels
(usually 90%, 95% and 99%). The tables are
read by specifying * ( referred to as the
16
-------
numerator or the number of successes)
and n (referred to as the denominator
or the sample size). The corresponding
table entries are the lower and upper
confidence limits. This information is
also summarized in graphs that depict
the upper and lower confidence limits
on the y-axis and the estimated
proportion on the x-axis.
III.C.2. Normal Approximation
An alternative approach to the
one based on the Binomial distribution
(described above) is to construct the
confidence limits based on the Normal
distribution.	Construction of
confidence limits based on the Normal
distribution provides a greater degree
of flexibility which can be
advantageous for more complex
sampling designs (e.g., stratified
random designs discussed in the next
section).
As discussed in the previous
section the Binomial distribution exactly
describes the probability distribution of
possible values of the estimate.
However, under certain circumstances,
the Normal distribution is a good
approximation to the Binomial
distribution. In these cases, confidence
limits based on the Normal distribution
may be used instead of those based on
the Binomial distribution.
Approximate confidence limits,
based on the Normal distribution, are
computed from a simple formula.
Therefore, confidence limits do not
have to be restricted to confidence
levels and sample sizes listed in
standard tables and graphs, and
interpolations between tabulated values
are not required. For example, a
standard table of Binomial confidence
limits may list confidence limits for
confidence levels of 95% and 99%,
and may list sample sizes in steps of
10 (e.g., 10, 20, 30, etc.). By using
the Normal approximation, any
combination of confidence level and
sample size can be addressed directly
(e.g., 85% confidence and a sample
size of 53). The Normal approximation
requires information on only two
quantities: a coefficient based on the
Normal distribution, and the variance of
the probability distribution of possible
values of the estimate (£).
The variance of the probability
distribution of possible values of the
estimate can be estimated as the
product of the estimated proportion
times one minus the estimated
proportion, all divided by the sample
size:
I P (1-p) 1 / »•
This is the same as the expression for
the variance presented in Section
II.C.2, except that the true proportion
(which in practice is unknown) in this
case is replaced by the estimate of the
proportion (£). For example, if P is 0.4
and the sample size is 50, then the
estimate of the variance is 0.0048 (i.e.,
[0.4 x 0.6] / 50).
17
-------
The required Normal coefficients are
tabulated in most introductory statistics
textbooks as well as in advanced texts.
Generally the tabulations are presented in
steps of 1% or less (e.g., 90%, 91%, 92%,
etc.). For the standard confidence levels of
90%, 95% and 99%, the Normal coefficients
(c) are 1.65, 1.96, and 2.58 respectively.
The lower confidence limit based on the
Normal approximation is simply the estimated
value minus a quantity equal to the product of
the Normal coefficient (c) for the desired
confidence level 3nd the square root of the
estimated variance:
p - C yj p (1-p) / n
Similarly, the upper confidence limit based on
this approximation is the estimated value plus
that same quantity:
P ~ c J p (1-p) / n
For a 95% confidence interval based on the
example in the previous paragraph, the lower
confidence limit is 0.26,
0.4 - 1.96 0.0048 ,
and the upper confidence limit is 0.54,
0 4 + 1 96 y/ 0.0048
As noted previously, the Normal
approximation is not always accurate. In
particular, it is not accurate if the sample
size is too small. Working rules have been
18
-------
established (e.g., see W.G. Cochran. 1977.
Sampling Techniques. John Wiley and Sons)
regarding the minimum sample size that is
required when using the Normal
approximation. The required minimum sample
size is larger for estimates of the proportion
(values of P) close to 0.0 or 1.0, and is
smallest if the estimate is 0.5 (Table 1). The
required minimum sample size for all
situations is never less than 30 and is as large
as 1400 when the estimated proportion is
0.05 (or 0.95). Clearly, the Normal
approximation must be applied with caution.
Whenever possible, the exact confidence
limits based on the Binomial distribution
should be used.
IV. Data Analysis for Stratified
Random Samples
The discussion of data analysis methods
up to this point has assumed that all locations
within the resource of interest have an equal
chance of being selected for sampling. This
may not always be the case. In some
situations, areas of special interest (strata)
may be identified and additional sampling
effort expended in these areas. Although
every location within a stratum may have an
equal chance of being selected for sampling,
a location within a special interest area would
have a higher chance of being selected than
a location outside the special interest area.
To ensure that estimates are unbiased, the
analysis of data from this type of sampling
design (referred to as a stratified random
sampling design) must account for the
different levels of sampling effort -in the
different strata.
The recommended method for analyzing
data from stratified sampling designs is
straight-forward and intuitively appealing.
Tabie 1. Minimum sample sizes required
for use of the Normal approximation.
A
P
0.5
30
04
SO
0.3
80
0.2
200
0.1
600
0.05
1400
19
-------
The method is illustrated in the following
paragraphs with a hypothetical example for
the case of two strata. The method,
however, is not limited to two strata and can
be extended to analyze data from a stratified
random sample with any number of strata.

Stratum 2
Figure 28. Hypothesized resource divided into
two strata.
8«mplng Point
- In non-tubnomlnal area .
- in (ubnominal area a
Figure 29. Stratified resource with 20 samples
in Stratum 1 and 10 samples in Stratum 2.
Suppose the resource being studied is
divided into two strata: a 200 square mile
area of special interest (labeled Stratum 1 in
Figure 28) and the remaining 800 square
miles of the resource (labeled Stratum 2).
The intention is to be able to characterize
the entire resource but also to characterize
just the area of special interest. For this
reason, suppose that most samples are
allocated to stratum 1. For this example,
the total sample size of 30 is split between
the two strata with 20 samples going to
stratum 1 and 10 samples going to stratum
2. Within each stratum, the sampling
locations are selected randomly (Figure 29).
Two steps are needed to estimate the
proportion of the total area (i.e., the entire
resource) that is subnominal in this case.
First, a separate estimate is computed for
each of the two strata (say £1 and P2) using
the methods described in the foregoing
sections. For this example, the estimate for
stratum 1 is based on 20 samples, and the
estimate for stratum 2 is based on 10
samples. The second step is to compute a
weighted average of these two estimates.
The weight associated with each stratum is
the ratio of the area of the stratum divided
by the total area of the resource. For this
example the weight for stratum 1 is 0.2 (i.e.,
200/1000) and the weight for stratum 2 is
20
-------
0.8 (i.e., 800/1000). Therefore, the
weighted average (0) is:
p = 0.2 pi +08 p2
The resulting estimate for the entire
resource is unbiased.
Similarly, the upper confidence limit is
the weighted average proportion plus
that same quantity. As with any use of
the Normal approximation, adequate
sample sizes in each stratum are
necessary for reliable results.
A confidence interval for the
overall estimate can be calculated
based on the Normal approximation.
Using the previous example, the
estimated variance of the estimate from
stratum 1 is	I / 20, while the
estimated variance of the estimate from
stratum 2 is [ £2(7-f>2) ] / 10. The
estimated variance of the weighted
average is the weighted sum of the
variances from the two strata:
Var(p ) = (0.2):[(pl(l-pl))/20]
+ (0 8)l[(p2(l-p2))/10]
Each weight used to compute the
overall variance is simply the square of
the corresponding weight that was
used to compute the overall proportion
The lower limit of the confidence
interval is the weighted average
proportion minus the quantity equal to
the product of the Normal coefficient
(c) for the desired confidence level, and
the square root of the variance of the
weighted average:
p - cj Var(p )
V. Closing Comments
The methods described in this
document are appropriate for analyzing
data from simple random samples and
stratified random samples. However,
some R-EMAP programs use neither
simple random nor stratified random
sampling designs. In these cases, the
described methods should only be used
as a last resort, and will only produce
approximations. A more general
method of data analysis should be used
that is consistent with the complexity
of the sampling design. This more
general approach is conceptually similar
to the described methods, but is more
involved and requires that the
probability of selecting each location
and the probability of selecting every
pair of locations is known. This
additional information may not always
be available. If any doubt exists
regarding which method to use, the
EMAP Statistics and Design Team at
the EPA Corvallis Laboratory can make
the determination.
The hypothetical example referred
to throughout this document is
intended simply as an instructional tool.
The methods described are not limited
to analyzing data from estuaries. They
are appropriate whether the purpose of
21
-------
sampling is to estimate the proportion
of the number of resource units (e.g.,
numbers of lakes), the proportion of
total length of a resource (e.g., miles of
streams), the proportion of area of a
resource (e.g., square miles of an
estuary), or the proportion of volume of
a resource (e.g., cubic meters of one of
the Great Lakes). The methods can be
applied without modification to each of
these situations. More detailed,
technical documentation on the
methods described in this document is
available from the EMAP Statistics and
Design Team in Corvallis.
22
-------